Observability

The full observability stack ships as cb-* sub-containers and is wired together by the master at first boot. This page walks the stack as a workflow: scrape, visualise, query logs, follow traces, read profiles, route alerts.

The stack

Every component is auto-deployed by the master via the Portainer API on first boot. No manual provisioning.

Sub-containerRoleReceives from
cb-prometheusTime-series metrics + PromQL.Master /metrics, every worker, every cb-*.
cb-alertmanagerAlert routing + dedup + silences.cb-prometheus.
cb-grafanaDashboards, alert UI, explore.cb-prometheus, cb-loki, cb-tempo, cb-pyroscope (datasources auto-provisioned).
cb-lokiLog aggregation + LogQL.Master, every worker (push), every cb-*.
cb-tempoDistributed traces (OTLP/HTTP).Master + workers (OTel SDK in-process).
cb-pyroscopeContinuous profiles (CPU + heap + mutex).Master + workers.

What the master exposes for itself

The master serves Prometheus metrics on contextbay:7480/metrics. This endpoint is intentionally open (no auth) because cb-prometheus has to scrape it on every interval — Prometheus convention. It lives outside the /api tree precisely so it can't accidentally inherit auth middleware.

Beyond the standard Go runtime metrics, CB exposes a small set of counters dedicated to the host-onboarding flow:

Host-onboarding counters

MetricLabelsDescription
contextbay_enroll_attempts_totalnoneTotal /api/enroll calls past the rate-limit check.
contextbay_enroll_accepted_totalnoneSuccessful enrollments — token consumed, mesh authkey minted.
contextbay_enroll_rejections_totalreasonRejections, by bounded reason: rate_limited, bad_request, bad_secret, invalid_token, expired, already_consumed, race_consumed, not_configured, mesh_mint_failed.
contextbay_fuse_attempts_totalnonePOST /api/hosts/{id}/fuse calls that transitioned a node into fusing.
contextbay_fuse_succeeded_totalnoneFuses that reached fused (observed by the Portainer poller).
contextbay_fuse_failed_totalsourceFuse failures by source: portainer_error, watchdog_timeout.
contextbay_enrollment_sweep_deleted_totalnoneExpired enroll tokens deleted by the periodic sweeper (one increment per row).

Steady-state cardinality is 15 series — labels use bounded vocabularies on purpose so the metric set is flat regardless of fleet size. Definitions live in internal/metrics/onboarding.go.

Useful PromQL queries:

# Enroll success rate (per minute)
rate(contextbay_enroll_accepted_total[5m])
  / rate(contextbay_enroll_attempts_total[5m])

# Top rejection reason in the last hour
sort_desc(
  sum by (reason) (
    increase(contextbay_enroll_rejections_total[1h])
  )
)

# Fuse success vs watchdog timeouts
sum(rate(contextbay_fuse_succeeded_total[5m]))
sum(rate(contextbay_fuse_failed_total{source="watchdog_timeout"}[5m]))

Reading dashboards

The master's Grafana provisioner ships a set of curated dashboards on first boot. Find them in the master UI under /grafana (embedded iframe — no separate login).

  • Fleet Overview — node health, uptime, container counts, alert volume.
  • Host Onboarding — uses the counters above. Enroll attempts vs accepted, rejection mix, fuse success rate, sweeper deletions.
  • Per-host detail — drill into a single host's CPU/memory/disk and the containers it runs.
  • CB master internals — HTTP latency by route, gRPC stream health, event-bus backpressure, DB query timing.

Add custom dashboards via POST /api/grafana/dashboards. User-saved dashboards are stored in cb-grafana's volume and listed alongside the provisioned ones via GET /api/grafana/dashboards. Stars/un-stars a dashboard with POST /api/grafana/dashboards/{uid}/star.

Querying logs (Loki)

Every CB component pushes logs to cb-loki — master, workers, sub-containers. Reach them via /grafana → Explore → Loki or via the Logs page in CB itself (LogQL editor).

A few starter queries:

# Master errors in the last hour
{container="contextbay"} |= "level=error"

# Worker logs for a specific node
{job="contextbay-worker", node_name="<worker-1>"}

# All n8n workflow execution failures
{container="cb-n8n"} |~ "execution.*failed"

# Headscale authkey lifecycle
{container="cb-headscale"} |~ "preauth.*key"

Loki retention is configured via the [loki] block in the master config. Master push behaviour (batch size, flush interval) is also tunable from there.

Reading traces (Tempo)

The master and every worker ship spans to cb-tempo via OTLP/HTTP. Sample rate, endpoint, and on/off are all configured under [tracing] (default sample_rate = 1.0 for dev — turn down for production).

A typical request hitting /api/hosts produces a trace that crosses HTTP → service layer → gRPC client → the Headscale REST API. Trace IDs are echoed back via traceparent headers — copy one into /grafana → Explore → Tempo to see the full waterfall.

When debugging a slow handler, jump from a Loki log line straight to its trace via the trace_id field — Grafana's data-link integration is auto-provisioned.

Continuous profiling (Pyroscope)

cb-pyroscope receives continuous profiles from the master and every worker — CPU, alloc heap, in-use heap, goroutine, mutex, and (optionally) block. The master process tags profiles with service=contextbay-master; workers use service=contextbay-worker plus a node_name tag.

Tunable from the master config:

[profiling]
enabled         = true
pyroscope_url   = "http://cb-pyroscope:4040"
mutex_profiling = true
block_profiling = false   # Off by default — Go runtime overhead is non-trivial

View flamegraphs in /grafana → Explore → Pyroscope or via the Pyroscope UI directly. Filter by service first, then drill into CPU vs alloc to compare.

Alerts (Alertmanager)

CB writes its alert rule files into /data/generated/ on the master and Prometheus picks them up via mounted-volume file_sd. The three rule files:

  • recording_rules.yml — pre-aggregated PromQL used by dashboards (cheaper than ad-hoc).
  • contextbay_alerts.rules.yml — alerts written by users via /api/alert-rules and synced to disk by the master.
  • slo_alerts.rules.yml — SLO burn-rate alerts derived from CB's built-in SLOs.

Add a custom rule:

curl -sS -X POST http://localhost:7480/api/alert-rules \
  -H "X-API-Key: cb_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "EnrollRejectionsHigh",
    "expr": "sum(rate(contextbay_enroll_rejections_total[5m])) > 0.5",
    "for": "5m",
    "severity": "warning",
    "summary": "Enroll rejections high"
  }'

Routing is handled by cb-alertmanager. Notification channels are configured via /api/notification-channels (admin role) — Discord, generic webhook, or n8n trigger. Alertmanager itself can be silenced through /api/alertmanager/silences.

Inbound webhook from Alertmanager → CB lives at POST /api/webhooks/alertmanager — used to mirror alerts as CB events and Grafana annotations.

Reload after config changes

When the master rewrites prometheus.yml, alertmanager.yml, or one of the rule files, it triggers a config reload of the relevant sub-container automatically — no manual SIGHUP needed.

Verify a reload landed by checking Prometheus' loaded config via the master's proxy:

curl -sS http://localhost:7480/api/prometheus/status \
  -H "X-API-Key: cb_..." | jq .data.runtimeInfo

# Or directly:
curl -sS http://cb-prometheus:9090/-/status/config | head

See GET /api/prometheus/rules to confirm the rule you just wrote is loaded, and GET /api/prometheus/targets for scrape health.

Wiring it all together

A typical incident loop:

  1. cb-prometheus fires on a recording-rule threshold, cb-alertmanager dedupes and routes via your notification channels.
  2. CB also receives the alert at /api/webhooks/alertmanager — emits a CB event and a Grafana annotation.
  3. The receiving channel (e.g. Discord) shows a link back to the relevant Grafana dashboard scoped to the affected host.
  4. From the dashboard you jump into Loki for logs around the timestamp, and from any log line you click into Tempo for the trace.
  5. If the trace points to a slow handler, Pyroscope's CPU/heap profile for the same window tells you where time went.