Observability
The full observability stack ships as cb-* sub-containers and is wired together by the master at first boot. This page walks the stack as a workflow: scrape, visualise, query logs, follow traces, read profiles, route alerts.
The stack
Every component is auto-deployed by the master via the Portainer API on first boot. No manual provisioning.
| Sub-container | Role | Receives from |
|---|---|---|
cb-prometheus | Time-series metrics + PromQL. | Master /metrics, every worker, every cb-*. |
cb-alertmanager | Alert routing + dedup + silences. | cb-prometheus. |
cb-grafana | Dashboards, alert UI, explore. | cb-prometheus, cb-loki, cb-tempo, cb-pyroscope (datasources auto-provisioned). |
cb-loki | Log aggregation + LogQL. | Master, every worker (push), every cb-*. |
cb-tempo | Distributed traces (OTLP/HTTP). | Master + workers (OTel SDK in-process). |
cb-pyroscope | Continuous profiles (CPU + heap + mutex). | Master + workers. |
What the master exposes for itself
The master serves Prometheus metrics on contextbay:7480/metrics. This endpoint is intentionally open (no auth) because cb-prometheus has to scrape it on every interval — Prometheus convention. It lives outside the /api tree precisely so it can't accidentally inherit auth middleware.
Beyond the standard Go runtime metrics, CB exposes a small set of counters dedicated to the host-onboarding flow:
Host-onboarding counters
| Metric | Labels | Description |
|---|---|---|
contextbay_enroll_attempts_total | none | Total /api/enroll calls past the rate-limit check. |
contextbay_enroll_accepted_total | none | Successful enrollments — token consumed, mesh authkey minted. |
contextbay_enroll_rejections_total | reason | Rejections, by bounded reason: rate_limited, bad_request, bad_secret, invalid_token, expired, already_consumed, race_consumed, not_configured, mesh_mint_failed. |
contextbay_fuse_attempts_total | none | POST /api/hosts/{id}/fuse calls that transitioned a node into fusing. |
contextbay_fuse_succeeded_total | none | Fuses that reached fused (observed by the Portainer poller). |
contextbay_fuse_failed_total | source | Fuse failures by source: portainer_error, watchdog_timeout. |
contextbay_enrollment_sweep_deleted_total | none | Expired enroll tokens deleted by the periodic sweeper (one increment per row). |
Steady-state cardinality is 15 series — labels use bounded vocabularies on purpose so the metric set is flat regardless of fleet size. Definitions live in internal/metrics/onboarding.go.
Useful PromQL queries:
# Enroll success rate (per minute)
rate(contextbay_enroll_accepted_total[5m])
/ rate(contextbay_enroll_attempts_total[5m])
# Top rejection reason in the last hour
sort_desc(
sum by (reason) (
increase(contextbay_enroll_rejections_total[1h])
)
)
# Fuse success vs watchdog timeouts
sum(rate(contextbay_fuse_succeeded_total[5m]))
sum(rate(contextbay_fuse_failed_total{source="watchdog_timeout"}[5m]))Reading dashboards
The master's Grafana provisioner ships a set of curated dashboards on first boot. Find them in the master UI under /grafana (embedded iframe — no separate login).
- Fleet Overview — node health, uptime, container counts, alert volume.
- Host Onboarding — uses the counters above. Enroll attempts vs accepted, rejection mix, fuse success rate, sweeper deletions.
- Per-host detail — drill into a single host's CPU/memory/disk and the containers it runs.
- CB master internals — HTTP latency by route, gRPC stream health, event-bus backpressure, DB query timing.
Add custom dashboards via POST /api/grafana/dashboards. User-saved dashboards are stored in cb-grafana's volume and listed alongside the provisioned ones via GET /api/grafana/dashboards. Stars/un-stars a dashboard with POST /api/grafana/dashboards/{uid}/star.
Querying logs (Loki)
Every CB component pushes logs to cb-loki — master, workers, sub-containers. Reach them via /grafana → Explore → Loki or via the Logs page in CB itself (LogQL editor).
A few starter queries:
# Master errors in the last hour
{container="contextbay"} |= "level=error"
# Worker logs for a specific node
{job="contextbay-worker", node_name="<worker-1>"}
# All n8n workflow execution failures
{container="cb-n8n"} |~ "execution.*failed"
# Headscale authkey lifecycle
{container="cb-headscale"} |~ "preauth.*key"Loki retention is configured via the [loki] block in the master config. Master push behaviour (batch size, flush interval) is also tunable from there.
Reading traces (Tempo)
The master and every worker ship spans to cb-tempo via OTLP/HTTP. Sample rate, endpoint, and on/off are all configured under [tracing] (default sample_rate = 1.0 for dev — turn down for production).
A typical request hitting /api/hosts produces a trace that crosses HTTP → service layer → gRPC client → the Headscale REST API. Trace IDs are echoed back via traceparent headers — copy one into /grafana → Explore → Tempo to see the full waterfall.
When debugging a slow handler, jump from a Loki log line straight to its trace via the trace_id field — Grafana's data-link integration is auto-provisioned.
Continuous profiling (Pyroscope)
cb-pyroscope receives continuous profiles from the master and every worker — CPU, alloc heap, in-use heap, goroutine, mutex, and (optionally) block. The master process tags profiles with service=contextbay-master; workers use service=contextbay-worker plus a node_name tag.
Tunable from the master config:
[profiling]
enabled = true
pyroscope_url = "http://cb-pyroscope:4040"
mutex_profiling = true
block_profiling = false # Off by default — Go runtime overhead is non-trivialView flamegraphs in /grafana → Explore → Pyroscope or via the Pyroscope UI directly. Filter by service first, then drill into CPU vs alloc to compare.
Alerts (Alertmanager)
CB writes its alert rule files into /data/generated/ on the master and Prometheus picks them up via mounted-volume file_sd. The three rule files:
recording_rules.yml— pre-aggregated PromQL used by dashboards (cheaper than ad-hoc).contextbay_alerts.rules.yml— alerts written by users via/api/alert-rulesand synced to disk by the master.slo_alerts.rules.yml— SLO burn-rate alerts derived from CB's built-in SLOs.
Add a custom rule:
curl -sS -X POST http://localhost:7480/api/alert-rules \
-H "X-API-Key: cb_..." \
-H "Content-Type: application/json" \
-d '{
"name": "EnrollRejectionsHigh",
"expr": "sum(rate(contextbay_enroll_rejections_total[5m])) > 0.5",
"for": "5m",
"severity": "warning",
"summary": "Enroll rejections high"
}'Routing is handled by cb-alertmanager. Notification channels are configured via /api/notification-channels (admin role) — Discord, generic webhook, or n8n trigger. Alertmanager itself can be silenced through /api/alertmanager/silences.
Inbound webhook from Alertmanager → CB lives at POST /api/webhooks/alertmanager — used to mirror alerts as CB events and Grafana annotations.
Reload after config changes
When the master rewrites prometheus.yml, alertmanager.yml, or one of the rule files, it triggers a config reload of the relevant sub-container automatically — no manual SIGHUP needed.
Verify a reload landed by checking Prometheus' loaded config via the master's proxy:
curl -sS http://localhost:7480/api/prometheus/status \
-H "X-API-Key: cb_..." | jq .data.runtimeInfo
# Or directly:
curl -sS http://cb-prometheus:9090/-/status/config | headSee GET /api/prometheus/rules to confirm the rule you just wrote is loaded, and GET /api/prometheus/targets for scrape health.
Wiring it all together
A typical incident loop:
- cb-prometheus fires on a recording-rule threshold, cb-alertmanager dedupes and routes via your notification channels.
- CB also receives the alert at
/api/webhooks/alertmanager— emits a CB event and a Grafana annotation. - The receiving channel (e.g. Discord) shows a link back to the relevant Grafana dashboard scoped to the affected host.
- From the dashboard you jump into Loki for logs around the timestamp, and from any log line you click into Tempo for the trace.
- If the trace points to a slow handler, Pyroscope's CPU/heap profile for the same window tells you where time went.

