Observability
Observability spans every layer of the platform — from the cloud infrastructure through Kubernetes to the application — and it’s where two promises of the offer come together: the live, shared dashboards that are your transparency window, and the alerting that drives The Operation.
We standardize on the Grafana open-source stack, deployed in-cluster.
The stack
Section titled “The stack”| Signal | Tool | What it gives you |
|---|---|---|
| Dashboards | Grafana | A single pane over metrics, logs, and traces |
| Metrics | Prometheus (scaled with Mimir) | Time-series for infra, cluster, and app |
| Logs | Loki | Centralized, queryable logs, correlated with metrics |
| Traces | Tempo | Distributed tracing across services |
| Alerting | Alertmanager | Routing, grouping, and de-duplication of alerts |
Metrics, logs, and traces are correlated in Grafana, so an alert leads straight to the relevant logs and traces — not a hunt across disconnected tools.
Your transparency window
Section titled “Your transparency window”The dashboards are shared with you and live — not a monthly PDF. They’re how you see exactly what we run on your behalf, in real time. This is the offer’s “observability stack + live shared dashboards” made concrete: no black box, no key-person knowledge, full visibility into the platform’s health, performance, and cost signals.
SLOs and alerting
Section titled “SLOs and alerting”- We define SLOs (service-level objectives) for the signals that matter to your product.
- Alertmanager routes alerts by severity to the on-call rotation.
- Alerts are the entry point to incident handling — see The Operation.
infra · cluster · app │ metrics → Prometheus / Mimir ┐ │ logs → Loki ├─► Grafana (shared dashboards) │ traces → Tempo ┘ │ └────────────────────────────► Alertmanager ─► on-call (Teams / Slack)This closes the loop between the Platform and The Operation: the platform emits the signals, and observability turns them into the dashboards you watch and the alerts we act on.