Diagnostics
Guides for inspecting systems and finding root cause — where to look, which commands and queries to run, and what healthy vs. unhealthy signals look like.
Diagnostics differ from incident runbooks: they don't assume something is on fire. They're the reusable "how do I see what's going on in X" guides that incident and maintenance runbooks link to.
A good diagnostics guide covers:
- Access — how to reach the system (dashboards, logs, a shell, a DB client).
- Key signals — the handful of metrics/log lines that actually matter, with expected ranges.
- Drill-downs — common questions ("is tenant X throttled?", "why is ingest lagging?") and the exact query/command to answer each.
Available guides
No Library-native diagnostics guides yet. For service-specific inspection guides, see the cross-repo Runbooks index and the Systems overviews. Add cross-cutting diagnostics here (e.g. tracing a request across control plane → data plane).