Production Incident Response

Something is broken in production right now? Start here. This page is the investigation entrypoint: how to navigate this repo to figure out what's wrong, stop customer impact, and find root cause — whether you're a human on-call or an agent.

📟 Paged by an alert? Go straight to the Alert → Runbook index and look up the exact alert title from the PagerDuty page — it maps the alert name to the runbook that owns it.

Investigation flow — where to look, in order

Know the alert name? Use the Alert → Runbook index. It maps each PagerDuty/Grafana alert title (e.g. Vespa Disk Utilization > 80%, Ecom API 4XX Rate Anomalous) to its owning runbook. If your symptom isn't a named alert, or the alert isn't listed, continue with the steps below.
Find the runbook for the symptom. Open the cross-repo Runbooks index — it aggregates every runbook across all service repos in one place. Start here when you don't yet know which service owns the failure. (This page is generated at build time; if it's missing, run npm run aggregate.)
Identify the affected system / plane. Use the Systems overviews to map the symptom to a plane and its primary repos:
- Control plane — account/index management, APIs, auth, deployment.
- Data plane — running indexes: ingest, query, reindexing, EKS.
- Ecommerce — customer-facing ecommerce & agentic APIs.
The "how a request flows" section there is the fastest way to reason about where in the request path the failure is.
Diagnose — find root cause. Use the Diagnostics guides for how to inspect a system (dashboards, logs, queries, healthy-vs-unhealthy signals) without assuming you already know the cause.
Drill into the owning repo. Find the repo on the Repository Map, then read its imported docs under docs/repo-docs/<repo>/, or inspect the source directly in repos/<repo>/ (run git submodule update --init --recursive first if the submodule isn't checked out — see AGENTS.md at the repo root).
Escalate to the owning team. Teams lists ownership and contacts when you need the people who know the system.

What a good incident runbook covers

Service-specific incident runbooks live with the owning repo (and surface in the cross-repo index above). Each should answer, in order:

Detect — what alert/symptom triggers it, and how to confirm it.
Mitigate — the fastest safe action to stop customer impact (roll back, fail over, scale, disable a feature).
Diagnose — where to look for root cause (link to a diagnostics guide rather than duplicating it).
Recover — how to return to a healthy steady state and verify it.
Follow up — what to capture for the postmortem.

Library-native incident pages

Alert → Runbook index — every recurring PagerDuty/Grafana alert title mapped to its owning runbook. The fastest path from "I got paged" to "I'm following the runbook".
On-call incident doc review (June 2026) — a deduplicated review of recent #cloud-production-incidents alerts: what fired, whether the resolution is known, whether it's determinable from current docs, and the doc changes made in response.

Most incident runbooks still live with the service that owns the failure mode — see the cross-repo Runbooks index (the data plane and control plane carry the majority). Add a runbook here only when the response genuinely spans multiple systems (e.g. an outage that crosses control plane → data plane).

To add one, copy templates/runbook-template.md (repo root) into this folder and record an owner and last_reviewed date.

Investigation flow — where to look, in order​

What a good incident runbook covers​

Library-native incident pages​

Investigation flow — where to look, in order

What a good incident runbook covers

Library-native incident pages