Alert → Runbook index
When PagerDuty pages you from #cloud-production-incidents, the page title is the alert name. This page maps that alert name to the runbook that owns it, so you don't have to guess which repo or service is responsible.
How to use: find your alert in the table (Ctrl/Cmd-F the exact alert title), open the linked runbook, and follow it. If your alert isn't listed, fall back to the cross-repo Runbooks index and the investigation flow.
Runbooks live in the owning service repo and are imported here at build time. Links below point at the imported copy under
repo-docs/. A few entries are marked (pending merge) — the runbook exists in this working tree but hasn't landed upstream yet, so the link will resolve once the owning repo merges it.
Most frequent alerts (read these first)
| Alert (PagerDuty/Grafana title) | Plane | Runbook | What it usually is |
|---|---|---|---|
Vespa Disk Utilization > 80% (per index, e.g. production-tenant-v3 / hyuhke9a) | Data plane | vespa_disk_utilization (pending merge) | A tenant's Vespa content disk filling up. Increase the index's storage; usually safe to do in business hours unless ≥~90% or climbing fast. Not the same as KServe node disk. |
| Ecom API 4XX Rate Anomalous vs Baseline / Ecom API 4XX Rate Exceeds 25% | Control plane (ecom) | ecom-4xx-errors | Often a customer sending a bad/unknown param (e.g. geoLocation) or an oversensitive alarm at low volume / around pixel updates. Triage customer-vs-platform via Cloudflare logs. |
| Ecom Internal Indexer Job Success Rate Below 99% | Control plane + data plane | ecom-internal-indexer-job-success-rate | Frequently timeouts on a healthy server when cell /documents latency rises — jobs are marked ERROR though marqo did the work. One cell wobble can fire a dozen+ per-index alerts at once. |
Fluent Bit Unhealthy Pods (cell2-MultitenantEKSCluster) | Data plane | fluent_bit_unhealthy_pods | A per-node log-shipper daemonset pod transiently unhealthy. Worst case = delayed logs from some nodes. Usually self-resolves — often safe to do nothing once resolved. |
| ControlPlaneSimpleHealthCheckFailureAlarm | Control plane (CloudWatch) | No runbook yet — see the incident doc review | Synthetic control-plane health check failing intermittently. Resolution is not yet documented — capture findings when you next get paged so we can write the runbook. |
Data plane (cloud_data_plane)
| Alert | Runbook |
|---|---|
| Vespa Disk Utilization > 80% | vespa_disk_utilization (pending merge) |
| Index Unreachable | index_unreachability |
| Fluent Bit Unhealthy Pods | fluent_bit_unhealthy_pods |
| Cloudflare Requests 5xx Errors / Reverse Proxy 5xx | reverse_proxy_5xx |
| Cloudflare Origin p95 Latency | cf_origin_p95_latency |
| Cloudflare 429 Rate | cloudflare_429_rate |
| KServe Node Disk Utilization (GPU node root fs — not Vespa disk) | kserve_node_disk_util |
| KServe Inference Error Rate | kserve_inference_error_rate |
| Edge Unreachability | edge_unreachability |
| CoreDNS Error Rate / Unhealthy Pods | coredns_error_rate · coredns_unhealthy_pods |
| KEDA Operator / Scaler / Scaled Object | keda_operator_unhealthy · keda_scaler_errors · keda_scaled_object_errors |
| Prometheus Server / Autoscaling Unhealthy | prometheus_server_unhealthy · prometheus_autoscaling_unhealthy |
For the full data-plane runbook set, see the cross-repo Runbooks index.
Control plane / ecommerce (cloud_control_plane)
| Alert | Runbook |
|---|---|
| Ecom API 4XX Rate (Anomalous / Exceeds 25%) | ecom-4xx-errors |
| Ecom API 5XX Errors | ecom-5xx-errors |
| Ecom API Success Rate Below 98% | ecom-success-rate |
| Ecom Internal Indexer Job Success Rate Below 99% | ecom-internal-indexer-job-success-rate |
| Ecom Agentic Search / Converse 5XX RPS | ecom-agentic-search-5xx-rps · ecom-agentic-converse-5xx-rps |
| Ecom Queue Backlog Increasing | ecom-queue-backlog-increasing |
| Merchandising Exporter Lambda Has Errors | merchandising-exporter-lambda-errors |
| Ecom Settings / Metrics Worker Lambda Errors | ecom-settings-exporter-lambda-errors · ecom-metrics-worker-lambda-errors |
| Controller API 4XX Rate Anomalous vs Baseline | controller-api-4xx-rate-anomaly |
| Controller API 5XX Rate Exceeds 5% in 5m | controller-api-5xx-rate |
| Admin Worker Failed Requests | admin-api-failed-requests |
| Prod Ecom Monitoring Service Alarm | prod-ecom-monitoring-service-alarm |
| ControlPlaneSimpleHealthCheckFailureAlarm | No runbook yet — see incident doc review |
For the full control-plane runbook set (incidents, diagnostics, operations, post-incident reports), see the cross-repo Runbooks index.
Keeping this current
This index is hand-maintained. When you add or rename an incident runbook in a service repo, add the alert title here too — the alert name is what an on-call sees first, and it's the fastest path from "I got paged" to "I'm following the runbook". A recurring finding from the June 2026 incident doc review is that discoverability, not missing content, is the main gap.