Alert → Runbook index

When PagerDuty pages you from #cloud-production-incidents, the page title is the alert name. This page maps that alert name to the runbook that owns it, so you don't have to guess which repo or service is responsible.

How to use: find your alert in the table (Ctrl/Cmd-F the exact alert title), open the linked runbook, and follow it. If your alert isn't listed, fall back to the cross-repo Runbooks index and the investigation flow.

Runbooks live in the owning service repo and are imported here at build time. Links below point at the imported copy under repo-docs/. A few entries are marked (pending merge) — the runbook exists in this working tree but hasn't landed upstream yet, so the link will resolve once the owning repo merges it.

Most frequent alerts (read these first)

Alert (PagerDuty/Grafana title)	Plane	Runbook	What it usually is
Vespa Disk Utilization > 80% (per index, e.g. `production-tenant-v3` / `hyuhke9a`)	Data plane	vespa_disk_utilization (pending merge)	A tenant's Vespa content disk filling up. Increase the index's storage; usually safe to do in business hours unless ≥~90% or climbing fast. Not the same as KServe node disk.
Ecom API 4XX Rate Anomalous vs Baseline / Ecom API 4XX Rate Exceeds 25%	Control plane (ecom)	ecom-4xx-errors	Often a customer sending a bad/unknown param (e.g. `geoLocation`) or an oversensitive alarm at low volume / around pixel updates. Triage customer-vs-platform via Cloudflare logs.
Ecom Internal Indexer Job Success Rate Below 99%	Control plane + data plane	ecom-internal-indexer-job-success-rate	Frequently timeouts on a healthy server when cell `/documents` latency rises — jobs are marked ERROR though marqo did the work. One cell wobble can fire a dozen+ per-index alerts at once.
Fluent Bit Unhealthy Pods (`cell2-MultitenantEKSCluster`)	Data plane	fluent_bit_unhealthy_pods	A per-node log-shipper daemonset pod transiently unhealthy. Worst case = delayed logs from some nodes. Usually self-resolves — often safe to do nothing once resolved.
ControlPlaneSimpleHealthCheckFailureAlarm	Control plane (CloudWatch)	No runbook yet — see the incident doc review	Synthetic control-plane health check failing intermittently. Resolution is not yet documented — capture findings when you next get paged so we can write the runbook.

Data plane (cloud_data_plane)

Alert	Runbook
Vespa Disk Utilization > 80%	vespa_disk_utilization (pending merge)
Index Unreachable	index_unreachability
Fluent Bit Unhealthy Pods	fluent_bit_unhealthy_pods
Cloudflare Requests 5xx Errors / Reverse Proxy 5xx	reverse_proxy_5xx
Cloudflare Origin p95 Latency	cf_origin_p95_latency
Cloudflare 429 Rate	cloudflare_429_rate
KServe Node Disk Utilization (GPU node root fs — not Vespa disk)	kserve_node_disk_util
KServe Inference Error Rate	kserve_inference_error_rate
Edge Unreachability	edge_unreachability
CoreDNS Error Rate / Unhealthy Pods	coredns_error_rate · coredns_unhealthy_pods
KEDA Operator / Scaler / Scaled Object	keda_operator_unhealthy · keda_scaler_errors · keda_scaled_object_errors
Prometheus Server / Autoscaling Unhealthy	prometheus_server_unhealthy · prometheus_autoscaling_unhealthy

For the full data-plane runbook set, see the cross-repo Runbooks index.

Control plane / ecommerce (cloud_control_plane)

Alert	Runbook
Ecom API 4XX Rate (Anomalous / Exceeds 25%)	ecom-4xx-errors
Ecom API 5XX Errors	ecom-5xx-errors
Ecom API Success Rate Below 98%	ecom-success-rate
Ecom Internal Indexer Job Success Rate Below 99%	ecom-internal-indexer-job-success-rate
Ecom Agentic Search / Converse 5XX RPS	ecom-agentic-search-5xx-rps · ecom-agentic-converse-5xx-rps
Ecom Queue Backlog Increasing	ecom-queue-backlog-increasing
Merchandising Exporter Lambda Has Errors	merchandising-exporter-lambda-errors
Ecom Settings / Metrics Worker Lambda Errors	ecom-settings-exporter-lambda-errors · ecom-metrics-worker-lambda-errors
Controller API 4XX Rate Anomalous vs Baseline	controller-api-4xx-rate-anomaly
Controller API 5XX Rate Exceeds 5% in 5m	controller-api-5xx-rate
Admin Worker Failed Requests	admin-api-failed-requests
Prod Ecom Monitoring Service Alarm	prod-ecom-monitoring-service-alarm
ControlPlaneSimpleHealthCheckFailureAlarm	No runbook yet — see incident doc review

For the full control-plane runbook set (incidents, diagnostics, operations, post-incident reports), see the cross-repo Runbooks index.

Keeping this current

This index is hand-maintained. When you add or rename an incident runbook in a service repo, add the alert title here too — the alert name is what an on-call sees first, and it's the fastest path from "I got paged" to "I'm following the runbook". A recurring finding from the June 2026 incident doc review is that discoverability, not missing content, is the main gap.

Most frequent alerts (read these first)​

Data plane (cloud_data_plane)​

Control plane / ecommerce (cloud_control_plane)​

Keeping this current​

Most frequent alerts (read these first)

Data plane (cloud_data_plane)

Control plane / ecommerce (cloud_control_plane)

Keeping this current