Skip to main content

Alert → Runbook index

When PagerDuty pages you from #cloud-production-incidents, the page title is the alert name. This page maps that alert name to the runbook that owns it, so you don't have to guess which repo or service is responsible.

How to use: find your alert in the table (Ctrl/Cmd-F the exact alert title), open the linked runbook, and follow it. If your alert isn't listed, fall back to the cross-repo Runbooks index and the investigation flow.

Runbooks live in the owning service repo and are imported here at build time. Links below point at the imported copy under repo-docs/. A few entries are marked (pending merge) — the runbook exists in this working tree but hasn't landed upstream yet, so the link will resolve once the owning repo merges it.

Most frequent alerts (read these first)

Alert (PagerDuty/Grafana title)PlaneRunbookWhat it usually is
Vespa Disk Utilization > 80% (per index, e.g. production-tenant-v3 / hyuhke9a)Data planevespa_disk_utilization (pending merge)A tenant's Vespa content disk filling up. Increase the index's storage; usually safe to do in business hours unless ≥~90% or climbing fast. Not the same as KServe node disk.
Ecom API 4XX Rate Anomalous vs Baseline / Ecom API 4XX Rate Exceeds 25%Control plane (ecom)ecom-4xx-errorsOften a customer sending a bad/unknown param (e.g. geoLocation) or an oversensitive alarm at low volume / around pixel updates. Triage customer-vs-platform via Cloudflare logs.
Ecom Internal Indexer Job Success Rate Below 99%Control plane + data planeecom-internal-indexer-job-success-rateFrequently timeouts on a healthy server when cell /documents latency rises — jobs are marked ERROR though marqo did the work. One cell wobble can fire a dozen+ per-index alerts at once.
Fluent Bit Unhealthy Pods (cell2-MultitenantEKSCluster)Data planefluent_bit_unhealthy_podsA per-node log-shipper daemonset pod transiently unhealthy. Worst case = delayed logs from some nodes. Usually self-resolves — often safe to do nothing once resolved.
ControlPlaneSimpleHealthCheckFailureAlarmControl plane (CloudWatch)No runbook yet — see the incident doc reviewSynthetic control-plane health check failing intermittently. Resolution is not yet documented — capture findings when you next get paged so we can write the runbook.

Data plane (cloud_data_plane)

AlertRunbook
Vespa Disk Utilization > 80%vespa_disk_utilization (pending merge)
Index Unreachableindex_unreachability
Fluent Bit Unhealthy Podsfluent_bit_unhealthy_pods
Cloudflare Requests 5xx Errors / Reverse Proxy 5xxreverse_proxy_5xx
Cloudflare Origin p95 Latencycf_origin_p95_latency
Cloudflare 429 Ratecloudflare_429_rate
KServe Node Disk Utilization (GPU node root fs — not Vespa disk)kserve_node_disk_util
KServe Inference Error Ratekserve_inference_error_rate
Edge Unreachabilityedge_unreachability
CoreDNS Error Rate / Unhealthy Podscoredns_error_rate · coredns_unhealthy_pods
KEDA Operator / Scaler / Scaled Objectkeda_operator_unhealthy · keda_scaler_errors · keda_scaled_object_errors
Prometheus Server / Autoscaling Unhealthyprometheus_server_unhealthy · prometheus_autoscaling_unhealthy

For the full data-plane runbook set, see the cross-repo Runbooks index.

Control plane / ecommerce (cloud_control_plane)

AlertRunbook
Ecom API 4XX Rate (Anomalous / Exceeds 25%)ecom-4xx-errors
Ecom API 5XX Errorsecom-5xx-errors
Ecom API Success Rate Below 98%ecom-success-rate
Ecom Internal Indexer Job Success Rate Below 99%ecom-internal-indexer-job-success-rate
Ecom Agentic Search / Converse 5XX RPSecom-agentic-search-5xx-rps · ecom-agentic-converse-5xx-rps
Ecom Queue Backlog Increasingecom-queue-backlog-increasing
Merchandising Exporter Lambda Has Errorsmerchandising-exporter-lambda-errors
Ecom Settings / Metrics Worker Lambda Errorsecom-settings-exporter-lambda-errors · ecom-metrics-worker-lambda-errors
Controller API 4XX Rate Anomalous vs Baselinecontroller-api-4xx-rate-anomaly
Controller API 5XX Rate Exceeds 5% in 5mcontroller-api-5xx-rate
Admin Worker Failed Requestsadmin-api-failed-requests
Prod Ecom Monitoring Service Alarmprod-ecom-monitoring-service-alarm
ControlPlaneSimpleHealthCheckFailureAlarmNo runbook yet — see incident doc review

For the full control-plane runbook set (incidents, diagnostics, operations, post-incident reports), see the cross-repo Runbooks index.

Keeping this current

This index is hand-maintained. When you add or rename an incident runbook in a service repo, add the alert title here too — the alert name is what an on-call sees first, and it's the fastest path from "I got paged" to "I'm following the runbook". A recurring finding from the June 2026 incident doc review is that discoverability, not missing content, is the main gap.