On-Call Incident Documentation Review — #cloud-production-incidents
Status — 2026-06-20 (actions taken in this pass). This review drove the following concrete changes:
- Committed to
library:
- New Alert → Runbook index (
docs/runbooks/incidents/alert-index.md) — closes the discoverability gap (finding D).- This report.
incidents/index.mdupdated to surface both, with a "paged by an alert?" fast path.- Drafted as submodule runbook uplifts, left UNCOMMITTED in the local working tree under
repos/for the owning teams to PR (per the task constraint — Library only aggregates these repos, it does not own them). The newrepos/cloud_data_plane/runbooks/vespa_disk_utilization.md(the one genuinely missing high-frequency runbook) and the seven enrich/new drafts in §6(B) are applied to the checked-out submodules but not committed.- Out of scope (owning-team engineering): the alert-tuning, indexer client-timeout/retry-backoff, and
ControlPlaneSimpleHealthCheckFailureAlarmroot-cause items in §6(C).
Scope: Production on-call alerts observed in #cloud-production-incidents Prepared: 2026-06-20 Audience: Marqo on-call engineers, data-plane and control-plane owning teams
1. Summary
Documentation coverage is broad — the data plane ships ~28 runbooks, and the control plane has a deep incident/diagnostics/maintenance runbook tree backed by real post-incident reports — so the dominant problem is not missing fixes. Across nine deduplicated incident types, eight have an owning (or near-owning) runbook and most resolutions are already known to the team. The genuine gaps are concentrated: (a) exactly one high-frequency alert has no runbook at all — "Vespa Disk Utilization > 80%", the single most frequent alert, whose Grafana rule points only at an external Notion page and whose real remediation lives unnarrated in index_workflows_v4 storage-resizing code; (b) there is no alert→runbook discoverability index, so several alerts (notably the Datadog-sourced "Cloudflare Requests 5xx Errors" and ControlPlaneSimpleHealthCheckFailureAlarm) cannot be routed from their firing name to any doc; and (c) a pervasive alert-oversensitivity/noise problem at low volume drives most of the recurring pages (per-index 4XX anomaly, internal-indexer success rate, single-error Lambda alerts) where the correct action is frequently "tune the alarm / do nothing," yet runbooks push toward over-remediation. The net recommendation is light on net-new runbooks and heavy on discoverability, "safe to do nothing" framing, and alert tuning owned by the service teams.
2. Method
- Source A — Slack: the full
#cloud-production-incidentshistory was read and deduplicated from raw alert firings into distinct incident types, capturing human diagnosis/resolution notes (Oliver, Pandu, Mehul, Owen) and the auto-resolve behavior of bot integrations (Prometheus, "Marqo Cloud v2 - Datadog"). - Source B — Repo audit: for each incident type, the candidate owning runbook and adjacent docs were read in full across the
librarysubmodules (repos/cloud_data_plane,repos/cloud_control_plane,repos/cloud_control_plane_cdk), plus the alert definitions themselves (Grafana alert JSON/.tf, AMP recording rules,ecom_alerts.tf, CDK alarms) and the remediation code paths (e.g.index_workflows_v4/src/storage_resizing/). Each assessment cross-checked the alert→runbook wiring (runbook_url) and grepped for the incident's true signature terms. - Scope window: alert firings observed 2026-06-18 → 2026-06-20, with supporting post-incident reports referenced where they exist (
docs/incidents/reports/2026-06-18-kserve-vram-oom.md,2026-06-19-kogan-latency-blips.md,2026-06-19-kark-p90.md). - Bar applied: "determinable from current docs" means a non-domain-expert on-call could investigate, diagnose, and resolve the incident by following the linked docs alone — not merely that an expert could.
3. Deduplicated incident table
| Alert / Symptom | System | Frequency | Root cause known? | Resolution known? | Determinable from docs? | Owning runbook | Verdict |
|---|---|---|---|---|---|---|---|
| Vespa Disk Utilization > 80% (per-tenant, e.g. production-tenant-v3) | data-plane | Most frequent; many/day, flaps near 80% | Yes | Yes | No | NONE (Grafana → external Notion only) | Missing |
| Ecom API 4XX Rate Anomalous / > 25% (fashionnova, karkkainen) | control-plane (ecom) | Recurring, multiple/day | Yes | Yes | Partial | ecom-4xx-errors.md | Partial |
| Ecom API Success Rate Below 98% (karkkainen-prod-000) | control-plane (ecom) | Occasional | Partial | Partial | Partial | ecom-success-rate.md | Partial |
| Index Unreachable (e.g. j3grp7ot, bf16) | data-plane | Bursty (19 in one burst) | Partial | Partial | Partial | index_unreachability.md | Partial |
| Fluent Bit Unhealthy Pods (cell2 EKS) | data-plane | Recurring, transient | Yes | Yes | Partial | fluent_bit_unhealthy_pods.md | Partial |
| Cloudflare Requests 5xx Errors (Datadog-sourced) | data-plane / edge | Recurring, auto-resolves ~2 min | Yes | Yes | No | reverse_proxy_5xx.md (different alert) | Partial |
| Merchandising Exporter Lambda Has Errors | control-plane (ecom) | Recurring, sev 2.5, auto-resolves | Yes | Yes | Partial | merchandising-exporter-lambda-errors.md | Partial |
| ControlPlaneSimpleHealthCheckFailureAlarm | control-plane (CloudWatch) | Recurring, ~daily; only ever acknowledged | No | No | No | NONE (no def in checked-out IaC) | Missing |
| Ecom Internal Indexer Job Success Rate < 99% (+ per-index cascade) | control-plane (ecom indexer) + data-plane (cell2 /documents latency) | Episodic; one wobble → dozen+ alerts | Yes | Partial | Partial | ecom-internal-indexer-job-success-rate.md | Partial |
4. Per-incident detail
4.1 Vespa Disk Utilization > 80% — Missing (highest priority)
What fired: the per-tenant recording rule min_over_time(recording:vespa_disk_utilization_pct[5m]) > 80, flapping right around threshold; chronic for production-tenant-v3. This is the most frequent alert by far.
Real root cause & resolution: per-tenant Vespa content PVC filling up; the fix is to increase index storage capacity. Two real levers exist in code: ModifyIndexWorkflow → StorageResizer (components/index_workflows_v4/src/storage_resizing/storage_resizer.py, adds shards/replicas — each content node has its own PVC), and the migration f_20260311091930-switch_to_balanced_storage.py (grows content PVCs 15Gi→63Gi by moving balanced_throughput → balanced_storage).
Thread evidence: Oliver: "this has been pretty flappy, sitting right around 75%. Should we increase something or just keep an eye on it for the weekend?"; Pandu: "maybe we can increase it tomorrow? the disk increasing thing script is pretty straightforward … it is more to keep the alarm quiet."
Doc gap: no runbook maps this alert to its remediation. The Grafana rule's runbook_url points at an external Notion page outside the audited docs. The nearest candidate, kserve_node_disk_util.md, is the wrong subject (GPU node root-fs > 90%, remediated by cordon/drain/replace) and would actively mislead the on-call into draining GPU nodes. index_unreachability.md only covers inference-replica autoscaling (a compute axis, not disk).
Recommended change: author repos/cloud_data_plane/runbooks/vespa_disk_utilization.md keyed to the alert; document the metric/flapping behavior, the PVC root cause (distinct from kserve root-fs), both remediation levers (StorageResizer add-shards/replicas; the balanced_storage PVC-growth migration), prerequisites, the "keep the alarm quiet" framing, and an act-now-vs-monitor decision. Repoint the Grafana runbook_url at it.
4.2 Ecom API 4XX Rate — Partial
What fired: "4XX Rate Anomalous vs Baseline" / "> 25%" on fashionnova and karkkainen tenants.
Real root cause & resolution: (a) a client sent an unrecognized param geoLocation to POST /recommendations/similar (Darin experiment traffic), which a strict Zod schema rejects with Unrecognized param(s) (recommendations.ts:220); fashionnova rolled back. (b)/(c)/(d) the anomaly alarm is oversensitive — pixel updates shift the baseline, low volume hair-triggers it, and a stale _bf16 duplicate alarm re-fired.
Thread evidence: Mehul: "Client error 400: Unrecognized param(s): geoLocation"; Owen: "they are sending it to recommend"; Pandu: "new alarm that seems oversensitive … due to pixel updates … if it is noise we will learn to ignore it … eventually tune the noisiness."
Doc gap: the runbook is good but doesn't explain endpoint-specific strict validation (geoLocation is valid on /search but rejected on recommendations endpoints), gives no worked example mapping a log line to the cause, and — unlike its sibling controller-api-4xx-rate-anomaly.md — has no alarm-noise/tuning guidance to let an on-call confidently conclude "this is noise, ignore/tune."
Recommended change: enrich ecom-4xx-errors.md with (1) an endpoint-aware param-validation section + the real geoLocation → /recommendations/similar worked example (grep prod-ecom-api Cloudflare logs for Unrecognized param), linking flows/recommendations.md and design/features.md; and (2) a "when it's safe to do nothing" subsection mirroring the controller-api sibling (low-volume hair-trigger, pixel-update baseline shift, stale _bf16 duplicate).
4.3 Ecom API Success Rate Below 98% — Partial
What fired: success rate < 98% on karkkainen-prod-000.
Real root cause & resolution: a wonky index cutover — "a fork of a fork, AND possibly because the original index had been cleaned up somewhat manually before." Self-resolved once the cutover settled.
Thread evidence: Oliver: "Resolved. The cutover was wonky because it was a fork of a fork, AND possibly because the original index had been cleaned up somewhat manually before. Will mini postmortem it later."
Doc gap: ecom-success-rate.md mentions "alias/fork change" only as a one-line generic clue; ecom-5xx-errors.md frames everything as component failures and never names forks/cutovers. The fork-of-a-fork and manually-cleaned-source failure modes are documented only as Future Work in fork.md/decommission-fork-source.md, none connected to a 5XX/success-rate symptom, and decommission-fork-source.md is never linked from the incident path. The deferred mini-postmortem was never written.
Recommended change: add a "wonky fork cutover" common-cause section (detect via the forks list / fork detail page / Aliases tab; call out the two hazards; state when to wait it out vs issue a fork rollback/cleanup), cross-link decommission-fork-source.md, and write the deferred postmortem.
4.4 Index Unreachable — Partial
What fired: "Index Unreachable" on multiple indexes (e.g. j3grp7ot, bf16); 19 triggers in one burst, self-resolved by Prometheus.
Real root cause & resolution: suspected forking/bf16 cutover, correlated with pixel-update bursts ("is it pixel oclock"); transient. The deeper documented root cause for the bf16 burst is shared-T4 GPU VRAM exhaustion (2026-06-18-kserve-vram-oom.md), where more replicas worsen VRAM pressure and the failure self-heals as over-scale unwinds.
Thread evidence: Pandu: "bf16 … hmm suspicious — wondering if it has to do with the forking"; Oliver: "19 triggers … is it pixel oclock."
Doc gap: the runbook assumes the only cause is underscaling and its sole remedy is scaling out via PUT /v2/indexes/autoscaling — actively misleading here, since scaling replicas can worsen VRAM. No transient/self-resolving path, no link to the kserve VRAM-OOM report or GPU layer, and "Check logs in Athena" gives no concrete query. The report itself notes the bf16/precision and slim-image theories were ruled out as red herrings.
Recommended change: branch index_unreachability.md on cause — add a transient/self-resolving path (did the gauge already recover? correlate with pixel bursts) and an inference/GPU path linking the VRAM-OOM report + DCGM_FI_DEV_FB_FREE; add a scale-out caution; replace the Athena step with a concrete query; flag the bf16/slim-image red herrings.
4.5 Fluent Bit Unhealthy Pods — Partial
What fired: unhealthy Fluent Bit pod(s) on cell2 EKS; transient, self-resolved.
Real root cause & resolution: a DaemonSet pod transiently unhealthy on some node; worst case = delayed logs from those nodes; no action needed, alarm can be relaxed.
Thread evidence: Oliver: "self-resolved. there seem to be dozens of fluent bits. not sure what I am looking for"; Mehul: "Fluent-bit is a daemonset. Runs on each node. Collects and pushes logs to s3. We can relax the alarm a little. As the worst case scenario is delayed logs from some nodes."
Doc gap (concrete, reproduces the on-call's complaint): Step 4 uses the wrong namespace and selector — kubectl get pods -n logging -l app.kubernetes.io/name=fluent-bit, but the Helm release deploys into namespace fluent-bit as aws-for-fluent-bit (addons.tf:694-698), and the alert's recording rule matches pod=~"aws-for-fluent-bit.*" in namespace fluent-bit (amp_recording_rules.tf:37-38). The runbook's selector matches nothing, even with -A → "step 4 has no results." It also lacks what-it-is/blast-radius framing (DaemonSet = one pod per node, hence "dozens"; worst case is delayed logs) and any "usually self-resolving → do nothing / relax the alarm" gate, jumping straight to invasive remediation.
Recommended change: fix Step 4 to kubectl get pods -n fluent-bit -l app.kubernetes.io/name=aws-for-fluent-bit (+ --field-selector=status.phase!=Running to isolate the firing pod); add "what it is / blast radius" and a "triage first / safe to do nothing" gate; reserve escalate/delete/Helm-memory steps for the persistent/CrashLoop/OOMKilled case.
4.6 Cloudflare Requests 5xx Errors — Partial (not determinable from docs)
What fired: the Datadog-sourced monitor "Cloudflare Requests 5xx Errors," each firing Acknowledged then Resolved by "Marqo Cloud v2 - Datadog" within ~2 min, no human action.
Real root cause & resolution: transient Cloudflare-edge 5xx blips; auto-resolve; investigate only if sustained.
Doc gap: the candidate reverse_proxy_5xx.md is keyed to a different alert — a Grafana/Prometheus rule on origin/reverse-proxy 5xx (> 20/min), not the Cloudflare edge request series. The Datadog monitor is not defined in-repo and no index routes its name to any runbook. Worse, following reverse_proxy_5xx.md verbatim against a 2-minute self-resolving edge blip drives the on-call toward an unnecessary production autoscaling change — the opposite of the correct "do nothing / wait." The runbook offers no edge-vs-origin disambiguation and no "transient, safe to ignore" exit.
Recommended change: add an edge-vs-origin section (or a dedicated runbook keyed to the Datadog monitor) stating it auto-resolves in ~2 min, with an "investigate only if SUSTAINED beyond N min / N occurrences" gate and an explicit do-not-autoscale warning for transient edge blips.
4.7 Merchandising Exporter Lambda Has Errors — Partial
What fired: sev 2.5 Lambda-errors alert; recurring, each resolved by Prometheus within minutes, no human discussion.
Real root cause & resolution: typically a single transient Lambda error (retry/throttle/cold-start); auto-resolves on the next clean 5-min window; no action.
Doc gap: the alert design is the reason it recurs and is undocumented in the runbook — ecom_alerts.tf defines it with for = "0s", Sum(Errors) > 0 over 5 min, route_sev_2_5 = true, so any single error pages instantly and clears on the next clean window. The runbook opens on staleness/customer impact and has no "is this transient? / when to stand down" calibration, pushing a non-expert to over-investigate a no-op firing. (Its deeper triage, DDB row map, exporter-transform detail, and Muji case study are already strong for genuine sustained failures.)
Recommended change: add a short "Expected behavior / is this transient?" section at the top (single-error trigger, auto-resolves, sev 2.5 low-urgency, only escalate if it persists across windows or coincides with stale-merchandising/search alerts); file a de-flapping follow-up (longer for, or Errors > 1).
4.8 ControlPlaneSimpleHealthCheckFailureAlarm — Missing (resolution genuinely unknown)
What fired: a CloudWatch alarm re-triggering ~daily; thread shows only Acknowledged/Triggered flips by Oliver, never diagnosed.
Real root cause & resolution: unknown from history — likely an intermittent/flaky synthetic check.
Doc gap (honest): the exact alarm name returns zero hits across the entire library tree, and the alarm is not even defined in the checked-out IaC (cloud_control_plane_cdk defines unrelated alarms; nothing constructs a ControlPlaneSimpleHealthCheck). The control-plane incidents index doesn't list it; the component diagnostic never mentions a synthetic check; only the generic "Why are alarms firing?" block applies. There is no documented resolution, consistent with the history.
Recommended change: author an investigation-scaffold runbook under cloud_control_plane/docs/runbooks/incidents/ and link it from incidents/index.md. First locate the alarm's actual definition in the deploying IaC (Terraform/CloudWatch synthetic / Route53 health check — it is not in the audited submodule SHAs), then document what it probes, how to confirm real-vs-flaky (probed endpoint, Monolith ECS service + NLB/API-Gateway target health, recent deploys), remediation for the common case, and — if confirmed flaky — a retune follow-up (evaluation periods / M-of-N / treat-missing-data). State explicitly that the resolution is unknown from history.
4.9 Ecom Internal Indexer Job Success Rate < 99% (cascade) — Partial
What fired: "Internal Indexer Job Success Rate Below 99%" plus a correlated cascade of a dozen+ per-index 4XX/success-rate alerts firing at once.
Real root cause & resolution: cell2 marqo /documents (re-embed) latency rose above the indexer client timeout (p95 ~1s→2s, p99 ~2.3s→4–5s) → UPDATE_DOCUMENTS jobs time out client-side and are marked ERROR even though marqo completed most of the work (mn-kogan-prod-000: 6,284 PATCH→200, ~13 5xx). Low-volume indexes drop to 0% → a dozen+ alerts fire together. Co-symptoms: marqo HAProxy/pod restart burst + Fluent Bit unhealthy. Self-heals on retry; real fix is to raise the indexer client timeout / add retry-backoff and investigate the cell latency wobble.
Thread evidence: Pandu: "these are timeouts on a healthy server, not server failures … a few timed-out jobs + ~0 completions in the 30-min window drops each to 0% → a dozen+ alerts fire at once … Real fix: raise the indexer client timeout / add retry-backoff … per-index internal-indexer success rate is hair-trigger at low volume — consider a min-volume floor or a fleet-level alert."
Doc gap: the runbook's failure model is binary (4XX = bad payload, 5XX = server failed) and has no "timeouts on a healthy server" class — the exact signature here (many PATCH→200 but job ERROR). There is no cross-alert correlation anywhere (on-call would open a dozen parallel investigations instead of one cell-level cause), the real fix is in no doc, the low-volume hair-trigger warning is absent, and the data-plane /documents latency wobble has no runbook — it appears only in 2026-06-19-kogan-latency-blips.md, which is not a runbook, isn't linked, and frames the slow write path as a separate standing ticket.
Recommended change: add to ecom-internal-indexer-job-success-rate.md: (1) a "timeouts on a healthy server" failure class (check destination-cell /documents latency before assuming payload/platform failure); (2) a cross-alert correlation note (one cell wobble → many alerts = one incident); (3) the real remediation (raise client timeout / retry-backoff; self-heals; no customer action); (4) the min-volume-floor / fleet-level alert-tuning follow-up. Create a companion data-plane runbook for marqo /documents (re-embed) latency and link it bidirectionally + to the Kogan report.
5. Cross-cutting findings
A. Alert oversensitivity / low-volume noise is the dominant recurring-page driver. Multiple distinct alerts page on essentially trivial absolute volume and the correct action is to tune or do nothing: the ecom 4XX anomaly alarm (pixel-update baseline shift, low-volume hair-trigger, stale _bf16 duplicate), the internal-indexer success rate (a handful of timed-out jobs mechanically drops a low-volume index to 0%), and the Merchandising Exporter Lambda alarm (for=0s, Errors>0 — any single error pages). The systemic fixes are the same family: min-volume floors, fleet-level aggregation, longer for-durations, and >1 thresholds. The strongest existing model is controller-api-4xx-rate-anomaly.md's request-count gate + "tune the sigma/baseline if chronically noisy" — its peers should mirror it.
B. Correlated cascades read as N incidents but are one. A single cell2 /documents latency wobble simultaneously fired the internal-indexer success-rate alert, a dozen+ per-index 4XX/success-rate alerts, a marqo HAProxy/pod-restart burst, and Fluent Bit unhealthy pods. No doc ties these together, so an on-call risks N parallel investigations. Runbooks for these alerts should carry a "this may be a co-symptom of a cell-level wobble — check /documents latency first" correlation note.
C. Recurring triggers cluster around forking / bf16 cutovers and pixel-update bursts. The success-rate drop (fork-of-a-fork cutover), the Index Unreachable burst (bf16/forking + "is it pixel oclock"), and the ecom 4XX anomaly (pixel updates) all trace back to cutover/forking churn and pixel-update traffic bursts. These are recognizable, recurring patterns that deserve a named "common triggers" callout, plus the corrective that the bf16/precision and slim-image theories were ruled out in the VRAM-OOM report (avoid the known dead end).
D. Discoverability gap — there is no alert→runbook index. Several alerts can't be routed from their firing name to a doc: "Vespa Disk Utilization > 80%" (runbook_url → external Notion only), the Datadog "Cloudflare Requests 5xx Errors" monitor (not defined in-repo; nearest runbook is a different origin-5xx alert), and ControlPlaneSimpleHealthCheckFailureAlarm (zero hits anywhere; not in checked-out IaC). A single index keyed on exact alert names would convert several "Missing/Partial-undiscoverable" cases into "findable."
6. Recommended changes
(A) Library repo changes — to commit in library
- New alert→runbook index (e.g.
docs/runbooks/alert-index.md): a table keyed on exact firing alert names → owning runbook path (or "external Notion" / "no runbook yet"), covering every alert in §3. This directly closes the discoverability gap (D) and is the highest-leverage Library-native change. - This report, committed under the Library docs hub as the on-call documentation review of record.
incidents/index.mdpointers: add entries routing the currently-unmapped alerts ("Vespa Disk Utilization > 80%", "Cloudflare Requests 5xx Errors",ControlPlaneSimpleHealthCheckFailureAlarm) to their owning runbooks once those land, so the index stops omitting them.
(B) Submodule runbook uplifts — drafted, left UNCOMMITTED for the owning team's PR
(Each is scoped to a submodule the Library only aggregates; open against the owning repo, not committed here.)
| Target path | One-line summary |
|---|---|
repos/cloud_data_plane/runbooks/vespa_disk_utilization.md (new) | New runbook keyed to "Vespa Disk Utilization > 80%": per-tenant PVC fill (distinct from kserve root-fs), the two resize levers (StorageResizer add-shards/replicas; balanced_storage PVC-growth migration), act-now-vs-monitor, "keep the alarm quiet" framing; repoint the Grafana runbook_url at it. |
repos/cloud_control_plane/docs/runbooks/incidents/ecom-4xx-errors.md | Add a strict-schema 400 cause (geoLocation sent to /recommendations/*, with the real fashionnova worked example) and a "safe to do nothing" section for the oversensitive anomaly alarm (low-volume hair-trigger, pixel-update baseline shift, stale _bf16 duplicate), mirroring the controller-api sibling. |
repos/cloud_control_plane/docs/runbooks/incidents/ecom-success-rate.md | Add a "wonky fork cutover" common-cause section (detect mid-cutover/forked index via forks list / fork detail / Aliases tab; fork-of-a-fork and manually-cleaned-source hazards; wait-it-out vs fork rollback), cross-linking decommission-fork-source.md and the fork API doc. |
repos/cloud_data_plane/runbooks/index_unreachability.md | Add cause-branching "Common causes" (transient/self-resolving + inference-GPU-VRAM paths) and a scale-out caution, grounded in 2026-06-18-kserve-vram-oom.md, so on-call doesn't chase replicas/bf16 when the layer is shared-T4 GPU VRAM. |
repos/cloud_data_plane/runbooks/fluent_bit_unhealthy_pods.md | Add "what it is / blast radius" + "triage first / safe to do nothing" framing and fix Step 4's wrong namespace/selector to the authoritative -n fluent-bit / aws-for-fluent-bit.* from the alert's recording rule. |
repos/cloud_data_plane/runbooks/reverse_proxy_5xx.md | Add edge-vs-origin disambiguation and a "safe to do nothing" gate for transient Cloudflare-edge 5xx blips (Datadog "Cloudflare Requests 5xx Errors" monitor, auto-resolves ~2 min), so on-call doesn't run an unnecessary autoscaling change. |
repos/cloud_control_plane/docs/runbooks/incidents/merchandising-exporter-lambda-errors.md | Add an "Expected behavior / is this transient?" calibration section (single-error trigger, auto-resolves on next clean 5-min window, sev 2.5 low-urgency, when it's safe to do nothing) plus an alert de-flapping pointer. |
repos/cloud_control_plane/docs/runbooks/incidents/control-plane-simple-health-check-failure.md (new) | New investigation-scaffold runbook for ControlPlaneSimpleHealthCheckFailureAlarm: confirm real-vs-flaky via Monolith ECS/NLB/API-Gateway health, locate the alarm in IaC (NOT in the checked-out submodule), retune the hair-trigger synthetic; resolution explicitly stated as unknown from history. |
repos/cloud_control_plane/docs/runbooks/incidents/ecom-internal-indexer-job-success-rate.md (+ new companion data-plane runbook) | Add the "timeouts on a healthy server" failure class, the one-cell-wobble-fires-many-alerts correlation, the raise-timeout/retry-backoff fix and the low-volume alert-tuning note; plus a new companion data-plane runbook for marqo /documents (re-embed) latency. |
(C) Out-of-scope for docs — owning-team engineering / ops follow-ups
- Alert-threshold tuning (cross-cutting A): add min-volume floors and/or fleet-level aggregation for the per-index internal-indexer and ecom 4XX anomaly alarms; de-flap the Merchandising Exporter Lambda alarm (longer
for, orErrors > 1); relax the Fluent Bit alarmfor/threshold; add a sustained-only gate to the Cloudflare edge 5xx monitor. - Indexer client-timeout / retry-backoff (incident 4.9): raise the internal-indexer client timeout and/or add retry-backoff for
UPDATE_DOCUMENTS, and investigate the cell2/documentsre-embed latency wobble at source. ControlPlaneSimpleHealthCheckFailureAlarmroot-cause investigation (incident 4.8): locate the alarm's definition in the deploying IaC, determine what the synthetic probes, and decide fix-vs-retune; resolution is currently unknown from history and cannot be derived from docs.- Deferred postmortems: write the karkkainen-prod-000 cutover mini-postmortem (4.3) and reconcile the chronically-slow
/documentswrite-path standing ticket (4.9) with the internal-indexer cascade.
Honesty note on what is not determinable from current docs: "Vespa Disk Utilization > 80%" (no in-repo runbook; remediation only in code), the Datadog "Cloudflare Requests 5xx Errors" monitor (not defined in-repo; nearest runbook is a different alert), and ControlPlaneSimpleHealthCheckFailureAlarm (no runbook, no in-repo alarm definition, and no human-recorded resolution). For these, the report states the gap plainly rather than implying coverage exists.