Controller API 4XX Rate Anomalous vs Baseline
This runbook covers the Grafana alert Controller API 4XX Rate Anomalous vs Baseline.
The alert fires when the control-plane controller's 4XX rate (controller_request_latency_milliseconds_count{status_class="4XX"} divided by total requests) rises more than 3.5 standard deviations above its trailing 1-hour baseline, gated on at least 20 requests in the window. It catches a large increase in client errors — a bad deploy, a broken console/UI flow, an auth or validation regression, or a client retry storm — that a fixed threshold would miss for a service whose normal 4XX rate is well under 1%.
Triage
- Check the alert time window and confirm the 4XX spike is ongoing rather than a brief, already-cleared burst.
- In Grafana Explore (AMP datasource), find which endpoints and accounts are driving the spike:
topk(10, sum by (endpoint) (increase(controller_request_latency_milliseconds_count{status_class="4XX"}[5m])))sum by (account_id) (increase(controller_request_latency_milliseconds_count{status_class="4XX"}[15m]))
- Distinguish the failure mode:
- 401/403 surge → auth/permissions regression or expired credentials.
- 404 surge → a client calling a renamed/removed route, or a missing record.
- 400/422 surge → a validation or request-shape change (often a frontend or API contract mismatch).
- Correlate with recent controller, console/UI, or client deploys around the alert start time.
- Tail the controller logs for the failing route to confirm the exact status and message.
Remediation
- If a controller or client deploy introduced the regression, roll back or patch it.
- If a contract changed (request shape, route, auth), align the client and server or restore the previous contract.
- If the spike is a legitimate but expected change in traffic (e.g. a new client deliberately probing), confirm with the owning team and tune the baseline window or sigma if it is now chronically noisy.
Validation
- The 4XX rate returns toward its baseline:
sum(increase(controller_request_latency_milliseconds_count{status_class="4XX"}[5m])) / sum(increase(controller_request_latency_milliseconds_count[5m])). - The Z-score drops below 3.5σ.
- The alert clears after the next evaluation window (
for = 10m).