Skip to main content

Controller API 4XX Rate Anomalous vs Baseline

This runbook covers the Grafana alert Controller API 4XX Rate Anomalous vs Baseline.

The alert fires when the control-plane controller's 4XX rate (controller_request_latency_milliseconds_count{status_class="4XX"} divided by total requests) rises more than 3.5 standard deviations above its trailing 1-hour baseline, gated on at least 20 requests in the window. It catches a large increase in client errors — a bad deploy, a broken console/UI flow, an auth or validation regression, or a client retry storm — that a fixed threshold would miss for a service whose normal 4XX rate is well under 1%.

Triage

  1. Check the alert time window and confirm the 4XX spike is ongoing rather than a brief, already-cleared burst.
  2. In Grafana Explore (AMP datasource), find which endpoints and accounts are driving the spike:
    • topk(10, sum by (endpoint) (increase(controller_request_latency_milliseconds_count{status_class="4XX"}[5m])))
    • sum by (account_id) (increase(controller_request_latency_milliseconds_count{status_class="4XX"}[15m]))
  3. Distinguish the failure mode:
    • 401/403 surge → auth/permissions regression or expired credentials.
    • 404 surge → a client calling a renamed/removed route, or a missing record.
    • 400/422 surge → a validation or request-shape change (often a frontend or API contract mismatch).
  4. Correlate with recent controller, console/UI, or client deploys around the alert start time.
  5. Tail the controller logs for the failing route to confirm the exact status and message.

Remediation

  • If a controller or client deploy introduced the regression, roll back or patch it.
  • If a contract changed (request shape, route, auth), align the client and server or restore the previous contract.
  • If the spike is a legitimate but expected change in traffic (e.g. a new client deliberately probing), confirm with the owning team and tune the baseline window or sigma if it is now chronically noisy.

Validation

  • The 4XX rate returns toward its baseline: sum(increase(controller_request_latency_milliseconds_count{status_class="4XX"}[5m])) / sum(increase(controller_request_latency_milliseconds_count[5m])).
  • The Z-score drops below 3.5σ.
  • The alert clears after the next evaluation window (for = 10m).