Controller API 5XX Rate Exceeds 5% in 5m
This runbook covers the Grafana alert Controller API 5XX Rate Exceeds 5% in 5m.
The alert fires when the control-plane controller's server-side error rate (controller_request_latency_milliseconds_count{status_class="5XX"} divided by total requests) exceeds 5% over a 5-minute window, gated on at least 20 requests in that window. The controller serves merchandising and search-config endpoints (e.g. GET api/merchandise/config, POST api/search, GET api/v2/indexes). Its baseline 5XX rate is normally 0, so any sustained 5XX fraction at this level is a real regression.
Triage
- Check the alert time window and confirm whether 5XXs are ongoing or a single burst that has already cleared.
- In Grafana Explore (AMP datasource), break the errors down by endpoint and account:
topk(10, sum by (endpoint) (increase(controller_request_latency_milliseconds_count{status_class="5XX"}[5m])))sum by (account_id) (increase(controller_request_latency_milliseconds_count{status_class="5XX"}[15m]))
- Determine whether the 5XXs are isolated to one endpoint/account or service-wide.
- Tail the controller logs for the failing route and correlate with the same timestamps; identify the failing downstream dependency (DDB, Cognito, Step Functions, a microservice).
- Check for recent controller deploys or infrastructure changes (ECS/EBS health, instance saturation) around the alert start time.
Remediation
- If a deploy caused the failures, roll back or patch the controller change.
- If the failure is downstream, fix or route around the broken dependency if an approved manual path exists.
- If the controller hosts are unhealthy or saturated (e.g. an EBS/ECS capacity event), restore capacity before changing application code.
- If the route is low-impact but noisy, confirm with the owning team before muting or adjusting thresholds.
Validation
- The failing controller route returns 2XX again.
sum(increase(controller_request_latency_milliseconds_count{status_class="5XX"}[5m]))returns to 0.- The alert clears after the next evaluation window.