Ecom API Success Rate Below 98%
This runbook covers the Grafana alert Ecom API Success Rate Below 98%.
Means the volume of 5XX errors on the ecom API is enough to impact the success rate ((2XX + 4XX) / (2XX + 4XX + 5XX)). Something should definitely be done, and depending on the severity the customer may need to be contacted.
See Ecom5XXErrors-{system_account_id}-{index_name} also for dealing with 5XX errors.
The alert only fires when the account/index has more than 500 requests in the 5 minute window, so treat it as real customer impact rather than low-volume noise.
Triage
- Use the alert labels to identify
label_system_account_idandlabel_index_name. - Check the prod ecom API worker logs in Cloudflare: prod-ecom-api logs.
- Determine which endpoint and status class is driving the failure.
- If the failed requests are 5XX, follow Ecom API 5XX Errors.
- Check whether a deploy, settings change, alias/fork change, or downstream cell incident started at the same time.
Depending on the problem, you may be able to Edit ecommerce index settings to mitigate it temporarily (or if you’re lucky, fix it).
Validation
- Confirm the affected endpoint returns expected responses for the affected account/index.
- Confirm the success-rate metric is back above 98% and the alert clears.
- If customer writes/searches failed, capture whether retries are needed and hand that context to the account manager.