Ecom API 5XX Errors
This runbook is the primary 5XX investigation path for Ecom API Success Rate Below 98%. The success-rate alert treats 2XX and 4XX as successful and pages when 5XX or unexpected non-2XX/non-4XX responses push an account/index below 98% success.
There should usually be no sustained 5XXs. One-off timeouts can happen, but repeated 5XXs mean the platform is failing requests the customer cannot correct.
Triage
- Use the alert labels to identify
label_system_account_idandlabel_index_name. - Check the prod ecom API worker logs in Cloudflare: prod-ecom-api logs.
- Capture the failing path, status, response body, request IDs, and any downstream error.
- Check whether failures are concentrated on one endpoint, for example
/search,/recommend, document writes, settings, or merchandising. - Check dependent systems based on the failure:
- Search proxy settings/KV: Settings Sync
- Ecom API/admin Lambda: Lambda
- Search proxy and Cloudflare KV: Cloudflare Workers
- Indexing/write path: Add Documents
Remediation
- Fix or roll back the component producing the 5XX.
- If the failure is caused by bad per-index config, use Edit ecommerce index settings and verify the settings exporter updates Cloudflare KV.
- If an index or downstream cell is unhealthy, use Polo/data-plane dashboards to confirm capacity and index health before scaling or routing changes.
- Contact the account manager if customer-facing impact is sustained or the customer may need to retry writes/searches.
Validation
- Re-run a failing request or confirm new traffic for the affected index is no longer returning 5XX.
- Confirm the success-rate alert clears.
- Check that no related queue-depth, settings-exporter, or internal-indexer alerts are still firing.