Skip to main content

Ecom API 5XX Errors

This runbook is the primary 5XX investigation path for Ecom API Success Rate Below 98%. The success-rate alert treats 2XX and 4XX as successful and pages when 5XX or unexpected non-2XX/non-4XX responses push an account/index below 98% success.

There should usually be no sustained 5XXs. One-off timeouts can happen, but repeated 5XXs mean the platform is failing requests the customer cannot correct.

Triage

  1. Use the alert labels to identify label_system_account_id and label_index_name.
  2. Check the prod ecom API worker logs in Cloudflare: prod-ecom-api logs.
  3. Capture the failing path, status, response body, request IDs, and any downstream error.
  4. Check whether failures are concentrated on one endpoint, for example /search, /recommend, document writes, settings, or merchandising.
  5. Check dependent systems based on the failure:

Remediation

  • Fix or roll back the component producing the 5XX.
  • If the failure is caused by bad per-index config, use Edit ecommerce index settings and verify the settings exporter updates Cloudflare KV.
  • If an index or downstream cell is unhealthy, use Polo/data-plane dashboards to confirm capacity and index health before scaling or routing changes.
  • Contact the account manager if customer-facing impact is sustained or the customer may need to retry writes/searches.

Validation

  • Re-run a failing request or confirm new traffic for the affected index is no longer returning 5XX.
  • Confirm the success-rate alert clears.
  • Check that no related queue-depth, settings-exporter, or internal-indexer alerts are still firing.