Skip to main content

Ecom Agentic Search 5XX RPS

This runbook covers the Grafana alert Agentic Search Stream 5XX RPS Exceeds 2.

It fires when the agentic_search stream emits more than 2 5XX completions per second for an account/index (computed as sum_over_time(ecom_agentic_stream_e2e_latency_milliseconds_count{status_class="5XX", operation="agentic_search"}[5m]) / 300). The stream metrics are pushed by the ecom_metrics_consumer Lambda.

Sustained 5XXs on the agentic search stream mean conversational/AI-powered searches are failing for the customer — the request reaches the agentic worker but cannot complete. See Flow: Agentic Search for the full request path.

Triage

  1. Use the alert labels to identify label_system_account_id and label_index_name.

  2. Tail the agentic worker and the search proxy to see the failing completions — see Cloudflare Workers:

    npx wrangler tail prod-agentic-search
    npx wrangler tail prod-ecom-api
  3. Capture the failing path, status, response body, request IDs, and any downstream error.

  4. Check the most common agentic-specific failure sources (see the "What to Look For" table in Flow: Agentic Search):

    • LLM errors — missing/invalid GOOGLE_API_KEY or a Gemini outage. Check Secrets Manager and the Gemini API status.
    • Search proxy / Marqo failures — the per-category searches run via the search proxy RPC. If the underlying search path is unhealthy, follow Ecom API 5XX Errors.
    • Cache / DynamoDB — failures reading prod-AgenticCachedQueriesTable. Check DynamoDB.
  5. Check whether a deploy, settings change, or downstream cell incident started at the same time.

Remediation

  • Fix or roll back the component producing the 5XX (agentic worker deploy, Gemini key, or downstream search).
  • If Gemini is the cause, restore/rotate GOOGLE_API_KEY or wait out the provider incident; the short-query bypass and cached paths may keep partial functionality.
  • If the failure is in the underlying search path rather than the agentic layer, use Ecom API 5XX Errors.
  • Contact the account manager if customer-facing impact is sustained.

Validation

  • Re-run a failing agentic search for the affected index and confirm it streams a result without 5XX.
  • Confirm the agentic search 5XX RPS drops below the threshold and the alert clears.
  • Check that no related ecom 5XX, queue-depth, or settings-exporter alerts are still firing.