Skip to main content

Ecom Agentic Converse 5XX RPS

This runbook covers the Grafana alert Agentic Converse Stream 5XX RPS Exceeds 2.

It fires when the converse stream emits more than 2 5XX completions per second for an account/index (computed as sum_over_time(ecom_agentic_stream_e2e_latency_milliseconds_count{status_class="5XX", operation="converse"}[5m]) / 300). The stream metrics are pushed by the ecom_metrics_consumer Lambda.

Converse is the multi-turn chat path served by the agentic worker (POST /agentic-search/converse). Sustained 5XXs mean conversational replies are failing for the customer. The path is the same agentic worker as agentic search but always uses function calling, supports document/image context, and persists conversation state in a Durable Object — see Flow: Agentic Search (the "Converse (Multi-Turn Chat)" section).

Triage

  1. Use the alert labels to identify label_system_account_id and label_index_name.

  2. Tail the agentic worker and the search proxy to see the failing completions — see Cloudflare Workers:

    npx wrangler tail prod-agentic-search
    npx wrangler tail prod-ecom-api
  3. Capture the failing path, status, response body, request IDs, and any downstream error.

  4. Check the most common agentic-specific failure sources (see the "What to Look For" table in Flow: Agentic Search):

    • LLM errors — missing/invalid GOOGLE_API_KEY or a Gemini outage. Check Secrets Manager and the Gemini API status.
    • Conversation state — Durable Object failures (ConversationSqlDO) can break multi-turn replies. Inspect the Durable Object in the Cloudflare dashboard.
    • Document/image context — converse fetches docs by ID and resolves uploaded images from R2; a failure resolving that context can surface as a 5XX.
    • Search proxy / Marqo failures — the underlying category searches run via the search proxy RPC. If that path is unhealthy, follow Ecom API 5XX Errors.
  5. Check whether a deploy, settings change, or downstream cell incident started at the same time.

Remediation

  • Fix or roll back the component producing the 5XX (agentic worker deploy, Gemini key, Durable Object, or downstream search).
  • If Gemini is the cause, restore/rotate GOOGLE_API_KEY or wait out the provider incident.
  • If the failure is in the underlying search path rather than the agentic layer, use Ecom API 5XX Errors.
  • Contact the account manager if customer-facing impact is sustained.

Validation

  • Re-run a failing converse turn for the affected index and confirm it streams a reply without 5XX.
  • Confirm the converse 5XX RPS drops below the threshold and the alert clears.
  • Check that no related ecom 5XX, queue-depth, or settings-exporter alerts are still firing.