Ecom Agentic Converse 5XX RPS
This runbook covers the Grafana alert Agentic Converse Stream 5XX RPS Exceeds 2.
It fires when the converse stream emits more than 2 5XX completions per second for an account/index (computed as sum_over_time(ecom_agentic_stream_e2e_latency_milliseconds_count{status_class="5XX", operation="converse"}[5m]) / 300). The stream metrics are pushed by the ecom_metrics_consumer Lambda.
Converse is the multi-turn chat path served by the agentic worker (POST /agentic-search/converse). Sustained 5XXs mean conversational replies are failing for the customer. The path is the same agentic worker as agentic search but always uses function calling, supports document/image context, and persists conversation state in a Durable Object — see Flow: Agentic Search (the "Converse (Multi-Turn Chat)" section).
Triage
-
Use the alert labels to identify
label_system_account_idandlabel_index_name. -
Tail the agentic worker and the search proxy to see the failing completions — see Cloudflare Workers:
npx wrangler tail prod-agentic-searchnpx wrangler tail prod-ecom-api -
Capture the failing path, status, response body, request IDs, and any downstream error.
-
Check the most common agentic-specific failure sources (see the "What to Look For" table in Flow: Agentic Search):
- LLM errors — missing/invalid
GOOGLE_API_KEYor a Gemini outage. Check Secrets Manager and the Gemini API status. - Conversation state — Durable Object failures (
ConversationSqlDO) can break multi-turn replies. Inspect the Durable Object in the Cloudflare dashboard. - Document/image context — converse fetches docs by ID and resolves uploaded images from R2; a failure resolving that context can surface as a 5XX.
- Search proxy / Marqo failures — the underlying category searches run via the search proxy RPC. If that path is unhealthy, follow Ecom API 5XX Errors.
- LLM errors — missing/invalid
-
Check whether a deploy, settings change, or downstream cell incident started at the same time.
Remediation
- Fix or roll back the component producing the 5XX (agentic worker deploy, Gemini key, Durable Object, or downstream search).
- If Gemini is the cause, restore/rotate
GOOGLE_API_KEYor wait out the provider incident. - If the failure is in the underlying search path rather than the agentic layer, use Ecom API 5XX Errors.
- Contact the account manager if customer-facing impact is sustained.
Validation
- Re-run a failing converse turn for the affected index and confirm it streams a reply without 5XX.
- Confirm the converse 5XX RPS drops below the threshold and the alert clears.
- Check that no related ecom 5XX, queue-depth, or settings-exporter alerts are still firing.