Skip to main content

Reindexing Pipeline Replay Lambda Has Errors

This runbook covers the Grafana alert Reindexing Pipeline Replay Lambda Has Errors.

The alert fires when reindexing-pipeline-prod-replay-lambda records any Lambda error in 5 minutes. Replay errors can prevent reindex traffic from reaching the destination index, including fork/reindex jobs that rely on side-effects-only replay into the ecom API.

Triage

  1. Check reindexing-pipeline-prod-replay-lambda logs in the data-plane account around the alert window.
  2. Extract the reindex_id, source account/index, destination account/index, and any replay SQS queue URL from the failed log event.
  3. Look up the job in the reindexing table or admin endpoint:
    • Table: reindexing-pipeline-prod-reindexing-table
    • Admin route: /api/v1/admin/reindexing?systemAccountId=<account-id>
  4. Check whether the replay Lambda error is caused by ecom API 4XX/5XX, missing settings/KV, destination index visibility, malformed documents, queue permissions, or timeout/capacity.
  5. If this is part of a fork workflow, inspect the fork's docs_config and reindex queue state before changing anything.

Useful starting point:

aws logs tail /aws/lambda/reindexing-pipeline-prod-replay-lambda --since 30m --filter-pattern "?ERROR ?Traceback ?Exception"

See Reindexing, Ecom API 4XX Rate Alerts, and Ecom API 5XX Errors.

Remediation

  • If replay requests are rejected by the ecom API, fix the underlying API/settings/document issue before replaying anything.
  • If destination settings are missing or stale, follow Ecom Index Settings Exporter Lambda Has Errors.
  • If the replay queue or event source mapping is broken, repair the queue/trigger configuration for the fork/reindex job.
  • If a code deploy caused valid replay payloads to fail, roll back or patch the relevant ecom API/reindexing pipeline component.

Do not manually replay documents until you know whether replay is idempotent for this job and whether newer writes could be overwritten.

Validation

  • The replay Lambda has no new errors.
  • The reindex job's replayed and completed document counts are increasing.
  • The destination index receives the expected documents or side effects.
  • Any related Reindexing Job Made No Progress in 15m alert clears.