Reindexing Pipeline Replay Lambda Has Errors
This runbook covers the Grafana alert Reindexing Pipeline Replay Lambda Has Errors.
The alert fires when reindexing-pipeline-prod-replay-lambda records any Lambda error in 5 minutes. Replay errors can prevent reindex traffic from reaching the destination index, including fork/reindex jobs that rely on side-effects-only replay into the ecom API.
Triage
- Check
reindexing-pipeline-prod-replay-lambdalogs in the data-plane account around the alert window. - Extract the
reindex_id, source account/index, destination account/index, and any replay SQS queue URL from the failed log event. - Look up the job in the reindexing table or admin endpoint:
- Table:
reindexing-pipeline-prod-reindexing-table - Admin route:
/api/v1/admin/reindexing?systemAccountId=<account-id>
- Table:
- Check whether the replay Lambda error is caused by ecom API 4XX/5XX, missing settings/KV, destination index visibility, malformed documents, queue permissions, or timeout/capacity.
- If this is part of a fork workflow, inspect the fork's
docs_configand reindex queue state before changing anything.
Useful starting point:
aws logs tail /aws/lambda/reindexing-pipeline-prod-replay-lambda --since 30m --filter-pattern "?ERROR ?Traceback ?Exception"
See Reindexing, Ecom API 4XX Rate Alerts, and Ecom API 5XX Errors.
Remediation
- If replay requests are rejected by the ecom API, fix the underlying API/settings/document issue before replaying anything.
- If destination settings are missing or stale, follow Ecom Index Settings Exporter Lambda Has Errors.
- If the replay queue or event source mapping is broken, repair the queue/trigger configuration for the fork/reindex job.
- If a code deploy caused valid replay payloads to fail, roll back or patch the relevant ecom API/reindexing pipeline component.
Do not manually replay documents until you know whether replay is idempotent for this job and whether newer writes could be overwritten.
Validation
- The replay Lambda has no new errors.
- The reindex job's replayed and completed document counts are increasing.
- The destination index receives the expected documents or side effects.
- Any related
Reindexing Job Made No Progress in 15malert clears.