Ecom Internal Indexer Job Success Rate Below 99%
This runbook covers the Grafana alert Ecom Internal Indexer Job Success Rate Below 99%.
The alert fires when internal ecom indexer jobs for an account/index have less than 99% success over 30 minutes. Internal jobs are system-initiated writes, such as fork/reindex or maintenance traffic. Because Marqo controls these requests, failures usually indicate a platform, settings, data-shape, or destination-index issue rather than customer input.
Triage
- Use the alert labels to identify
label_system_account_idandlabel_index_name. - Check
prod-EcomIndexerFunctionlogs for the affected account/index and recent internal jobs. - Inspect
prod-EcomIndexerJobsTablefor jobs withis_internal = trueand terminal statuses other thanCOMPLETED. - Check whether failures are concentrated on one operation, destination index, reindex/fork, or new field/backfill.
- Classify the failure:
- 4XX-style failures usually mean the payload or destination index config is invalid.
- 5XX-style failures usually mean Marqo, ecom API, or downstream infrastructure failed the request.
Useful starting point:
aws logs tail /aws/lambda/prod-EcomIndexerFunction --since 30m --filter-pattern "?ERROR ?Traceback ?Exception"
See Add Documents, Debugging indexer failures, and Reindexing.
Remediation
- If payloads are invalid, fix the generator/replay path before retrying failed jobs.
- If settings, aliases, or destination index visibility are wrong, fix those first and verify the ecom API can route the internal request.
- If the destination index is overloaded, scale cautiously and monitor data-plane CPU/GPU/Vespa health before increasing indexing concurrency.
- Redrive or re-run internal jobs only after the root cause is fixed and after checking whether replay can overwrite newer writes.
Validation
- New internal jobs for the affected account/index complete successfully.
- Failed internal job count stops increasing.
- The alert clears after the 30 minute lookback no longer contains failed internal jobs.