Ecom Internal Indexer Job Success Rate Below 99%

This runbook covers the Grafana alert Ecom Internal Indexer Job Success Rate Below 99%.

The alert fires when internal ecom indexer jobs for an account/index have less than 99% success over 30 minutes. Internal jobs are system-initiated writes, such as fork/reindex or maintenance traffic. Because Marqo controls these requests, failures usually indicate a platform, settings, data-shape, or destination-index issue rather than customer input.

Triage

Use the alert labels to identify label_system_account_id and label_index_name.
Check prod-EcomIndexerFunction logs for the affected account/index and recent internal jobs.
Inspect prod-EcomIndexerJobsTable for jobs with is_internal = true and terminal statuses other than COMPLETED.
Check whether failures are concentrated on one operation, destination index, reindex/fork, or new field/backfill.
Classify the failure:
- 4XX-style failures usually mean the payload or destination index config is invalid.
- 5XX-style failures usually mean Marqo, ecom API, or downstream infrastructure failed the request.

Useful starting point:

aws logs tail /aws/lambda/prod-EcomIndexerFunction --since 30m --filter-pattern "?ERROR ?Traceback ?Exception"

See Add Documents, Debugging indexer failures, and Reindexing.

Remediation

If payloads are invalid, fix the generator/replay path before retrying failed jobs.
If settings, aliases, or destination index visibility are wrong, fix those first and verify the ecom API can route the internal request.
If the destination index is overloaded, scale cautiously and monitor data-plane CPU/GPU/Vespa health before increasing indexing concurrency.
Redrive or re-run internal jobs only after the root cause is fixed and after checking whether replay can overwrite newer writes.

Validation

New internal jobs for the affected account/index complete successfully.
Failed internal job count stops increasing.
The alert clears after the 30 minute lookback no longer contains failed internal jobs.

Triage​

Remediation​

Validation​

Triage

Remediation

Validation