Skip to main content

Ecom Internal Indexer Job Success Rate Below 99%

This runbook covers the Grafana alert Ecom Internal Indexer Job Success Rate Below 99%.

The alert fires when internal ecom indexer jobs for an account/index have less than 99% success over 30 minutes. Internal jobs are system-initiated writes, such as fork/reindex or maintenance traffic. Because Marqo controls these requests, failures usually indicate a platform, settings, data-shape, or destination-index issue rather than customer input.

Triage

  1. Use the alert labels to identify label_system_account_id and label_index_name.
  2. Check prod-EcomIndexerFunction logs for the affected account/index and recent internal jobs.
  3. Inspect prod-EcomIndexerJobsTable for jobs with is_internal = true and terminal statuses other than COMPLETED.
  4. Check whether failures are concentrated on one operation, destination index, reindex/fork, or new field/backfill.
  5. Classify the failure:
    • 4XX-style failures usually mean the payload or destination index config is invalid.
    • 5XX-style failures usually mean Marqo, ecom API, or downstream infrastructure failed the request.

Useful starting point:

aws logs tail /aws/lambda/prod-EcomIndexerFunction --since 30m --filter-pattern "?ERROR ?Traceback ?Exception"

See Add Documents, Debugging indexer failures, and Reindexing.

Remediation

  • If payloads are invalid, fix the generator/replay path before retrying failed jobs.
  • If settings, aliases, or destination index visibility are wrong, fix those first and verify the ecom API can route the internal request.
  • If the destination index is overloaded, scale cautiously and monitor data-plane CPU/GPU/Vespa health before increasing indexing concurrency.
  • Redrive or re-run internal jobs only after the root cause is fixed and after checking whether replay can overwrite newer writes.

Validation

  • New internal jobs for the affected account/index complete successfully.
  • Failed internal job count stops increasing.
  • The alert clears after the 30 minute lookback no longer contains failed internal jobs.