Skip to main content

Reindexing Job Made No Progress in 15m

This runbook covers the Grafana alert Reindexing Job Made No Progress in 15m.

The alert fires for an active reindex job when none of its extracted, replayed, or completed document counters have increased for 15 minutes. The alert labels include reindex_id, label_system_account_id, and label_index_name.

Triage

  1. Use the alert labels to identify the job.
  2. Look up the job in the reindexing table or admin endpoint:
    • Table: reindexing-pipeline-prod-reindexing-table
    • Admin route: /api/v1/admin/reindexing?systemAccountId=<account-id>
  3. Compare docCount, extractedDocCount, replayedDocCount, completedDocCount, and lastUpdatedAt.
  4. Determine which stage is stuck:
    • extraction count not moving: source index/export path issue,
    • replayed count not moving: replay Lambda, ecom API, destination settings, or replay queue issue,
    • completed count not moving: destination indexer jobs or Marqo writes issue.
  5. Check sibling alerts, especially Reindexing Pipeline Replay Lambda Has Errors, Ecom Queue Depth / Queue Backlog, and Ecom Internal Indexer Job Success Rate Below 99%.

Remediation

  • If extraction is stuck, check the source index/cell and the reindexing pipeline logs for the job.
  • If replay is stuck, follow Reindexing Pipeline Replay Lambda Has Errors, even if the replay Lambda alert has not fired.
  • If completion is stuck, inspect destination ecom indexer jobs and queue depth. Scale indexing throughput only after confirming the destination index has capacity.
  • If the job is genuinely abandoned or unsafe to continue, escalate before cancelling or deleting any reindex resources.

Validation

  • At least one of the job counters starts increasing again.
  • The alert clears after Grafana observes progress.
  • For fork/reindex workflows, confirm the destination index has the expected document count before cleanup or cutover.