Debugging Indexer Failures
Guide for investigating ecom indexer job failures using CloudWatch logs, DynamoDB, and S3 chunk replay.
Prerequisites
- AWS CLI configured with the
controllerprofile (prod account) - Access to CloudWatch Logs, DynamoDB, and S3 in
us-east-1
1. Check the Job Record
Query the job from DynamoDB to see current status and progress:
aws dynamodb get-item \
--profile controller --region us-east-1 \
--table-name prod-EcomIndexerJobsTable \
--key '{
"pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"},
"sk": {"S": "JOB#{created_at}#{job_id}"}
}' \
--projection-expression "job_status, error_message, failed_items, failed_items_details, processed_items, total_items, skipped_items, conflict_items, last_updated" \
--output json
Key fields:
pk:PLATFORM#shopify#SHOP#{system_account_id}-{index_name}sk:JOB#{created_at_iso}#{job_id}job_status: PENDING, IN_PROGRESS, COMPLETED, FAILED, CANCELLEDfailed_items_details: Map of error reasons to affected doc IDs
To find the latest job for a shop:
aws dynamodb query \
--profile controller --region us-east-1 \
--table-name prod-EcomIndexerJobsTable \
--key-condition-expression "pk = :pk" \
--expression-attribute-values '{":pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"}}' \
--no-scan-index-forward --max-items 1 \
--output json
2. Check CloudWatch Logs
Key log groups
| Log Group | Contains |
|---|---|
/aws/lambda/prod-ShopifyAppAdminFunction | Webhook receipt, bulk sync triggers, job creation |
/aws/lambda/prod-ShopifyWebhookWorker | Bulk finish processing, chunking, SQS enqueueing |
/aws/lambda/prod-EcomIndexerFunction | Document processing, Marqo requests, errors |
Search for a specific job
aws logs start-query \
--profile controller --region us-east-1 \
--log-group-name "/aws/lambda/prod-EcomIndexerFunction" \
--start-time $(date -u -j -f "%Y-%m-%dT%H:%M:%S" "2026-01-01T00:00:00" "+%s") \
--end-time $(date -u -j -f "%Y-%m-%dT%H:%M:%S" "2026-01-02T00:00:00" "+%s") \
--query-string 'fields @timestamp, @message
| filter @message like /{job_id}/
and (@message like /ERROR/ or @message like /status/ or @message like /FAILED/)
| sort @timestamp asc
| limit 50'
# Then retrieve results:
aws logs get-query-results --profile controller --region us-east-1 --query-id "{query_id}"
Search for bulk operation webhooks
aws logs start-query \
--profile controller --region us-east-1 \
--log-group-name "/aws/lambda/prod-ShopifyAppAdminFunction" \
--start-time ... --end-time ... \
--query-string 'fields @timestamp, @message
| filter @message like /bulk_operations/ and @message like /{shop_domain}/
| sort @timestamp asc | limit 50'
3. Check Shopify Bulk Operation Status
If a bulk sync job is stuck in PENDING, check whether Shopify completed the export:
# Get access token from sessions table
ACCESS_TOKEN=$(aws dynamodb query \
--profile controller --region us-east-1 \
--table-name prod-ShopifyEntitiesTable \
--key-condition-expression "pk = :pk AND begins_with(sk, :sk)" \
--expression-attribute-values '{
":pk": {"S": "SHOP#{shop_domain}"},
":sk": {"S": "USER#"}
}' \
--projection-expression "access_token" \
--output json | python3 -c "import json,sys; print(json.load(sys.stdin)['Items'][0]['access_token']['S'])")
# Query Shopify for bulk operation status
curl -s "https://{shop_domain}/admin/api/2024-10/graphql.json" \
-H "X-Shopify-Access-Token: $ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"query": "query { node(id: \"{bulk_operation_gid}\") { ... on BulkOperation { id status errorCode createdAt completedAt fileSize url objectCount } } }"}' \
| python3 -m json.tool
4. Replay Failing Sub-chunks Against Marqo
When the indexer logs show a sub-chunk failed (e.g., chunk_0001.jsonl#part#27 marked failed due to fatal_error), replay it against Marqo to get the actual error.
Step 1: Download the chunk from S3
The S3 path is: s3://prod-ecom-product-data-bucket/{shop_id}//bulk/{job_id}/chunk_{N}.jsonl
aws s3 cp \
"s3://prod-ecom-product-data-bucket/{shop_id}//bulk/{job_id}/chunk_0000.jsonl" \
/tmp/chunk.jsonl --profile controller
Step 2: Extract the failing sub-chunk
The part#N in the log corresponds to the Nth batch of docs (batch size = progress_step, typically 30). Part 27 = lines 810-839 (27×30 to 28×30-1).
import json
with open('/tmp/chunk.jsonl') as f:
lines = f.readlines()
start = 27 * 30 # part number × batch size
end = min(start + 30, len(lines))
docs = [json.loads(l) for l in lines[start:end]]
payload = {
"documents": docs,
"tensorFields": ["variantImageMultimodal"], # from index add_docs_config
"useExistingTensors": True,
}
with open('/tmp/replay.json', 'w') as f:
json.dump(payload, f)
Step 3: Get the index config
Find the Marqo endpoint and tensorFields from the index settings:
aws dynamodb scan \
--profile controller --region us-east-1 \
--table-name prod-EcomIndexSettingsTable \
--filter-expression "contains(sk, :idx)" \
--expression-attribute-values '{":idx": {"S": "{index_name}"}}' \
--projection-expression "pk, sk, add_docs_config, index_endpoint, dp_index_endpoint" \
--output json
Step 4: Replay the request
curl -s -X POST "{marqo_endpoint}/indexes/{index_name}/documents" \
-H "x-api-key: {api_key}" \
-H "Content-Type: application/json" \
-d @/tmp/replay.json | python3 -m json.tool
The response body will show the actual Marqo error (e.g., field limit exceeded, invalid document format, etc.).
5. Unstick a Job
Cancel a stuck job
If a job is stuck in IN_PROGRESS or PENDING with no SQS messages left:
aws dynamodb update-item \
--profile controller --region us-east-1 \
--table-name prod-EcomIndexerJobsTable \
--key '{
"pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"},
"sk": {"S": "JOB#{created_at}#{job_id}"}
}' \
--update-expression "SET job_status = :s, status_created_at = :sc, completed_at = :ca, error_message = :em" \
--expression-attribute-values '{
":s": {"S": "CANCELLED"},
":sc": {"S": "CANCELLED#{created_at}"},
":em": {"S": "Manually cancelled: {reason}"},
":ca": {"S": "{now_iso}"}
}'
Check SQS queue depth
Verify whether messages are still being processed:
aws sqs get-queue-attributes \
--profile controller --region us-east-1 \
--queue-url "https://sqs.us-east-1.amazonaws.com/023568249301/prod-{shop_id}-doc-queue" \
--attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible ApproximateNumberOfMessagesDelayed
6. Manually Trigger Bulk Finish Processing
If the bulk_operations/finish webhook was missed or the webhook worker failed, you can manually invoke the worker Lambda with a synthetic BULK_FINISH message. This works as long as the Shopify download URL hasn't expired (~7 days from export completion).
Prerequisites
- Confirm the Shopify bulk operation is COMPLETED (see Section 3)
- Get the
shop_id,system_account_id,index_name, andshopify_domainfor the shop - Get the
admin_graphql_api_id(the bulk operation GID) from the job record or Shopify query
Invoke the webhook worker
aws lambda invoke \
--profile controller --region us-east-1 \
--function-name prod-ShopifyWebhookWorker \
--invocation-type Event \
--cli-binary-format raw-in-base64-out \
--payload '{
"type": "BULK_FINISH",
"shop_id": "{system_account_id}-{index_name}",
"admin_graphql_api_id": "gid://shopify/BulkOperation/{bulk_op_id}",
"status": "completed",
"shopify_domain": "{shop}.myshopify.com",
"system_account_id": "{system_account_id}",
"index_name": "{index_name}"
}' \
/dev/null
The worker will:
- Find the matching PENDING BULK job by
admin_graphql_api_id - Query Shopify for the download URL
- Stream the bulk export file, chunk it, and enqueue to SQS
- The indexer Lambdas process chunks as normal
Notes
- The Shopify download URL is a signed GCS URL that expires ~7 days after export completion
- If expired, you'll need to re-trigger the bulk export from the Shopify app UI
- The
invocation-type Eventmakes it async — check the webhook worker logs for progress - If the job lookup fails ("No job found"), verify the job exists in DDB with the correct
bulk_operation_idinplatform_data
7. Common Failure Patterns
| Symptom | Cause | Fix |
|---|---|---|
| Job stuck in PENDING, 0/0 items | bulk_operations/finish webhook never received | Check Shopify bulk op status; re-trigger sync |
| Job thrashing FAILED ↔ IN_PROGRESS | Per-chunk error triggers full job failure + concurrent redrive reset | Fixed in PR #2331 + #2332 |
HTTP request failed with status 400 | Marqo rejecting docs (field limits, invalid data) | Replay sub-chunk to get error body |
Image download timeout for N docs | Marqo can't download product images | Transient; retried via SQS. After max retries, recorded as failed items |
Retrieved 0 existing documents then 400 | Pre-write readback finds no existing docs (normal for new products); 400 is from Marqo rejecting the write for a different reason | Replay the sub-chunk |