Skip to main content

Debugging Indexer Failures

Guide for investigating ecom indexer job failures using CloudWatch logs, DynamoDB, and S3 chunk replay.

Prerequisites

  • AWS CLI configured with the controller profile (prod account)
  • Access to CloudWatch Logs, DynamoDB, and S3 in us-east-1

1. Check the Job Record

Query the job from DynamoDB to see current status and progress:

aws dynamodb get-item \
--profile controller --region us-east-1 \
--table-name prod-EcomIndexerJobsTable \
--key '{
"pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"},
"sk": {"S": "JOB#{created_at}#{job_id}"}
}' \
--projection-expression "job_status, error_message, failed_items, failed_items_details, processed_items, total_items, skipped_items, conflict_items, last_updated" \
--output json

Key fields:

  • pk: PLATFORM#shopify#SHOP#{system_account_id}-{index_name}
  • sk: JOB#{created_at_iso}#{job_id}
  • job_status: PENDING, IN_PROGRESS, COMPLETED, FAILED, CANCELLED
  • failed_items_details: Map of error reasons to affected doc IDs

To find the latest job for a shop:

aws dynamodb query \
--profile controller --region us-east-1 \
--table-name prod-EcomIndexerJobsTable \
--key-condition-expression "pk = :pk" \
--expression-attribute-values '{":pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"}}' \
--no-scan-index-forward --max-items 1 \
--output json

2. Check CloudWatch Logs

Key log groups

Log GroupContains
/aws/lambda/prod-ShopifyAppAdminFunctionWebhook receipt, bulk sync triggers, job creation
/aws/lambda/prod-ShopifyWebhookWorkerBulk finish processing, chunking, SQS enqueueing
/aws/lambda/prod-EcomIndexerFunctionDocument processing, Marqo requests, errors

Search for a specific job

aws logs start-query \
--profile controller --region us-east-1 \
--log-group-name "/aws/lambda/prod-EcomIndexerFunction" \
--start-time $(date -u -j -f "%Y-%m-%dT%H:%M:%S" "2026-01-01T00:00:00" "+%s") \
--end-time $(date -u -j -f "%Y-%m-%dT%H:%M:%S" "2026-01-02T00:00:00" "+%s") \
--query-string 'fields @timestamp, @message
| filter @message like /{job_id}/
and (@message like /ERROR/ or @message like /status/ or @message like /FAILED/)
| sort @timestamp asc
| limit 50'

# Then retrieve results:
aws logs get-query-results --profile controller --region us-east-1 --query-id "{query_id}"

Search for bulk operation webhooks

aws logs start-query \
--profile controller --region us-east-1 \
--log-group-name "/aws/lambda/prod-ShopifyAppAdminFunction" \
--start-time ... --end-time ... \
--query-string 'fields @timestamp, @message
| filter @message like /bulk_operations/ and @message like /{shop_domain}/
| sort @timestamp asc | limit 50'

3. Check Shopify Bulk Operation Status

If a bulk sync job is stuck in PENDING, check whether Shopify completed the export:

# Get access token from sessions table
ACCESS_TOKEN=$(aws dynamodb query \
--profile controller --region us-east-1 \
--table-name prod-ShopifyEntitiesTable \
--key-condition-expression "pk = :pk AND begins_with(sk, :sk)" \
--expression-attribute-values '{
":pk": {"S": "SHOP#{shop_domain}"},
":sk": {"S": "USER#"}
}' \
--projection-expression "access_token" \
--output json | python3 -c "import json,sys; print(json.load(sys.stdin)['Items'][0]['access_token']['S'])")

# Query Shopify for bulk operation status
curl -s "https://{shop_domain}/admin/api/2024-10/graphql.json" \
-H "X-Shopify-Access-Token: $ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"query": "query { node(id: \"{bulk_operation_gid}\") { ... on BulkOperation { id status errorCode createdAt completedAt fileSize url objectCount } } }"}' \
| python3 -m json.tool

4. Replay Failing Sub-chunks Against Marqo

When the indexer logs show a sub-chunk failed (e.g., chunk_0001.jsonl#part#27 marked failed due to fatal_error), replay it against Marqo to get the actual error.

Step 1: Download the chunk from S3

The S3 path is: s3://prod-ecom-product-data-bucket/{shop_id}//bulk/{job_id}/chunk_{N}.jsonl

aws s3 cp \
"s3://prod-ecom-product-data-bucket/{shop_id}//bulk/{job_id}/chunk_0000.jsonl" \
/tmp/chunk.jsonl --profile controller

Step 2: Extract the failing sub-chunk

The part#N in the log corresponds to the Nth batch of docs (batch size = progress_step, typically 30). Part 27 = lines 810-839 (27×30 to 28×30-1).

import json

with open('/tmp/chunk.jsonl') as f:
lines = f.readlines()

start = 27 * 30 # part number × batch size
end = min(start + 30, len(lines))
docs = [json.loads(l) for l in lines[start:end]]

payload = {
"documents": docs,
"tensorFields": ["variantImageMultimodal"], # from index add_docs_config
"useExistingTensors": True,
}

with open('/tmp/replay.json', 'w') as f:
json.dump(payload, f)

Step 3: Get the index config

Find the Marqo endpoint and tensorFields from the index settings:

aws dynamodb scan \
--profile controller --region us-east-1 \
--table-name prod-EcomIndexSettingsTable \
--filter-expression "contains(sk, :idx)" \
--expression-attribute-values '{":idx": {"S": "{index_name}"}}' \
--projection-expression "pk, sk, add_docs_config, index_endpoint, dp_index_endpoint" \
--output json

Step 4: Replay the request

curl -s -X POST "{marqo_endpoint}/indexes/{index_name}/documents" \
-H "x-api-key: {api_key}" \
-H "Content-Type: application/json" \
-d @/tmp/replay.json | python3 -m json.tool

The response body will show the actual Marqo error (e.g., field limit exceeded, invalid document format, etc.).

5. Unstick a Job

Cancel a stuck job

If a job is stuck in IN_PROGRESS or PENDING with no SQS messages left:

aws dynamodb update-item \
--profile controller --region us-east-1 \
--table-name prod-EcomIndexerJobsTable \
--key '{
"pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"},
"sk": {"S": "JOB#{created_at}#{job_id}"}
}' \
--update-expression "SET job_status = :s, status_created_at = :sc, completed_at = :ca, error_message = :em" \
--expression-attribute-values '{
":s": {"S": "CANCELLED"},
":sc": {"S": "CANCELLED#{created_at}"},
":em": {"S": "Manually cancelled: {reason}"},
":ca": {"S": "{now_iso}"}
}'

Check SQS queue depth

Verify whether messages are still being processed:

aws sqs get-queue-attributes \
--profile controller --region us-east-1 \
--queue-url "https://sqs.us-east-1.amazonaws.com/023568249301/prod-{shop_id}-doc-queue" \
--attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible ApproximateNumberOfMessagesDelayed

6. Manually Trigger Bulk Finish Processing

If the bulk_operations/finish webhook was missed or the webhook worker failed, you can manually invoke the worker Lambda with a synthetic BULK_FINISH message. This works as long as the Shopify download URL hasn't expired (~7 days from export completion).

Prerequisites

  1. Confirm the Shopify bulk operation is COMPLETED (see Section 3)
  2. Get the shop_id, system_account_id, index_name, and shopify_domain for the shop
  3. Get the admin_graphql_api_id (the bulk operation GID) from the job record or Shopify query

Invoke the webhook worker

aws lambda invoke \
--profile controller --region us-east-1 \
--function-name prod-ShopifyWebhookWorker \
--invocation-type Event \
--cli-binary-format raw-in-base64-out \
--payload '{
"type": "BULK_FINISH",
"shop_id": "{system_account_id}-{index_name}",
"admin_graphql_api_id": "gid://shopify/BulkOperation/{bulk_op_id}",
"status": "completed",
"shopify_domain": "{shop}.myshopify.com",
"system_account_id": "{system_account_id}",
"index_name": "{index_name}"
}' \
/dev/null

The worker will:

  1. Find the matching PENDING BULK job by admin_graphql_api_id
  2. Query Shopify for the download URL
  3. Stream the bulk export file, chunk it, and enqueue to SQS
  4. The indexer Lambdas process chunks as normal

Notes

  • The Shopify download URL is a signed GCS URL that expires ~7 days after export completion
  • If expired, you'll need to re-trigger the bulk export from the Shopify app UI
  • The invocation-type Event makes it async — check the webhook worker logs for progress
  • If the job lookup fails ("No job found"), verify the job exists in DDB with the correct bulk_operation_id in platform_data

7. Common Failure Patterns

SymptomCauseFix
Job stuck in PENDING, 0/0 itemsbulk_operations/finish webhook never receivedCheck Shopify bulk op status; re-trigger sync
Job thrashing FAILED ↔ IN_PROGRESSPer-chunk error triggers full job failure + concurrent redrive resetFixed in PR #2331 + #2332
HTTP request failed with status 400Marqo rejecting docs (field limits, invalid data)Replay sub-chunk to get error body
Image download timeout for N docsMarqo can't download product imagesTransient; retried via SQS. After max retries, recorded as failed items
Retrieved 0 existing documents then 400Pre-write readback finds no existing docs (normal for new products); 400 is from Marqo rejecting the write for a different reasonReplay the sub-chunk