Debugging Indexer Failures

Guide for investigating ecom indexer job failures using CloudWatch logs, DynamoDB, and S3 chunk replay.

Prerequisites

AWS CLI configured with the controller profile (prod account)
Access to CloudWatch Logs, DynamoDB, and S3 in us-east-1

1. Check the Job Record

Query the job from DynamoDB to see current status and progress:

aws dynamodb get-item \
  --profile controller --region us-east-1 \
  --table-name prod-EcomIndexerJobsTable \
  --key '{
    "pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"},
    "sk": {"S": "JOB#{created_at}#{job_id}"}
  }' \
  --projection-expression "job_status, error_message, failed_items, failed_items_details, processed_items, total_items, skipped_items, conflict_items, last_updated" \
  --output json

Key fields:

pk: PLATFORM#shopify#SHOP#{system_account_id}-{index_name}
sk: JOB#{created_at_iso}#{job_id}
job_status: PENDING, IN_PROGRESS, COMPLETED, FAILED, CANCELLED
failed_items_details: Map of error reasons to affected doc IDs

To find the latest job for a shop:

aws dynamodb query \
  --profile controller --region us-east-1 \
  --table-name prod-EcomIndexerJobsTable \
  --key-condition-expression "pk = :pk" \
  --expression-attribute-values '{":pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"}}' \
  --no-scan-index-forward --max-items 1 \
  --output json

2. Check CloudWatch Logs

Key log groups

Log Group	Contains
`/aws/lambda/prod-ShopifyAppAdminFunction`	Webhook receipt, bulk sync triggers, job creation
`/aws/lambda/prod-ShopifyWebhookWorker`	Bulk finish processing, chunking, SQS enqueueing
`/aws/lambda/prod-EcomIndexerFunction`	Document processing, Marqo requests, errors

Search for a specific job

aws logs start-query \
  --profile controller --region us-east-1 \
  --log-group-name "/aws/lambda/prod-EcomIndexerFunction" \
  --start-time $(date -u -j -f "%Y-%m-%dT%H:%M:%S" "2026-01-01T00:00:00" "+%s") \
  --end-time $(date -u -j -f "%Y-%m-%dT%H:%M:%S" "2026-01-02T00:00:00" "+%s") \
  --query-string 'fields @timestamp, @message
    | filter @message like /{job_id}/
      and (@message like /ERROR/ or @message like /status/ or @message like /FAILED/)
    | sort @timestamp asc
    | limit 50'

# Then retrieve results:
aws logs get-query-results --profile controller --region us-east-1 --query-id "{query_id}"

Search for bulk operation webhooks

aws logs start-query \
  --profile controller --region us-east-1 \
  --log-group-name "/aws/lambda/prod-ShopifyAppAdminFunction" \
  --start-time ... --end-time ... \
  --query-string 'fields @timestamp, @message
    | filter @message like /bulk_operations/ and @message like /{shop_domain}/
    | sort @timestamp asc | limit 50'

3. Check Shopify Bulk Operation Status

If a bulk sync job is stuck in PENDING, check whether Shopify completed the export:

# Get access token from sessions table
ACCESS_TOKEN=$(aws dynamodb query \
  --profile controller --region us-east-1 \
  --table-name prod-ShopifyEntitiesTable \
  --key-condition-expression "pk = :pk AND begins_with(sk, :sk)" \
  --expression-attribute-values '{
    ":pk": {"S": "SHOP#{shop_domain}"},
    ":sk": {"S": "USER#"}
  }' \
  --projection-expression "access_token" \
  --output json | python3 -c "import json,sys; print(json.load(sys.stdin)['Items'][0]['access_token']['S'])")

# Query Shopify for bulk operation status
curl -s "https://{shop_domain}/admin/api/2024-10/graphql.json" \
  -H "X-Shopify-Access-Token: $ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query": "query { node(id: \"{bulk_operation_gid}\") { ... on BulkOperation { id status errorCode createdAt completedAt fileSize url objectCount } } }"}' \
  | python3 -m json.tool

4. Replay Failing Sub-chunks Against Marqo

When the indexer logs show a sub-chunk failed (e.g., chunk_0001.jsonl#part#27 marked failed due to fatal_error), replay it against Marqo to get the actual error.

Step 1: Download the chunk from S3

The S3 path is: s3://prod-ecom-product-data-bucket/{shop_id}//bulk/{job_id}/chunk_{N}.jsonl

aws s3 cp \
  "s3://prod-ecom-product-data-bucket/{shop_id}//bulk/{job_id}/chunk_0000.jsonl" \
  /tmp/chunk.jsonl --profile controller

Step 2: Extract the failing sub-chunk

The part#N in the log corresponds to the Nth batch of docs (batch size = progress_step, typically 30). Part 27 = lines 810-839 (27×30 to 28×30-1).

import json

with open('/tmp/chunk.jsonl') as f:
    lines = f.readlines()

start = 27 * 30  # part number × batch size
end = min(start + 30, len(lines))
docs = [json.loads(l) for l in lines[start:end]]

payload = {
    "documents": docs,
    "tensorFields": ["variantImageMultimodal"],  # from index add_docs_config
    "useExistingTensors": True,
}

with open('/tmp/replay.json', 'w') as f:
    json.dump(payload, f)

Step 3: Get the index config

Find the Marqo endpoint and tensorFields from the index settings:

aws dynamodb scan \
  --profile controller --region us-east-1 \
  --table-name prod-EcomIndexSettingsTable \
  --filter-expression "contains(sk, :idx)" \
  --expression-attribute-values '{":idx": {"S": "{index_name}"}}' \
  --projection-expression "pk, sk, add_docs_config, index_endpoint, dp_index_endpoint" \
  --output json

Step 4: Replay the request

curl -s -X POST "{marqo_endpoint}/indexes/{index_name}/documents" \
  -H "x-api-key: {api_key}" \
  -H "Content-Type: application/json" \
  -d @/tmp/replay.json | python3 -m json.tool

The response body will show the actual Marqo error (e.g., field limit exceeded, invalid document format, etc.).

5. Unstick a Job

Cancel a stuck job

If a job is stuck in IN_PROGRESS or PENDING with no SQS messages left:

aws dynamodb update-item \
  --profile controller --region us-east-1 \
  --table-name prod-EcomIndexerJobsTable \
  --key '{
    "pk": {"S": "PLATFORM#shopify#SHOP#{shop_id}"},
    "sk": {"S": "JOB#{created_at}#{job_id}"}
  }' \
  --update-expression "SET job_status = :s, status_created_at = :sc, completed_at = :ca, error_message = :em" \
  --expression-attribute-values '{
    ":s": {"S": "CANCELLED"},
    ":sc": {"S": "CANCELLED#{created_at}"},
    ":em": {"S": "Manually cancelled: {reason}"},
    ":ca": {"S": "{now_iso}"}
  }'

Check SQS queue depth

Verify whether messages are still being processed:

aws sqs get-queue-attributes \
  --profile controller --region us-east-1 \
  --queue-url "https://sqs.us-east-1.amazonaws.com/023568249301/prod-{shop_id}-doc-queue" \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible ApproximateNumberOfMessagesDelayed

6. Manually Trigger Bulk Finish Processing

If the bulk_operations/finish webhook was missed or the webhook worker failed, you can manually invoke the worker Lambda with a synthetic BULK_FINISH message. This works as long as the Shopify download URL hasn't expired (~7 days from export completion).

Prerequisites

Confirm the Shopify bulk operation is COMPLETED (see Section 3)
Get the shop_id, system_account_id, index_name, and shopify_domain for the shop
Get the admin_graphql_api_id (the bulk operation GID) from the job record or Shopify query

Invoke the webhook worker

aws lambda invoke \
  --profile controller --region us-east-1 \
  --function-name prod-ShopifyWebhookWorker \
  --invocation-type Event \
  --cli-binary-format raw-in-base64-out \
  --payload '{
    "type": "BULK_FINISH",
    "shop_id": "{system_account_id}-{index_name}",
    "admin_graphql_api_id": "gid://shopify/BulkOperation/{bulk_op_id}",
    "status": "completed",
    "shopify_domain": "{shop}.myshopify.com",
    "system_account_id": "{system_account_id}",
    "index_name": "{index_name}"
  }' \
  /dev/null

The worker will:

Find the matching PENDING BULK job by admin_graphql_api_id
Query Shopify for the download URL
Stream the bulk export file, chunk it, and enqueue to SQS
The indexer Lambdas process chunks as normal

Notes

The Shopify download URL is a signed GCS URL that expires ~7 days after export completion
If expired, you'll need to re-trigger the bulk export from the Shopify app UI
The invocation-type Event makes it async — check the webhook worker logs for progress
If the job lookup fails ("No job found"), verify the job exists in DDB with the correct bulk_operation_id in platform_data

7. Common Failure Patterns

Symptom	Cause	Fix
Job stuck in PENDING, 0/0 items	`bulk_operations/finish` webhook never received	Check Shopify bulk op status; re-trigger sync
Job thrashing FAILED ↔ IN_PROGRESS	Per-chunk error triggers full job failure + concurrent redrive reset	Fixed in PR #2331 + #2332
`HTTP request failed with status 400`	Marqo rejecting docs (field limits, invalid data)	Replay sub-chunk to get error body
`Image download timeout for N docs`	Marqo can't download product images	Transient; retried via SQS. After max retries, recorded as failed items
`Retrieved 0 existing documents` then 400	Pre-write readback finds no existing docs (normal for new products); 400 is from Marqo rejecting the write for a different reason	Replay the sub-chunk

Prerequisites​

1. Check the Job Record​

2. Check CloudWatch Logs​

Key log groups​

Search for a specific job​

Search for bulk operation webhooks​

3. Check Shopify Bulk Operation Status​

4. Replay Failing Sub-chunks Against Marqo​

Step 1: Download the chunk from S3​

Step 2: Extract the failing sub-chunk​

Step 3: Get the index config​

Step 4: Replay the request​

5. Unstick a Job​

Cancel a stuck job​

Check SQS queue depth​

6. Manually Trigger Bulk Finish Processing​

Prerequisites​

Invoke the webhook worker​

Notes​

7. Common Failure Patterns​

Prerequisites

1. Check the Job Record

2. Check CloudWatch Logs

Key log groups

Search for a specific job

Search for bulk operation webhooks

3. Check Shopify Bulk Operation Status

4. Replay Failing Sub-chunks Against Marqo

Step 1: Download the chunk from S3

Step 2: Extract the failing sub-chunk

Step 3: Get the index config

Step 4: Replay the request

5. Unstick a Job

Cancel a stuck job

Check SQS queue depth

6. Manually Trigger Bulk Finish Processing

Prerequisites

Invoke the webhook worker

Notes

7. Common Failure Patterns