FashionNova Queue Age Spike (2026-02-17)
Summary
- Queue:
prod-fashionnova-ecom-prod-queue - Spike:
ApproximateAgeOfOldestMessagereached912saround2026-02-17 00:19:00 UTC - High-confidence cause: exactly 7 stale/retried messages were processed around
00:20:36-00:20:39 UTC, matching the age spike behavior. - Exact first-failure trigger for those 7 messages remains unproven from current logs.
What Was Verified
- Account/region: controller account
023568249301,us-east-1 - Queue live settings:
VisibilityTimeout=900ReceiveMessageWaitTimeSeconds=20- DLQ max receive count
5 - Event source mapping (
prod-EcomIndexerFunction): BatchSize=30MaximumBatchingWindowInSeconds=20ScalingConfig.MaximumConcurrency=15FunctionResponseTypes=[ReportBatchItemFailures]
Timeline (UTC)
00:06age132s00:19age912s00:20age949s00:21age22s(reset)
In the same 00:20 minute:
- NumberOfMessagesSent=391
- NumberOfMessagesReceived=398
- Delta +7 receives (consistent with redelivery/retry behavior)
Correlated Jobs (FashionNova)
From prod-EcomIndexerJobsTable for shop c8tw6196-fashionnova-ecom-prod:
- Exactly 7 jobs had lag >600s in 00:00-01:00 UTC (out of 27,705 jobs total).
- The same 7 were the top lags (~958-969s).
- Statuses: 6 CONFLICT, 1 COMPLETED.
Job IDs:
- 50af0e12-4167-499b-a3e7-915f2878c73f
- 950ae038-d797-4d20-a04f-b5b0388c8aca
- 183374f3-54cc-4206-b2d8-520782cade97
- 6967613b-8170-440c-8b17-0f5f61402a9b
- 16d8b0bb-2e94-4102-8e10-4453ca9142af
- 54ba88a8-89bd-4893-b3c1-482b15b3b647
- d199ad46-67af-4090-b808-81bcc2e98160
Observed timing pattern:
- Created and S3-written around 00:04:27-00:04:40
- First observed processing logs around 00:20:36-00:20:39
Manual Tracing: Find Delayed Jobs (lag_sec > 600)
Lag was computed as:
- lag_sec = started_at - created_at (in seconds)
Repro command used to find delayed jobs for this shop/hour:
PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T01:00:00+00:00~'
aws dynamodb query \
--profile "$PROFILE" --region "$REGION" \
--table-name "$TABLE" \
--key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
--expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
--projection-expression "job_id,created_at,started_at,job_status" \
--output json | jq -r '
def parse_ts($s): (($s|sub("\\+00:00$";"Z")|sub("\\.[0-9]+Z$";"Z")|strptime("%Y-%m-%dT%H:%M:%SZ")|mktime));
[ .Items[]
| {
job_id: .job_id.S,
created_at: .created_at.S,
started_at: (.started_at.S // null),
status: (.job_status.S // "UNKNOWN")
}
| . + { lag_sec: (if (.started_at==null) then null else (parse_ts(.started_at) - parse_ts(.created_at)) end) }
] as $rows |
"total_jobs=\($rows|length)",
"delayed_gt_600=\($rows|map(select(.lag_sec != null and .lag_sec > 600))|length)",
"delayed_status_breakdown:",
(($rows|map(select(.lag_sec != null and .lag_sec > 600))|group_by(.status)|map(" \(.[0].status)=\(length)"))[]? // " none"),
"top_20_lags:",
($rows|map(select(.lag_sec!=null))|sort_by(.lag_sec)|reverse|.[0:20]|map(" \(.lag_sec|floor)s \(.status) \(.job_id)")[])
'
Expected output pattern for this incident:
- total_jobs=27705
- delayed_gt_600=7
- Breakdown: CONFLICT=6, COMPLETED=1
Additional Result: Why Delayed Jobs Became CONFLICT
For the 6 delayed jobs that ended CONFLICT, conflict_details.older_or_equal_version points to a newer job for the same doc_id that completed first.
| Delayed Job | Lag | Doc ID | Newer Job | Newer Created At (UTC) | Newer Started At (UTC) | Newer Status |
|---|---|---|---|---|---|---|
50af0e12-4167-499b-a3e7-915f2878c73f |
969s | 6541400440956 |
13e40a0b-8481-4d80-b15f-ec60416c05bc |
2026-02-17T00:20:25.745683+00:00 |
2026-02-17T00:20:34.141021+00:00 |
COMPLETED |
950ae038-d797-4d20-a04f-b5b0388c8aca |
967s | 6543226208380 |
a0946a86-83ae-4a3b-8c83-458c0c2e6b98 |
2026-02-17T00:20:25.108583+00:00 |
2026-02-17T00:20:35.700047+00:00 |
COMPLETED |
6967613b-8170-440c-8b17-0f5f61402a9b |
959s | 6543038840956 |
2f1a1ba8-2182-496b-a666-d6b54c4511ce |
2026-02-17T00:19:37.276996+00:00 |
2026-02-17T00:19:47.271925+00:00 |
COMPLETED |
54ba88a8-89bd-4893-b3c1-482b15b3b647 |
959s | 6540983730300 |
4fec410d-caf0-4181-bb03-3e783647367e |
2026-02-17T00:19:33.595623+00:00 |
2026-02-17T00:19:43.814537+00:00 |
COMPLETED |
16d8b0bb-2e94-4102-8e10-4453ca9142af |
959s | 6543851782268 |
17f90ce9-8388-4105-95f4-41d4eb190539 |
2026-02-17T00:19:38.446487+00:00 |
2026-02-17T00:19:58.878720+00:00 |
COMPLETED |
d199ad46-67af-4090-b808-81bcc2e98160 |
958s | 6543625158780 |
02400322-1d67-4941-ac9e-460fff9d1ec7 |
2026-02-17T00:19:39.284058+00:00 |
2026-02-17T00:19:52.546012+00:00 |
COMPLETED |
Interpretation:
- These delayed messages were stale by the time they were retried.
- Newer writes for the same docs landed first, so delayed retries became older_or_equal_version conflicts.
Command Log (New Results) - With Purpose
- Confirm delayed jobs and lag in jobs table. Why: prove that only a small set of jobs had extreme lag.
PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T01:00:00+00:00~'
aws dynamodb query \
--profile "$PROFILE" --region "$REGION" \
--table-name "$TABLE" \
--key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
--expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
--projection-expression "job_id,created_at,started_at,job_status" \
--output json | jq -r '
def parse_ts($s): (($s|sub("\\+00:00$";"Z")|sub("\\.[0-9]+Z$";"Z")|strptime("%Y-%m-%dT%H:%M:%SZ")|mktime));
[ .Items[]
| {job_id: .job_id.S, created_at: .created_at.S, started_at: (.started_at.S // null), status: (.job_status.S // "UNKNOWN")}
| . + { lag_sec: (if (.started_at==null) then null else (parse_ts(.started_at) - parse_ts(.created_at)) end) }
] as $rows |
"total_jobs=\($rows|length)",
"delayed_gt_600=\($rows|map(select(.lag_sec != null and .lag_sec > 600))|length)",
"delayed_status_breakdown:",
(($rows|map(select(.lag_sec != null and .lag_sec > 600))|group_by(.status)|map(" \(.[0].status)=\(length)"))[]? // " none")
'
- Map delayed
CONFLICTjobs to the newer superseding jobs fromconflict_details. Why: prove causality forCONFLICTstatus (older write lost to newer write).
# Save a 00:00-00:30 snapshot with conflict_details
PROFILE=controller
REGION=us-east-1
TABLE=prod-EcomIndexerJobsTable
PK='PLATFORM#ecom#SHOP#c8tw6196-fashionnova-ecom-prod'
SK_START='JOB#2026-02-17T00:00:00+00:00'
SK_END='JOB#2026-02-17T00:30:00+00:00~'
aws dynamodb query \
--profile "$PROFILE" --region "$REGION" \
--table-name "$TABLE" \
--key-condition-expression "pk = :pk AND sk BETWEEN :s1 AND :s2" \
--expression-attribute-values "{\":pk\":{\"S\":\"$PK\"},\":s1\":{\"S\":\"$SK_START\"},\":s2\":{\"S\":\"$SK_END\"}}" \
--projection-expression "job_id,created_at,started_at,job_status,conflict_details" \
--output json > /tmp/fn_jobs_0030.json
jq -r '
def tsnorm: sub("\\+00:00$";"Z") | sub("\\.[0-9]+Z$";"Z");
def tsec: (tsnorm | strptime("%Y-%m-%dT%H:%M:%SZ") | mktime);
[ .Items[]
| {
job_id: .job_id.S,
created_at: .created_at.S,
started_at: (.started_at.S // null),
status: (.job_status.S // "UNKNOWN"),
conflict_details: (.conflict_details.M // null)
}
| . + { lag_sec: (if .started_at then ((.started_at|tsec)-(.created_at|tsec)) else null end) }
] as $rows |
($rows | map({key:.job_id, value:.}) | from_entries) as $byid |
$rows
| map(select(.lag_sec != null and .lag_sec > 600 and .status=="CONFLICT"))
| .[] as $d
| ($d.conflict_details.older_or_equal_version.L[]?.M) as $c
| ($byid[$c.job_id.S] // null) as $n
| "delayed_job=\($d.job_id) doc=\($c.doc_id.S) newer_job=\($c.job_id.S) newer_created=\($n.created_at) newer_started=\($n.started_at) newer_status=\($n.status)"
' /tmp/fn_jobs_0030.json
- Check Lambda service-level failure pressure in spike window. Why: rule out Lambda-level failures/throttles as primary cause.
PROFILE=controller
REGION=us-east-1
FUNC=prod-EcomIndexerFunction
for metric in Errors Throttles Invocations; do
aws cloudwatch get-metric-statistics \
--profile "$PROFILE" --region "$REGION" \
--namespace AWS/Lambda \
--metric-name "$metric" \
--dimensions Name=FunctionName,Value="$FUNC" \
--start-time 2026-02-17T00:00:00Z \
--end-time 2026-02-17T00:30:00Z \
--period 60 \
--statistics Sum \
--output json | jq -r '.Datapoints | sort_by(.Timestamp)[] | "\(.Timestamp)\t\(.Sum)"'
done
- Check for FashionNova-specific retry/error signatures in logs. Why: test whether explicit app-level errors explain the delay.
PROFILE=controller
REGION=us-east-1
LOG_GROUP="/aws/lambda/prod-EcomIndexerFunction"
START=1771286400 # 2026-02-17 00:00:00 UTC
END=1771290000 # 2026-02-17 01:00:00 UTC
Q='fields @timestamp, @message
| filter @message like /fashionnova-ecom-prod/
| filter @message like /Rate limit error processing SQS record|Document not found error processing SQS record|Inference timeout error processing SQS record|Error processing SQS record|Failed to process record|Failed to process chunk|Task timed out|Failed to parse record for parallel mode|Deferring [0-9]+ message\\(s\\) from index/
| sort @timestamp asc
| limit 200'
QID=$(aws logs start-query --profile "$PROFILE" --region "$REGION" --log-group-name "$LOG_GROUP" --start-time "$START" --end-time "$END" --query-string "$Q" --output text --query queryId)
aws logs get-query-results --profile "$PROFILE" --region "$REGION" --query-id "$QID"
- (Attempted) check CloudTrail for SQS visibility-change API events. Why: try to directly confirm explicit visibility-shortening calls in the period.
PROFILE=controller
REGION=us-east-1
aws cloudtrail lookup-events \
--profile "$PROFILE" --region "$REGION" \
--lookup-attributes AttributeKey=EventName,AttributeValue=ChangeMessageVisibilityBatch \
--start-time 2026-02-17T00:00:00Z \
--end-time 2026-02-17T00:30:00Z \
--max-results 50
Result note: - This lookup returned no events for that window in this account/region.
S3 Payload Characteristics
- All 7 were single-doc
add_docs_0000.jsonlpayloads (variants=1). - 6 later resolved as
CONFLICTbecause newer writes for same document IDs had already been processed.
Errors and Failures
No explicit FashionNova-specific error signatures were found in the target window for:
- Rate limit error processing SQS record
- Document not found error processing SQS record
- Inference timeout error processing SQS record
- Error processing SQS record
- Failed to process record
- Failed to process chunk
- Lambda runtime failures (Task timed out, OOM, runtime exit)
Lambda REPORT lines in window peaked around ~99s, not near 900s timeout.
Confidence
- High confidence: 7 stale/retried messages caused the age spike.
- Medium confidence: original first-attempt trigger is unknown due to insufficient per-message retry reason logging.
Recommended Next Step
Add targeted logging in ecom_indexer Lambda to emit, per failed message:
- messageId
- job_id
- ApproximateReceiveCount
- explicit retry reason when adding to batchItemFailures
This will make the next spike directly attributable.
Problem To Solve (Observability Gap)
We can prove delayed jobs and eventual CONFLICT, but we cannot always prove the first retry trigger that caused the delay.
Why this gap exists in current code:
- Retry outcome is returned as batchItemFailures without structured reason payload:
- components/ecom_indexer/ecom_indexer/lambda_function.py:1581
- At least one retry branch is effectively silent on cause details:
- components/ecom_indexer/ecom_indexer/lambda_function.py:586
- Job details are processing outcomes, not retry-decision provenance.
- Redrive can clear stored job detail fields, so first-cause evidence can be lost:
- components/ecom_indexer/ecom_indexer/job_management.py:170
Minimal Plan: Retry-Cause Logging (Phase 1)
Goal: capture the first retry cause at decision time, with enough detail for fast root-cause triage.
Design:
- Emit one structured log event per retry decision, e.g. INDEXER_RETRY_DECISION.
- Scope is per retried SQS message (job/chunk message), not per document.
- Required fields (always logged):
- message_id, job_id, shop_id, index_name
- receive_count (ApproximateReceiveCount)
- retry_trigger (single trigger that caused retry)
- mode (sequential, parallel_event, polled)
- request_id
- Optional context fields (log only when available):
- all_error_types_detected (for mixed-failure batches)
- error_type (exception class)
- Full Marqo error capture (untruncated):
- Persist full original Marqo error as first_retry_marqo_error once (write-once).
- Do not emit full Marqo error on every retry log line by default (keeps Phase 1 minimal).
- Write first_retry_* attribution to retry_attribution with write-once semantics so first cause survives later status transitions:
- first_retry_cause
- first_retry_at
- first_retry_receive_count
- first_retry_request_id
- first_retry_marqo_error
- Write-once requirement:
- Implement via conditional update (compare-and-set), e.g. attribute_not_exists(retry_attribution.first_retry_at).
- On conditional failure, no-op (do not overwrite earlier first-cause data).
Primary code touchpoints:
- Retry branch logging + reason capture:
- components/ecom_indexer/ecom_indexer/lambda_function.py:586
- components/ecom_indexer/ecom_indexer/lambda_function.py:1408
- components/ecom_indexer/ecom_indexer/lambda_function.py:1459
- components/ecom_indexer/ecom_indexer/lambda_function.py:1469
- components/ecom_indexer/ecom_indexer/lambda_function.py:1503
- components/ecom_indexer/ecom_indexer/lambda_function.py:1537
- components/ecom_indexer/ecom_indexer/lambda_function.py:1581
- Persist first-cause metadata:
- components/ecom_indexer/ecom_indexer/job_management.py:360
Admin UI Wiring (Phase 2)
Goal: make triage easy in admin_worker without manual CloudWatch spelunking.
Plan: - Add admin API endpoint to query retry-cause logs by index/time window. - Surface summary + recent events in Monitoring page.
Likely touchpoints:
- Backend logs query plumbing:
- components/admin_lambda/admin_lambda/clients/cloudwatch_logs_client.py
- components/admin_lambda/admin_lambda/services/monitoring_service.py
- components/admin_lambda/admin_lambda/routes/monitoring_routes.py
- UI:
- components/admin_worker/app/routes/monitoring.tsx
Investigation Links
- Main incident report (this file):
docs/incidents/reports/fashionnova-ecom-queue-age-spike-2026-02-17.md- Incident index:
docs/runbooks/incidents/index.md- Full working notes / evidence trail:
tmp/agent-plans/2026-02-17-queue-spike-debug/plan.md