Ecom Queue Depth / Queue Backlog

Summary

This runbook covers the Grafana alerts Ecom Queue Depth Exceeds 250K and Ecom Metrics Queue Oldest Message Age Exceeds 10m, plus the older per-index queue backlog alarm shape.

The depth alert means a controller-account SQS queue has more than 250,000 visible messages. The alert label gives the QueueName. For per-index indexing queues, this means write operations are being sent to the async indexing queue faster than they can be processed.

The age alert means the oldest visible message on prod-EcomMetricsQueue is older than 10 minutes — the ecom metrics consumer is falling behind even if absolute queue depth has not crossed the depth threshold. Treat it as a prod-EcomMetricsQueue backlog (see the metrics-queue note under Actions) and check whether metrics processing is stuck.

Context

Queues usually start backing up when the customer performs a large batch of writes in a short time, e.g. reindexing all of their docs, or reconciling with some source of truth. These operations are not inherently wrong or bad, but can lead to bad customer experiences if not managed well.

For example, the customer wants to reconcile and reindex 15M docs over the weekend; because everything is serverless they expect it to be relatively quick and dump all the docs on Friday afternoon. If we fail to notice, they could come in on Monday morning with a queue backed up with like 14M docs, and it could take the whole day (or worse) to flush it out before any more changes can be processed.

The purpose of this alarm is to alert us of a situation that could develop into something unpleasant for the customer. We should at least validate that nothing else about the index looks problematic (e.g. high add docs latency; many 4XX/5XXs causing retries; resource starvation).

Conditions

Current Grafana rule:

Queue metric: AWS/SQS ApproximateNumberOfMessagesVisible
Query window: 1 hour average
Threshold: visible messages > 250000
Excludes prod-EcomMetricsQueueDLQ

Legacy per-index alarm:

Ecom index
Indexing queue has >= 1000 visible messages
The number of visible jobs has increased every minute for >= 10 minutes

Actions

Identify the queue from the alert QueueName.
If the queue is prod-EcomMetricsQueue, follow Ecom Metrics Worker Lambda Has Errors and check whether metrics processing is stuck.
If the queue is a per-index indexing queue, this situation may require no action. Many writes may have occurred in a short span of time.
- If the request rate returns to normal soon, the queue will be processed normally.
- If the high rate of write requests continues for a long time, or the rate is exceptionally high for a short time, the queue could grow very large, and writes could become severely delayed.
Keep an eye on the 📈 prod-EcomDashboard (in [PROD] Controller account) to track the number of visible (backed up) messages on the queue. A very large backlog will become a problem.
- “Very large” depends on the customer’s documents and infra. Divide the queue’s visible messages by its messages deleted per minute - that’s how many minutes it will take to clear without further traffic.
- Action may be warranted when that number exceeds ~100-500 (a couple of hours to a business day). Beyond that, customers start to lose control of the contents of their index.
If the backlog continues to grow very large, escalate to the customer’s account manager.
- Understand whether this traffic is expected, if we know when it will end, and if the time to clear the queue is acceptable.
- Determine how urgent it is that we clear the queue (e.g. if everything must be processed by Monday morning).
If action is required to reduce the backlog, this usually means scaling out wherever the bottleneck is.
- Indexing is typically rate limited to avoid overwhelming Vespa and interfering with serving search requests. The main rate limit is defined on the indexer Lambda trigger that consumes the index’s queue.
  - (For admin access, use Escalator: Self-Service Admin for the Controller account)
  - Go to the prod-EcomIndexerFunction Lambda’s triggers config
  - Find the trigger for the index in question
  - Select it and click Edit
  - Review the “Maximum concurrency”. Increasing this value will increase the rate messages are processed, but also increase the RPS of write operations on the inference and Vespa nodes proportionally.
  - Before increasing the value, review the index’s infrastructure in Polo or the Cloud cell’s CustomerIndexConfigTable, and review the per-index dashboard to check the average CPU and GPU utils across the board. If any are near their limits, consider scaling those components out before increasing the indexing concurrency.
After the queue is sufficiently processed, and the write RPS has returned to normal, remember to reverse any scale out ops to return the index’s infra to its natural configuration.