Scale ecom indexing throughput

TODO(oliver): More detail. Contact @Oliver Lade and/or @Raynor Chavez for support.

Background

Ecom API handles customers’ document write requests via an async queue. They can write as much as they like to the queue, but the speed at which the messages on the queue are processed is a function of the scaling of both the queue trigger (the resource that takes messages off the queue and invokes the indexer Lambda function with them) and the index data plane itself.

By default the ecom indexer scaling is relatively conservative to ensure large writes to new indexes don’t immediately overwhelm the index and provide a bad experience and first impression. If more aggressive scaling is desired, it’s a manual process to find a good balance between cost and throughput. In the future, once the data plane supports autoscaling, we can possibly even remove the throughput throttling altogether.

Process

TL;DR: Find the index’s SQS queue’s trigger under the prod-EcomIndexerFunction Lambda, Configuration > Triggers. Select it via radio button, click edit, and increase the maximum concurrency. This will be the number of indexer lambda functions running in parallel to process the queue.

Each lambda takes 10 jobs, which could be anywhere from 10 to 10,000 docs depending on how many docs the customer sends per ecom API request.

Increase in relatively small steps (e.g. double, not like 5-10x) and monitor the throughput of the data plane (per-index dashboard). A common bottleneck when scaling out is Vespa API nodes, which may need to be scaled out manually to support greater indexer throughput.

If the scaling out is temporary to process a large backlog, remember to reduce the trigger’s max concurrency when scaling in the data plane infra.