Skip to main content

Runbook: reverse_proxy_5xx

This runbook covers steps to investigate and remediate when the reverse proxy is returning 5XX errors for an index. The alert fires when the 5XX request count is greater than 0.

Steps

1. Check metrics in the per-index dashboard

Open the per-index dashboard and confirm the impact for the affected index:

https://g-3d216b3ddc.grafana-workspace.us-east-1.amazonaws.com/d/per_index_dashboard

Filter by the index_name and account_id from the alert labels.

2. Check logs in Athena

Query Athena to see what the 5XX errors are. Look for error patterns, affected endpoints, and request volumes.

3. Scale out if underscaled

If the errors are caused by the index being underscaled, modify the autoscaling config via the Control Plane API Gateway.

  1. Open AWS Console -> API Gateway
  2. Select the Control Plane API (APIGateway)
  3. Go to Stages -> prod
  4. Locate the route: PUT /v2/indexes/autoscaling

Dry run first

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": true,
"dryRun": true,
"minInferenceReplicas": <desired_min>,
"maxInferenceReplicas": <desired_max>
}

Apply live

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": true,
"dryRun": false,
"minInferenceReplicas": <desired_min>,
"maxInferenceReplicas": <desired_max>
}

Validate

  1. Confirm the API call returned success.
  2. Monitor the index replica count trends toward the new bounds.
  3. Watch for the 5XX rate to decrease.

Rollback

If the change causes issues, disable autoscaling:

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": false
}