Skip to main content

Runbook: cloudflare_429_rate

This runbook covers steps to investigate and remediate when there is a high ratio of 429 (rate limited) responses from the Cloudflare origin for an account.

Steps

1. Check metrics in the per-index dashboard

Open the per-index dashboard and confirm the impact for the affected account and path:

https://g-3d216b3ddc.grafana-workspace.us-east-1.amazonaws.com/d/per_index_dashboard

Filter by the system_account_id and request_path from the alert labels.

2. Check logs in Athena

Query Athena to see the 429 responses. Look for which endpoints are being rate limited, request patterns, and whether the traffic is legitimate.

3. Scale out if underscaled

If the 429s are caused by the index being underscaled and unable to handle the request volume, modify the autoscaling config via the Control Plane API Gateway.

  1. Open AWS Console -> API Gateway
  2. Select the Control Plane API (APIGateway)
  3. Go to Stages -> prod
  4. Locate the route: PUT /v2/indexes/autoscaling

Dry run first

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": true,
"dryRun": true,
"minInferenceReplicas": <desired_min>,
"maxInferenceReplicas": <desired_max>
}

Apply live

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": true,
"dryRun": false,
"minInferenceReplicas": <desired_min>,
"maxInferenceReplicas": <desired_max>
}

Validate

  1. Confirm the API call returned success.
  2. Monitor the index replica count trends toward the new bounds.
  3. Watch for the 429 rate to decrease.

Rollback

If the change causes issues, disable autoscaling:

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": false
}