Skip to main content

Runbook: cf_origin_p95_latency

This runbook covers steps to investigate and remediate when the Cloudflare origin P95 latency exceeds the configured threshold for an account/index/endpoint combination.

Steps

1. Check metrics in the per-index dashboard

Open the per-index dashboard and confirm the impact for the affected index:

https://g-3d216b3ddc.grafana-workspace.us-east-1.amazonaws.com/d/per_index_dashboard

Filter by the system_account_id, index_name, and endpoint from the alert labels.

2. Check logs in Athena

Query Athena to see the high-latency requests. Look for slow endpoints, request patterns, and whether specific operations are causing the latency spike.

3. Scale out if underscaled

If the latency is caused by the index being underscaled and unable to handle the request volume, modify the autoscaling config via the Control Plane API Gateway.

  1. Open AWS Console -> API Gateway
  2. Select the Control Plane API (APIGateway)
  3. Go to Stages -> prod
  4. Locate the route: PUT /v2/indexes/autoscaling

Dry run first

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": true,
"dryRun": true,
"minInferenceReplicas": <desired_min>,
"maxInferenceReplicas": <desired_max>
}

Apply live

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": true,
"dryRun": false,
"minInferenceReplicas": <desired_min>,
"maxInferenceReplicas": <desired_max>
}

Validate

  1. Confirm the API call returned success.
  2. Monitor the index replica count trends toward the new bounds.
  3. Watch for P95 latency to decrease below the threshold.

Rollback

If the change causes issues, disable autoscaling:

{
"systemAccountId": "<system_account_id>",
"indexName": "<index_name>",
"autoscalingEnabled": false
}