Runbook: edge_unreachability

This runbook covers steps to investigate and remediate when an edge compute endpoint is unreachable. The alert fires when an edge endpoint has been unreachable for 5 minutes.

Steps

1. Check metrics in the per-index dashboard

Open the per-index dashboard and confirm the impact for the affected index:

https://g-3d216b3ddc.grafana-workspace.us-east-1.amazonaws.com/d/per_index_dashboard

Filter by the index_name and system_account_id from the alert labels.

2. Check logs in Athena

Query Athena to see what errors are occurring. Look for connection timeouts, DNS resolution failures, or upstream errors.

3. Scale out if underscaled

If the unreachability is caused by the index being underscaled, modify the autoscaling config via the Control Plane API Gateway.

Navigate to the endpoint

Open AWS Console -> API Gateway
Select the Control Plane API (APIGateway)
Go to Stages -> prod
Locate the route: PUT /v2/indexes/autoscaling

Dry run first

{
  "systemAccountId": "<system_account_id>",
  "indexName": "<index_name>",
  "autoscalingEnabled": true,
  "dryRun": true,
  "minInferenceReplicas": <desired_min>,
  "maxInferenceReplicas": <desired_max>
}

Apply live

{
  "systemAccountId": "<system_account_id>",
  "indexName": "<index_name>",
  "autoscalingEnabled": true,
  "dryRun": false,
  "minInferenceReplicas": <desired_min>,
  "maxInferenceReplicas": <desired_max>
}

Validate

Confirm the API call returned success.
Monitor the index replica count trends toward the new bounds.
Watch for the edge reachability check to recover.

Rollback

If the change causes issues, disable autoscaling:

{
  "systemAccountId": "<system_account_id>",
  "indexName": "<index_name>",
  "autoscalingEnabled": false
}

Steps​

1. Check metrics in the per-index dashboard​

2. Check logs in Athena​

3. Scale out if underscaled​

Navigate to the endpoint​

Dry run first​

Apply live​

Validate​

Rollback​