Skip to main content

Runbook: prometheus_autoscaling_unhealthy

This runbook covers steps to investigate and remediate when the Prometheus Autoscaling server stops sending metrics to AMP. The alert fires when health metrics have not been received for more than 5 minutes.

Impact

KEDA autoscaling may be impacted as the dedicated Prometheus instance used for autoscaling metrics is down. Models may not scale in response to load changes.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check Prometheus autoscaling pod status

kubectl get pods -n prometheus -l app=prometheus-autoscaling

If the label doesn't match, list all pods in the prometheus namespace:

kubectl get pods -n prometheus

5. Describe the pod to identify issues

kubectl describe pod <prometheus-autoscaling-pod> -n prometheus

Look for:

  • Events: OOMKilled, FailedScheduling, ImagePullBackOff, CrashLoopBackOff
  • Conditions: Ready, ContainersReady

6. Check pod logs

kubectl logs <prometheus-autoscaling-pod> -n prometheus --tail=200

Look for:

  • Remote write errors (connection refused, timeouts to AMP endpoint)
  • Scrape failures
  • WAL corruption errors

7. Verify remote write to AMP

Check if the Prometheus remote write configuration is correct and the AMP workspace is reachable:

kubectl get configmap -n prometheus -o yaml | grep remote_write -A 10

8. Remediate

Depending on the root cause:

  • If OOMKilled: The pod may need increased memory limits (Terraform/Helm change).
  • If remote write failing: Check AMP workspace health in the AWS Console and verify IAM permissions.
  • If CrashLoopBackOff: Check logs for root cause and restart if needed:
    kubectl delete pod <prometheus-autoscaling-pod> -n prometheus