Runbook: prometheus_autoscaling_unhealthy
This runbook covers steps to investigate and remediate when the Prometheus Autoscaling server stops sending metrics to AMP. The alert fires when health metrics have not been received for more than 5 minutes.
Impact
KEDA autoscaling may be impacted as the dedicated Prometheus instance used for autoscaling metrics is down. Models may not scale in response to load changes.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check Prometheus autoscaling pod status
kubectl get pods -n prometheus -l app=prometheus-autoscaling
If the label doesn't match, list all pods in the prometheus namespace:
kubectl get pods -n prometheus
5. Describe the pod to identify issues
kubectl describe pod <prometheus-autoscaling-pod> -n prometheus
Look for:
- Events: OOMKilled, FailedScheduling, ImagePullBackOff, CrashLoopBackOff
- Conditions: Ready, ContainersReady
6. Check pod logs
kubectl logs <prometheus-autoscaling-pod> -n prometheus --tail=200
Look for:
- Remote write errors (connection refused, timeouts to AMP endpoint)
- Scrape failures
- WAL corruption errors
7. Verify remote write to AMP
Check if the Prometheus remote write configuration is correct and the AMP workspace is reachable:
kubectl get configmap -n prometheus -o yaml | grep remote_write -A 10
8. Remediate
Depending on the root cause:
- If OOMKilled: The pod may need increased memory limits (Terraform/Helm change).
- If remote write failing: Check AMP workspace health in the AWS Console and verify IAM permissions.
- If CrashLoopBackOff: Check logs for root cause and restart if needed:
kubectl delete pod <prometheus-autoscaling-pod> -n prometheus