Runbook: prometheus_server_unhealthy
This runbook covers steps to investigate and remediate when the Prometheus server has no running pods. The alert fires when the running pod count is less than 1 for more than 5 minutes.
Impact
Metrics collection and alerting may be impacted. Without a healthy Prometheus server, no new metrics are scraped and alert evaluation stops.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check Prometheus server pod status
kubectl get pods -n prometheus -l app=prometheus-server
5. Describe the pod to identify issues
kubectl describe pod -n prometheus -l app=prometheus-server
Look for:
- Events: OOMKilled, FailedScheduling, ImagePullBackOff, CrashLoopBackOff
- Conditions: Ready, ContainersReady
- Resource limits: Check if the pod is being OOMKilled due to memory limits
6. Check pod logs
kubectl logs -n prometheus -l app=prometheus-server --tail=200
If the pod is in CrashLoopBackOff, check previous logs:
kubectl logs -n prometheus -l app=prometheus-server --previous --tail=200
7. Remediate
Depending on the root cause:
- If OOMKilled: The pod may need increased memory limits. This requires a Terraform/Helm change.
- If CrashLoopBackOff: Check logs for the root cause (corrupt WAL, bad config, etc.).
- If stuck in Pending: Check node resources and scheduling constraints.
- If the pod is simply missing: Check if the deployment/statefulset still exists and has the correct replica count:
kubectl get deployment -n prometheuskubectl get statefulset -n prometheus