Runbook: prometheus_server_unhealthy

This runbook covers steps to investigate and remediate when the Prometheus server has no running pods. The alert fires when the running pod count is less than 1 for more than 5 minutes.

Impact

Metrics collection and alerting may be impacted. Without a healthy Prometheus server, no new metrics are scraped and alert evaluation stops.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check Prometheus server pod status

kubectl get pods -n prometheus -l app=prometheus-server

5. Describe the pod to identify issues

kubectl describe pod -n prometheus -l app=prometheus-server

Look for:

Events: OOMKilled, FailedScheduling, ImagePullBackOff, CrashLoopBackOff
Conditions: Ready, ContainersReady
Resource limits: Check if the pod is being OOMKilled due to memory limits

6. Check pod logs

kubectl logs -n prometheus -l app=prometheus-server --tail=200

If the pod is in CrashLoopBackOff, check previous logs:

kubectl logs -n prometheus -l app=prometheus-server --previous --tail=200

7. Remediate

Depending on the root cause:

If OOMKilled: The pod may need increased memory limits. This requires a Terraform/Helm change.
If CrashLoopBackOff: Check logs for the root cause (corrupt WAL, bad config, etc.).
If stuck in Pending: Check node resources and scheduling constraints.
If the pod is simply missing: Check if the deployment/statefulset still exists and has the correct replica count:
```
kubectl get deployment -n prometheus
kubectl get statefulset -n prometheus
```

Impact​

Steps​

1. Get admin permissions via Escalator​

2. Copy admin credentials to local terminal​

3. Get EKS cluster credentials​

4. Check Prometheus server pod status​

5. Describe the pod to identify issues​

6. Check pod logs​

7. Remediate​