Runbook: coredns_unhealthy_pods
This runbook covers steps to investigate and remediate when CoreDNS has unhealthy pods. The alert fires when the unhealthy pod count is greater than 0 for more than 5 minutes.
Impact
DNS resolution within the cluster may be impacted. Services may fail to resolve internal cluster DNS names, causing connection failures across workloads.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check CoreDNS pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns
5. Describe unhealthy pods
kubectl describe pod <unhealthy-pod-name> -n kube-system
Look for:
- Events: OOMKilled, FailedScheduling, CrashLoopBackOff, readiness probe failures
- Node issues: Check if the node the pod is scheduled on is healthy
6. Check pod logs
kubectl logs <unhealthy-pod-name> -n kube-system --tail=200
Look for:
- Plugin errors
- Upstream DNS resolution failures
- Memory or resource issues
7. Remediate
Depending on the root cause:
- If CrashLoopBackOff or OOMKilled: Restart the pod:
kubectl delete pod <unhealthy-pod-name> -n kube-system
- If node is unhealthy: Cordon and drain the node, then check if the pod reschedules successfully.
- If all CoreDNS pods are unhealthy: This is a critical situation. Check the CoreDNS deployment:
kubectl get deployment coredns -n kube-systemkubectl rollout restart deployment coredns -n kube-system