Runbook: coredns_unhealthy_pods

This runbook covers steps to investigate and remediate when CoreDNS has unhealthy pods. The alert fires when the unhealthy pod count is greater than 0 for more than 5 minutes.

Impact

DNS resolution within the cluster may be impacted. Services may fail to resolve internal cluster DNS names, causing connection failures across workloads.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check CoreDNS pod status

kubectl get pods -n kube-system -l k8s-app=kube-dns

5. Describe unhealthy pods

kubectl describe pod <unhealthy-pod-name> -n kube-system

Look for:

Events: OOMKilled, FailedScheduling, CrashLoopBackOff, readiness probe failures
Node issues: Check if the node the pod is scheduled on is healthy

6. Check pod logs

kubectl logs <unhealthy-pod-name> -n kube-system --tail=200

Look for:

Plugin errors
Upstream DNS resolution failures
Memory or resource issues

7. Remediate

Depending on the root cause:

If CrashLoopBackOff or OOMKilled: Restart the pod:

kubectl delete pod <unhealthy-pod-name> -n kube-system

If node is unhealthy: Cordon and drain the node, then check if the pod reschedules successfully.

If all CoreDNS pods are unhealthy: This is a critical situation. Check the CoreDNS deployment:

kubectl get deployment coredns -n kube-system
kubectl rollout restart deployment coredns -n kube-system

Impact​

Steps​

1. Get admin permissions via Escalator​

2. Copy admin credentials to local terminal​

3. Get EKS cluster credentials​

4. Check CoreDNS pod status​

5. Describe unhealthy pods​

6. Check pod logs​

7. Remediate​