Runbook: coredns_error_rate
This runbook covers steps to investigate and remediate when the CoreDNS SERVFAIL error rate exceeds 5%. The alert fires when the error rate is above 5% for more than 5 minutes.
Impact
DNS resolution within the cluster may be degraded. Services may experience intermittent DNS lookup failures, causing increased latency and connection errors.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check CoreDNS pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns
Ensure all pods are Running and Ready.
5. Check CoreDNS logs for SERVFAIL errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=300
Look for:
- Which domains are generating SERVFAIL responses
- Upstream DNS connectivity issues
- Plugin errors or timeouts
6. Test DNS resolution from within the cluster
kubectl run dns-test --image=busybox:1.28 --restart=Never --rm -it -- nslookup kubernetes.default
Test external resolution:
kubectl run dns-test --image=busybox:1.28 --restart=Never --rm -it -- nslookup google.com
7. Check upstream DNS
Verify that the VPC DNS resolver and any custom upstream DNS servers are functioning correctly. Check the CoreDNS ConfigMap for upstream configuration:
kubectl get configmap coredns -n kube-system -o yaml
8. Remediate
Depending on the root cause:
- If upstream DNS is failing: Check VPC DNS settings and any custom forwarders in the Corefile.
- If CoreDNS is overloaded: Check resource usage and consider scaling:
kubectl top pods -n kube-system -l k8s-app=kube-dns
- If a specific domain is causing errors: Check if the domain exists and is resolvable outside the cluster.
- If pods are unhealthy: Restart CoreDNS:
kubectl rollout restart deployment coredns -n kube-system