Runbook: coredns_error_rate

This runbook covers steps to investigate and remediate when the CoreDNS SERVFAIL error rate exceeds 5%. The alert fires when the error rate is above 5% for more than 5 minutes.

Impact

DNS resolution within the cluster may be degraded. Services may experience intermittent DNS lookup failures, causing increased latency and connection errors.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check CoreDNS pod status

kubectl get pods -n kube-system -l k8s-app=kube-dns

Ensure all pods are Running and Ready.

5. Check CoreDNS logs for SERVFAIL errors

kubectl logs -n kube-system -l k8s-app=kube-dns --tail=300

Look for:

Which domains are generating SERVFAIL responses
Upstream DNS connectivity issues
Plugin errors or timeouts

6. Test DNS resolution from within the cluster

kubectl run dns-test --image=busybox:1.28 --restart=Never --rm -it -- nslookup kubernetes.default

Test external resolution:

kubectl run dns-test --image=busybox:1.28 --restart=Never --rm -it -- nslookup google.com

7. Check upstream DNS

Verify that the VPC DNS resolver and any custom upstream DNS servers are functioning correctly. Check the CoreDNS ConfigMap for upstream configuration:

kubectl get configmap coredns -n kube-system -o yaml

8. Remediate

Depending on the root cause:

If upstream DNS is failing: Check VPC DNS settings and any custom forwarders in the Corefile.
If CoreDNS is overloaded: Check resource usage and consider scaling:
```
kubectl top pods -n kube-system -l k8s-app=kube-dns
```
If a specific domain is causing errors: Check if the domain exists and is resolvable outside the cluster.

If pods are unhealthy: Restart CoreDNS:

kubectl rollout restart deployment coredns -n kube-system

Impact​

Steps​

1. Get admin permissions via Escalator​

2. Copy admin credentials to local terminal​

3. Get EKS cluster credentials​

4. Check CoreDNS pod status​

5. Check CoreDNS logs for SERVFAIL errors​

6. Test DNS resolution from within the cluster​

7. Check upstream DNS​

8. Remediate​