Runbook: external_dns_unhealthy_pods
This runbook covers steps to investigate and remediate when External DNS has unhealthy pods. The alert fires when the unhealthy pod count is greater than 0 for more than 5 minutes.
Impact
DNS record management may be impacted. New services or ingresses may not get DNS records created in Route 53, and existing records may not be updated or cleaned up.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check External DNS pod status
kubectl get pods -n kube-system -l app.kubernetes.io/name=external-dns
5. Describe unhealthy pods
kubectl describe pod <unhealthy-pod-name> -n kube-system
Look for:
- Events: OOMKilled, CrashLoopBackOff, ImagePullBackOff
- IAM/IRSA issues: Check if the service account has the correct IAM role annotation
6. Check pod logs
kubectl logs <unhealthy-pod-name> -n kube-system --tail=200
Look for:
- AWS API errors (Route 53 permission denied, throttling)
- Source or provider errors
- Configuration issues
7. Remediate
Depending on the root cause:
- If CrashLoopBackOff: Check logs for root cause, then restart:
kubectl delete pod <unhealthy-pod-name> -n kube-system
- If IAM permission errors: Verify the IRSA role and Route 53 policies in AWS Console.
- If throttling: External DNS may be making too many Route 53 API calls. Check the sync interval configuration.