Skip to main content

Runbook: external_dns_unhealthy_pods

This runbook covers steps to investigate and remediate when External DNS has unhealthy pods. The alert fires when the unhealthy pod count is greater than 0 for more than 5 minutes.

Impact

DNS record management may be impacted. New services or ingresses may not get DNS records created in Route 53, and existing records may not be updated or cleaned up.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check External DNS pod status

kubectl get pods -n kube-system -l app.kubernetes.io/name=external-dns

5. Describe unhealthy pods

kubectl describe pod <unhealthy-pod-name> -n kube-system

Look for:

  • Events: OOMKilled, CrashLoopBackOff, ImagePullBackOff
  • IAM/IRSA issues: Check if the service account has the correct IAM role annotation

6. Check pod logs

kubectl logs <unhealthy-pod-name> -n kube-system --tail=200

Look for:

  • AWS API errors (Route 53 permission denied, throttling)
  • Source or provider errors
  • Configuration issues

7. Remediate

Depending on the root cause:

  • If CrashLoopBackOff: Check logs for root cause, then restart:
    kubectl delete pod <unhealthy-pod-name> -n kube-system
  • If IAM permission errors: Verify the IRSA role and Route 53 policies in AWS Console.
  • If throttling: External DNS may be making too many Route 53 API calls. Check the sync interval configuration.