Runbook: alb_controller_unhealthy_pods
This runbook covers steps to investigate and remediate when the AWS Load Balancer Controller has unhealthy pods. The alert fires when the unhealthy pod count is greater than 0 for more than 5 minutes.
Impact
Ingress and load balancer provisioning may be impacted. New Ingress resources or Service of type LoadBalancer may not be created or updated, and existing load balancers may not reflect configuration changes.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check ALB Controller pod status
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
5. Describe unhealthy pods
kubectl describe pod <unhealthy-pod-name> -n kube-system
Look for:
- Events: OOMKilled, CrashLoopBackOff, ImagePullBackOff
- Readiness/Liveness probe failures
- IAM/IRSA issues: Check if the service account has the correct IAM role annotation
6. Check pod logs
kubectl logs <unhealthy-pod-name> -n kube-system --tail=200
Look for:
- AWS API errors (permission denied, throttling)
- Webhook certificate issues
- Leader election failures
7. Remediate
Depending on the root cause:
- If CrashLoopBackOff: Check logs for root cause, then restart:
kubectl delete pod <unhealthy-pod-name> -n kube-system
- If IAM permission errors: Verify the IRSA role and policy in AWS Console.
- If webhook issues: Check the webhook configuration:
kubectl get validatingwebhookconfigurationskubectl get mutatingwebhookconfigurations
- If leader election issues: Delete all controller pods to force re-election:
kubectl rollout restart deployment aws-load-balancer-controller -n kube-system