Runbook: alb_controller_unhealthy_pods

This runbook covers steps to investigate and remediate when the AWS Load Balancer Controller has unhealthy pods. The alert fires when the unhealthy pod count is greater than 0 for more than 5 minutes.

Impact

Ingress and load balancer provisioning may be impacted. New Ingress resources or Service of type LoadBalancer may not be created or updated, and existing load balancers may not reflect configuration changes.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check ALB Controller pod status

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller

5. Describe unhealthy pods

kubectl describe pod <unhealthy-pod-name> -n kube-system

Look for:

Events: OOMKilled, CrashLoopBackOff, ImagePullBackOff
Readiness/Liveness probe failures
IAM/IRSA issues: Check if the service account has the correct IAM role annotation

6. Check pod logs

kubectl logs <unhealthy-pod-name> -n kube-system --tail=200

Look for:

AWS API errors (permission denied, throttling)
Webhook certificate issues
Leader election failures

7. Remediate

Depending on the root cause:

If CrashLoopBackOff: Check logs for root cause, then restart:
```
kubectl delete pod <unhealthy-pod-name> -n kube-system
```
If IAM permission errors: Verify the IRSA role and policy in AWS Console.

If webhook issues: Check the webhook configuration:

kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

If leader election issues: Delete all controller pods to force re-election:

kubectl rollout restart deployment aws-load-balancer-controller -n kube-system

Impact​

Steps​

1. Get admin permissions via Escalator​

2. Copy admin credentials to local terminal​

3. Get EKS cluster credentials​

4. Check ALB Controller pod status​

5. Describe unhealthy pods​

6. Check pod logs​

7. Remediate​