Runbook: autoscaling_controller_unhealthy
This runbook covers steps to investigate and remediate when the Autoscaling Controller has no running pods. The alert fires when the running pod count is less than 1 for more than 5 minutes.
Impact
KEDA event tracking and DynamoDB updates may be impacted. The autoscaling controller is responsible for managing scaling decisions and persisting state. Without it, autoscaling may not function correctly.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check Autoscaling Controller pod status
kubectl get pods -A -l app=autoscaling-controller
If the label doesn't match, search more broadly:
kubectl get pods -A | grep autoscaling-controller
5. Describe the pod to identify issues
kubectl describe pod <pod-name> -n <namespace>
Look for:
- Events: OOMKilled, CrashLoopBackOff, ImagePullBackOff, FailedScheduling
- DynamoDB connectivity: Check if the pod can reach DynamoDB
6. Check pod logs
kubectl logs <pod-name> -n <namespace> --tail=200
If the pod is in CrashLoopBackOff:
kubectl logs <pod-name> -n <namespace> --previous --tail=200
Look for:
- DynamoDB errors (permission denied, table not found)
- KEDA integration errors
- Configuration or startup failures
7. Remediate
Depending on the root cause:
- If CrashLoopBackOff: Check logs for root cause, then restart:
kubectl delete pod <pod-name> -n <namespace>
- If OOMKilled: The pod may need increased memory limits (Terraform/Helm change).
- If IAM permission errors: Verify the IRSA role and DynamoDB policies.
- If the deployment is missing or scaled to 0:
kubectl get deployment -A | grep autoscaling-controllerkubectl scale deployment <deployment-name> -n <namespace> --replicas=1