Runbook: autoscaling_controller_unhealthy

This runbook covers steps to investigate and remediate when the Autoscaling Controller has no running pods. The alert fires when the running pod count is less than 1 for more than 5 minutes.

Impact

KEDA event tracking and DynamoDB updates may be impacted. The autoscaling controller is responsible for managing scaling decisions and persisting state. Without it, autoscaling may not function correctly.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check Autoscaling Controller pod status

kubectl get pods -A -l app=autoscaling-controller

If the label doesn't match, search more broadly:

kubectl get pods -A | grep autoscaling-controller

5. Describe the pod to identify issues

kubectl describe pod <pod-name> -n <namespace>

Look for:

Events: OOMKilled, CrashLoopBackOff, ImagePullBackOff, FailedScheduling
DynamoDB connectivity: Check if the pod can reach DynamoDB

6. Check pod logs

kubectl logs <pod-name> -n <namespace> --tail=200

If the pod is in CrashLoopBackOff:

kubectl logs <pod-name> -n <namespace> --previous --tail=200

Look for:

DynamoDB errors (permission denied, table not found)
KEDA integration errors
Configuration or startup failures

7. Remediate

Depending on the root cause:

If CrashLoopBackOff: Check logs for root cause, then restart:
```
kubectl delete pod <pod-name> -n <namespace>
```
If OOMKilled: The pod may need increased memory limits (Terraform/Helm change).
If IAM permission errors: Verify the IRSA role and DynamoDB policies.

If the deployment is missing or scaled to 0:

kubectl get deployment -A | grep autoscaling-controller
kubectl scale deployment <deployment-name> -n <namespace> --replicas=1

Impact​

Steps​

1. Get admin permissions via Escalator​

2. Copy admin credentials to local terminal​

3. Get EKS cluster credentials​

4. Check Autoscaling Controller pod status​

5. Describe the pod to identify issues​

6. Check pod logs​

7. Remediate​