Runbook: autoscaling_controller_exceptions

This runbook covers steps to investigate and remediate when the Autoscaling Controller is experiencing exceptions. The alert fires when the exception count is greater than 0 for more than 5 minutes.

Impact

KEDA event tracking and DynamoDB updates may be degraded. While the controller is still running, exceptions indicate that some operations are failing, which could lead to incorrect scaling decisions or stale state.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check Autoscaling Controller pod status

kubectl get pods -A | grep autoscaling-controller

Verify the pod is Running and Ready.

5. Check pod logs for exceptions

kubectl logs <pod-name> -n <namespace> --tail=500

Look for:

Exception stack traces
DynamoDB errors (throttling, conditional check failures)
KEDA API errors
Timeout or connectivity errors

6. Check DynamoDB table health

If exceptions are related to DynamoDB, check the table in AWS Console:

Look for throttled read/write requests
Verify table capacity settings
Check for any ongoing table operations

7. Remediate

Depending on the root cause:

If DynamoDB throttling: Consider increasing table capacity or switching to on-demand mode.
If transient errors: The controller should recover automatically. Monitor to see if exceptions stop.

If persistent errors: Restart the controller:

kubectl delete pod <pod-name> -n <namespace>

If code/configuration issue: Check recent deployments for changes that may have introduced the issue.

Impact​

Steps​

1. Get admin permissions via Escalator​

2. Copy admin credentials to local terminal​

3. Get EKS cluster credentials​

4. Check Autoscaling Controller pod status​

5. Check pod logs for exceptions​

6. Check DynamoDB table health​

7. Remediate​