Runbook: autoscaling_controller_exceptions
This runbook covers steps to investigate and remediate when the Autoscaling Controller is experiencing exceptions. The alert fires when the exception count is greater than 0 for more than 5 minutes.
Impact
KEDA event tracking and DynamoDB updates may be degraded. While the controller is still running, exceptions indicate that some operations are failing, which could lead to incorrect scaling decisions or stale state.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check Autoscaling Controller pod status
kubectl get pods -A | grep autoscaling-controller
Verify the pod is Running and Ready.
5. Check pod logs for exceptions
kubectl logs <pod-name> -n <namespace> --tail=500
Look for:
- Exception stack traces
- DynamoDB errors (throttling, conditional check failures)
- KEDA API errors
- Timeout or connectivity errors
6. Check DynamoDB table health
If exceptions are related to DynamoDB, check the table in AWS Console:
- Look for throttled read/write requests
- Verify table capacity settings
- Check for any ongoing table operations
7. Remediate
Depending on the root cause:
- If DynamoDB throttling: Consider increasing table capacity or switching to on-demand mode.
- If transient errors: The controller should recover automatically. Monitor to see if exceptions stop.
- If persistent errors: Restart the controller:
kubectl delete pod <pod-name> -n <namespace>
- If code/configuration issue: Check recent deployments for changes that may have introduced the issue.