Runbook: keda_operator_unhealthy
This runbook covers steps to investigate and remediate when the KEDA Operator has no running pods. The alert fires when the running pod count is less than 1 for more than 5 minutes.
Impact
Kubernetes Event-Driven Autoscaling (KEDA) may be impacted. ScaledObjects and ScaledJobs will not be reconciled, meaning workloads will not scale in or out based on external metrics.
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check KEDA Operator pod status
kubectl get pods -n keda -l app=keda-operator
5. Describe the pod to identify issues
kubectl describe pod <pod-name> -n keda
Look for:
- Events: OOMKilled, CrashLoopBackOff, ImagePullBackOff, FailedScheduling
- Leader election issues
6. Check pod logs
kubectl logs <pod-name> -n keda --tail=200
If the pod is in CrashLoopBackOff:
kubectl logs <pod-name> -n keda --previous --tail=200
Look for:
- Metrics server connection errors
- Prometheus connectivity issues
- CRD or webhook errors
- Leader election failures
7. Remediate
Depending on the root cause:
- If CrashLoopBackOff: Check logs for root cause, then restart:
kubectl delete pod <pod-name> -n keda
- If OOMKilled: The pod may need increased memory limits (Terraform/Helm change).
- If CRD issues: Verify KEDA CRDs are installed:
kubectl get crd | grep keda
- If the deployment is scaled to 0 or missing:
kubectl get deployment -n kedakubectl rollout restart deployment keda-operator -n keda