Runbook: keda_operator_unhealthy

This runbook covers steps to investigate and remediate when the KEDA Operator has no running pods. The alert fires when the running pod count is less than 1 for more than 5 minutes.

Impact

Kubernetes Event-Driven Autoscaling (KEDA) may be impacted. ScaledObjects and ScaledJobs will not be reconciled, meaning workloads will not scale in or out based on external metrics.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check KEDA Operator pod status

kubectl get pods -n keda -l app=keda-operator

5. Describe the pod to identify issues

kubectl describe pod <pod-name> -n keda

Look for:

Events: OOMKilled, CrashLoopBackOff, ImagePullBackOff, FailedScheduling
Leader election issues

6. Check pod logs

kubectl logs <pod-name> -n keda --tail=200

If the pod is in CrashLoopBackOff:

kubectl logs <pod-name> -n keda --previous --tail=200

Look for:

Metrics server connection errors
Prometheus connectivity issues
CRD or webhook errors
Leader election failures

7. Remediate

Depending on the root cause:

If CrashLoopBackOff: Check logs for root cause, then restart:
```
kubectl delete pod <pod-name> -n keda
```
If OOMKilled: The pod may need increased memory limits (Terraform/Helm change).
If CRD issues: Verify KEDA CRDs are installed:
```
kubectl get crd | grep keda
```

If the deployment is scaled to 0 or missing:

kubectl get deployment -n keda
kubectl rollout restart deployment keda-operator -n keda

Impact​

Steps​

1. Get admin permissions via Escalator​

2. Copy admin credentials to local terminal​

3. Get EKS cluster credentials​

4. Check KEDA Operator pod status​

5. Describe the pod to identify issues​

6. Check pod logs​

7. Remediate​