Skip to main content

Runbook: keda_scaler_errors

This runbook covers steps to investigate and remediate KEDA scaler errors. The alert fires when the scaler error count is greater than 0 for more than 5 minutes.

Impact

Autoscaling metrics collection may be impacted for the affected ScaledObject. The workload may not scale correctly in response to load changes if the scaler cannot fetch metrics.

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Identify the affected ScaledObject

The alert labels include the scaledObject and namespace. Check the ScaledObject status:

kubectl get scaledobject <scaled-object-name> -n <namespace> -o yaml

5. Check KEDA operator logs for scaler errors

kubectl logs -n keda -l app=keda-operator --tail=300 | grep -i "scaler.*error\|<scaled-object-name>"

Look for:

  • Prometheus query errors (connection refused, query syntax)
  • Metric not found errors
  • Authentication/authorization errors against the metrics source

6. Verify the metrics source

If the scaler uses Prometheus, verify the Prometheus endpoint is reachable and the query returns data:

# Check if Prometheus is healthy
kubectl get pods -n prometheus

7. Check the ScaledObject trigger configuration

kubectl get scaledobject <scaled-object-name> -n <namespace> -o jsonpath='{.spec.triggers}' | jq .

Verify that:

  • The metrics server address is correct
  • The query/metric name is valid
  • Authentication credentials are present and valid

8. Remediate

Depending on the root cause:

  • If Prometheus is down: Refer to the prometheus_server_unhealthy or prometheus_autoscaling_unhealthy runbook.
  • If the query is invalid: Fix the ScaledObject trigger query.
  • If transient connectivity: The scaler should recover automatically. Monitor to see if errors stop.
  • If persistent: Restart the KEDA operator:
    kubectl rollout restart deployment keda-operator -n keda