Skip to main content

Runbook: kserve_scaling_capacity

This runbook covers steps to investigate and remediate scaling capacity alerts for KServe inference workloads.

Steps

1. Verify scaling metrics

Open the KServe Inference dashboard and check scaling metrics for the affected models:

https://g-3d216b3ddc.grafana-workspace.us-east-1.amazonaws.com/d/kserve_inference_dashboard/kserve-inference?orgId=1&refresh=30s

2. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

3. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

4. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

5. Scale up the affected models

Update the KEDA scaling object for the affected models to increase the min/max replicas:

# List KEDA scaled objects to find the one for the affected model
kubectl get scaledobjects -A

# Edit the scaled object to increase min/max replicas
kubectl edit scaledobject <scaled-object-name> -n <namespace>

Increase minReplicaCount and/or maxReplicaCount as needed to handle the load.