Runbook: kserve_node_disk_util

This runbook covers steps to investigate and remediate high disk utilization alerts on KServe GPU nodes. The alert fires when root filesystem usage exceeds 90% on kserve-gpu labeled nodes.

Steps

1. Verify disk utilization metrics

Open the KServe Inference dashboard and check node disk utilization:

https://g-3d216b3ddc.grafana-workspace.us-east-1.amazonaws.com/d/kserve_inference_dashboard/kserve-inference?orgId=1&refresh=30s

2. Get read-only credentials for prod-cell-1

https://d-9067a2ad56.awsapps.com/start/#/?tab=accounts

3. Identify nodes with high disk utilization

Use kubectl to find the affected nodes:

kubectl get nodes -l workload=kserve-gpu -o wide

4. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

5. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

6. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

7. Investigate disk usage on the affected node

SSH or exec into a pod on the affected node to inspect disk usage:

# Find pods running on the affected node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node-name>

# Check disk usage from within a pod on that node
kubectl exec -it <pod-name> -n <namespace> -- df -h

8. Remediate

Depending on the root cause:

If old container images or layers are consuming space: cordon the node and drain it so it can be replaced:
```
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
```
If model artifacts are filling disk: identify and clean up unused model data, or consider moving to nodes with larger root volumes.

Steps​

1. Verify disk utilization metrics​

2. Get read-only credentials for prod-cell-1​

3. Identify nodes with high disk utilization​

4. Get admin permissions via Escalator​

5. Copy admin credentials to local terminal​

6. Get EKS cluster credentials​

7. Investigate disk usage on the affected node​

8. Remediate​