Runbook: kserve_node_disk_util
This runbook covers steps to investigate and remediate high disk utilization alerts on KServe GPU nodes. The alert fires when root filesystem usage exceeds 90% on kserve-gpu labeled nodes.
Steps
1. Verify disk utilization metrics
Open the KServe Inference dashboard and check node disk utilization:
2. Get read-only credentials for prod-cell-1
Log in to IAM Identity Center and copy read-only credentials for the prod-cell-1 account:
https://d-9067a2ad56.awsapps.com/start/#/?tab=accounts
3. Identify nodes with high disk utilization
Use kubectl to find the affected nodes:
kubectl get nodes -l workload=kserve-gpu -o wide
4. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
5. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
6. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
7. Investigate disk usage on the affected node
SSH or exec into a pod on the affected node to inspect disk usage:
# Find pods running on the affected node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node-name>
# Check disk usage from within a pod on that node
kubectl exec -it <pod-name> -n <namespace> -- df -h
8. Remediate
Depending on the root cause:
-
If old container images or layers are consuming space: cordon the node and drain it so it can be replaced:
kubectl cordon <node-name>kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data -
If model artifacts are filling disk: identify and clean up unused model data, or consider moving to nodes with larger root volumes.