Runbook: fluent_bit_unhealthy_pods
This runbook covers steps to investigate and remediate when Fluent Bit has unhealthy pods. The alert fires when the unhealthy pod count is greater than 0 for more than 5 minutes.
Impact
Log collection and forwarding may be impacted. Logs from pods running on nodes with unhealthy Fluent Bit instances will not be shipped to the log destination (e.g., CloudWatch, S3).
Steps
1. Get admin permissions via Escalator
Request admin access through Escalator:
https://escalator.marqo-staging.com/
2. Copy admin credentials to local terminal
Copy the admin credentials from Escalator and export them in your terminal.
3. Get EKS cluster credentials
aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster
4. Check Fluent Bit pod status
kubectl get pods -n logging -l app.kubernetes.io/name=fluent-bit
If not in the logging namespace, check other common namespaces:
kubectl get pods -A -l app.kubernetes.io/name=fluent-bit
5. Describe unhealthy pods
kubectl describe pod <unhealthy-pod-name> -n <namespace>
Look for:
- Events: OOMKilled, CrashLoopBackOff
- Volume mount issues: Fluent Bit needs access to
/var/logand container log directories - Node issues: Check if the DaemonSet pod's node is healthy
6. Check pod logs
kubectl logs <unhealthy-pod-name> -n <namespace> --tail=200
Look for:
- Output plugin errors (CloudWatch, S3 permission denied)
- Buffer overflow or backpressure issues
- Parser errors
7. Remediate
Depending on the root cause:
- If OOMKilled: Fluent Bit may need increased memory limits (Terraform/Helm change).
- If CrashLoopBackOff: Check logs for root cause, then restart:
kubectl delete pod <unhealthy-pod-name> -n <namespace>
- If output plugin errors: Verify IAM permissions for the log destination.
- If buffer issues: Check disk space on the node and Fluent Bit buffer configuration.