Runbook: fluent_bit_unhealthy_pods

This runbook covers steps to investigate and remediate when Fluent Bit has unhealthy pods. The alert fires when the unhealthy pod count is greater than 0 for more than 5 minutes.

Impact

Log collection and forwarding may be impacted. Logs from pods running on nodes with unhealthy Fluent Bit instances will not be shipped to the log destination (e.g., CloudWatch, S3).

Steps

1. Get admin permissions via Escalator

Request admin access through Escalator:

https://escalator.marqo-staging.com/

2. Copy admin credentials to local terminal

Copy the admin credentials from Escalator and export them in your terminal.

3. Get EKS cluster credentials

aws eks update-kubeconfig --region us-east-1 --name cell2-MultitenantEKSCluster

4. Check Fluent Bit pod status

kubectl get pods -n logging -l app.kubernetes.io/name=fluent-bit

If not in the logging namespace, check other common namespaces:

kubectl get pods -A -l app.kubernetes.io/name=fluent-bit

5. Describe unhealthy pods

kubectl describe pod <unhealthy-pod-name> -n <namespace>

Look for:

Events: OOMKilled, CrashLoopBackOff
Volume mount issues: Fluent Bit needs access to /var/log and container log directories
Node issues: Check if the DaemonSet pod's node is healthy

6. Check pod logs

kubectl logs <unhealthy-pod-name> -n <namespace> --tail=200

Look for:

Output plugin errors (CloudWatch, S3 permission denied)
Buffer overflow or backpressure issues
Parser errors

7. Remediate

Depending on the root cause:

If OOMKilled: Fluent Bit may need increased memory limits (Terraform/Helm change).
If CrashLoopBackOff: Check logs for root cause, then restart:
```
kubectl delete pod <unhealthy-pod-name> -n <namespace>
```
If output plugin errors: Verify IAM permissions for the log destination.
If buffer issues: Check disk space on the node and Fluent Bit buffer configuration.

Impact​

Steps​

1. Get admin permissions via Escalator​

2. Copy admin credentials to local terminal​

3. Get EKS cluster credentials​

4. Check Fluent Bit pod status​

5. Describe unhealthy pods​

6. Check pod logs​

7. Remediate​