Runbook: cf_metrics_collector_dlq
This runbook covers steps to investigate and resolve when the Cloudflare Metrics Collector Lambda is failing and messages are landing in the DLQ.
| Field | Details |
|---|---|
| Owner | Dataplane |
| Required Access | Access to AWS and Cloudflare |
| Requires 2PR | Yes (for manual invocation/retry) |
| Metrics Dashboard | CloudflareMetricsCollectorLambda Monitoring |
Overview
The CloudflareMetricsCollector Lambda is triggered by EventBridge Schedule every minute. It makes API calls to Cloudflare's Analytics API to collect HTTP, Worker and KV metrics, converts and publishes them to CloudWatch Metrics. Each minute, the lambda makes 3 GraphQL Analytics API calls, 1 REST Cloudflare API call, and 4 CloudWatch API calls to put metrics.
If there are too many failed invocations, a message is pushed to the linked SQS DLQ. This alert monitors NumberOfMessagesSent per 5 minute period.
Steps
- Check CloudWatch Alarm and SQS Queue to determine how many errors have occurred. Also check Lambda Monitoring.
- Go to CloudWatch LogGroup and select the LogStream. Run a query to find the error in the logs.
- Check the Cloudflare Status page to see if there are any ongoing issues.
Remediation
The remediation depends on the cause of the error:
- Cloudflare API Limits
- Cloudflare Analytics API has a default quota of 300 GraphQL queries every 5 minute window. Limits
- Cloudflare API allows 1200 API calls per 5 mins. Rate limits
- We should not exceed these limits based on the behavior of this Lambda.
- AWS API Throttling
- If the issue is with CloudWatch Put Metrics API throttling, retry the failed attempts.
Manual Invocation - Retry
Under 2PR, you can manually invoke the lambda to reprocess metrics for a specified time frame. Navigate to Lambda Testing Tab and specify the JSON below. The example fetches metrics from 4:32 (inclusive) to 4:34:
{
"source": "manual",
"start_time": "2025-05-13T04:32:00.000Z",
"end_time": "2025-05-13T04:35:00.000Z"
}