Skip to main content

Cloudflare logpush → Parquet pipeline: per-environment cutover

The Parquet conversion pipeline (PR #1179, DP-1755) moves the logpush destination bucket into the multitenant_eks_cluster module and converts worker trace events to Parquet (S3-event Lambda + 12-hourly Glue compaction). Each environment cuts over independently with the steps below, in order, after the terraform apply.

The buckets are env singletons (marqo-<env>-cf-logpush-logs, marqo-<env>-cf-parquet-logs): at most one stack per env may have enable_cloudflare_log_analytics on (currently prod2 for production; prod1 stays off).

Steps

  1. Pre-apply: wire the upsert-workflow tfvars. If the upsert workflow's OIDC role lives in a different AWS account from this env's data plane, set cloudflare_logpush_upsert_role_arns in the env tfvars before the initial apply — the OwnershipChallengeRead statement on the logpush bucket policy is gated on this var, and the workflow's challenge-file read in step 2 will hit AccessDenied until that policy is in place. Same-account OIDC needs no tfvar change.

  2. Repoint the upsert workflow. Set the GitHub environment variable CLOUDFLARE_LOGPUSH_BUCKET_NAME to the new bucket (marqo-<env>-cf-logpush-logs). 1a. Load-test the converter Lambda. Before relying on the max_upload_bytes: 200 MB batching params in prod, push one synthetic .gz of ~200 MB uncompressed NDJSON (matching real logpush schema) into the logpush bucket under a test job prefix, then check the converter Lambda's CloudWatch REPORT line for Max Memory Used — confirm headroom under the 2 GB allocation before the repoint. Dev testing only exercised tiny files; the production batching params have not been load-validated.

  3. Repoint logpush. Re-run the cloudflare-upsert-log-push-job workflow for each job in cloudflare_logpush_job_names. This also applies the upload-batching params (300 s / 200 MB) to existing jobs.

    Observability gap until step 4: from this moment, new logs flow to the new bucket, but the JSON table (and the dashboard's default Live JSON datasource) still reads the legacy bucket — it silently stops receiving new rows. The Parquet toggle is the fresh view in this window; switch to it for any live investigation, and keep the step 2 → step 4 window short.

  4. Optional history migration. Grant a data-plane role read on the legacy bucket (CDK change in cloud_control_plane_cdk cloudflare_logs_stack.py), aws s3 sync per job prefix into the new bucket, then run the Glue job with --day <yyyymmdd> per copied day (~$20–40 one-off). Skip this and the legacy bucket's remaining history simply ages out.

  5. Follow the JSON table over. Remove cloudflare_logpush_bucket_name from the env tfvars and re-apply — the JSON table and Grafana read access move to the new bucket, closing the step 2 gap.

  6. Retire the legacy stack. Delete the CDK CloudflareLogsStack once the legacy bucket has drained (it is RemovalPolicy.RETAIN, so the bucket survives the stack and is emptied/deleted manually).

  7. Alerting before the default flip. Add Grafana alerts on the converter Lambda (Errors > 0, Throttles sustained — the function raises on any failed conversion, so the native metric is sufficient) and on Glue compaction job-run failures. A broken converter after step 7 otherwise degrades silently into a stale dashboard.

  8. Flip the dashboard default to Parquet. Change the datasource template variable's default in infra/multitenant_eks_cluster/grafana-dashboards/cloudflare_worker_logs.json (and its help text) from Live JSON to Parquet.

Day-boundary caveat (permanent)

Events from the last few minutes of a day can be delivered after midnight UTC and are stored under the next day's folder with their original hour=23 partition value. Dashboard windows crossing midnight can miss those rows on the Parquet datasource until the window's end passes 23:00 UTC of the later day. Use Live JSON or widen the range when investigating right at a day boundary. (Also noted in the dashboard's help panel.)