Cloudflare logpush → Parquet pipeline: per-environment cutover
The Parquet conversion pipeline (PR #1179, DP-1755) moves the logpush
destination bucket into the multitenant_eks_cluster module and converts
worker trace events to Parquet (S3-event Lambda + 12-hourly Glue compaction).
Each environment cuts over independently with the steps below, in order,
after the terraform apply.
The buckets are env singletons (marqo-<env>-cf-logpush-logs,
marqo-<env>-cf-parquet-logs): at most one stack per env may have
enable_cloudflare_log_analytics on (currently prod2 for production; prod1
stays off).
Steps
-
Pre-apply: wire the upsert-workflow tfvars. If the upsert workflow's OIDC role lives in a different AWS account from this env's data plane, set
cloudflare_logpush_upsert_role_arnsin the env tfvars before the initial apply — theOwnershipChallengeReadstatement on the logpush bucket policy is gated on this var, and the workflow's challenge-file read in step 2 will hitAccessDenieduntil that policy is in place. Same-account OIDC needs no tfvar change. -
Repoint the upsert workflow. Set the GitHub environment variable
CLOUDFLARE_LOGPUSH_BUCKET_NAMEto the new bucket (marqo-<env>-cf-logpush-logs). 1a. Load-test the converter Lambda. Before relying on themax_upload_bytes: 200 MBbatching params in prod, push one synthetic.gzof ~200 MB uncompressed NDJSON (matching real logpush schema) into the logpush bucket under a test job prefix, then check the converter Lambda's CloudWatchREPORTline forMax Memory Used— confirm headroom under the 2 GB allocation before the repoint. Dev testing only exercised tiny files; the production batching params have not been load-validated. -
Repoint logpush. Re-run the
cloudflare-upsert-log-push-jobworkflow for each job incloudflare_logpush_job_names. This also applies the upload-batching params (300 s / 200 MB) to existing jobs.Observability gap until step 4: from this moment, new logs flow to the new bucket, but the JSON table (and the dashboard's default Live JSON datasource) still reads the legacy bucket — it silently stops receiving new rows. The Parquet toggle is the fresh view in this window; switch to it for any live investigation, and keep the step 2 → step 4 window short.
-
Optional history migration. Grant a data-plane role read on the legacy bucket (CDK change in
cloud_control_plane_cdkcloudflare_logs_stack.py),aws s3 syncper job prefix into the new bucket, then run the Glue job with--day <yyyymmdd>per copied day (~$20–40 one-off). Skip this and the legacy bucket's remaining history simply ages out. -
Follow the JSON table over. Remove
cloudflare_logpush_bucket_namefrom the env tfvars and re-apply — the JSON table and Grafana read access move to the new bucket, closing the step 2 gap. -
Retire the legacy stack. Delete the CDK
CloudflareLogsStackonce the legacy bucket has drained (it isRemovalPolicy.RETAIN, so the bucket survives the stack and is emptied/deleted manually). -
Alerting before the default flip. Add Grafana alerts on the converter Lambda (
Errors> 0,Throttlessustained — the function raises on any failed conversion, so the native metric is sufficient) and on Glue compaction job-run failures. A broken converter after step 7 otherwise degrades silently into a stale dashboard. -
Flip the dashboard default to Parquet. Change the
datasourcetemplate variable's default ininfra/multitenant_eks_cluster/grafana-dashboards/cloudflare_worker_logs.json(and its help text) from Live JSON to Parquet.
Day-boundary caveat (permanent)
Events from the last few minutes of a day can be delivered after midnight UTC
and are stored under the next day's folder with their original hour=23
partition value. Dashboard windows crossing midnight can miss those rows on
the Parquet datasource until the window's end passes 23:00 UTC of the later
day. Use Live JSON or widen the range when investigating right at a day
boundary. (Also noted in the dashboard's help panel.)