Skip to main content

Metrics Publishing

E-commerce metrics currently have two supported publishing transports:

  • AWS/Python producers publish to EcomMetricsQueue. The components/ecom_metrics_consumer service consumes those events and remote-writes to Amazon Managed Prometheus.
  • components/search_proxy can publish request metrics through the Cloudflare METRICS_GATEWAY Service Binding. This is the high-volume Worker producer and the first target for reducing SQS cost.

Python/AWS producers do not have a direct Cloudflare Service Binding path yet. Keep them on the existing SQS publishers until a supported AWS route exists.

Current SQS Path

Most e-commerce metrics are sent to EcomMetricsQueue and flushed by components/ecom_metrics_consumer to AMP.

Current event shapes include:

  • typed events such as request, controller_request, indexer_job, agentic_stream, onboarding_progress, and failure;
  • generic type: "timeseries" events produced by components/metrics_publisher.SqsMetricsPublisher.

The Python MetricsPublisher interface accepts pre-formatted Prometheus TimeSeries values. SqsMetricsPublisher serializes those as:

{
"type": "timeseries",
"env": "staging",
"merge": "sum",
"series": []
}

The consumer coalesces additive series in the sum bucket and gauges in the last bucket before remote-writing to AMP.

Search Proxy Gateway Worker

search_proxy still builds the same request metric event used by the SQS path. When METRICS_GATEWAY is bound, it converts that event to the same Prometheus series produced by components/ecom_metrics_consumer:

  • ecom_api_request_latency_milliseconds_bucket
  • ecom_api_request_latency_milliseconds_sum
  • ecom_api_request_latency_milliseconds_count
  • ecom_api_cache_hit_count
  • ecom_api_cache_miss_count

It sends those datapoints through the gateway worker batch API:

env.METRICS_GATEWAY.reportAndConfirm([
{
name: "ecom_api_request_latency_milliseconds_count",
type: "counter",
value: 1,
labels: { env: "dev-feat-grpc-push" },
},
]);

Zero-valued counter increments are omitted from the gateway batch. They are no-ops for additive counters, and the SQS fallback still preserves the original request event contract.

Wrangler only binds METRICS_GATEWAY for dev search proxy deployments:

EnvironmentServiceEntrypoint
dev cellscell1-dev-metrics-gateway-workerMetricsReporter
local dev env.devcell1-dev-metrics-gateway-workerMetricsReporter

Staging, preprod, and prod do not declare this Service Binding yet. They keep using the existing SQS path until the gateway/ingest path is ready for those environments.

If the binding is absent locally, or if the gateway worker call fails at runtime, search_proxy falls back to the existing SQS path and logs the gateway failure. Metrics publishing remains best-effort and must not affect the request flow.

Ingest Service

The future ingest service will receive single-datapoint requests behind the metrics gateway worker. search_proxy does not call that ingest service directly and does not know its hostname, credentials, or request format.

Until that service exists, the gateway worker must own any forwarding or compatibility behavior needed to keep the published Prometheus metrics equivalent to the current SQS consumer output.

Tunnel

The metrics tunnel is not owned by search_proxy.

The metrics gateway worker owns any tunnel/private-connectivity setup required to reach an ingest service in the cluster. This repo's search proxy config only declares the METRICS_GATEWAY Service Binding and keeps SQS as the fallback transport.

Validation

Relevant checks:

cd components/search_proxy
npm test -- metrics.test.ts
npm run lint