Skip to main content

Ecom Metrics Worker Lambda Has Errors

This runbook covers the Grafana alert Ecom Metrics Worker Lambda Has Errors.

The alert fires when prod-EcomMetricsWorker records more than 1 Lambda error in 5 minutes in the controller account. This worker consumes the ecom metrics queue and publishes the metrics used by the ecom API, queue, and indexing alerts. While it is failing, downstream ecom metrics-based alerts may be stale or inaccurate.

Triage

  1. Check prod-EcomMetricsWorker logs in the controller account around the alert window.
  2. Check prod-EcomMetricsQueue depth and age. If it is growing, the worker is not keeping up or is failing messages repeatedly.
  3. Check prod-EcomMetricsQueueDLQ. If messages are landing there, inspect a sample message without deleting it.
  4. Identify whether the error is:
    • malformed metric payload from the Cloudflare worker,
    • a code/schema mismatch in components/ecom_metrics_consumer,
    • AWS/AMP remote-write or credentials failure,
    • timeout/throttling/capacity on the Lambda.

Useful starting points:

aws logs tail /aws/lambda/prod-EcomMetricsWorker --since 30m --filter-pattern "?ERROR ?Traceback ?Exception"
aws sqs get-queue-attributes --queue-url <prod-EcomMetricsQueue-url> --attribute-names All
aws sqs get-queue-attributes --queue-url <prod-EcomMetricsQueueDLQ-url> --attribute-names All

See also SQS and Lambda.

Remediation

  • If a deploy introduced a schema or parsing error, roll back or patch the producer/consumer pair.
  • If a small number of poison messages are blocking retries, preserve examples, fix the code/config issue, then decide whether to redrive or discard the affected metrics messages. Do not redrive before the root cause is fixed.
  • If the Lambda is throttled or timing out, check reserved concurrency, timeout, memory, and event source mapping settings.
  • If AMP writes are failing, check credentials, workspace availability, and recent data-plane observability changes.

Validation

  • prod-EcomMetricsWorker has no new Lambda errors.
  • prod-EcomMetricsQueue drains and the DLQ is not growing.
  • Fresh ecom API metrics are visible in Grafana/AMP.
  • Any dependent ecom alerts are evaluated from current data again.