Skip to main content

Onboarding Completion Metrics → Grafana

Bead: ccp-tjui · Status: Design / plan

Goal

Produce a Grafana time-series chart with one series per account where the Y axis is the count of onboarding checklist steps completed for that account. The effect is to watch accounts "stair-step" toward completion: filter by a selection of accounts, see who is incomplete and by how much, and see who changed recently.

The source request (paraphrased):

Set up some kind of cron job to run every minute or so, calling the admin_lambda to output onboarding progress metrics … a grafana time series chart with a series per account where the Y axis is the count of onboarding checklist steps that have been completed … We can't just publish a change metric on the "leading edge" because grafana treats metrics as stale if they don't have any new data points in the last 5 minutes.

Background: how telemetry reaches Grafana today

Grafana (Amazon Managed Grafana) is backed by Amazon Managed Prometheus (AMP). Every existing per-account dashboard (per_index_dashboard, ecom_overview_dashboard — see components/admin_worker/app/routes/my-customers*.tsx) queries AMP with PromQL, filtering on the account_id label (whose value is the system account id) and env.

Metrics get into AMP via a single, battle-tested pipeline:

producer ──SQS(JSON event)──▶ EcomMetricsQueue ──▶ ecom_metrics_consumer Lambda

convert → dedup → Prometheus remote_write

Amazon Managed Prometheus (AMP)

Grafana
  • Producers today: search_proxy (type: "request") and ecom_indexer (type: "indexer_job"), each send_message-ing a JSON event to the queue (components/ecom_indexer/ecom_indexer/job_metrics.py, components/search_proxy/src/metrics.ts).
  • Consumer: components/ecom_metrics_consumer/ parses events by type, converts to Prometheus TimeSeries, deduplicates, snappy-compresses protobuf, SigV4 signs, and POSTs to the AMP remote_write endpoint. It handles cross-account AMP writes via sts:AssumeRole and SQS partial-batch retries.

Account topology (decisive for transport choice)

Envadmin acctecom acctsame acct?AMP acct
prod023568249301023568249301yes651774330118
staging468036072962468036072962yes468036072962

admin_lambda (admin stack) and EcomMetricsQueue (ecom stack) deploy to the same AWS account. The cross-account hop to AMP is already absorbed by the consumer. So admin_lambda can send onboarding events to the existing queue with a plain same-account sqs:SendMessage grant — no cross-account producer plumbing, no shared IAM-user secret.

The data we're measuring

Onboarding state lives in DynamoDB (ONBOARDING_TABLE_NAME), owned by admin_lambda. Each OnboardingRecordModel (components/admin_lambda/admin_lambda/models/onboarding_models.py) carries:

  • system_account_id — the account identifier used as the account_id label.
  • visible_account_id, account_name, status — context.
  • completed_steps: list[str] | None — the checklist progress.

OnboardingDataServiceDDB.list_all() already enumerates every record via the gsi1 GSI (paginated), and the hourly pixel-health batch already fans out over them (run_pixel_health_checks_batch in components/admin_lambda/admin_lambda/routes/onboarding_routes.py).

The full set of checklist steps (the denominator) is defined as:

  • pixel_health_service.py: AUTOMATABLE_STEPS = {pixel_installed, pixel_receiving, pixel_verified, pixel_complete}
  • index_health_service.py: INDEX_AUTOMATABLE_STEPS = {index_created, index_associated, docs_added}
  • plus the user-completable pixel_customized.

8 steps total today. The metric value for an account is len(completed_steps or []).

Why not the obvious alternatives

  • Leading-edge / "publish on change" (sparse series). Rejected, and this is the crux of the request. AMP/Prometheus marks a series stale ~5 minutes after its last sample, so a chart built from change-only points goes blank between changes — you cannot see a steady "this account is stuck at 3/8" line. The fix is to re-publish every account's current value on a short cadence (every ~1 min ≪ 5 min staleness window) so each series stays continuously fresh and renders flat between changes and stepped on change.
  • CloudWatch custom metrics. admin_lambda already has cloudwatch:PutMetricData, so this is tempting, but it would put onboarding on a different datasource from every other per-account dashboard (AMP), require provisioning a Grafana CloudWatch datasource/read-role in the admin account, and incur per-account custom-metric cost. Inconsistent and worse; rejected.
  • admin_lambda writes to AMP directly. Would duplicate the remote_write / snappy / SigV4 / cross-account-assume-role machinery that already lives, tested, in ecom_metrics_consumer. Rejected in favour of reuse.

Reuse the existing SQS → consumer → AMP pipeline by adding a new gauge event type, and drive it from admin_lambda on a 1-minute EventBridge schedule (the request's explicit "call the admin_lambda" directive).

Metric model

  • onboarding_completed_steps (gauge) — value = len(completed_steps).
  • onboarding_total_steps (gauge, optional but recommended) — value = the number of defined steps (8 today). Emitting it makes dashboards self-describing (percent-complete = completed/total) and auto-corrects if steps are added.
  • Labels (kept deliberately stable): account_id (= system_account_id), env.
    • account_id matches the existing dashboards' label so onboarding can reuse the same account filter/variable conventions.
    • env = FULL_ENV, matching ecom_indexer's config.full_env so the env filter lines up across dashboards.
    • Do NOT label by status or account_name. A Prometheus series is identified by its full label set; a changing label value forks a new series and breaks the single continuous line per account that we want. status/name belong in Grafana display mapping, not in the series identity.

Gauge semantics — the one correctness trap

The consumer's _deduplicate (prometheus_service.py) sums samples that share a label-set + timestamp, and _process_and_flush stamps processing-time timestamps (not event time). That is correct for additive histogram/counter series, but wrong for a gauge: if two events for the same account land in one flush — e.g. an SQS redelivery/retry colliding with a fresh tick — summing would report completed_steps = 10 instead of 5.

The plan therefore adds gauge-aware dedup (last-value-wins per label-set + timestamp) for the onboarding path rather than reusing the summing dedup. Concretely: generalize _deduplicate to take a merge strategy (sum for the existing histogram/counter paths, last for gauges), or add a parallel _deduplicate_gauges. This must be covered by a test that feeds two events for the same account_id in one batch and asserts the latest value wins (not the sum).

Scope of which records to emit

Emit one gauge per onboarding record that represents a real account, to keep active-series cardinality bounded and meaningful:

  • Include records that have engaged: status in {visited, registered, submitted, completed} or completed_steps non-empty. This shows accounts climbing from their first step and keeps completed accounts pinned at 8/8 (so the stair-step to completion stays visible).
  • Exclude never-engaged, expired pending leads, which would otherwise emit a permanent flat-line-at-zero series forever.

Cardinality ≈ number of real onboarding accounts (bounded, grows slowly — each account onboards once). This filter is the main tunable knob; document it where it's defined.

Component changes

1. components/ecom_metrics_consumer — accept the new event type

  • models.py: add OnboardingProgressMetricEvent (type: Literal["onboarding_progress"], env: str, accountId: str, completedSteps: int, totalSteps: int | None, messageId: str).
  • lambda_function.py: add an elif event_type == "onboarding_progress": branch that validates and collects these events, then dispatches them via prom.process_onboarding_events(...), folding failures into failed_message_ids and counting them in the parsed_*/success log/outcome accounting (don't leave the new type out of all_parsed_ids).
  • prometheus_service.py: add _build_gauge_timeseries(name, value, labels, ts), _onboarding_event_to_timeseries, and process_onboarding_events; route the gauge series through last-value-wins dedup (see "Gauge semantics" above).

2. components/admin_lambda — compute and emit (the producer)

  • New services/onboarding_metrics_service.py (or equivalent): list records via the existing onboarding data service, apply the scope filter, build one onboarding_progress event per account (accountId = system_account_id, completedSteps = len(completed_steps or []), totalSteps = <defined steps>, env = FULL_ENV), and enqueue them. Use sqs.send_message_batch (10/call) to minimise API calls. Be resilient per the pixel-health batch pattern: one account failing must not abort the tick; return a summary {emitted, skipped, errors}.
  • A thin SQS client/helper reading the queue URL from ECOM_METRICS_QUEUE_URL (mirror ecom_indexer/job_metrics.py).
  • run_lambda.py: register action "onboarding.emit_completion_metrics" in _ACTION_HANDLERS, wired through dependencies.py like the other onboarding actions.
  • Reuse a single source of truth for the step set / count (import the existing AUTOMATABLE_STEPS | INDEX_AUTOMATABLE_STEPS | {pixel_customized}) rather than hard-coding 8, so the denominator stays correct as steps evolve.

3. infra/admin/stacks/admin_stack.py — wiring

  • Grant the admin_lambda role sqs:SendMessage on the EcomMetricsQueue ARN. Same account, so derive the ARN/URL from the deterministic config.envify("EcomMetricsQueue", …) name (or add the name/url to admin config). No cross-account policy needed.
  • Set ECOM_METRICS_QUEUE_URL in the admin_lambda environment.
  • Add a second EventBridge rule (config.envify("OnboardingCompletionMetricsSchedule", …), events.Schedule.rate(Duration.minutes(1)), enabled=config.is_prodish) targeting the lambda with {"action": "onboarding.emit_completion_metrics"} — exactly mirroring the existing setup_pixel_health_schedule. is_prodish keeps dev/branch cells from emitting.

4. Grafana dashboard (manual) + optional admin link

Grafana dashboards are not version-controlled in this repo (managed in the workspace). After metrics flow:

  • Create a time-series panel: onboarding_completed_steps{env="$env", account_id=~"$account_id"}, legend {{account_id}}, with an account_id multi-select template variable so the chart can be filtered to a selection of accounts. Optionally overlay onboarding_total_steps or compute percent-complete.
  • Optionally add a Grafana link from the onboarding admin view, mirroring the existing buildIndexGrafanaUrl/buildGrafanaDashboardUrl helpers in admin_worker.

Testing

  • Consumer: new-type parsing/dispatch in lambda_function; gauge series construction; gauge last-value-wins dedup (two same-account_id events in one batch → latest value, not the sum); unknown-type still skipped; partial batch failure accounting includes onboarding messages.
  • Producer (admin_lambda): count = len(completed_steps); label values (account_id = system_account_id, env = FULL_ENV); scope filter includes/excludes the right records; one account's failure doesn't abort the batch; send_message_batch chunking at 10; summary shape.
  • Infra: snapshot/assertion that the 1-min rule exists, is is_prodish-gated, targets the right action, and that the role has sqs:SendMessage on the queue.

Open questions / confirm during implementation

  1. Confirm FULL_ENV in the admin account resolves to the same env string that ecom_indexer emits (config.full_env) so the dashboards' env filter matches. If they differ, normalise at the producer.
  2. Confirm the scope filter against real data (how many expired pending leads exist) to validate the cardinality assumption.
  3. Decide whether onboarding_total_steps is worth emitting now or deferred.

Suggested implementation breakdown (child beads)

  1. ecom_metrics_consumer: add onboarding_progress gauge event + gauge dedup + tests.
  2. admin_lambda: onboarding metrics producer service + SQS helper + action handler + tests.
  3. infra/admin: sqs:SendMessage grant, ECOM_METRICS_QUEUE_URL env, 1-min EventBridge rule.
  4. Grafana dashboard + optional admin_worker link + doc update.