Onboarding Completion Metrics → Grafana
Bead: ccp-tjui · Status: Design / plan
Goal
Produce a Grafana time-series chart with one series per account where the Y axis is the count of onboarding checklist steps completed for that account. The effect is to watch accounts "stair-step" toward completion: filter by a selection of accounts, see who is incomplete and by how much, and see who changed recently.
The source request (paraphrased):
Set up some kind of cron job to run every minute or so, calling the admin_lambda to output onboarding progress metrics … a grafana time series chart with a series per account where the Y axis is the count of onboarding checklist steps that have been completed … We can't just publish a change metric on the "leading edge" because grafana treats metrics as stale if they don't have any new data points in the last 5 minutes.
Background: how telemetry reaches Grafana today
Grafana (Amazon Managed Grafana) is backed by Amazon Managed Prometheus
(AMP). Every existing per-account dashboard (per_index_dashboard,
ecom_overview_dashboard — see components/admin_worker/app/routes/my-customers*.tsx)
queries AMP with PromQL, filtering on the account_id label (whose value is
the system account id) and env.
Metrics get into AMP via a single, battle-tested pipeline:
producer ──SQS(JSON event)──▶ EcomMetricsQueue ──▶ ecom_metrics_consumer Lambda
│
convert → dedup → Prometheus remote_write
▼
Amazon Managed Prometheus (AMP)
▼
Grafana
- Producers today:
search_proxy(type: "request") andecom_indexer(type: "indexer_job"), eachsend_message-ing a JSON event to the queue (components/ecom_indexer/ecom_indexer/job_metrics.py,components/search_proxy/src/metrics.ts). - Consumer:
components/ecom_metrics_consumer/parses events bytype, converts to PrometheusTimeSeries, deduplicates, snappy-compresses protobuf, SigV4 signs, and POSTs to the AMPremote_writeendpoint. It handles cross-account AMP writes viasts:AssumeRoleand SQS partial-batch retries.
Account topology (decisive for transport choice)
| Env | admin acct | ecom acct | same acct? | AMP acct |
|---|---|---|---|---|
| prod | 023568249301 | 023568249301 | yes | 651774330118 |
| staging | 468036072962 | 468036072962 | yes | 468036072962 |
admin_lambda (admin stack) and EcomMetricsQueue (ecom stack) deploy to the
same AWS account. The cross-account hop to AMP is already absorbed by the
consumer. So admin_lambda can send onboarding events to the existing queue
with a plain same-account sqs:SendMessage grant — no cross-account producer
plumbing, no shared IAM-user secret.
The data we're measuring
Onboarding state lives in DynamoDB (ONBOARDING_TABLE_NAME), owned by
admin_lambda. Each OnboardingRecordModel
(components/admin_lambda/admin_lambda/models/onboarding_models.py) carries:
system_account_id— the account identifier used as theaccount_idlabel.visible_account_id,account_name,status— context.completed_steps: list[str] | None— the checklist progress.
OnboardingDataServiceDDB.list_all() already enumerates every record via the
gsi1 GSI (paginated), and the hourly pixel-health batch already fans out over
them (run_pixel_health_checks_batch in
components/admin_lambda/admin_lambda/routes/onboarding_routes.py).
The full set of checklist steps (the denominator) is defined as:
pixel_health_service.py:AUTOMATABLE_STEPS = {pixel_installed, pixel_receiving, pixel_verified, pixel_complete}index_health_service.py:INDEX_AUTOMATABLE_STEPS = {index_created, index_associated, docs_added}- plus the user-completable
pixel_customized.
→ 8 steps total today. The metric value for an account is
len(completed_steps or []).
Why not the obvious alternatives
- Leading-edge / "publish on change" (sparse series). Rejected, and this is the crux of the request. AMP/Prometheus marks a series stale ~5 minutes after its last sample, so a chart built from change-only points goes blank between changes — you cannot see a steady "this account is stuck at 3/8" line. The fix is to re-publish every account's current value on a short cadence (every ~1 min ≪ 5 min staleness window) so each series stays continuously fresh and renders flat between changes and stepped on change.
- CloudWatch custom metrics.
admin_lambdaalready hascloudwatch:PutMetricData, so this is tempting, but it would put onboarding on a different datasource from every other per-account dashboard (AMP), require provisioning a Grafana CloudWatch datasource/read-role in the admin account, and incur per-account custom-metric cost. Inconsistent and worse; rejected. admin_lambdawrites to AMP directly. Would duplicate the remote_write / snappy / SigV4 / cross-account-assume-role machinery that already lives, tested, inecom_metrics_consumer. Rejected in favour of reuse.
Recommended design
Reuse the existing SQS → consumer → AMP pipeline by adding a new gauge event
type, and drive it from admin_lambda on a 1-minute EventBridge schedule (the
request's explicit "call the admin_lambda" directive).
Metric model
onboarding_completed_steps(gauge) — value= len(completed_steps).onboarding_total_steps(gauge, optional but recommended) — value=the number of defined steps (8 today). Emitting it makes dashboards self-describing (percent-complete = completed/total) and auto-corrects if steps are added.- Labels (kept deliberately stable):
account_id(=system_account_id),env.account_idmatches the existing dashboards' label so onboarding can reuse the same account filter/variable conventions.env=FULL_ENV, matchingecom_indexer'sconfig.full_envso theenvfilter lines up across dashboards.- Do NOT label by
statusoraccount_name. A Prometheus series is identified by its full label set; a changing label value forks a new series and breaks the single continuous line per account that we want.status/name belong in Grafana display mapping, not in the series identity.
Gauge semantics — the one correctness trap
The consumer's _deduplicate (prometheus_service.py) sums samples that
share a label-set + timestamp, and _process_and_flush stamps processing-time
timestamps (not event time). That is correct for additive histogram/counter
series, but wrong for a gauge: if two events for the same account land in one
flush — e.g. an SQS redelivery/retry colliding with a fresh tick — summing would
report completed_steps = 10 instead of 5.
The plan therefore adds gauge-aware dedup (last-value-wins per label-set +
timestamp) for the onboarding path rather than reusing the summing dedup.
Concretely: generalize _deduplicate to take a merge strategy (sum for the
existing histogram/counter paths, last for gauges), or add a parallel
_deduplicate_gauges. This must be covered by a test that feeds two events for
the same account_id in one batch and asserts the latest value wins (not the
sum).
Scope of which records to emit
Emit one gauge per onboarding record that represents a real account, to keep active-series cardinality bounded and meaningful:
- Include records that have engaged:
status in {visited, registered, submitted, completed}orcompleted_stepsnon-empty. This shows accounts climbing from their first step and keeps completed accounts pinned at 8/8 (so the stair-step to completion stays visible). - Exclude never-engaged, expired
pendingleads, which would otherwise emit a permanent flat-line-at-zero series forever.
Cardinality ≈ number of real onboarding accounts (bounded, grows slowly — each account onboards once). This filter is the main tunable knob; document it where it's defined.
Component changes
1. components/ecom_metrics_consumer — accept the new event type
models.py: addOnboardingProgressMetricEvent(type: Literal["onboarding_progress"],env: str,accountId: str,completedSteps: int,totalSteps: int | None,messageId: str).lambda_function.py: add anelif event_type == "onboarding_progress":branch that validates and collects these events, then dispatches them viaprom.process_onboarding_events(...), folding failures intofailed_message_idsand counting them in theparsed_*/successlog/outcome accounting (don't leave the new type out ofall_parsed_ids).prometheus_service.py: add_build_gauge_timeseries(name, value, labels, ts),_onboarding_event_to_timeseries, andprocess_onboarding_events; route the gauge series through last-value-wins dedup (see "Gauge semantics" above).
2. components/admin_lambda — compute and emit (the producer)
- New
services/onboarding_metrics_service.py(or equivalent): list records via the existing onboarding data service, apply the scope filter, build oneonboarding_progressevent per account (accountId = system_account_id,completedSteps = len(completed_steps or []),totalSteps = <defined steps>,env = FULL_ENV), and enqueue them. Usesqs.send_message_batch(10/call) to minimise API calls. Be resilient per the pixel-health batch pattern: one account failing must not abort the tick; return a summary{emitted, skipped, errors}. - A thin SQS client/helper reading the queue URL from
ECOM_METRICS_QUEUE_URL(mirrorecom_indexer/job_metrics.py). run_lambda.py: register action"onboarding.emit_completion_metrics"in_ACTION_HANDLERS, wired throughdependencies.pylike the other onboarding actions.- Reuse a single source of truth for the step set / count (import the existing
AUTOMATABLE_STEPS | INDEX_AUTOMATABLE_STEPS | {pixel_customized}) rather than hard-coding 8, so the denominator stays correct as steps evolve.
3. infra/admin/stacks/admin_stack.py — wiring
- Grant the admin_lambda role
sqs:SendMessageon theEcomMetricsQueueARN. Same account, so derive the ARN/URL from the deterministicconfig.envify("EcomMetricsQueue", …)name (or add the name/url to admin config). No cross-account policy needed. - Set
ECOM_METRICS_QUEUE_URLin the admin_lambda environment. - Add a second EventBridge rule
(
config.envify("OnboardingCompletionMetricsSchedule", …),events.Schedule.rate(Duration.minutes(1)),enabled=config.is_prodish) targeting the lambda with{"action": "onboarding.emit_completion_metrics"}— exactly mirroring the existingsetup_pixel_health_schedule.is_prodishkeeps dev/branch cells from emitting.
4. Grafana dashboard (manual) + optional admin link
Grafana dashboards are not version-controlled in this repo (managed in the workspace). After metrics flow:
- Create a time-series panel:
onboarding_completed_steps{env="$env", account_id=~"$account_id"}, legend{{account_id}}, with anaccount_idmulti-select template variable so the chart can be filtered to a selection of accounts. Optionally overlayonboarding_total_stepsor compute percent-complete. - Optionally add a Grafana link from the onboarding admin view, mirroring the
existing
buildIndexGrafanaUrl/buildGrafanaDashboardUrlhelpers inadmin_worker.
Testing
- Consumer: new-type parsing/dispatch in
lambda_function; gauge series construction; gauge last-value-wins dedup (two same-account_idevents in one batch → latest value, not the sum); unknown-type still skipped; partial batch failure accounting includes onboarding messages. - Producer (admin_lambda): count =
len(completed_steps); label values (account_id = system_account_id,env = FULL_ENV); scope filter includes/excludes the right records; one account's failure doesn't abort the batch;send_message_batchchunking at 10; summary shape. - Infra: snapshot/assertion that the 1-min rule exists, is
is_prodish-gated, targets the right action, and that the role hassqs:SendMessageon the queue.
Open questions / confirm during implementation
- Confirm
FULL_ENVin the admin account resolves to the sameenvstring thatecom_indexeremits (config.full_env) so the dashboards'envfilter matches. If they differ, normalise at the producer. - Confirm the scope filter against real data (how many expired
pendingleads exist) to validate the cardinality assumption. - Decide whether
onboarding_total_stepsis worth emitting now or deferred.
Suggested implementation breakdown (child beads)
ecom_metrics_consumer: addonboarding_progressgauge event + gauge dedup + tests.admin_lambda: onboarding metrics producer service + SQS helper + action handler + tests.infra/admin:sqs:SendMessagegrant,ECOM_METRICS_QUEUE_URLenv, 1-min EventBridge rule.- Grafana dashboard + optional
admin_workerlink + doc update.