Ecom Onboarding Design Doc
This document describes a new, mostly automated customer onboarding workflow to help new customers get set up quickly and easily with minimal manual effort from and back-and-forth with solutions engineers (SEs), and more consistent results for customers. It's an input to the agents that will build the feature.
Implementation plan: See
ecom_onboarding_implementation.mdfor all resolved design decisions, component topology, phased implementation plan, and risk register.
Executive Summary
Onboarding is slow, manual, and error-prone, and won’t scale with more customers. We will automate everything that we can and drop customers right into an onboarding flow to help them set themselves up.
Background
- Onboarding to the ecom platform requires many manual steps across many inconsistent surfaces. It's time consuming and error prone for SEs, and won't scale when onboarding more customers faster.
- There's no single source of truth for how complete a customer's onboarding is.
Proposed Solution
Create a "golden path" for getting a new customer from zero to a standard "completely onboarded" milestone, including an ecom index with standard configs and some docs, and a pixel installed and ingesting events and updating those docs.
This is more than just a UI (though that is part of it); it's a series of tasks that are automated behind API calls, triggered automatically at the appropriate times.
- An Admin UI "Onboarding" page to start a new onboarding workflow by filling out as much for the customer as we can.
- A typical onboarding UX that the customer steps through to sign up and receive their tasks.
- A single pane of glass in the Admin UI for SEs to monitor, with health checks for each stream of onboarding (primarily index and pixel setup)
- A checklist in the Console for the customer to follow with the current status, visible to all members, with links to details docs for each step.
Functional Requirements
- Admin UI to create onboarding record and magic link
- Admin UI to observe onboarding status and progress, and list tasks required by SEs
- Admin APIs to automate otherwise-manual tasks (e.g. trigger customer pixel infrastructure setup)
- Onboarding UI for customer
- Checklist UI for customer
Non-Functional Requirements
- Security: magic links expire if unused and can’t be reused
- Simplicity: hide the machinery of the Marqo Cloud systems from both SEs and customers
Out of Scope
Not for MVP, but useful future features:
- Regenerate magic link for a customer that didn’t onboard in time, or if something went wrong (create a new onboarding record against the same account ID).
- Allow existing user to follow onboarding flow
- Capture storefront platform (e.g. Shopify or other), mode (headless or embedded), and checkout sandboxing (yes or no) to determine what else we need to produce. MVP captures only web vs mobile.
- Handling non-standard onboarding for e.g. multiple indexes or pixels.
- Regular check that customer read and write requests are not failing with 4XX (e.g. customer sending missing image URLs, or invalid search params), updating the "Index" health check status
- Regular check with test queries (canary test suite) that search is functional, updating "Index" health check status if not
- Regular check that incoming Pixel event
docIds match index doc_ids, updating the "Pixel" health check status - Proactive checks like whether the customer’s website’s CSP will allow fetching the Pixel script
- Pixel script telemetry to validate that it’s running on the customer site reliably (whether or not it’s producing events)
- Gradual rollout of Pixel script changes or Pixel feature flags: could have a place, but not necessary.
- GET /pixel/:id extended data: estimated event ingestion rate, ETL schedule/last update, Parquet file analysis. MVP returns core status only.
- Step Functions orchestration for the onboarding workflow (currently batch API calls in Admin endpoint).
- Migrating pixel configuration out of GitHub source code into DDB (removes PAT dependency).
UX Design
Onboarding Flow
- A wizard UI with a series of "screens" that the user clicks "Next" through. Similar to the current signup flow, but probably larger in size (and more screens).
Screens:
- Welcome to Marqo Cloud! We’re thrilled to have you on board. Let’s quickly get you set up.
- Sign up
- Standard email/password or Google registration form embedded
- Resolved: Controller detects onboarding context, creates Cognito user, creates OWNER membership to the pre-created account. Does NOT create a new account. Separate view from existing signup to avoid regression.
- Pixel:
- Show Customer ID
- Select storefront type: web or mobile (MVP; platform selection is future scope)
- Show web snippet for web storefront (“Don’t worry, you’ll be able to see this later.”)
- Index:
- Let’s get you set up with your first index
- Input index name
- Choose index type: dropdown populated from keys of
create_index_configmap in account’sDEFAULT_CONFIGSrecord (e.g.ecommerce,fashion,text)- For any index type, optionally enter a custom model ID to override the default model in the config
- Enter a list of collapseFields
- Select index tier (staging or production) — separate dimension from type. Staging = smaller/cheaper infra, production = full-scale.
- Resolved: Index creation does NOT fire until the full onboarding form is complete and submitted. Config is immutable and creation is expensive; customer may want to go back and change it.
- Team
- Invite other members of your team now, or skip to do it later.
- Table of invites, + icon button to add a row, rubbish bin icon button to remove
- Email, login method, role (default Member, else Merchandiser)
- Pre-populated from SE prefill data, editable by customer.
- You’re all set here! The next steps will be indexing your product catalog, and installing your Pixel snippet to start ingesting user interaction data. Click the checklist button in the Console to see what’s left to do.
- On submit: All actions fire immediately — index creation (via ecom API), team invites, pixel_id → index association.
Progressive save:
- PATCH fires on every "Next" button click + attempt save on
beforeunloadusingnavigator.sendBeacon(). - Mid-step data may be lost if browser closes, but previous steps are preserved.
- Controller accepts both PATCH and POST for saves (sendBeacon sends POST).
Auth flow:
/onboardingroute works both unauthenticated (steps 1-2) and authenticated (steps 3+).- Console checks onboarding state and resumes where user left off on reload.
- Add
/onboardingto console'spublicPathsso the auth guard doesn't redirect.
Sources:
- Page:
components/console/src/pages/onboarding/OnboardingWizard.tsx - Form config:
components/console/src/form-configs/onboarding/onboarding.forms.ts: a single form for the whole page, incrementally populated. (Try the existing declarative form pattern, but if it's too fiddly, then eject and just use raw Formik and inputs.) - API module:
components/console/src/api/onboarding/api.ts— uses a separate axios instance without auth interceptor for unauthenticated calls.
Onboarding Checklist
The tasks that the customer is required to perform, presented in the Console. When the customer clicks the Checklist button in the header bar (next to current theme toggle button), toggle showing the checklist in the right side bar.
Resolved: Generalize the existing SlideDrawer from components/console/src/pages/merchandise/SlideDrawer.tsx (right MUI drawer with SidebarContext, resizable, close button). Lift to app layout level so it’s accessible from any page.
Visibility: Console calls GET /api/onboarding/checklist for the current account. If an onboarding record exists → show checklist button in header (next to theme toggle). If not → hide button.
- “What’s left to do:”
- Index [STATUS]
- Create your first index
- Add some documents
- Pixel [STATUS]
- Install your Pixel snippet (only for web storefronts)
- Install your checkout Pixel snippet (only for Shopify integrations with sandboxed checkout flows)
- Send some Pixel events (only for non-web storefronts)
Sources:
- Component:
components/console/src/components/onboarding/ChecklistDrawer.tsx - Context:
components/console/src/contexts/onboarding-context.tsx— provideshasOnboarding,checklistOpen,toggleChecklist,checklistData - Header:
components/console/src/components/dashboard-navbar/dashboard-navbar.component.tsx— add checklist button between ThemeToggle and AccountButton
Admin UI
New Admin UI to manage onboarding records (components/admin_worker/app/routes/onboarding.tsx)
- Table of onboarding records that have been created
- Each with a status summary and link to details page
- "New Onboarding" button opens create form
- SE pre-fills known customer data: index type, custom model, collapseFields, invitees, etc.
- Account IDs are auto-generated on submit. Cell ID is auto-selected (only 1 cell currently).
- On submit: Admin Lambda creates account (calls Controller), creates pixel account (runs infra script), creates onboarding record, returns magic link.
Onboarding details page (components/admin_worker/app/routes/onboarding.$onboardingId.tsx)
- Magic onboarding link and when it expires
- Contents of the onboarding record
- Onboarding checklist as the customer sees it
- SE checklist, with checkboxes to confirm tasks complete
- Resolved: SE tasks are mostly manual pixel script customization (building customer-specific
website.tslogic). SE marks customization "Done" → pixel status becomes CONFIGURED.
- Resolved: SE tasks are mostly manual pixel script customization (building customer-specific
Add "Onboarding" to the sidebar navigation in components/admin_worker/app/components/layout/sidebar.tsx.
Data Design
Storage
Resolved: New {env}-OnboardingTable DDB table with a new shared library components/onboarding_service/ (pattern: components/merchandising_service/). Both Controller and Admin Lambda depend on the shared library. Avoids layering violations between Controller and ecom.
DDB key design:
| Field | Value |
|---|---|
pk | ONBOARDING#{onboarding_id} |
sk | DETAILS |
gsi1pk | ONBOARDING_LIST |
gsi1sk | ISO timestamp of created_at |
GSI1 enables efficient listing of all onboarding records in the Admin UI, sorted by creation date.
For looking up onboarding by account: either GSI2 (gsi2pk=ACCOUNT#{system_account_id}) or scan with filter (acceptable given low record volume).
Onboarding Record
-
onboarding_id: UUID, also serves as the magic link shared secret -
created_at: when the link was generated -
created_by: email of Marqo staff that created the link -
cell_id: Cloud cell the account is in (auto-selected; only 1 cell currently) -
visible_account_id: Visible account ID of the pre-created account -
system_account_id: System account ID of the pre-created account -
pixel_id: ID of the pre-created pixel account -
prefill_state: map of field name to preset field value (can be changed by customer during onboarding flow, but saved for prefill and later inspection) -
created_note: note from the creator with any context they want to add for later inspection. Not exposed to the customer. -
valid_until: timestamp after which the onboarding record/magic link can't be used for signup (30 days from creation) -
visited_at: when the customer first visited the link -
registered_at: when the customer completed the user registration step -
registered_by: user ID of the user registered via the link -
submitted_at: when the customer completed and submitted the full onboarding form -
status:pending|visited|registered|submitted|completed -
Then all the fields the customer will progressively complete:
- Account:
- Name
- Index:
- name
- index type (key from
create_index_configmap, e.g.ecommerce,fashion,text) - custom model (optional override for any index type)
- collapseFields: str[]
- index tier:
staging|production
- Pixel:
- Storefront type (web or mobile). MVP only; platform selection is future scope.
- Team:
- Invitees:
- Login method
- Role (member or merchandiser)
- Invitees:
- Account:
Health check schema
The health or readiness of the customer's account is determined according to a number of dimensions, each of which has multiple states that they pass through to become ready.
Resolved: Admin Lambda aggregates health check status across all dimensions. Console gets basic status from Controller via GET /api/onboarding/checklist.
Internal Checklist
- Account:
- User is signed up with status ACTIVE
- Account exists with membership with status active
- User has account membership with role OWNER
- Account has user membership with role OWNER
- Account is initialized for ecom (i.e. has INDEX#DEFAULT_CONFIGS record)
- If yes, status: READY
- If no, status: PROVISIONING
- Team:
- At least one user has been invited to join (health status: PENDING)
- At least one user has accepted (health status: READY)
- API keys:
- At least one API key exists (health status: READY)
- Index:
- At least one index exists
- index is state CREATING (health status: CREATING)
- index is state READY (health status: READY)
- At least one index exists
- Pixel:
- At least one pixel account exists (else status: MISSING)
- an entry exists in customer_config_data.ts in
pixelrepo onmain - infrastructure is provisioned for the customer ID
- the mapping from pixel ID to system account ID is defined in the analytics account
- At least one pixel is attached to at least one index (in ecom index settings)
- an entry exists in customer_config_data.ts in
- If no to any, status: PROVISIONING
- If yes to all, status: PENDING (waiting for data)
- If events are flowing into the ingest endpoint, status: RECEIVING
- If _pixel_* fields are being written to some docs, status: READY
- Perhaps later a separate validation that all incoming docId values match
_ids of docs in the index; if not something like status: MISMATCH. Not for MVP.
- Perhaps later a separate validation that all incoming docId values match
- At least one pixel account exists (else status: MISSING)
API Design
Controller
New Django app at components/controller/onboarding/. Wraps onboarding_service shared library for DDB access.
Unauthenticated endpoints (magic link flow):
GET /api/onboarding/:id- Validate: not expired (
valid_until > now), not already submitted - Set
visited_aton first visit - Return: prefill_state + progressive form fields + checklist status
- Validate: not expired (
PATCH /api/onboarding/:id/save(also accepts POST fornavigator.sendBeacon())- Validate onboarding_id exists and is not expired
- Write partial form fields to DDB
POST /api/onboarding/:id/signup- Critical integration: Creates Cognito user via existing helpers, but does NOT create a new account. Creates OWNER membership to the pre-created account. Separate view from existing
signup.pyto avoid regression. - Sets
registered_at,registered_byon OnboardingRecord - Returns auth token
- Critical integration: Creates Cognito user via existing helpers, but does NOT create a new account. Creates OWNER membership to the pre-created account. Separate view from existing
Authenticated endpoints:
POST /api/onboarding/:id/submit- Fires: index creation (via ecom API), team invites (via member service), pixel_id → index association
- Sets
submitted_at, updates status
GET /api/onboarding/checklist- Queries for OnboardingRecord matching current user's account
- Returns checklist status items or 404 if no onboarding
Admin Service
Onboarding record management
(components/admin_lambda/admin_lambda/routes/onboarding_routes.py)
POST /api/v1/onboarding- Resolved: On submit, Admin Lambda orchestrates: (1) generate UUID onboarding_id, (2) call Controller to create Marqo account with placeholder Airwallex card, (3) run
manage_customer_infrastructure.pydirectly for pixel infra, (4) write to GitHub repo via PAT (customer-index-data.ts + placeholder website.ts), (5) create OnboardingRecord in DDB, (6) return magic link URL + onboarding_id.
- Resolved: On submit, Admin Lambda orchestrates: (1) generate UUID onboarding_id, (2) call Controller to create Marqo account with placeholder Airwallex card, (3) run
GET /api/v1/onboarding- List all onboarding records (via GSI1 query, sorted by created_at desc)
GET /api/v1/onboarding/:id- Return full record + aggregated health check status across all dimensions
PATCH /api/v1/onboarding/:id- SE updates: notes, mark pixel customization done, etc.
DELETE /api/v1/onboarding/:id- Revoke the onboarding invitation
Automation of actions
(components/admin_lambda/admin_lambda/routes/pixel_routes.py)
Resolved: Full automation for MVP. Admin Lambda runs manage_customer_infrastructure.py directly (no GitHub workflow trigger). GitHub PAT stored in AWS Secrets Manager.
POST /api/v1/pixel {type: “web”, name: “Customer” personalization: bool}- Create a new Pixel account
- Update customer-index-data.ts
- Create a placeholder website.ts in the appropriate dir
- Run
manage_customer_infrastructure.pyfor infra provisioning
GET /api/v1/pixel/:id- Resolved: Core status only for MVP:
- ID
- Status (per onboarding)
- Customer name
- Type
- Personalization enabled?
- Associated account(s)/index(s) in Analytics DDB
- Indexes for which the
pixel_idis set to this Pixel - Whether events are flowing
- Deferred to future: estimated event ingestion rate, ETL schedule/last update, Parquet file analysis, script presence/content details
- Resolved: Core status only for MVP:
Security Design
Repeated access
- The onboarding flow should not be able to be accessed in the console by customers without a magic link, i.e. the link will contain a key that is validated by the server (controller).
- When an SE generates a magic link via admin UI, the Admin Lambda creates an account, pixel, and
ONBOARDING#{onboarding_id}record in the{env}-OnboardingTable, whereonboarding_idis a UUID shared secret (128 bits of entropy), and returns the link for the SE containing theonboarding_id. - New route in console at
/onboarding?key={onboarding_id}, which on loading extracts thekeyparam and sends it toGET /api/onboarding/:idon the Controller which validates the onboarding ID (shows a spinner in the meantime, not the onboarding flow). - The secret is valid for 30 days from generation. After 30 days, or once the user has completed the whole onboarding flow, the onboarding record/magic link cannot be used again (it's basically an invite token for one user that expires on use). Visiting the link after that will return a 401 and trigger Console's auto-logout.
- Rate limiting on unauthenticated endpoints to prevent abuse.
- When an SE generates a magic link via admin UI, the Admin Lambda creates an account, pixel, and
Controller ↔ Admin Integration (OUT OF SCOPE)
A lot of the onboarding automation on the backend will happen through the Admin Service, which customers (and hence Console/Controller) do not have access to.
If some customer action should trigger an Admin API, rather than opening an auth hole in the Admin API, A Lambda would listen for a DDB stream from the UsersAccountsTable (filtering for onboarding records) and call the Admin API when it happens (with a permissioned IAM role).
NOTE: Not required for the current scope, since customer changes need not trigger admin calls.
- So we need a new “onboarding executor” Lambda, aka a new component
components/onboarding_executor, largely following the design and deployment ofecom_settings_exporterbut part of the Admin system, not Controller or Ecom. onboarding_executoris provisioned as part of the Admin system ininfra/adminby a newOnboardingStack. The lambda would run with a role that is allowed to invoke the admin API gateway without a CF JWT (may require a different internal domain, not sure) - This DDB-triggered behaviour need not generalise to ongoing account/index/pixel health/readiness, since ongoing issues like new pixel data not matching docs in the index would not be detected by DDB triggers or recorded in the ONBOARDING record.
GitHub Integration for Pixel
Pixel config lives in source code in the pixel GitHub repo, and the Admin Service needs to access it. Fetching this alone requires read access which would expose all the code and configs, which is overbroad for the narrow requirement (even if only internal). However the Admin Service also needs to write new Pixel configs to the repo, which requires either direct write access or some blessed channel for getting data written (e.g. running a specific GitHub workflow). Either way, automated writing to the repo risks accidentally writing bad content and breaking CI, or worse, breaking logic of Pixel scripts that still manage to get published.
For a rapid MVP, write access is probably acceptable given some mitigations, such as ensuring CI prevents bad changes from getting out. But a safe and robust solution must be a fast follow.
For the customer infra creation at least the Admin Service could just run the same Python script that the workflow runs.
Implementation Details
There are some tricky bits to get right.
Account creation
- The best UX is to create the account at the same time the magic link is generated so there’s no need to slow the customer down with questions, latency, or possible bugs.
- However account creation is currently only done by a customer. We don’t support creating an account via APIs without a user doing it, so that needs to be enabled to pre-create the account.
- Resolved: Admin Lambda calls Controller’s existing account creation APIs to pre-create the account. Controller handles Cognito + Stripe + DDB.
- Resolved: Hardcode a real Airwallex card with a $1 limit (details stored in AWS Secrets Manager, loaded at deploy time into env var) to satisfy the Stripe integration logic. Temporary until billing is removed. Not exposed to customers.
- We could remove Stripe billing integration altogether, but it will take time and create risk of breaking existing logic that expects it.
Workflow Processing
A Step Functions state machine would seem like a good fit for this onboarding workflow. However almost everything for initial onboarding can be done upfront at magic link generation time, we can more easily just batch some API calls in an Admin endpoint.
In the future (out of scope for now) we may choose to migrate the batch of calls to a Step Function.
Pixel Creation
Setting up a new Pixel requires the most inconsistent tasks. Automating these will require new permissions for the Admin Service:
- Creating an account in the Analytics AWS account will require either an API or a role to assume
- Creating the Pixel script for the snippet to download requires either pushing code to the Pixel repo or pushing a generated file to R2 directly.
- We do actually want the code in Git to iterate on, and ideally an agent to iterate on it automatically and raise PRs.
- However for the initial onboarding we could just push a generic Pixel website script that checks all the usual suspects (e.g. search page if ?q or ?query or ?search etc.) to get some data flowing into the Pixel Event Tracking API while waiting for the real script to be developed.
- Updating Pixel Event Tracking API to accept the new customer ID requires writing to Git.
- Creating customer infrastructure (SQS queue etc.) requires running - and approving in prod - a GitHub workflow.
- Checking for events received requires read access to the Pixel event data S3 bucket.
Resolved: Full automation for MVP. Admin Lambda runs manage_customer_infrastructure.py directly (no GitHub workflow trigger — faster, no GitHub dependency for infra). GitHub PAT stored in AWS Secrets Manager, fetched at runtime. Writes to the pixel repo via GitHub REST API.
Most of these actions require generous permissions on the pixel GitHub repo (e.g. a PAT). For an MVP this is probably tolerable, but it’s prone to abuse for overbroad GitHub operations or accidentally breaking CI. Possibly the PAT could be restricted to only running workflows (maybe only some workflows via some validation steps), which would limit the risk. We could also still require human approvals to deploy prod changes (another box on the SE’s checklist with a link and status). However a better medium-term solution might just be to externalise all the parts that depend on source code to DDB (out of scope for now).
Testing
With multiple systems interacting (Controller, Console, Admin API, Admin UI, Pixel), end-to-end (E2E) tests are essential to verify the whole workflow holds together.
A full E2E test of the happy path would go as follows:
- Playwright navigates to Admin UI > Onboarding > Create, pre-fills some data, and hits Save
- The Admin UI navigates to the Onboarding details page, and the magic link is extracted
- Verify the account was created
- Another Playwright tab navigates to the magic link
- Verify it shows the first page of the onboarding flow with a signup form
- Playwright creates a new user with username/password, then fetches confirmation code from MailSlurp inbox and enters it.
- Verify onboarding flow is shown
- Playwright clicks through the onboarding flow and inputs some data
- Fill an empty field
- Change a pre-filled field
- Check that the Admin UI onboarding state is updated
- Next next next
- Does choose to create an index
- Reload the page and verify it’s in the same state with no data lost
- Playwright completes the onboarding flow
- Verify on Console home page and checklist button present
- Verify the index exists
- Click the checklist button
- Verify the checklist shows up on the sidebar and the state is correct.
- Wait til the index is READY, then add some docs
- Verify the jobs succeed and the docs can be queries
- Verify the checklists on both Console and Admin UI are updated
- Playwright creates an index through the index onboarding flow
- New Playwright tab browses to a dev store storefront, injects the Pixel snippet with the expected script URL, and clicks around
- Verify the script exists and publishes events successfully
- Admin Playwright refreshes the onboarding details page
- Verify the Pixel status has updated to RECEIVING
Documentation
To make customer onboarding scalable, the steps that aren’t fully automated must at least be self-service. This requires high-quality docs to avert routine questions.
Docs for customer-facing tasks will live in the Marqo docs (e.g. Event Tracking Pixel for Pixel setup).
Docs for SEs will live in Library (repo), published from submodule cloud_control_plane/docs.
All docs should be short and concise (linking out if necessary), easy to understand and follow,a nd include screenshots where UI ops are described.