Storefront Admin: SSO with Marqo Console Accounts + Scoped Tokens for CLI
- Status: Draft for review (Raynor)
- Owner: storefront-admin-auth (feature-plans team)
- Date: 2026-06-10
- Related plans: settings concurrency control (task #1), settings versioning (task #2), theme-targeted deploys (task #4) — see Cross-plan interfaces
Summary
Replace "paste a raw, never-expiring, admin-scope API key into a login form and keep it in localStorage" with:
- SSO: log in to storefront_admin with a Marqo cloud console account (Cognito email/password or Google/GitHub via Cognito Hosted UI), reusing the Controller's existing identity stack and building directly on the in-flight
worktree-feat+cognito-loginbranch. - Sessions: short-lived, scoped session JWTs held in an httpOnly cookie; the React Router 7 worker proxies admin_server calls and guards routes in loaders. Raw API keys and Cognito tokens never sit in localStorage.
- CLI tokens: revocable, expiring, scope- and shop-limited tokens minted in the storefront_admin UI, stored hashed in DynamoDB, replacing raw
--api-keyusage inscripts/ecom/storefront_settings.pyand agent tooling. - Scopes:
settings:read,settings:write,settings:deploy_live,tokens:manage, plus per-shop scoping, enforced by a new unified auth dependency in admin_server. Legacy raw-API-key callers keep working unchanged during a metered deprecation.
1. Current state (verified, with evidence)
1.1 storefront_admin (the editor)
React Router 7 app on a Cloudflare Worker (components/storefront_admin), deployed at shopify.marqo-ep.ai (prod), staging-shopify.dev-marqo.org, preprod-shopify.dev-marqo.org (components/storefront_admin/wrangler.toml). No KV/DO bindings; the worker only validates env (Zod) and renders (workers/app.ts:17-41, app/.server/env.ts:7-13 — only ENV, FULL_ENV, ADMIN_SERVER_BASE_URL).
- Login = paste raw Marqo API key, validate by calling
listShops(), pick a shop (app/routes/_index.tsx:36-97). - Auth state = three localStorage keys including the raw API key;
isAuthenticated: !!storedKey(app/hooks/use-auth.ts:11-13,26-40). No expiry, no server-side session, no loader guards — the editor route is gated client-side only. - API calls go browser → admin_server directly with
Authorization: Bearer <raw key>(app/lib/api-client.ts:28-31), againsthttps://admin.ecom.marqo.ai/api/v1/storefront/*. - Live preview queries the search proxy directly with the read-only
x-marqo-index-idcredential, not the API key (README "How the live preview works") — unaffected by this plan.
1.2 admin_server (the API)
- Storefront routes (
components/shopify/admin_server/admin_server/routes/storefront_routes.py) authenticate withauthenticate_api_key_request(auth/dependencies.py:220-274): decrypt the key locally (3DES-ECB, secret from Secrets Manager —utils/api_key_utils.py) to get{system_account_id, cell, token}, then validate against the per-cell legacy control planePOST /account/key/validatethrough the IAM-signedControlPlaneGateway(components/ecom_utils/ecom_utils/control_plane/gateway.py:30-120). - No scope check exists in admin_server. Any valid key gets full access to every storefront route and every ecom route (index delete, doc writes, etc. —
routes/ecom_routes.py). The validate response does return ascopefield — the Controller checks it (components/controller/account/authentication/api_key_auth.py:16,45-60,ALLOWED_SCOPES = {"read", "read_write", "admin"}), admin_server ignores it. - Per-shop access control exists:
resolve_storefront_shop403s when the shop'ssystem_account_iddoesn't match the key's (admin_server/dependencies.py:725-755). - Author identity: writes stamp
updated_by_user_id = f"api_key:{auth.system_account_id}"(routes/storefront_routes.py:174-179→services/settings_service.py:60-89; field onmodels/shopify_entities.py:111). Every write by every person on an account is attributed to the same string. - CORS is
allow_origins=["*"],allow_headers=["*"](admin_server/main.py:66-70) — any website can make authenticated requests if it has a key.
1.3 Console identity (the thing to reuse)
The live console (cloud.marqo.ai) is the Django Controller (components/controller), backed by AWS Cognito:
- Email/password sign-in →
POST /account/signin; Google/GitHub SSO → Cognito Hosted UI code →POST /account/ssoexchange (controller/account/views/sso.py). - Token verification: standard Cognito JWKS (
controller/account/authentication/verification.py:23-75,AccessTokenVerifierfetchinghttps://cognito-idp.{region}.amazonaws.com/{userpool_id}/.well-known/jwks.json), thencognito_service.get_user_complete+ membership resolution (backend.py:22-110). - Users belong to accounts via memberships with roles
OWNER/MEMBER/MERCHANDISER("MERCH") and statusPENDING/ACTIVE(controller/models/users_accounts.py:27-35,242-260). MERCH is a deliberately restricted role: the console gates merchandiser users into a limited view via theIS_MERCHANDISERfeature flag (controller/account/feature_flagging_checks.py:126, with a standing TODO to move to proper RBAC). - The monolith
identity_servicein this repo (Cognito + GCP backends,sso()is a TODO atidentity_service/index.py:81-83) is not in production use (root CLAUDE.md: "These are not currently used") — do not build on it.
1.4 In-flight prior art: worktree-feat+cognito-login
Branch worktree-feat+cognito-login (6 commits, dc4739966..928a05831) already implements Cognito login UX for storefront_admin:
app/lib/auth-client.ts(new):ControllerAuthClient—POST /account/signin,POST /account/sso(code exchange withredirect_uri),GET /api_keys?accountId=;pickBestApiKey()prefersadmin > read_write > any;resolveApiKeyAndShops()fetches the raw API key and shop list.app/routes/auth.callback.tsx(new): OAuth callback; OAuthstateCSRF protection; single state across Google/GitHub buttons.use-auth.ts: stores Cognito token + account id + raw API key in localStorage; treats Cognito token expiry as session anchor.- Controller side:
_validate_redirect_urichecks the redirect origin againstCORS_ORIGIN_WHITELIST(account/views/sso.py), config additions per env, tests (test_sso.py). wrangler.toml/env.ts: addsCONTROLLER_BASE_URL(e.g.https://cloud.marqo.ai/api),COGNITO_HOSTED_UI_URL(https://auth.controller.marqo.com),COGNITO_CLIENT_ID.
Gap: it solves the login UX but the security posture is unchanged — the session still degenerates to a raw admin key in localStorage, with no scopes, no expiry on the key, no revocation, and no server-side session. This plan adopts its Controller-side work and login UI wholesale, and replaces the "resolve raw API key" step with scoped session tokens.
1.5 CLI / agent usage today
scripts/ecom/storefront_settings.py (backup/push/diff helper, BASE_URL = "https://admin.ecom.marqo.ai/api/v1/storefront", line 36) takes --api-key <raw key>. The storefront CSS customization guide and agent workflows pass raw keys on the command line; keys end up in shell history, tmp files, and agent context/memory. This is the second consumer of the new tokens.
1.6 Internal-staff prior art (not reused, for contrast)
admin_worker/admin_lambda use Cloudflare Access JWTs validated by an API Gateway JWT authorizer (admin_worker/app/.server/gateway.ts:67-78; admin_lambda/admin_lambda/auth/auth.py). That stack authenticates Marqo staff via Cloudflare Zero Trust and is unsuitable for customers (they aren't in our CF org), but two patterns carry over: the worker-as-auth-proxy (gateway.ts promotes a credential into Authorization before forwarding) and the Zod refine that forbids local-dev escape hatches outside dev (admin_worker/app/.server/env.ts:36-38).
2. Threat model
Assets: shop settings (defacement of live storefront UX, malicious custom CSS/JS-adjacent injection into merchant sites), live theme deploys (task #4 — direct write access to a customer's published Shopify theme), Marqo index/account control (the same key works on /api/v1/indexes — deletion, data exfiltration), Shopify access tokens reachable through settings writes (metafield sync uses stored shop tokens, storefront_routes.py:182-208).
Actors: merchant users (per-account), Marqo staff/SE doing integrations, CI/agents running CLI scripts, attackers (XSS on the editor, stolen laptops/shell history, leaked keys in repos/docs/screen shares).
Today's weaknesses (ranked):
| # | Weakness | Impact |
|---|---|---|
| W1 | Raw admin-scope key in localStorage; any XSS in the editor (it renders merchant-controlled CSS/templates in a live preview) exfiltrates a credential that can delete indexes | Critical |
| W2 | Keys never expire and revocation = deleting the key in the console, which breaks every other consumer of that key (widget provisioning, scripts) — so in practice nobody revokes | High |
| W3 | No scopes at admin_server; the validate response's scope is ignored (auth/dependencies.py:247-254), so even a read key gets write access on storefront routes | High |
| W4 | CLI keys in argv/shell history/agent memory; one paste into the wrong place = W1 without needing XSS | High |
| W5 | No per-user identity — updated_by_user_id is api_key:{account} for everyone; no audit trail, no conflict attribution (blocks tasks #1, #2) | Medium |
| W6 | admin_server sets allow_origins=["*"] with allow_credentials=True (main.py:66-71) — Starlette reflects the request Origin when credentials are enabled, so any origin can already make credentialed cross-site requests to admin_server. Today nothing is exploitable only because auth is a Bearer header (which an attacker page can't auto-attach without the key) and no cookie auth exists. This config must be fixed in the same PR that introduces cookies (§7 step 3), not after | Medium |
| W7 | API key cipher is 3DES-ECB (utils/api_key_utils.py) — legacy format, out of scope to replace here, but new tokens must not inherit it | Low (here) |
Design responses: W1 → httpOnly cookie + worker proxy, short-lived session; W2/W4 → dedicated revocable CLI tokens, key stays in the console vault; W3 → scope enforcement dependency; W5 → actor identity on every request; W6 → CORS allowlist shipped in the same PR as cookie auth; W7 → new tokens are asymmetric-JWT/SHA-256, no new uses of the 3DES path.
3. Goals and non-goals
Goals
- Console account (Cognito) login to storefront_admin — email/password and Google/GitHub SSO.
- No raw API key or Cognito token in localStorage; sessions expire and are server-verifiable.
- Scoped, shop-bound, revocable, expiring tokens for CLI/agents, self-served from the storefront_admin UI.
- Scope + shop enforcement in admin_server on all
/api/v1/storefront/*routes (and exported for task #4's deploy routes). - Per-request actor identity available to versioning (task #2) and concurrency (task #1).
- Zero-lockout rollout: legacy API-key auth keeps working until metrics show it's unused, and remains as a documented break-glass path.
Non-goals (out of scope)
- Changing auth on non-storefront admin_server routes (
/api/v1/indexes, docs, sync, collections) or the Shopify-embedded-app session auth. - Replacing the 3DES API-key format or the per-cell
/account/key/validatecontract. - Console (cloud.marqo.ai) UI changes beyond Cognito app-client callback config + CORS allowlist entries (already drafted on the cognito-login branch).
- Cognito custom claims / groups; SCIM; fine-grained roles beyond OWNER/MEMBER passthrough.
- Registering the token prefix with GitHub secret scanning (worth doing later; noted, not planned).
- search_proxy auth (the preview's
x-marqo-index-idread-only credential is unchanged).
4. Recommended design
4.1 Why reuse the console Cognito identity (and not invent anything)
- It's the only production customer IdP we have; users already exist, with verified emails, MFA-capable Cognito pools, and Google/GitHub federation via the Hosted UI (
auth.controller.marqo.com). - The Controller-side enablement (redirect-uri allowlisting, SSO code exchange for an external origin) is already written and reviewed on
worktree-feat+cognito-login— we adopt those commits rather than re-deriving them. - Cognito access tokens are standard JWTs verifiable offline via JWKS; the verification pattern already exists in-repo (
controller/account/authentication/verification.py:23-75) and admin_server can copy it without calling Cognito on the hot path. - Alternatives rejected: Cloudflare Access (staff-only, customers not in our ZT org); building a new IdP in
identity_service(explicitly not in production,sso()unimplemented); passing Cognito tokens straight to admin_server on every request (Cognito tokens carry no scopes/shops/account claims, forcing a membership lookup per request and making CLI tokens a separate mechanism anyway — one token format the server mints itself is simpler and uniform).
4.2 SSO login flow
Browser storefront_admin worker Controller (cloud.marqo.ai) admin_server
| | | |
|--- GET / (login page) ---->| | |
| [Google/GitHub] redirect to Cognito Hosted UI (state=CSRF nonce, redirect_uri=/auth/callback)
|--- GET /auth/callback?code=... -->| | |
| |-- POST /account/sso (code, redirect_uri) ->| |
| |<------- cognito access token, account list ---------------|
| |-- POST /api/v1/storefront/auth/session ------------------>|
| | {cognito_token, account_id} verifies JWKS sig locally |
| | verifies membership via |
| | Controller, mints session JWT
| |<-------------------- {session_jwt, shops, email} ---------|
|<-- Set-Cookie: __sft_session=<jwt>; HttpOnly; Secure; SameSite=Lax; redirect /editor |
- Email/password follows the same shape: the login form posts to a worker action, which calls Controller
POST /account/signinserver-side, then the sameauth/sessionexchange. The password transits the worker but is never stored; this matches what the console itself does and what the cognito-login branch does client-side — moving it server-side keeps credentials out of browser JS. - Multi-account users: Controller signin/SSO responses include the account context (
get_user_accounts_and_selected,controller/account/views/account_data.py); if the user has >1 account, the worker shows an account picker before theauth/sessionexchange (replacing today's shop picker step at the same spot in the UX; shop picking stays after login as today). - OAuth
state: keep the branch's CSRF nonce, but store it in a short-lived httpOnly cookie set by the worker (not localStorage), compared in the callback action. This cookie must beSameSite=Laxand Max-Age ≤ 10 minutes — not Strict: the return from the Cognito Hosted UI is a cross-site top-level navigation, and Strict would withhold the cookie exactly when the callback needs it.
New admin_server endpoint — POST /api/v1/storefront/auth/session:
- Verify the Cognito access token signature/expiry/
client_id/issuer against the pool's JWKS (newadmin_server/auth/cognito_verifier.py, ported fromcontroller/account/authentication/verification.py; JWKS cached with TTL; pool id + client id from env). Fail fast on any mismatch (CLAUDE.md: no silent fallbacks). - Resolve membership authoritatively via a new, thin Controller endpoint —
GET /account/memberships— which is a committed PR2 deliverable, not a fallback. No existing Controller endpoint fits:sso.pyand the account-context view (get_user_accounts_and_selected,controller/account/views/account_data.py) return only the user's selected account, not the membership set, and nothing confirmed today takes a raw Cognito access token and returns memberships. The new view authenticates the user's own Cognito access token through the existingCognitoAuthenticationbackend (controller/account/authentication/backend.py:22) and returns every membership as{account_id, system_account_id, role, cell_id}. Note the join this requires: membership records carry only{cognito_username, visible_account_id, role, status}(BaseMembershipData,controller/models/users_accounts.py:242-249) — neithersystem_account_idnorcell_idis on them; both live onAccountData(users_accounts.py:174-181). The view therefore does a per-accountAccountDataread for each membership (N+1 reads at login time — acceptable; this endpoint is never on the editor's request path). That same join is the authoritativeaccount_id ↔ system_account_idmapping source:AccountDatacarries bothvisible_account_idandsystem_account_id, confirming they are distinct identifiers whose mapping is held on the account record (feeds step 3's parity gate). admin_server requires the requestedaccount_id∈ that set and the membership'sstatus == ACTIVE(MemberStatus,users_accounts.py:33-35— aPENDINGinvitee must not mint a session); the request param is a selector, never trusted. The membership'srolefeeds the role→scope mapping in step 4. Without this endpoint, session issuance does not ship — there is no "trust the client's account_id" interim mode. - Take
system_account_id(andcell_id) for the JWT only from the matched membership record returned by step 2 — never from client input. Whether consoleaccount_idandsystem_account_idare the same identifier is unresolved (the cognito-login branch lists API keys byaccount_id;resolve_storefront_shopand the settings GSI key bysystem_account_id); this is settled by a PR2 design spike with a test gate: an integration test (staging data) asserting thatGET /shopsunder a minted session returns exactly the same shop list as under a legacy API key of the same account. PR2 does not merge until that parity test passes. - Mint the session JWT (4.4) with scopes derived from the membership's role (role→scope mapping, 4.4A; unknown roles fail closed) and return it with the user's shop list.
This is a login-time exchange — Controller/Cognito are not on the per-request path. If the Controller is down, existing sessions keep working; only new logins fail (and the break-glass path in §7 still works).
4.3 Session handling in the RR7 worker
- Cookie:
__sft_session, httpOnly, Secure, SameSite=Lax, Path=/, Max-Age = session lifetime. Stateless (the JWT is the session); no KV needed. - Loader guards: a shared
requireSession(request, context)helper inapp/.server/session.tsparses + verifies the cookie (ES256 signature check against the public verification key, a plain Wrangler varSESSION_JWT_PUBLIC_KEY— see 4.4A; the worker holds no signing material) andredirect("/")on missing/expired/invalid. Applied in loaders ofeditor.tsx,editor.$section.tsx, and the new tokens route. The root loader exposes{email, accountId, shops}claims to the UI — components stop reading localStorage for auth. - API proxying: new resource route
app/routes/api.proxy.$.ts— the browser calls same-origin/api/proxy/<path>; the worker verifies the cookie, then forwards toADMIN_SERVER_BASE_URLwithAuthorization: Bearer <session JWT>. This mirrors admin_worker's gateway promotion pattern (gateway.ts:67-78).ApiClientchanges only itsbaseUrland drops the key parameter. Methods allowlist: GET/POST/PUT/DELETE on/api/v1/storefront/*and/api/v1/accountonly. - CSRF: SameSite=Lax + same-origin proxy + JSON content-type checks on admin_server cover the cookie-auth CSRF surface (no cross-site POST can carry the cookie with Lax except top-level navigations, which don't POST JSON). The proxy additionally rejects requests whose
Origin/Sec-Fetch-Siteindicate cross-site. - Session refresh: the proxy transparently refreshes near-expiry sessions — when the cookie's JWT has < 30 minutes left, the worker calls
POST /api/v1/storefront/auth/session/refresh(Bearer = the still-valid session JWT) and admin_server re-mints with the same claims, newexp/jti, preservingorig_iat; refresh is refused oncenow - orig_iat > 24h(absolute cap → forced re-login). Active users never see a logout mid-edit; idle sessions die within 2h. - Logout: clears the cookie; optional best-effort Controller logout. Session JWTs are not individually revocable — the mitigations are the short 2h lifetime + 24h cap (see 4.4A; a
jtideny-list checked on the proxy was considered and rejected — at a 2h lifetime the added state buys little over the cap, and the deny-list itself becomes an availability dependency); the account-wide kill switch is rotating the signing key in Secrets Manager. - CORS fix (same PR as cookies — §7 step 3): admin_server today runs
allow_origins=["*"]withallow_credentials=True(main.py:66-71), which makes Starlette reflect any Origin on credentialed requests. The cookie itself never reaches admin_server — it is scoped to the worker's host (shopify.marqo-ep.ai) and the worker swaps it for a Bearer header, which is the actual cross-site safety property of this design. But shipping cookie auth anywhere in the system with that CORS config standing is indefensible defense-in-depth posture: the cookie-introducing PR (PR5) also changes admin_server CORS to an explicit origin allowlist (storefront_admin origins, admin app origins,localhostdev ports) and dropsallow_credentials(nothing uses cookies against admin_server). The CSRF surface that does exist — the worker proxy, same host as the cookie — is covered by the SameSite=Lax +Origin/Sec-Fetch-Sitechecks above.
4.4 Token design
Two token types, one verifier (admin_server), one scope model.
A. Session JWT (UI sessions) — stateless, short-lived.
{
"iss": "marqo-storefront-admin",
"token_use": "session",
"sub": "user:<cognito_sub>",
"email": "raynor@marqo.ai",
"account_id": "<console account id>",
"system_account_id": "<system account id>",
"scopes": ["settings:read", "settings:write", "settings:deploy_live", "tokens:manage"],
"shops": ["*"],
"jti": "<uuid>",
"iat": 1760000000,
"orig_iat": 1760000000,
"exp": 1760007200
}
-
ES256 (asymmetric). The private signing key lives in AWS Secrets Manager (
STOREFRONT_SESSION_SIGNING_KEY_NAME) and only admin_server can sign. The worker verifies with the public key, delivered as a plain (non-secret) Wrangler varSESSION_JWT_PUBLIC_KEY— a compromised worker config can read sessions' claims but can never mint or alter one. Rotation: tokens carry akidheader; verifiers (admin_server and worker) accept the current + previous public keys, the signer uses current. -
Lifetime 2h sliding, 24h absolute cap. Each JWT lives 2 hours; the worker proxy silently refreshes it when < 30 min remain (4.3) via
POST .../auth/session/refresh, which re-mints with the same claims and the originalorig_iat; refresh is denied pastorig_iat + 24h. Rationale: this tool mutates live storefronts, sessions are not individually revocable, and the only global remedy (key rotation) logs out everyone — so the exposure window of a stolen cookie must be short. Idle theft window ≤ 2h; active-attacker window ≤ 24h; honest users re-login at most daily. -
jtiis logged with every write for audit and is offered to task #1 as a same-user-two-tabs discriminator; refresh issues a newjti(the chain is linkable viaorig_iat+subin logs). -
Role→scope mapping at mint (
UserRole,users_accounts.py:27-31): session scopes are derived from the membership role — interactive authentication alone does not grant everything:Membership role Session scopes OWNER,MEMBERsettings:read,settings:write,settings:deploy_live,tokens:manageMERCHANDISER("MERCH")settings:read,settings:write,tokens:manage— nosettings:deploy_liveany other / future role mint refused (403, explicit error) — a role added in the Controller can never silently inherit full scopes Rationale: MERCH is the console's restricted role (feature-flag-gated limited view,
feature_flagging_checks.py:126), andsettings:deploy_livemutates live storefronts — the most privileged operation in this surface. The CLI-token subset rule (4.4B) composes with this: a MERCH session can only mint tokens ≤ its own scopes, so MERCH can never produce a deploy-capable credential. Honest consequence (same shape as the §4.5 CLI-default note): until task #4's staged records exist, every settings save is a live mutation, so MERCH sessions are effectively read-only on settings; revisiting MERCH live-save once staged saves exist is a named follow-up (§11). The deploy UI's confirmation step (task #4's gate) remains on top for roles that do hold the scope.tokens:manageis session-only. -
Lambda runtime note: admin_server runs as Lambda via Mangum (
run_lambda.py:5), so in-process caches (Cognito JWKS, key material from Secrets Manager) survive only per warm container. That's acceptable by construction: JWKS verification happens on the login/refresh path only (never per-request), so a cold start costs one JWKS + one Secrets Manager fetch on a login — not on editor traffic.
B. CLI token (PAT) — opaque, stored, revocable.
- Format:
mqsft_<token_id>_<secret>wheretoken_id= 12-char base32,secret= 32 bytes urlsafe-base64. Distinct greppable prefix; never a JWT (nothing to decode offline, nothing leaks if the signing secret leaks). - Storage: ShopifyEntities table (same table as settings, no new infra):
PK=AUTHTOKEN#{token_id},SK=DETAILS, attributes:secret_hash(SHA-256),system_account_id,account_id,scopes: list,shops: list,name,created_by(actor string of the creating session),created_at,expires_at(required, default 90d, max 365d),revoked_at?,last_used_at(updated via a conditionalupdate_itemthat only fires when the stored value is older than one hour — concurrency-safe across parallel Lambda containers and bounds write load; a lost update here is cosmetic and acceptable). New PydanticStorefrontAuthToken(RecordModel)inmodels/shopify_entities.py. GSI onsystem_account_idfor listing (reuse the existing system-account GSI pattern used bylist_settings_by_system_account). - Issuance UX: new "CLI tokens" page in storefront_admin (session auth,
tokens:manage): create (choose name, scopes — deploy scope behind an explicit warning toggle, shops, expiry), list (name, scopes, shops, last used, expiry), revoke. Secret displayed exactly once at creation. - Endpoints (session-auth only; CLI tokens deliberately cannot mint or revoke tokens — no self-escalation):
POST /api/v1/storefront/auth/tokens→{token: "mqsft_...", ...metadata}. Requested scopes must be ⊆ the creating session's scopes; requested shops ⊆ the session's shops.GET /api/v1/storefront/auth/tokens→ metadata list (never secrets/hashes).DELETE /api/v1/storefront/auth/tokens/{token_id}→ setsrevoked_at.
- Verification: prefix parse → DDB get by
token_id→ constant-time SHA-256 compare → reject ifrevoked_atset orexpires_atpast → build auth context. One DDB point-read per request; no cache in v1 (revocation is then immediate). - CLI usage:
scripts/ecom/storefront_settings.pygainsMARQO_STOREFRONT_TOKENenv-var support (preferred) while keeping--api-keyworking; the CSS customization guide and agent prompts switch to the env var. SameAuthorization: Bearerheader — admin_server distinguishes credential types by shape (4.6).
4.5 Scope model
| Scope | Grants | In sessions (OWNER/MEMBER; see 4.4A for MERCH) | Default in CLI tokens |
|---|---|---|---|
settings:read | GET shops/settings/fields/defaults, GET /api/v1/account | yes | yes |
settings:write | settings mutations that do not touch the live record: staged/theme-scoped saves, theme-record deletes (task #4), restore-to-staged (task #2) | yes | yes (unticked-able) |
settings:deploy_live | any mutation of the live settings record (sk=SETTINGS) or the live search_settings metafield: today's plain POST .../settings (no theme_id), task #4's POST /deploy, restores targeting live | yes (UI adds per-action confirm) | no — explicit opt-in with warning |
tokens:manage | create/list/revoke CLI tokens | yes | never (not grantable) |
- Shop scoping:
shopsclaim/attribute —["*"](all shops of the account, tracks newly connected shops) or an explicit domain list. Enforced inresolve_storefront_shop(4.6): a shop must pass both the existing account-ownership check and the token's shop list. - Names are flat strings, validated against a closed registry (
admin_server/auth/scopes.py, afrozenset+ helpers) — unknown scope in a token record fails closed at verification (fail fast). settings:deploy_livename and semantics are taken from the theme-deploys plan (docs/plans/theme-targeted-deploys.md, Cross-plan interfaces section): the two flags are independent (deploy-without-edit is a valid reviewer persona), and the privileged scope gates live-record mutations, not a specific endpoint. Until task #4's staged-settings records exist, every settings POST is a live mutation — so a default CLI token (nosettings:deploy_live) is read-only in practice until staged saves ship. The token-creation UI must say this explicitly so users minting a push-capable token know to tick the deploy scope.- Enforcement timing (theme-deploys' recommendation, adopted): scoped tokens are enforced immediately on landing; legacy API keys are exempt (all scopes except
tokens:manage) during the §5 transition.
4.6 Authorization enforcement in admin_server
New module admin_server/auth/storefront_auth.py. The full principal enum {console_user, cli_token, api_key, shopify_user} is defined exactly once, in admin_server/models/auth.py (the single-source file that also holds the actor grammar below); storefront auth uses a documented subset alias of it — storefront credentials can only ever be the first three, while shopify_user exists in the shared enum for the embedded-app surface (tasks #1/#2 author records). One definition, two views — implementers must not create a second enum:
# admin_server/models/auth.py (single source)
PrincipalType = Literal["console_user", "cli_token", "api_key", "shopify_user"]
# Subset produced by storefront credentials (no Shopify session auth on this surface):
StorefrontPrincipalType = Literal["console_user", "cli_token", "api_key"]
# admin_server/auth/storefront_auth.py
class StorefrontAuthContext(BaseModel):
actor_id: str # "user:<sub>" | "token:<token_id>" | "api_key:<system_account_id>"
actor_display: str | None # email | token name | None for legacy keys
principal_type: StorefrontPrincipalType # subset of the shared PrincipalType (models/auth.py)
principal_id: str # cognito_sub | token_id | system_account_id (the bare id, no prefix)
system_account_id: str
cell_id: str | None # only resolvable for api_key creds; None otherwise (see note)
scopes: frozenset[str]
shops: tuple[str, ...] # ("*",) or explicit domains
session_id: str | None # jti for sessions; offered to task #1
authenticate_storefront_request (FastAPI dependency) classifies the Bearer credential:
mqsft_prefix → CLI-token path (4.4B).- Three-dot JWT with
iss=marqo-storefront-admin→ verify session JWT. - Otherwise → legacy path: delegate to the existing
authenticate_api_key_requestlogic (auth/dependencies.py:220-274) unchanged, then wrap the result in aStorefrontAuthContextwith all scopes excepttokens:manage(token CRUD requires a console session, §4.4B) + all shops (zero behavior change for existing callers — no legacy route is lost because token CRUD is new). Additionally read thescopefield that/account/key/validatealready returns and log (not enforce, v1) when aread-scope key performs a write — input for the deprecation ratchet.
require_scopes(*scopes) returns a dependency that raises 403 (insufficient_scope, listing the missing scope) — applied per-route:
GET /shops,GET .../settings,GET .../fields,GET /defaults→settings:readPOST .../settings→settings:write- token CRUD →
tokens:manage - task #4 deploy routes and the existing live
POST .../settings→settings:deploy_live(dependency exported for task #4's routes)
resolve_storefront_shop (dependencies.py:725-755) gains the shop-list check after the existing ownership check; 403 message distinguishes "not your shop" from "token not scoped to this shop".
Cell note: cell_id today comes from decrypting the raw key (auth/dependencies.py:236-251) and is used by get_fields to resolve a data-plane key (storefront_routes.py:299-301). For session/CLI credentials, persist cell_id into the session/token record at issuance (Controller membership data knows the account's cell) so get_fields keeps working for all credential types. This is a concrete implementation task, not an afterthought — get_fields breaks otherwise.
Identity propagation (tasks #1/#2): storefront_routes.save_settings passes auth.actor_id (and actor_display) instead of the hardcoded f"api_key:{...}" at storefront_routes.py:174-179. updated_by_user_id keeps receiving the actor string — format is backward compatible (legacy callers produce the identical string as today).
Canonical actor grammar (single source, shared with tasks #1/#2/#4): every canonical actor_id is uniformly prefixed — <prefix>:<principal_id> with no unprefixed canonical form — and the prefix ↔ principal_type mapping is defined once, as code, in admin_server/models/auth.py (importable by the versioning/concurrency/deploy code, which all live in the same component). The locked string forms:
actor_id form | principal_type | produced by |
|---|---|---|
user:{cognito_sub} | console_user | session JWT (sub) |
token:{token_id} | cli_token | CLI token verification |
api_key:{system_account_id} | api_key | legacy key path |
shopify_user:{shopify_user_id} | shopify_user | embedded-app writers, normalized at capture time by the tasks #1/#2 write paths (the raw Shopify JWT user id gets the prefix when an actor_id-bearing record is produced) |
The canonical enum is {console_user, cli_token, api_key, shopify_user}. Derivation is a plain split at the first :: known prefix → (principal_type, principal_id); unknown prefix → fail closed (ValueError), never silently bucketed into principal_id.
Legacy-compat branch (data, not grammar): existing updated_by_user_id values predate the grammar — raw unprefixed Shopify user ids (embedded saves, settings_routes.py:88) and the literal "system" (webhook writers, webhook_service.py:702). The shared helper exposes a separate classify_legacy(value) for reading historical data only: no colon and not "system" → shopify_user; "system" → system writer. New writes never produce these forms — canonical actor_ids are always prefixed.
4.7 What does NOT change
- Ecom routes (
/api/v1/indexes/*etc.) keepauthenticate_api_key_requestuntouched. Session/CLI tokens are rejected there (they don't decrypt as keys) — by design: a leaked storefront token cannot touch indexes; §9 includes an explicit test for this. - One deliberate exception:
ecom_routes.get_account(routes/ecom_routes.py:122-141, currently onauthenticate_api_key_request) is the single ecom route that swaps to the unified dependency (withsettings:read), because the editor needsGET /api/v1/accountto build the preview credential. It is listed in PR1's route-annotation table; every other ecom route is untouched. - The Shopify embedded app auth (
authenticate_shopify_request), webhooks, app proxy: untouched. - Search preview credential (
x-marqo-index-id): untouched.
5. Backward compatibility & deprecation path
- Phase A (land scopes, no enforcement change): unified dependency on storefront routes; legacy keys → all scopes except
tokens:manage. Existing UI, scripts, agents: zero change. Log principal_type + scope-mismatch metrics (CloudWatch, dimension on principal type). - Phase B (SSO default): storefront_admin login page defaults to SSO; "Use API key instead" remains as a secondary link (it exercises the legacy path end-to-end, which doubles as the break-glass path). CLI docs/scripts switch to tokens.
- Phase C (ratchet): when metrics show legacy-key traffic ≈ 0 on storefront routes (target: 30 consecutive days), flip
STOREFRONT_LEGACY_KEYSenv fromallow→warn(response header + log) →denyper environment, staging first. The flag is read at request time; flipping back is instant (lockout antidote). Raw keys on ecom routes are unaffected forever (out of scope). - Never remove the API-key code path in v1 of this project; removal is a separate decision after Phase C holds in prod.
6. Local development story
- Hippodrome already runs
fake_cognito(local Cognito replacement issuing JWTs, port 9012) + the Controller + admin_server (components/hippodrome/AGENTS.md:76,134,152). PointCOGNITO_*/CONTROLLER_BASE_URLat the local stack; admin_server's JWKS URL is env-configurable so it can verify fake_cognito-issued tokens. Full SSO loop testable offline. - Against real backends:
.dev.varswith staging Controller/Cognito values (as the cognito-login branch documents inwrangler.tomlcomments). - Escape hatch: API-key login stays available in the UI in all envs through Phase B, so local dev against prod data (
ADMIN_SERVER_BASE_URL=https://admin.ecom.marqo.ai, per README) keeps working with no auth infra at all. - No prod-weakening backdoors: any local-only bypass vars (e.g. a pre-made session secret) must be guarded by a Zod
.refinethat rejects them whenENV∉ {local, dev} — the exact pattern admin_worker enforces forLOCAL_CF_ACCESS_TOKEN(admin_worker/app/.server/env.ts:36-38). Server-side mirrors: admin_server refuses a session signing key from plain env (vs Secrets Manager) unlessFULL_ENV∈ {local, dev, test} — a hard gate that raises at startup, never a logged fallback. The existingMARQO_API_KEY_SECRETenv fallback (utils/api_key_utils.py:150-159) has exactly the laxity we are avoiding (env fallback with no env check, reachable in prod if Secrets Manager errors); do not copy it, and file a follow-up to env-gate it (noted in §11).
7. Rollout (zero lockout)
| Step | Change | Lockout risk & mitigation |
|---|---|---|
| 1 | admin_server: scopes module, unified dependency (legacy=all-scopes), session/token endpoints, secrets | None — pure addition; legacy path byte-identical |
| 2 | Controller: adopt cognito-login branch commits (redirect allowlist, Cognito app-client callbacks for storefront origins, CORS entries) | None — console unaffected; new origins only |
| 3 | storefront_admin: SSO login + cookie sessions + proxy, API-key login kept; same PR fixes admin_server CORS (explicit origin allowlist, drop allow_credentials — nothing uses cookies against admin_server) so cookies and the CORS fix ship atomically | If SSO breaks → users click "Use API key instead" (old flow, fully server-side legacy path). CORS: an unknown-origin log-only report runs during steps 1–2 so the allowlist is data-backed before enforcement |
| 4 | CLI tokens UI + script env-var support | None — --api-key still accepted |
| 5 | Phase C ratchet (allow→warn→deny), staging→prod | Flag is runtime-flippable; deny only after 30 quiet days; ecom routes never affected |
Break-glass: legacy API-key auth (request: same as today) works at every step until the final deny; even at deny, flipping the env var back restores it without a deploy (env var change = ECS/Lambda config update, minutes). Signing-key rotation: ES256 kid header + verifiers accepting current and previous public keys means rotation never invalidates the fleet instantly unless intended (pulling both keys at once is the deliberate kill switch).
8. Cross-plan interfaces
(Proposals messaged to all three teammates 2026-06-10; reconciled against their published plans — theme-targeted-deploys.md, settings-versioning.md, settings-concurrency-control.md — same day.)
8.1 theme-deploys (task #4) — agreed with their published plan (messages exchanged 2026-06-10)
-
Scope names adopted from their plan:
settings:writeandsettings:deploy_live(required for any mutation of the live recordsk=SETTINGSor the livesearch_settingsmetafield). The flags are independent — deploy-without-edit is a valid reviewer persona. My earlierthemes:deploy-liveproposal is superseded. -
Operation → scope matrix (locked):
Operation Required scopes Staged (theme-targeted) save; theme-record delete settings:writeDirect live save (no theme_id— today'sPOST .../settings)settings:write+settings:deploy_livePOST .../settings/deploy(promote staged → live)settings:deploy_liveonlyPOST .../settings/deploy/rollback(task #4)settings:write+settings:deploy_live— same rule as restore-to-live: puts non-head content liveRestore-to-live (task #2) settings:write+settings:deploy_live— stricter than promote on purpose: restore chooses arbitrary historical content; promote moves the single staged head -
Enforcement: they declare
Depends(require_scopes(...))+resolve_storefront_shop; their plan states scoped-token enforcement "activates when the auth plan lands; until then the endpoint is full-access-API-key-only" — consistent with §5 Phase A here. -
Their open question 2 ("enforce immediately for scoped tokens vs grace period") is answered here: enforce immediately for scoped tokens; legacy keys exempt (all scopes except
tokens:manage) during the transition. No grace-period ambiguity because scoped tokens are new — nothing existing breaks. -
Defaults: OWNER/MEMBER sessions carry
settings:deploy_live(their UI confirm is the interactive gate); MERCHANDISER sessions do not (role→scope mapping, §4.4A — post-review addition 2026-06-11); CLI tokens exclude it unless explicitly granted with a warning (§4.5). No interface change for them: the scope check is identical regardless of how the caller came to hold (or not hold) the scope. -
Audit:
StorefrontAuthContext.actor_id/actor_display/principal_typeavailable for their deploy records (they stampactor_idintodeployed_by). -
Token claims: confirmed no per-token
theme_idallowlist in v1 — theme targeting is request-level; the deploy gate is scope + confirm dialog. -
Status: confirmed both ways 2026-06-10/11; matrix pinned in their plan's Cross-plan interfaces section.
8.2 settings-versioning (task #2) — agreed (messages exchanged 2026-06-10)
- Author shape (their final form, accepted): version records store two fields —
author_id= the canonical actor string (user:<cognito_sub>|token:<token_id>|api_key:<system_account_id>|shopify_user:<shopify_user_id>; principal type derivable from the prefix) andauthor_display(email / token name, denormalized at write time, nullable for legacy api_key callers). Their earlier structured-object draft is superseded.StorefrontAuthContextadditionally exposesprincipal_type/principal_idas convenience fields, but the version-record interface is just the two strings. (Embedded-app saves keep their Shopify-user identity, populated fromShopifyAuthRequestContext— outside this plan's surface.) updated_by_user_idon ShopifySettings continues to receive the flatactor_idstring — legacy writers produce today's exactapi_key:{system_account_id}value, so no migration and no breakage for existing consumers. Their version capture copies it.- Reads (version list/get/diff) record no actor; restore is a write and gets the same context — the restoring actor becomes the new version's author (
event_type=restore, their concern). - They import the prefix↔type mapping from
admin_server/models/auth.py(never re-derived). Grammar update (round-2 review, re-coordinated and confirmed 2026-06-11): version authors use the uniformly prefixed form — embedded-app authors are normalized toshopify_user:{id}at capture time (§4.6); theclassify_legacyhelper covers historical raw values only. They confirmed adoption and updated their doc in all three relevant spots (model comment, cross-plan section, dependencies) as a post-approval interface correction; their backfill stamps the script principal as author, soclassify_legacyis only needed if historicalupdated_by_user_idvalues are ever attributed. The webhook writers' literal"system"can never appear as a version author on their surface (infra writes go through their non-capturing path), so no special case is needed there. - Restore scoping (agreed): restore requires the same scopes as the equivalent save against the same target — live:
settings:write+settings:deploy_live(identical to the plain live POST under §8.1 semantics); staged (post task #4):settings:write. No separate restore scope.
8.3 settings-concurrency (task #1) — agreed (messages exchanged 2026-06-10; recorded in their plan §11)
- Storage: their write path persists
updated_by_user_id = actor_id(same attribute as today, richer values — no schema migration) plus a new optionalupdated_by_display = actor_display. settings-versioning derives its author pair from the same two attributes — one identity scheme feeds both plans. - Payload shape (theirs, accepted): flat fields shared by GET and 409 — GET returns
updatedBy(actor_id) +updatedByDisplay+lastUpdated; the 409 detail addsexpectedVersion,currentVersion,changeSource. (My earlier nestedconflicting_writerproposal is superseded — same information, flatter shape.) - Pre-SSO degradation (agreed): when the conflicting
actor_idequals the caller's ownapi_key:{system_account_id}, the conflict dialog says "saved from another session of this account" instead of naming a writer; real identities light up automatically once SSO lands. change_source(storefront_admin|script| ...) is set by the route/caller, not by the auth context — no auth work needed.session_id(JWTjti) stays on the context; not required for their v1, persistable later without contract change.- Sequencing: until
StorefrontAuthContextreplacesApiKeyAuthRequestContexton storefront routes (PR1 here), their Phase 1 stores the legacyapi_key:{system_account_id}string as actor_id-compatible attribution; the context swap is then a one-line change per write path on their side, no contract change.
9. Test plan
admin_server (pants test, moto for DDB/Secrets Manager):
- Credential classification: legacy key → legacy path (existing tests keep passing unmodified — that's the back-compat proof);
mqsft_→ token path; session JWT → session path; garbage → 401. - Session issuance: valid Cognito token + ACTIVE member account → JWT with correct claims; tampered/expired/wrong-audience Cognito token → 401; non-member
account_id→ 403;PENDINGmembership → 403; unknown/future role value → 403 (fail closed); Controller membership call failure → 502, never a silent default (fail fast). - Role→scope mapping: OWNER and MEMBER sessions carry
settings:deploy_live; a MERCHANDISER session does not — its livePOST .../settings→ 403, and its attempt to mint a CLI token requestingsettings:deploy_live→ 403 (subset rule composes with role mapping). - Session verification: expired / tampered / wrong-key-signed / unknown-
kid/token_use-mismatch matrix; previous-public-key (rotation window) still verifies. - Session refresh: valid session → new JWT with same claims, fresh
exp/jti, preservedorig_iat; expired session → 401; refresh pastorig_iat + 24h→ 401 (forced re-login); refresh never extends scopes/shops. - CLI tokens: create (scope ⊆ session scopes enforced — escalation attempt → 403;
tokens:manageungrantable; expiry required and capped), verify (wrong secret constant-time-fails, revoked → 401, expired → 401), list never returns hashes, revoke is immediate (no cache). - Scope enforcement: per-route 403 matrix (read-only token POSTs settings → 403; no-deploy token hits deploy route → 403); shop scoping (token for shop A → shop B → 403 even when same account).
- Identity propagation:
updated_by_user_idreceives each credential type's actor string (this also locks the tasks #1/#2 interface). get_fieldsworks for all three credential types (cell_id resolution — the regression trap from §4.6).- Ecom-route rejection (locks the §4.7 isolation property, currently true but untested): a session JWT and an
mqsft_token presented to/api/v1/indexes/...(e.g. DELETE) → 401;ecom_routes.get_accountis the only ecom route accepting them. - Shop-list parity gate (PR2, C3): integration test asserting
GET /shopsunder a minted session equalsGET /shopsunder a legacy API key of the same account. - Non-default-cell account: a member of an account whose
AccountData.cell_idis not the default cell gets a session carrying that cell_id, andget_fields(storefront_routes.py:299-301) resolves the correct data-plane key with it — guards the §4.2 join and the §4.6 cell note together. - CORS (PR5): preflight from an allowlisted origin passes; unknown origin gets no CORS headers;
allow_credentialsis off. last_used_at: concurrent token uses → exactly one conditional update within the hour window (noConditionalCheckFailedsurfacing to the caller).
storefront_admin (vitest): session cookie parse/verify helpers (expired/garbage → null), loader guard redirects, proxy route (path allowlist, header promotion, cookie-less → 401, cross-site Origin → 403), auth-client (extend the branch's existing auth-client.test.ts), token-management UI state. OAuth state cookie roundtrip.
E2E (CI, optionally Hippodrome + fake_cognito): SSO login → editor loads → save settings → version/author visible; create CLI token in UI → storefront_settings.py backup/push with MARQO_STOREFRONT_TOKEN → revoke → same command 401s; legacy --api-key still works end-to-end.
Security checks (part of code review, not automated): no token/secret ever logged (request_logging review); secret displayed once; constant-time compares; cookie flags.
10. Work breakdown (suggested PR sequence)
- PR1 — admin_server auth substrate: scopes registry + canonical actor grammar (
admin_server/models/auth.py),StorefrontAuthContext, unified dependency wrapping legacy path, route annotations incl.ecom_routes.get_account(legacy = all scopes excepttokens:manage⇒ no behavioral change), metrics, unknown-origin CORS log-only report. Independently shippable. - PR2 — Controller: memberships endpoint + branch adoption: the committed
GET /account/membershipsview (Cognito-token-authenticated, returns{account_id, system_account_id, role, cell_id}per membership — C2) plus cherry-pick/rebase of theworktree-feat+cognito-loginController commits (sso redirect validation, config) — coordinate with that branch's owner. Includes theaccount_id↔system_account_idmapping spike; merge gate: the shop-list parity test (§9) must pass against staging (C3). - PR3 — session issuance (admin_server): Cognito JWKS verifier, membership client,
POST /auth/session+/auth/session/refresh, ES256 key plumbing (Secrets Manager private key, public-key distribution,kidrotation), cell_id persistence. Depends on PR2. - PR4 — CLI tokens backend: token model/repo, CRUD endpoints, verification path.
- PR5 — storefront_admin SSO + CORS fix (atomic): login UI (reusing branch UI work), callback route, cookie session + silent refresh, loader guards, proxy route, API-key fallback link; same PR: admin_server CORS allowlist + drop
allow_credentials(C1 — cookies never ship whileallow_origins=["*"] + allow_credentials=Truestands). - PR6 — tokens UI + CLI: tokens page;
storefront_settings.pyenv-var support; docs (CSS customization guide, integration playbooks). - PR7 — hardening: deprecation flag (allow→warn→deny), dashboards, scope-mismatch alarms.
Each PR follows CLAUDE.md Definition of Done (tests for new branches/paths land with the PR, not after).
11. Open questions / verification items for implementation
Membership endpoint— resolved: committed PR2 deliverable (GET /account/memberships, §4.2 step 2). The cell_id location question is answered: it lives onAccountData(users_accounts.py:178), not membership records, so the view joins memberships → AccountData per account. Remaining for PR2: exact serializer shape only.— resolved process-wise: PR2 design spike with the shop-list parity merge gate (§4.2 step 3, §9). The factual answer is discovered, not assumed.account_id↔system_account_id- Coordinate with the owner of
worktree-feat+cognito-login(branch is ~review-complete; this plan supersedes its localStorage approach but adopts everything else). - theme-deploys scope naming confirmed both ways (§8.1:
settings:deploy_live); no rename expected before PR1 lands the registry. - Follow-up issue (out of scope, S2): env-gate the existing
MARQO_API_KEY_SECRETfallback inutils/api_key_utils.py:150-159the way the new signing-key gate works. - Follow-up (named, post task #4): revisit whether MERCHANDISER sessions should get
settings:deploy_live— or a staged-write-only workflow — once staged saves exist and MERCH is no longer effectively read-only on settings (§4.4A). Aligns with the Controller's standing RBAC TODO (feature_flagging_checks.py:126).