Settings Versioning: History, Rollback, and Audit for ShopifySettings
Author: settings-versioning (agent, feature-plans team)
Date: 2026-06-10
Stakeholders: Raynor (sign-off), settings-concurrency / theme-deploys / storefront-admin-auth plan owners
Status: Reviewer-approved (adversarial Opus plan-verifier, round 2, 2026-06-11) — awaiting Raynor sign-off
Executive Summary
Every save of a merchant's storefront settings (ShopifySettings: UI component HTML/CSS templates, DOM selectors, configuration) should produce a recoverable, attributable version. Today, rollback relies on ad-hoc manual JSON files in tmp/ (e.g. tmp/lg-settings-backup-2026-05-23T0900.json), made by whoever remembers to make them. This plan adds automatic version capture on every content write, a list/diff/restore API on the API-key-authenticated storefront admin routes, and a minimal restore UI in the storefront admin editor (components/storefront_admin).
Design in one paragraph: each content save writes a full-state snapshot to a new private S3 bucket and a small version-metadata item into the existing ShopifyEntities DynamoDB table — in a dedicated partition (pk SHOPVER#{domain}, sk {scope}#{n:010d}), so history volume can never inflate reads of the live record or any query that sweeps the SHOP#{domain} partition — atomically with the live-item write via TransactWriteItems (transactions span partitions within a table). The version number is the optimistic-lock counter (record_version) being introduced by the settings-concurrency plan — one counter, two purposes. Restore replays a snapshot's content fields through the existing save path, producing a new version (history is append-only) and re-syncing the Shopify metafield mirror exactly as a normal save does.
Problem Statement
ShopifySettings (DDB: pk SHOP#{domain}, sk SETTINGS, table ShopifyEntitiesTable) holds merchant-facing storefront configuration: ui_components and selector_components dicts containing large HTML/CSS templates, plus configuration. Real merchant payloads run ~115–120KB of JSON (Laura Geller, Muji backups in tmp/).
These settings are edited by:
- Merchants via the embedded Shopify app (
settings_routes.py, session auth), - Humans and AI agents via the storefront admin API (
storefront_routes.py, API-key auth) — the dominant path during integration/tuning work, often against live merchant storefronts.
Pain points:
- No rollback.
repository.save_settings()is a full-itemput_item— the previous state is destroyed on every save. A bad agent push to a live merchant requires hunting for a manual backup file. - No audit. Only
updated_by_user_id+last_updatedof the latest write survive. No record of who changed what, when, across time. - Ad-hoc backups don't scale. The
tmp/*.jsonconvention is manual, local to one machine, unversioned, and routinely forgotten.
Requirement (from operations): every save produces a recoverable version, with list/diff/restore available to the same actors who can save.
Glossary
- Live item — the single
ShopifySettingsrecord read at runtime (pkSHOP#{domain}, skSETTINGS). - Content fields —
ui_components,selector_components,configuration. These define storefront behavior and are what versioning protects. - Infra fields —
metadata(webhook registrations),active_index,system_account_id,cell_id. Written bywebhook_service.py,index_service.py,api_key_routes.pyvia partialupdate_settings()calls. Not user-facing config; not versioned and never restored. - Metafield mirror —
save_settings_to_metafields()writes the full settings JSON to Shopify shop metafieldmarqo.search_settings; the theme extension (marqo-search-embed.liquid) injects it into the page wherestorefront_searchconsumes it aswindow.MarqoUIConfig. - Version counter — the integer
record_versionattribute on the live item introduced by thesettings-concurrencyplan for optimistic locking (agreed interface, see Cross-plan). Starts at 1 on first conditional write (legacy records without the attribute are treated as 0); every successful write increments it by exactly 1. Distinct from the existingsettings_schema_version(payload schema version). - Scope — which settings record a version belongs to:
livetoday;theme:{theme_id}once thetheme-deploysplan lands per-theme records.
Tenets
- Recoverability over save availability. A settings save that cannot produce a version must fail (fail-fast per CLAUDE.md). We accept that an S3/DDB-transact outage blocks saves; we do not accept silent versionless saves.
- History is append-only. Restore never rewrites history; it creates a new version recording what was restored and by whom.
- One counter. The optimistic-lock counter and the version-history identifier are the same number. Two counters with subtly different semantics is a standing source of bugs.
- Restore is a save. Restore reuses the existing save path (validation, merge, metafield sync, concurrency check) rather than a parallel write path that drifts.
Functional Requirements
- FR-1: Every write that changes content fields produces a version capturing the full post-write state of content fields.
- FR-2: Versions record author identity, timestamp, event type (save/restore/deploy/backfill), and a summary of changed components.
- FR-3: API to list versions (paginated, newest first) and fetch one version's full payload.
- FR-4: API to diff two versions (or a version against current) at component/field granularity.
- FR-5: API to restore a version: content fields only, applied as a new version, with the metafield mirror re-synced.
- FR-6: Writes that touch only infra fields do not create versions.
- FR-7: Identical-content saves (no content change) do not create duplicate versions.
- FR-8: The storefront admin UI (
components/storefront_admin— the standalone settings editor; not admin_worker, the internal staff dashboard) shows the version list and offers restore with confirmation.
Non-Functional Requirements
- NFR-1: Version capture adds ≤1 S3 PUT + 0 extra DDB round-trips to the save path (metadata Put rides the same
TransactWriteItemsas the live write). - NFR-2: Listing versions never fetches snapshot payloads (metadata items stay < 2KB).
- NFR-3: No version data is publicly readable (the existing assets bucket is
public_read_access=Trueand is explicitly not used). - NFR-4: Sized for ~200 shops each accumulating low thousands of versions (automated agent teams saving continuously), with ≥10x headroom — see Performance at scale.
- NFR-5: Reads of the live settings record remain a single
GetItemwhose cost is independent of version count; no existing query's result set grows with history.
Out Of Scope
- Versioning the embedded Shopify app's UI (a history page inside the Shopify admin app). The capture happens for those saves (they share
SettingsService), but list/diff/restore UI is only built in the storefront admin editor (components/storefront_admin) for v1. A versions panel in admin_worker (the internal staff dashboard) is likewise out of scope. - Versioning other entities (
IndexSettings, API keys, sessions, merchandising rules). - Importing the historical ad-hoc
tmp/*.jsonbackups (optional manual one-off, not part of the system). - Scheduled snapshot pruning/retention automation beyond S3 lifecycle storage-class transitions (volume is tiny; deletion machinery is more risk than the cost it saves — revisit if growth demands).
- A separate finer-grained "restore" auth scope — restore requires the same scopes as the equivalent save against the same target (live:
settings:write+settings:deploy_live; staged:settings:write); see Cross-plan interfaces. - Cross-shop version copy ("apply Muji CA's v12 to Muji US") — useful, but a different feature.
Success Criteria
- 100% of content saves in prod produce a listed version (verifiable: compare save logs to version counts).
- A live-merchant rollback is a single API call / one button, taking < 1 minute end to end including metafield re-sync — measured against today's manual process (locate backup file, hand-craft POST).
- Zero ad-hoc
tmp/backups needed for new integration work.
API Design
All endpoints live on the storefront admin router (storefront_routes.py, API-key auth via authenticate_api_key_request + resolve_storefront_shop), alongside the existing GET/POST /shops/{shopify_domain}/settings. Shop access control is identical to the existing settings endpoints: the API key's system_account_id must own the shop.
List versions
GET /shops/{shopify_domain}/settings/versions?limit=20&cursor=<opaque>
{
"versions": [
{
"version": 42,
"scope": "live",
"event_type": "restore",
"restored_from": 38,
"author_id": "api_key:acct_123",
"author_display": null,
"created_at": "2026-06-10T03:21:00+00:00",
"settings_schema_version": 1,
"content_hash": "sha256:ab12…",
"snapshot_size_bytes": 119001,
"changed": {"ui_components": ["results_grid", "instant_search"], "selector_components": [], "configuration": []}
}
],
"next_cursor": "eyJ2IjogNDF9"
}
Newest first (ScanIndexForward=False over pk SHOPVER#{domain}, sk prefix live#). cursor is the base64-encoded DDB LastEvaluatedKey. Metadata only — no payloads; page cost is limit × ~2KB regardless of total history depth (see Performance at scale).
Get one version (full payload)
GET /shops/{shopify_domain}/settings/versions/{version}
Returns the metadata above plus "settings": {"ui_components": …, "selector_components": …, "configuration": …} fetched from S3. 404 if the version doesn't exist (RecordNotFoundError → 404 via existing handler pattern).
Diff
GET /shops/{shopify_domain}/settings/versions/{version}/diff?against=current
GET /shops/{shopify_domain}/settings/versions/{version}/diff?against=38
{
"from": 38, "to": "current", "to_version": 42,
"changes": [
{"path": "ui_components.results_grid.css", "change_type": "modified",
"diff": "--- v38\n+++ current\n@@ -10,4 +10,6 @@\n …"},
{"path": "ui_components.promo_banner", "change_type": "added"},
{"path": "selector_components.search_input.selector", "change_type": "modified",
"diff": "--- v38\n+++ current\n@@ …"}
]
}
Server-side structural diff: walk the three content dicts; report added/removed components and per-field changes. For changed string fields ≤ 64KB, include a difflib.unified_diff; larger fields report change_type and sizes only (client can fetch both payloads). This is deterministic stdlib work — no external deps. When against=current, the response's to_version records the live record_version read at diff time, so a client can review a diff and then restore with expected_version=to_version — guaranteeing the state it approved is the state being replaced.
Restore
POST /shops/{shopify_domain}/settings/versions/{version}/restore
Body: {"expected_version": 42} # optional; optimistic-lock guard, see below
(Field name matches the settings-concurrency plan's save contract; if they expose the guard differently — e.g. a header — restore mirrors their choice.)
Scopes (per storefront-admin-auth): restore requires the same scopes as the equivalent save against the same target — live target: settings:write + settings:deploy_live (any mutation of the live record needs the deploy scope, matching the live POST /settings itself); staged/theme target (future): settings:write only. No separate restore scope. Legacy raw API keys retain access per their allow→warn→deny ratchet.
Semantics:
- Fetch the version metadata and snapshot for
{version}; validate and (if needed) schema-migrate it (see Schema drift), and run the tier-1 metafield size guard (shared two-tier capacity interface, settings-concurrency's canonical wording) — an oversized snapshot fails fast with 422settings_too_large(the shared tier-1 status; tier 2's DDB cap is the one that maps to 413) before any write, since restore is a metafield-bound path. - Apply content fields only through
SettingsService.create_or_update_settings— infra fields on the live item are untouched,last_updated/author are fresh. - This produces a new version (e.g. restoring v38 when current is v42 creates v43 with
event_type=restore,restored_from=38). - Re-sync the metafield mirror via the shared save/restore helper, including settings-concurrency's post-write metafield reconciliation (their §4.5); the 207-partial contract applies when the access token is missing or Shopify errors (response shape reuses
SaveSettingsResponse). - Concurrency guard: the client-supplied
expected_versionis optional and advisory — the server does not depend on a client echo (today's GET doesn't even return a version; see prerequisite below). When provided, restore runs caller-managed: one attempt, 409 on conflict. When omitted, restore inherits service-managed mode (bounded conditional retry of the full pipeline) — not last-writer-wins; an unguarded restore still never overwrites a concurrent save silently within an attempt.
Hard prerequisite (from settings-concurrency's plan, their §5.1 / fix M1): the storefront admin GET /shops/{domain}/settings switches to a response carrying the current version — today it returns only components (storefront_routes.py:140-143, SettingsResponse in api_requests.py), so no client could echo a guard. The restore UI reads version from that GET (or from the versions list, whose to_version the diff endpoint also reports) and passes it as expected_version.
Responses: 200 {"status": "success", "new_version": 43, …}, 207 partial (metafield sync failed; DDB state restored — same contract as save), 404 unknown version, 409 version conflict or future-schema snapshot, 422 snapshot fails current model validation (validation_failed) or exceeds the tier-1 metafield guard (settings_too_large).
Future evolution
?scope=theme:{theme_id}query param on list/get/diff/restore once theme-deploys lands (additive; defaults tolive).- A
POST …/versions"pin/label" endpoint (named checkpoints) if tuning workflows want it.
Architecture
POST /settings (app or storefront admin)
│
SettingsService.create_or_update_settings
│ merge + validate (existing)
│
┌─ content hash unchanged? ── yes ──► plain save, no version
▼ no
SettingsVersionService.capture
│ 1. PUT snapshot JSON → S3 (settings-versions bucket, private)
│ 2. TransactWriteItems (same table, two partitions — transactions span pks):
│ • Update live item (pk SHOP#{domain}, sk SETTINGS; condition record_version = :expected OR
│ attribute_not_exists(record_version); SET record_version = :expected+1) ◄── settings-concurrency
│ • Put version item (pk SHOPVER#{domain}, sk live#{n:010d})
▼
metafield mirror sync (existing, unchanged)
Components:
SettingsSnapshotStore(new, admin_server): thin S3 wrapper.put_snapshot(domain, scope, version, payload) -> s3_key,get_snapshot(s3_key) -> dict. Key scheme:{shop_domain}/{scope}/{version:010d}.json.SettingsVersionRepository(new, extendsBaseRepository):puthappens inside the save transaction (see below);list_versions(domain, scope, limit, exclusive_start_key),get_version(domain, scope, version). Requires addingExclusiveStartKey/LastEvaluatedKeypassthrough toBaseRepository.query_by_pk_and_sk_prefix(currently unsupported — small additive change).SettingsVersionService(new):capture(computes hash/changed-summary, S3 put, builds transact items),list,get,diff,restore. Injected intoSettingsServiceand routes via the existingDependencyContainer@cached_propertypattern (dependencies.py).SettingsService(modified):create_or_update_settingsgains the capture step; the final write is the canonicalsave_settings(settings, *, expected_version, change_source, extra_transact_items)primitive owned by settings-concurrency (see Low Level Design — pinned verbatim in both plans); this plan contributes onePutto that transaction (theme-deploys contributes a second on deploys). Sequencing of the two PRs is covered in Release.- Write-path coverage: all content writes funnel through
create_or_update_settings(callers:settings_routes.save_settings,settings_routes.initialize_settings,storefront_routes.save_settings,api_key_routesonboarding default-init; theupdate_ui_componentswrapper is being deleted by settings-concurrency — routes callcreate_or_update_settingsdirectly). Infra-field writers (webhook_service,index_service,api_key_routesassociation fields) use the non-capturingupdate_settingspath and are deliberately not captured (FR-6). A unit test pins this taxonomy so future fields must be classified.
Failure ordering
- S3 put fails → save fails (5xx). Tenet 1: no versionless saves. S3 single-region availability is well above this endpoint's needs.
- Transaction fails after S3 put (condition conflict, throttle) → orphan S3 object. Harmless: listing is DDB-driven, orphans are unreachable. Accrual is bounded by the conflict rate (each conflicted attempt — caller-managed 409s and failed service-managed attempts alike — orphans at most one object): even at a pessimistic 5% conflict rate at the full agent scenario (360K saves/mo) that's ~18K orphans/mo × 120KB ≈ 2GB/mo ≈ $0.05/mo fleet-wide. Orphans are written under the same
{shop}/{scope}/prefixes as live snapshots, so the bucket-wide 90-daySTANDARD_IAlifecycle rule covers them automatically; no cleanup machinery in v1. - Metafield sync fails after commit → 207-partial via the shared helper, which includes settings-concurrency's post-write metafield reconciliation (their §4.5). The version exists and is correct (it reflects DDB state, the source of truth).
Data Storage / Modeling
Version metadata item (DynamoDB, ShopifyEntitiesTable)
New model in shopify_entities.py:
class ShopifySettingsVersion(ShopifyEntityBase):
"""
Immutable version-history record for a settings save.
pk: SHOPVER#{shop_domain} (dedicated partition — never mixed with live entities)
sk: live#{version:010d} (live scope)
theme#{theme_id}#{version:010d} (theme scope, future)
"""
entity_type: Literal["SETTINGS_VERSION"] = "SETTINGS_VERSION"
version: int # == live item's record_version after the captured write
scope: str = "live" # "live" | "theme:{theme_id}"
event_type: Literal["save", "restore", "deploy", "backfill"]
change_source: str # supplied by every write path per settings-concurrency
# (admin_ui | storefront_admin | onboarding | webhook_registration
# | webhook_cleanup | index_lifecycle | theme_deploy | script)
author_id: str # UNIFORMLY PREFIXED actor string (storefront-admin-auth §4.6):
# user:{cognito_sub} | token:{token_id} | api_key:{system_account_id}
# | shopify_user:{shopify_user_id}. Capture normalizes embedded-app
# writers (raw Shopify user id today) to the shopify_user: prefix.
# Prefix->type mapping imported from admin_server/models/auth.py
# (their PR1), never re-derived; unknown prefixes fail closed.
author_display: str | None = None # email / token name, denormalized at write time
ttl: int | None = None # expiry hook, unset in v1 (see Retention)
settings_schema_version: int # copied from the saved settings
content_hash: str # sha256 of canonical content-fields JSON
snapshot_s3_key: str
snapshot_size_bytes: int
changed: dict # {"ui_components": [keys], "selector_components": [...], "configuration": [...]}
restored_from: int | None = None # event_type == "restore"
source_scope: str | None = None # event_type == "deploy"
source_version: int | None = None
Size: < 2KB. Why a dedicated SHOPVER# partition rather than new sks under SHOP#{domain}: existing code sweeps the whole SHOP# partition — ShopifySessionRepository.list_shop_sessions (shopify_session_repository.py:135) calls query_by_pk(pk) and filters entity_type == "SESSION" client-side, and it runs on every storefront settings save and restore (access-token lookup in storefront_routes._get_access_token_for_shop). Thousands of version items in that partition would be read, transferred, and discarded on every save (5,000 × 2KB = 10MB ≈ 10 paginated 1MB queries per token lookup). A separate partition makes history invisible to all current and future SHOP# partition queries by construction (NFR-5), while TransactWriteItems still covers both items (same table; different pks are fine). Independently of this plan, list_shop_sessions should switch to query_by_pk_and_sk_prefix(pk, "USER#") as hygiene — noted in Impact, not load-bearing here.
Listing history is query_by_pk_and_sk_prefix(pk="SHOPVER#{domain}", sk_prefix="live#"), no GSI needed. The theme# prefix shares the partition but never matches the live# prefix query (and vice versa). Version items carry no system_account_id, so they can never appear in the sparse GSI_SystemAccountId.
Zero-padded version in the sk gives lexicographic == numeric ordering. The counter itself has no gaps (settings-concurrency increments by exactly 1 on every successful write), but history sequence numbers are sparse: the counter advances on infra-only writes too, while history items are created only for content writes. Documented in the model docstring and the API docs.
Snapshot object (S3, new private bucket)
- Bucket:
config.envify("shopify-settings-versions")inshopify_admin_stack.py. Private (BlockPublicAccess.BLOCK_ALL), SSE-S3,removal_policymatching the table (RETAIN-equivalent in prod viadeletion_protectionconvention — followpoint_in_time_recovery=config.is_prodstyle), lifecycle: transition toSTANDARD_IAafter 90 days, no expiry. Why not the existing assets bucket: it ispublic_read_access=Truebehind CloudFront — merchant CSS/HTML and author metadata must not be world-readable. Why not the job-details bucket: different domain/lifecycle; mixing makes future retention rules hazardous. - Object body: full post-save state of the three content fields plus envelope:
{
"snapshot_format": 1,
"shop_domain": "laurageller.myshopify.com",
"scope": "live",
"version": 42,
"settings_schema_version": 1,
"captured_at": "2026-06-10T03:21:00+00:00",
"settings": {"ui_components": {…}, "selector_components": {…}, "configuration": {…}}
}
- Grant:
bucket.grant_read_write(...)to the admin server task role and the backfill script role; env varSETTINGS_VERSIONS_BUCKETadded toconfig.py(fail fast at startup if unset, matching existing config style — no silent fallback).
Why S3 payload + DDB metadata (and not DDB-only)
Real payloads are ~120KB; the DDB item limit is 400KB. Full snapshots as DDB items would work today, but:
- Ceiling coupling. A snapshot item is the live item plus metadata — it hits the 400KB wall at the same moment the live item does, and history would be the thing that breaks first as merchants add CSS (LG grew from 115KB→119KB in one month of tuning).
- Partition + query economics. At the 200-shop agent scenario, bodies-in-DDB means 600MB of history per shop in the operational table, and every history page reads megabytes; metadata-only items keep the version partition at ~10MB per 5,000 versions and list pages at ~40KB (NFR-2) without
ProjectionExpressiongymnastics. Full numbers in Performance at scale. - Ops affordance. S3 objects are directly inspectable/downloadable with existing diagnostics tooling (the
tmp/*.jsonworkflow, formalized), and lifecycle-tiering is free.
Cost of the hybrid: one extra PUT per save and one IAM grant. See Alternative Solutions for the full comparison.
The 400KB live-item ceiling (split-record question)
Raynor's review question: what happens when a shop's settings outgrow the DDB 400KB item cap, and should we introduce split-records (N shard items assembled on read)? Point by point:
(a) Live-record exposure, stated explicitly. S3 snapshots remove the cap for history bodies, but the live SETTINGS item my capture transacts against is still one DDB item bounded at 400KB (including attribute names). Versioning does not enlarge the live item (the transaction's second item is the ~2KB metadata Put in another partition), so the exposure is pre-existing and orthogonal — but a save exceeding the cap also fails to capture (the whole transaction fails), so versioning inherits the ceiling exactly as saves do.
(b) Failure mode today. An oversized put_item raises botocore ClientError (ValidationException, "Item size has exceeded the maximum allowed size"). Nothing in the repository or routes handles it specifically — it bubbles to the routes' generic except Exception and surfaces as an opaque 500 "Failed to save settings" (logged with exc_info). It is surfaced, not swallowed, and it fails before the metafield sync, so DDB and the mirror both retain the previous state — no corruption, just an undiagnosable error for the caller. A typed ItemTooLargeError → 413 should land with whichever PR first reworks the repository write path (suggested to settings-concurrency). Critically, DDB is not the first ceiling: the Shopify metafield mirror caps values at ~128KB, and real payloads are already ~120KB — see the two-tier interface in (d).
(c) Capture survives a future sharded live record. If the live record later becomes manifest + N shards, this design holds because of two deliberate properties: (1) snapshots are format-independent — SettingsSnapshotStore writes the fully materialized content dict assembled in the service layer, never raw DDB items, so a snapshot is identical whether the live record is one item or twelve, and every existing snapshot stays restorable across the transition (restore replays content through the save path, whatever shape that path writes); (2) version identity stays singular — record_version and its conditional check live on the manifest item only; shards carry no counter. A sharded content save is then one TransactWriteItems: conditional manifest update + N shard puts + 1 version-metadata put. DynamoDB transactions allow 100 items / 4MB aggregate (the older 25-item limit was raised in 2022), so atomic capture survives settings up to ~3.8MB — ~10× past the point where sharding becomes necessary. Beyond that, the metafield mirror (single value) breaks first and the whole settings-distribution model needs rethinking, not just storage.
(d) Recommendation: defer sharding, with a measured trigger. Evidence (corpus ordering aligned with settings-concurrency §6.1a): the largest real merchant payload is Muji CA at 120.6KB (tmp/muji_ca_settings_2026-05-29T1527.json), with LG second at 119.0KB (tmp/lg-settings-backup-2026-05-26.json) — ~30% of the cap. Growth is episodic, not steady: LG grew 90.8KB → 119.0KB (+28KB) during its single heaviest month of CSS customization (Apr 22 → May 26 2026 backups), then flattened. Even sustaining that worst-case burst rate continuously, 400KB is ~10 months out; integration phases end. Sharding now would touch every read path, the lock design, and the mirror for headroom we don't need.
Capacity thresholds follow the shared two-tier capacity interface owned by settings-concurrency (their canonical wording governs; not restated here):
- Tier 1 — metafield-bound paths (the FIRST ceiling): ~128KB Shopify metafield value cap, with 100KB warn / 128KB guard, enforced on every path that re-syncs the mirror — save, deploy, and this plan's restore/replay: restoring an old snapshot runs the same tier-1 guard before committing, so a restore can never succeed in DDB and then strand the mirror. At ~120KB real payloads this tier is already near — it, not DDB, is the operative constraint.
- Tier 2 — DDB live item: 300KB warn / 400KB hard cap; the split-record trigger.
Versioning makes both tiers cheap to observe: snapshot_size_bytes is already recorded on every version item and logged at capture, giving a per-shop size time series against both thresholds with no new measurement code. Crossing tier 2 triggers a dedicated split-record design, executed via a settings_schema_version bump + migration-registry entry (the vehicle this plan already ships), with version history as the rollback anchor for that migration; crossing tier 1 is a content-budget conversation with the merchant integration, not a storage problem.
Record-shape trajectory — agreed across plans (settings-concurrency §6.1; theme-deploys notified): single item now. The shared design constraint is that the optimistic-lock condition always lives on exactly one item. If content ever outgrows the item, two trajectories preserve that: (1) manifest + shards in DDB — root sk=SETTINGS holds record_version + the shard list, all shard writes in one transaction conditioned on the root only (analyzed in (c) above); (2) — concurrency's preferred — the content body moves to S3 with the DDB item keeping record_version + an S3 key, which is exactly this plan's snapshot architecture: the snapshot store is the working precedent that trajectory 2 is sound. Under either layout, version metadata items (~2KB) and history sequence numbers are unaffected because the root counter's semantics don't change. Theme-deploys multiplies record count, not record size — each staged theme record has its own 400KB budget — so it adds no ceiling pressure. The split-record design is triggered by the 300KB alarm and executed via a settings_schema_version bump + migration-registry entry, with version history as the rollback anchor. Ownership note: the 300KB payload metric and the typed ItemTooLargeError → 413 mapping land with settings-concurrency's Phase 1 (today's oversized save is an opaque 500 on UI saves and silently swallowed in webhook_service.py:668 — their finding and fix).
Retention
Keep everything in v1; ship the expiry mechanism but leave it disabled:
ShopifyEntitiesTablealready has TTL enabled on thettlattribute (used by sessions,time_to_live_attribute="ttl"inshopify_admin_stack.py), so version-item expiry is a single optionalttl: int | Nonefield on the model — zero new infra. v1 writes it asNone.- S3 lifecycle: transition to
STANDARD_IAafter 90 days from day one; an expiry rule is added only when TTL is activated (the two must be enabled together, S3 expiry ≥ DDB TTL so a listed version always has its body). - Activation trigger (documented policy, not automation): fleet snapshot storage > 1TB or any shop > 100K versions → enable
ttl = created_at + 400 daysforevent_type="save"items only;restore/deploy/backfillitems are kept as durable anchors.
Justification: at the 200-shop agent-team scenario (see Performance at scale) steady-state accrual is ~$2/mo fleet-wide S3 — deletion machinery active from day one is more code and more risk (audit loss) than the cost it saves; but designing the field in now means activation is a config change, not a migration.
Low Level Design
Canonical shared write interface (pinned VERBATIM with settings-concurrency)
The repository primitives are owned and implemented by settings-concurrency; the block below is their FINAL canonical text (mirrored character-for-character from their §4.2/§4.3, confirmed 2026-06-11) and appears identically in both plans (a change to one must be made in both):
# ShopifySettingsRepository — the ONLY write paths for SETTINGS* records.
# Neither method retries internally; conflicts raise SettingsConflictError (409).
def save_settings(
self,
settings: ShopifySettings,
*,
expected_version: int,
change_source: str,
extra_transact_items: list[dict] | None = None,
) -> ShopifySettings: ...
# Conditional full save; persists expected_version + 1. Plain conditional
# put_item when extra_transact_items is None; one TransactWriteItems with the
# identical ConditionExpression when provided. ConditionalCheckFailedException
# or TransactionCanceledException(ConditionalCheckFailed) -> SettingsConflictError.
def update_settings(
self,
shopify_domain: str,
updates: dict,
*,
expected_version: int,
change_source: str,
) -> int: ...
# INFRA FIELDS ONLY: raises ValueError if updates touches
# ui_components / selector_components / configuration (fail-fast guard).
# Not captured by versioning (FR-6). Never retries internally; bounded
# retry (<=3, re-reading the version each attempt) lives at the calling service.
# SettingsService — single content-write funnel (update_ui_components is deleted).
def create_or_update_settings(
self,
shop_id: str,
updates: dict,
updated_by_user_id: str,
*,
expected_version: int | None, # REQUIRED keyword, no default: int = user-facing one-shot; None = service-managed mode
change_source: str,
event_type: str = "save", # versioning capture: save | restore | deploy | backfill
restored_from: int | None = None, # versioning capture: source version for restores
source_scope: str | None = None, # versioning capture: deploy events ("theme:{theme_id}")
source_version: int | None = None, # versioning capture: deploy events
) -> ShopifySettings: ...
Semantics riders pinned with the block (stated identically in both plans):
- Service-managed mode (
expected_version=None) re-runs the FULL pipeline per attempt including capture — fresh read, fresh merge + content hash, fresh provisional version, fresh S3 snapshot, fresh conditional write — max 3 attempts; failed-attempt snapshots are the accepted orphan class (quantified in Failure ordering).Nonedoes not mean unconditional/last-writer-wins. User-facing mode (expected_version=int) is one attempt → 409. Capture happens only for content-changing writes, on whichever attempt commits — never more than one version item per successful save. - Infra writers bypass capture entirely via
update_settings(webhook metadata merges,active_indexclears, association fields) — the non-capturing path per FR-6, which fail-fast rejects content fields with aValueErrorso the content/infra taxonomy is enforced at the primitive, not by convention; their bounded retries live at the calling services (webhook_service/index_service/api_key_routes) and involve no snapshots.
Contract details this plan additionally depends on:
- The condition builder is a standalone
_version_condition(expected_version)precisely so it can be reused inside transaction items; bothConditionalCheckFailedExceptionandTransactionCanceledException(ConditionalCheckFailed)map toSettingsConflictError(extendsServiceError, 409). extra_transact_itemsmay carry more than one Put: this plan's version-metadata item always; theme-deploys' rolling deploy-backup item additionally on deploys.- The
source_scope/source_versionkwargs are the deploy-event addendum negotiated 2026-06-10, now part of the canonical block in both plans.
Capture (inside create_or_update_settings, one attempt of the pipeline)
# Service-managed mode wraps THIS WHOLE BLOCK in a bounded loop (max 3) per the
# contract above; caller-managed mode runs it exactly once. The repo signature
# takes expected_version: int, so the service resolves None (service-managed)
# to the freshly-read counter on EVERY attempt before calling the repo.
def _save_attempt(self, shop_id, updates, updated_by_user_id, *, change_source,
event_type, restored_from, source_scope, source_version,
expected_version) -> ShopifySettings:
existing = self.repository.get_settings(shop_id) # fresh read every attempt
expected = expected_version if expected_version is not None \
else (existing.record_version if existing else 0) # None -> per-attempt resolution (int for repo)
settings = self._merge_settings(existing, updates) if existing else self._create_default_settings(...)
new_hash = content_hash(settings) # sha256 over canonical content-fields JSON
if existing and new_hash == content_hash(existing):
# No content change: plain conditional save of timestamps/author, no version. (FR-7)
persisted = self.repository.save_settings(settings, expected_version=expected,
change_source=change_source) # cheap path, no transact
return persisted
next_version = expected + 1 # provisional; == record_version after a successful commit
s3_key = self.snapshot_store.put_snapshot(shop_id, "live", next_version, content_fields(settings))
version_item = build_version_item(..., changed=changed_summary(existing, settings))
persisted = self.repository.save_settings( # one transact_write_items call
settings, expected_version=expected, change_source=change_source,
extra_transact_items=[version_item.to_transact_put()],
)
return persisted # NOT the pre-save object: carries the COMMITTED record_version
Notes:
content_hashcanonicalizes viajson.dumps(..., sort_keys=True, separators=(",", ":"))over only the three content fields — infra-field churn never creates versions (FR-6) and never suppresses real changes.next_versionis provisional until the conditional write commits; a conflicted attempt orphans its S3 object (harmless, unreachable — DDB metadata is the index). In caller-managed mode the conflict surfaces as 409; in service-managed mode the next attempt recomputes everything under the new counter value.- Transaction mechanics (verified against the codebase): the boto3 resource API does not support transactions, so
save_settingsuses a low-level client — and a standaloneboto3.client("dynamodb"), nottable.meta.client, because moto has a deserialization bug when transactions go through the resource's internal client (precedent with explanation:components/controller/merchandise/services/synonym_service.py:102-106). Transaction items use DynamoDB JSON, soversion_item.to_transact_put()serializes viaboto3.dynamodb.types.TypeSerializer(precedent:synonym_service.py_to_ddb_item, and its_execute_transactionshows theTransactionCanceledExceptionmapping). Moto transaction tests follow the existing synonym-service test patterns. changed_summarylists component keys whose serialized value differs — cheap (dict compare), powers the list-view UX without payload fetches.- Return the persisted object, never the pre-save one:
save_settingsreturns the committedShopifySettings(canonical block), andcreate_or_update_settingsmust propagate it — the pre-save object carries a stalerecord_version, which would leak into save/restore API responses and the metafield reconciliation helper. Pinned by Testing #19. - The
record_versionfield is added toShopifySettingsby settings-concurrency; if their PR hasn't merged when this builds, this plan introduces the attribute with the agreed semantics and they adopt it (see Cross-plan).
Restore
def restore(self, shop_id, version, author_id, expected_version=None) -> RestoreResult:
meta = self.version_repo.get_version(shop_id, "live", version) # 404 if missing
snapshot = self.snapshot_store.get_snapshot(meta.snapshot_s3_key)
content = migrate_snapshot(snapshot) # schema drift, below
settings = self.settings_service.create_or_update_settings(
shop_id, content, author_id,
change_source="storefront_admin", event_type="restore",
restored_from=version, expected_version=expected_version, # None => service-managed mode
)
return settings # route then runs the shared save/restore mirror-sync helper
The route handler factors the existing metafield-sync-with-207 block in storefront_routes.save_settings into a shared helper used by both endpoints (it is currently ~70 inline lines; extraction is a pure refactor covered by existing route tests). The helper also carries settings-concurrency's post-write metafield reconciliation (their §4.5), so restore's mirror behavior is byte-identical to save's by construction — pinned by a failing-first test (Testing #14).
Schema drift between versions
ShopifySettings.settings_schema_version exists (currently always 1, never bumped). Restore handles drift explicitly:
- Same version (the only case today): validate
snapshot["settings"]through the current Pydantic model insidecreate_or_update_settings(which already constructsShopifySettings(**merged)); a validation failure is a 422 — fail fast, no partial restore. - Older snapshot (
snapshot.settings_schema_version < CURRENT_SETTINGS_SCHEMA_VERSION): run an ordered migration registry before merging:v1 ships the registry empty plus the loop and tests; whoever first bumpsSETTINGS_MIGRATIONS: dict[int, Callable[[dict], dict]] = {} # {from_version: migrate_fn}, applied sequentiallysettings_schema_versionis forced (by a unit test asserting registry coverage of1..CURRENT-1) to provide a migration. This converts "schema drift" from a restore-time surprise into a compile-time-ish checklist item. - Newer snapshot (
> CURRENT): 409 with an explicit message — this occurs only after a code rollback; restoring a future-schema payload through old code is undefined and refused. (Fail fast.) snapshot_format(the envelope version) is checked the same way — unknown format → 500 with explicit error.
Status-code rationale (why 409 for case 3 but 500 for case 4): a newer settings_schema_version is a legitimate, client-visible state conflict — the system was rolled back and the caller can act on it (pick an older version, wait for roll-forward), which is what 409 means. An unrecognized snapshot_format is an internal invariant violation — we wrote every snapshot, so an unreadable envelope means our own writer/reader contract broke; that's a 500 and a bug, never a caller-actionable state.
Additionally, because restore replays through the normal save path, defaults-merging in get_settings_with_defaults continues to paper over additive component drift at read time exactly as it does for old live records today.
Diff
Pure function over two content dicts (current state is fetched from the live item; historical states from S3): walk ui_components / selector_components / configuration; per component key → added/removed; per scalar field → difflib.unified_diff for strings ≤ 64KB. Lives beside the service, fully unit-testable, no I/O.
Dependencies
- DynamoDB transactions — via the low-level client (
transact_write_items; the resource API doesn't support transactions), with a standaloneboto3.clientper the moto bug documented atsynonym_service.py:102-106, andTypeSerializer-encoded items. 2–3 item transactions, well under the 100-item/4MB limits. Moto supports client-level transactions (existing synonym-service tests prove it). - S3 — new bucket; admin server task role grant. Assumption: admin server's boto3 session already carries region/creds (it does — DDB + existing buckets).
- settings-concurrency plan (
docs/plans/settings-concurrency-control.md) — version counter + the canonical write primitives (pinned in LLD). Hard prerequisites: (1) the primitives themselves; (2) their M1 fix switching the storefront adminGET /settingsto a response carrying the currentversion(their §5.1) — without it no client can supply a restore/save guard. They also own: the payload-size metric + typed 413 for oversized items (their Phase 1), the post-write metafield reconciliation helper (their §4.5) that restore reuses, and the stale "DDB stream → KV export" docstring fixes instorefront_routes.py(their Phase-1 PR). - Shopify uplift roadmap (parallel team) — its admin_server track includes a fail-fast PR (B3) that funnels all settings writes through a single repository write path as an explicit "seam for the concurrency plan", plus golden metafield-payload tests (B1). If B3 lands first, this plan's capture hook attaches to that funnel; the golden tests also protect the restore→metafield path for free. No conflict, but implementers should rebase against that track.
- storefront-admin-auth plan (
docs/plans/storefront-admin-sso.md) — author identity (uniformly prefixedauthor_id+author_display; prefix↔type mapping imported from theiradmin_server/models/auth.py) + scope enforcement (settings:write+settings:deploy_liveon live-target restore). Soft interface, agreed: v1 normalizes today's identities at capture (shopify_user:{auth.user_id},api_key:{system_account_id}) and picks up theirStorefrontAuthContextwhen it lands; legacy raw API keys remain a permanent author shape. - theme-deploys plan (
docs/plans/theme-targeted-deploys.md) — consumes this plan'sscope/event_typemodel and the deploy kwargs; their rollingSETTINGS#DEPLOY#BACKUPrecord is an interim undo mechanism retired when restore lands. They depend on us, not vice versa. - Shopify metafields — unchanged usage; restore adopts the shared helper including concurrency's §4.5 reconciliation (see Restore). The metafield is the only mirror restore must re-sync (
ShopifyEntitiesTablehas no stream;ecom_settings_exporterreadsIndexSettings).
Engineering Excellence
Consistency and Integrity
Race: two writers, version capture vs. lock.
- Agent A reads settings (lock version 41), prepares save.
- Human B reads settings (41), saves first → live item now 42, version item
live#0000000042written atomically. - A's transaction conditions on
record_version = 41→TransactionCanceledException→SettingsConflictError→ 409 to A (caller-managed) or a fresh full-pipeline attempt (service-managed). A's pre-written S3 object for "42" is orphaned and unreachable (DDB metadata is the index). No torn state: metadata and live item commit together or not at all.
Race: restore vs. concurrent save — identical to the above. With expected_version supplied (caller-managed), the window is closed: 409. Without it, restore runs in service-managed mode — each attempt is still conditional (a concurrent save fails the attempt, never silently loses), and the bounded re-run re-reads before re-applying; this is strictly stronger than today's unconditional put.
Invariant: a version item exists ⇒ its snapshot exists (S3 written first); a snapshot may exist without a version item (orphan, ignorable). Stated in code comments and pinned by a test on the write ordering.
Reliability & Resilience
Failure modes are enumerated in Architecture → Failure ordering. The deliberate availability trade (S3 outage blocks saves) is Tenet 1 and called out for sign-off. Restore is idempotent in effect: restoring the same version twice produces two identical-content saves, the second of which is deduped by FR-7 (no duplicate version, plain timestamp save).
Scalability
See the dedicated Performance at scale section below — sized for 200 shops × thousands of versions each.
Observability
- Structured logs on capture (
shop,version,event_type,snapshot_size_bytes,changedkeys) and restore (restored_from,new_version, mirror result). - Metric/alarm candidates (follow existing admin-server logging patterns; no new metrics infra in v1): count of 5xx on save (would now include S3 failures), count of restores.
- Size alarms: the two-tier capacity interface (tier 1: 100KB warn / 128KB metafield guard; tier 2: 300KB warn / 400KB DDB) and the typed 413 mapping are owned by settings-concurrency — their canonical wording governs; this plan's restore/replay paths enforce the tier-1 guard, and
snapshot_size_byteson every version item supplies the complementary historical per-shop size series against both tiers at zero extra measurement cost.
Security
- New bucket private + SSE; least privilege: only the admin-server task role (and one-off backfill role) gets
grant_read_write. No CDN, no public ACLs (NFR-3). - Version payloads contain merchant HTML/CSS and author IDs — same sensitivity class as the live table; no new data categories.
- All new endpoints sit behind the existing API-key auth + shop-ownership resolution; once storefront-admin-auth lands, live-target restore declares
require_scopes("settings:write", "settings:deploy_live")— identical to the live save it is equivalent to (Cross-plan). - Snapshots are immutable by convention; no update/delete API is exposed.
Testing
moto for DDB + S3; fixtures extend admin_server.testing.fixtures (add settings_versions_bucket fixture mirroring existing bucket fixtures). Per CLAUDE.md, each new branch/error path gets a test that fails before the change:
Capture:
- Content save creates exactly one version item + one S3 object; payload matches post-save state.
- Metadata-only writes (
webhook_servicemetadata,active_indexclear, association fields viaupdate_settings) create no version (taxonomy pin test). - Identical-content save creates no version but updates
last_updated(FR-7), via the cheap non-transact path. - First-ever save for a shop creates v1 with
event_type="save". - Caller-managed conflict (single attempt, conditional failure) → 409, no version item written, S3 orphan tolerated.
- S3 put failure → save fails 5xx, live item unchanged.
changedsummary correctness for add/modify/remove component cases.- Service-managed retry × capture: simulate a conflict on attempt 1 (concurrent writer bumps the counter between read and write), success on attempt 2 → exactly one version item exists, keyed to the committed counter value, with content from the fresh attempt-2 merge; attempt-1's snapshot is an unreferenced orphan; never two version items for one logical save.
API:
9. List: ordering (newest first), pagination cursor round-trip, empty history.
10. Get: payload round-trip; 404 unknown version.
11. Diff: added/removed/modified components; large-string fallback (no diff body); against=current and against=N; to_version equals the live record_version at diff time.
12. Restore happy path: new version with restored_from, content matches snapshot, infra fields untouched, mirror synced via the shared helper (mock ShopifyService).
13. Restore with no access token → 207 partial, DDB restored (mirrors existing save test).
14. Restore mirror parity (failing-first): restore invokes the same shared helper as save, including concurrency's §4.5 post-write metafield reconciliation — asserted by replacing the helper with a sentinel and verifying both endpoints call it with equivalent arguments.
15. Restore 409 on expected_version mismatch (caller-managed); restore without expected_version runs service-managed (asserted: conditional write attempted, not unconditional); 409 on future settings_schema_version; 422 on model-invalid snapshot; migration registry applied for older schema version (synthetic v0→v1 migration in test).
16. Auth: endpoints reject keys not owning the shop; live-target restore requires settings:write + settings:deploy_live once scope enforcement lands (reuse existing storefront route auth tests pattern).
Backfill:
17. Script is idempotent (re-run creates no duplicate version items), skips shops with existing history, and its transaction's ConditionCheck aborts cleanly when live traffic bumps the counter mid-backfill (no version item written for that shop on that pass; re-run picks it up).
Committed-version propagation:
19. Save and restore responses, and the payload handed to the metafield reconciliation helper, carry the committed record_version (the persisted object returned by save_settings), never the pre-save object's stale value — failing-first against an implementation that returns the merged input.
20. Restore of a snapshot exceeding the tier-1 metafield guard fails 422 settings_too_large (shared tier-1 status per concurrency §6.1) before any DDB write (mirror can never be stranded by a restore).
Performance isolation:
21. Partition-isolation pin: insert 1,000 version items for a shop; get_settings, list_shop_sessions, and list_settings_by_system_account results are unchanged (NFR-5).
Run via pants test //components/shopify/admin_server:: plus the e2e suite in CI; no local e2e.
Performance at scale
Sizing basis (per Raynor's review question): 200 shops, each with an automated agent team saving settings edits — low thousands of version records per shop. Concretely: 200 shops × 5,000 versions × ~120KB payloads, with hot shops sustaining ~1 save/minute during agent sessions.
1. Reads of the live settings record — O(1), provably unaffected
get_settings is GetItem(pk="SHOP#{domain}", sk="SETTINGS") — exact composite key, cost fixed at ~1 RCU per 4KB... of that one item, independent of anything else in the table. Version items live in a different partition (SHOPVER#{domain}), so they cannot appear in:
- any
GetItem/QueryonSHOP#{domain}— including the full-partition sweep inlist_shop_sessions(query_by_pk, filters client-side), which runs on every storefront save/restore for the access-token lookup. This sweep is exactly the failure mode the sk-prefix placement would have caused: at 5,000 versions × 2KB it would have added ~10MB (≈10 paginated queries, ~1,250 RCU) to every save. The dedicated partition eliminates the class, not just the instance. - a hypothetical future
begins_with(sk, "SETTINGS")query — version sks (live#…) don't share the prefix, and version pks don't share the partition. GSI_SystemAccountId— sparse; version items never carrysystem_account_id.
A pin test asserts that list_shop_sessions and get_settings results are byte-identical before/after inserting 1,000 version items for the shop.
2. List-versions pagination at thousands of items
Query reads only the requested page: limit=20 × 2KB ≈ 40KB ≈ 5 RCU (eventually consistent), independent of total history depth — DDB Query with Limit + ExclusiveStartKey never scans past the page boundary, and newest-first ordering means the common request (recent history) is always the first page. Walking all 5,000 versions costs ~10MB / ~13 sequential pages of 1MB — only the backfill/ops tooling ever does that, never the UI (page size capped at 100). No FilterExpression is used anywhere in the listing path (filters would read-then-discard and break Limit semantics).
3. Partition growth and the 10GB question
- Version metadata partition (
SHOPVER#{domain}): 5,000 × 2KB = 10MB; even 100,000 versions = 200MB. The ~10GB partition guidance is 3 orders of magnitude away; on-demand tables also split hot partitions automatically. Because snapshot bodies are in S3, partition size grows at 2KB/version, not 120KB/version — this is the decisive reason for S3-bodies + DDB-pointers (storing bodies in DDB would put 5,000 × 120KB = 600MB per shop into the hot operational table and make every history page read megabytes). - S3: 5,000 × 120KB = 600MB/shop; ×200 shops = 120GB fleet ≈ $1.5/mo at IA. Unbounded growth is handled by the retention mechanism (TTL field + lifecycle, pre-plumbed, disabled in v1 — see Retention).
- Live item partition: unchanged — still exactly one ~120KB settings item plus sessions/API-key items.
4. Write amplification per save
| Today | With versioning (+ concurrency) | |
|---|---|---|
| DDB | 1 PutItem, ~120KB ≈ 120 WCU | 1 TransactWriteItems (live update 120KB + version put 2KB), transact 2× premium ≈ 244 WCU |
| S3 | — | 1 PUT (~120KB) |
| Latency | ~10–20ms | +~10ms (transact) +20–50ms (S3 PUT, same region) |
Latency context: the save path already makes 1–3 Shopify GraphQL calls (metafield sync, token checks) at 200–800ms each; +60ms is noise. Cost context at the aggressive scenario — 200 shops × 60 saves/day = 12K saves/day ≈ 360K/mo: today 43M WCU ≈ $54/mo; after, 88M WCU ≈ $110/mo + $2/mo S3 PUTs. The increment is dominated by the transact 2× premium, which optimistic locking (settings-concurrency) requires regardless of versioning; versioning's own marginal cost is the 2KB item + S3 ($3/mo fleet). High-frequency agent saves hit two external ceilings long before DDB: Shopify GraphQL rate limits on the metafield mirror, and the per-partition 1000 WCU/s cap on the live item (~8 saves/s/shop at 120KB — pre-existing, unchanged by this design since the version item lands in a different partition).
5. Retention/TTL as the mitigation
Pre-plumbed, disabled in v1, activation is a config change (details in Data Storage → Retention): table TTL already enabled on ttl; version items gain an optional ttl; matching S3 expiry rule activated together with it; restore/deploy/backfill versions exempt so rollback anchors survive. The scale analysis above shows TTL is a cost lever, not a performance prerequisite — no read or write path degrades with version count.
Counter interaction
None of this changes the shared record_version design: the counter lives on the live item and its conditional increment is unaffected by where history rows are stored. (Confirmed no change needed with settings-concurrency.)
Key Risks
- Sequencing with settings-concurrency. The transactional write primitive is shared. Mitigation: agreed attribute + primitive interface (below); whichever PR lands first introduces it.
- Save-path availability now includes S3. Accepted trade (Tenet 1); flagged for explicit sign-off.
- Author identity churn. Mitigated: agreed with storefront-admin-auth on self-contained
author_idactor strings + denormalizedauthor_display— no foreign key into their system. tmp/-era expectations. Operators must learn the new flow; mitigated by updatingdocs/integrations/storefront-css-customization-guide.md(it currently teaches manual backups) in the UI/API PR.- Size ceilings. Pre-existing, not introduced by versioning, but shared with it: the metafield mirror (~128KB) binds before DDB (400KB). Mitigated by the shared two-tier capacity interface (concurrency-owned thresholds; restore/replay enforce the tier-1 guard) plus the
snapshot_size_bytestime series and a pre-agreed sharding trajectory — see "The 400KB live-item ceiling".
Cost Analysis
Full derivation in Performance at scale §4. Summary at the 200-shop agent-team scenario (360K saves/mo fleet-wide): DDB write cost rises from ~$54/mo to ~$110/mo — almost entirely the TransactWriteItems 2× premium required by optimistic locking (settings-concurrency) regardless of versioning; versioning's own marginal cost (2KB items + S3 PUTs + storage) is ~$5/mo fleet-wide. Steady-state (today's real save rates, ~2 orders of magnitude lower): noise. New bucket: $0 fixed.
Release / Roll-out
Phased, each phase shippable and independently revertible:
- PR1 — infra: bucket + grants +
SETTINGS_VERSIONS_BUCKETconfig (fail-fast startup check deferred to PR2 so PR1 deploys cleanly before the env var consumer exists; PR2 adds the strict check). - PR2 — capture: models,
SettingsSnapshotStore,SettingsVersionRepository, capture increate_or_update_settings, transact primitive (coordinated with settings-concurrency: if their conditional-write PR is merged, extend it; else introduce the agreed primitive). From this deploy, every content save is versioned. Rollback = revert PR2; saves regress to current behavior, existing version data stays valid. - Backfill (one-off script,
scripts/ecom/backfill_settings_versions.py): scanShopifyEntitiesforsk=SETTINGS; for each shop with no history, snapshot the current state and write the version item (event_type="backfill", version = the counter value read, author = script principal). Collision safety with live traffic: the version item is written in aTransactWriteItemsof [ConditionCheckon the live item assertingrecord_versionstill equals the value read (orattribute_not_existsfor legacy),Putversion item withattribute_not_exists(sk)] — if a live save lands mid-backfill the transaction cancels cleanly and the re-run picks the shop up at its new counter value; a backfill item can never collide with or shadow a live-traffic version number. Idempotent; run once per env after PR2. Production write → human-confirmation gate per CLAUDE.md (--dry-rundefaults on; show output first). - PR3 — API: list/get/diff/restore endpoints + shared metafield-sync helper refactor (carrying concurrency's §4.5 reconciliation). (The stale stream-docstring fix is owned by settings-concurrency's Phase-1 PR.)
- PR4 — UI (
components/storefront_admin, the standalone settings editor): versions panel as a new section alongside the existing settings sections: table (time, author, event type, changed components), "View"/"Diff vs current", "Restore…" with confirmation dialog showing the diff summary. No editing. Why storefront_admin and not admin_worker: it is the surface that already consumes the storefront admin API (storefront_routes) these endpoints extend, and it is where storefront-admin-auth's session/token auth and thesettings:deploy_livescope land — admin_worker (the internal staff dashboard) has neither.
Backward compatibility: all changes are additive (new SHOPVER# partitions, new model, new endpoints). Old code never reads version items (different partition). Rollback strategy per phase: revert the PR; version data is inert without the code.
Impact on other components
admin_server: described above.BaseRepositorygains pagination passthrough (additive). Recommended (independent) hygiene fix:list_shop_sessionsshould query by sk prefixUSER#instead of sweeping the wholeSHOP#partition — not load-bearing for this design (version items are in a separate partition) but it's the cheap fix for a latent inefficiency discovered during this analysis.infra/ecom/stacks/shopify_admin_stack.py: new bucket + grants + env var.storefront_admin: new versions section (PR4). admin_worker: no change.storefront_search,search_proxy, theme extension: no change — they consume the metafield/MarqoUIConfig, and restore goes through the same mirror sync.scripts/: backfill script;docs/integrations/storefront-css-customization-guide.mdupdated to retire manual backups.- Teammate plans: interfaces below.
Alternative Solutions Considered
| A. Full snapshots as DDB items | B. S3 payload + DDB metadata (chosen) | C. S3 only (no DDB metadata) | D. DDB streams → async archiver | |
|---|---|---|---|---|
| Atomic with save | yes (transact) | yes (metadata; payload pre-written) | no (list = S3 ListObjects) | no — capture lags/can drop |
| Item-size ceiling | inherits 400KB minus overhead — binds first | payload unbounded | unbounded | unbounded |
| List-versions cost | high (or Projection gymnastics) | trivial | S3 List + per-object HEAD for metadata | trivial |
| "Every save versioned" guarantee | yes | yes | yes | no (stream lambda failure ⇒ silent gap — violates Tenet 1) |
| Extra infra | none | 1 private bucket | 1 bucket | stream + lambda + DLQ |
| Notes | viable today; rejected for ceiling coupling + hot-table bloat | — | rejected: no transactional listing/audit record | rejected: also requires enabling a table stream that doesn't exist |
D deserves a sentence: an async archiver is the standard "don't touch the write path" answer, but it cannot satisfy the operational requirement that motivated this work — a save whose version silently failed to materialize is exactly the incident we're trying to end.
Cross-plan interfaces
All three interfaces negotiated by message 2026-06-10 and reconciled against the teammates' published plans.
settings-concurrency (docs/plans/settings-concurrency-control.md) — AGREED; canonical block ACKed verbatim
- Single int attribute
record_versionon the live item (avoids clash withsettings_schema_version). Starts at 1 on creation; every successful write (content and infra, all writers) increments it by exactly 1 via_version_condition(expected_version)(record_version = :expected OR attribute_not_exists(record_version)); legacy records count as 0. Strictly monotonic, no counter gaps; history sequence numbers are sparse over it (content writes only). Storage namerecord_version; all API payloads exposeversion. - Canonical signatures pinned verbatim in my LLD and their §4.2/§4.3:
save_settings(settings, *, expected_version, change_source, extra_transact_items=None)and the non-capturingupdate_settings(shopify_domain, updates, *, expected_version, change_source), plus thecreate_or_update_settingsfunnel (with my ACKedsource_scope/source_versionaddendum). Old unconditioned methods are deleted, not aliased. Conflicts →SettingsConflictError(409). Repository primitives never retry; bounded retry is service-managed mode re-running the full pipeline (fresh snapshot per attempt).update_ui_componentsis deleted (their B1) — routes call the funnel directly. - Every write path supplies
change_source(admin_ui|storefront_admin|onboarding|webhook_registration|webhook_cleanup|index_lifecycle|theme_deploy|script); version items persist it alongsideevent_typeand author fields. - Their M1 fix is my hard prerequisite: storefront admin
GET /settingsreturns the currentversion(their §5.1) so clients can supply restore/save guards. - They own: payload-size metric + 300KB threshold + typed 413 (Phase 1), §4.5 post-write metafield reconciliation (reused by my restore via the shared helper), stale stream-docstring fixes (Phase 1), and the 400KB §6.1 trajectory statement (single-root-lock constraint; my S3 snapshot store cited as the trajectory-2 precedent).
- Their verified caveats, both honored by my design: version items must carry a distinct
entity_type(SETTINGS_VERSION) and must never carrysystem_account_id(sparse-GSI key — would bloatGSI_SystemAccountIdand admin_lambda scan costs); both hold, and the dedicatedSHOPVER#partition goes further by removing version items from thelist_shop_sessionssweep entirely.
theme-deploys (docs/plans/theme-targeted-deploys.md) — AGREED
- Their staged theme records: sk=
SETTINGS#THEME#{theme_id}underSHOP#{domain}, plus one internal rolling sk=SETTINGS#DEPLOY#BACKUP(pre-versioning undo snapshot — retired when my restore lands). Staged/backup records carry nosystem_account_id(invisible to the sparse GSI) and their plan pins tests that GSI listing excludes them. Each theme record has its ownrecord_versioncounter, so my per-theme history numbering matches their read of the concurrency plan. - My history stays in the dedicated
SHOPVER#{domain}partition with per-source-record scopes:live#{n}andtheme#{theme_id}#{n}. Histories never interleave; version items carry ascopeattribute. - Deploy-to-live is a version event on the live history: they call
create_or_update_settings(shop_id, content, updated_by, expected_version=expected_live_version, change_source="theme_deploy", event_type="deploy", source_scope=f"theme:{theme_id}", source_version=staged.record_version). Their backup Put rides the same transaction viaextra_transact_items(alongside my version Put — 3-item transaction on deploys). Capture, and rollback-of-a-deploy via restore, come for free.change_source="theme_deploy"(their writer identity, added to concurrency's enum);event_type="deploy"is the semantic discriminator — no conflict with my model. - v1 ships live-scope capture; theme-scope capture (versioning staged-record saves) is an extension owned by their plan — model, S3 keying, and repository are already scope-aware.
storefront-admin-auth (docs/plans/storefront-admin-sso.md) — AGREED
- Version records store the two-field author shape (their §8.2):
author_id+ denormalizedauthor_display(email | token name | None). Identity grammar (their §4.6, corrected after their round-2 review): uniformly prefixed —user:{cognito_sub}|token:{token_id}|api_key:{system_account_id}|shopify_user:{shopify_user_id}; capture normalizes embedded-app writers (which stamp raw Shopify user ids intoupdated_by_user_idtoday) to theshopify_user:prefix at write time, so split-at-first-colon extraction can never misparse. The canonical enum{console_user, cli_token, api_key, shopify_user}and the prefix↔type mapping live as code inadmin_server/models/auth.py(their PR1) and are imported, never re-derived; unknown prefixes fail closed. The shared module'sclassify_legacy(value)exists for reading historical unprefixed data only (no colon + not "system" → shopify_user) — relevant if oldupdated_by_user_idvalues are ever attributed during backfill; new writes always produce prefixed forms. The literal"system"written by webhook writers can never appear as a version author because infra writes are not captured (FR-6). The live item'supdated_by_user_idcontinues to receive whatever the auth context supplies, so existing consumers are unaffected. - Legacy raw API keys remain valid on storefront routes indefinitely (allow→warn→deny ratchet is config);
api_key:{system_account_id}with null display is a permanent author shape, and historicalupdated_by_user_idvalues already use it. - Scopes settled:
settings:read/settings:write/settings:deploy_live(the deploy-live scope gates any mutation of the live record, per theme-deploys' semantics). Restore requires the same scopes as the equivalent save against the same target — live:settings:write+settings:deploy_live(identical to livePOST /settings); staged (future):settings:write. No separate restore scope. Invisible to console sessions (which carry all scopes); the boundary bites only CLI tokens minted without the deploy scope — intended. - Restore is a write and receives the same
StorefrontAuthContext; the restoring actor becomes the new version's author (event_type=restore).
FAQs
- Why is restore "content fields only"? Infra fields (
active_index,system_account_id, webhook metadata) describe current infrastructure, not merchant intent; restoring a 3-week-oldactive_indexcould point metafields at a deleted index (index_service.delete_indexclears it deliberately). - Why snapshot post-save state rather than pre-save? Post-save means "version N is exactly what the world looked like after write N" — restore is a pure put of one object. The pre-change state of the very first instrumented save is covered by the backfill.
- Can agents still pull a JSON backup? Yes —
GET …/versions/{v}returns the full payload; the manualcurl-to-file workflow keeps working but is no longer load-bearing.
References
components/shopify/admin_server/admin_server/models/shopify_entities.py(live model)components/shopify/admin_server/admin_server/services/settings_service.py(save path, metafield mirror)components/shopify/admin_server/admin_server/routes/storefront_routes.py(API-key settings routes, 207 contract)infra/ecom/stacks/shopify_admin_stack.py(table + bucket definitions)docs/integrations/storefront-css-customization-guide.md(current manual-backup workflow this replaces)components/controller/merchandise/services/synonym_service.py(in-repo precedent for client-leveltransact_write_items: standalone client for the moto bug,TypeSerializeritem encoding,TransactionCanceledExceptionmapping)- Sibling plans (paths confirmed by their authors):
docs/plans/settings-concurrency-control.md,docs/plans/theme-targeted-deploys.md,docs/plans/storefront-admin-sso.md
Design Review Follow-ups
(To be filled after Raynor's review.)