Skip to main content

Settings Versioning: History, Rollback, and Audit for ShopifySettings

Author: settings-versioning (agent, feature-plans team)

Date: 2026-06-10

Stakeholders: Raynor (sign-off), settings-concurrency / theme-deploys / storefront-admin-auth plan owners

Status: Reviewer-approved (adversarial Opus plan-verifier, round 2, 2026-06-11) — awaiting Raynor sign-off

Executive Summary

Every save of a merchant's storefront settings (ShopifySettings: UI component HTML/CSS templates, DOM selectors, configuration) should produce a recoverable, attributable version. Today, rollback relies on ad-hoc manual JSON files in tmp/ (e.g. tmp/lg-settings-backup-2026-05-23T0900.json), made by whoever remembers to make them. This plan adds automatic version capture on every content write, a list/diff/restore API on the API-key-authenticated storefront admin routes, and a minimal restore UI in the storefront admin editor (components/storefront_admin).

Design in one paragraph: each content save writes a full-state snapshot to a new private S3 bucket and a small version-metadata item into the existing ShopifyEntities DynamoDB table — in a dedicated partition (pk SHOPVER#{domain}, sk {scope}#{n:010d}), so history volume can never inflate reads of the live record or any query that sweeps the SHOP#{domain} partition — atomically with the live-item write via TransactWriteItems (transactions span partitions within a table). The version number is the optimistic-lock counter (record_version) being introduced by the settings-concurrency plan — one counter, two purposes. Restore replays a snapshot's content fields through the existing save path, producing a new version (history is append-only) and re-syncing the Shopify metafield mirror exactly as a normal save does.

Problem Statement

ShopifySettings (DDB: pk SHOP#{domain}, sk SETTINGS, table ShopifyEntitiesTable) holds merchant-facing storefront configuration: ui_components and selector_components dicts containing large HTML/CSS templates, plus configuration. Real merchant payloads run ~115–120KB of JSON (Laura Geller, Muji backups in tmp/).

These settings are edited by:

  • Merchants via the embedded Shopify app (settings_routes.py, session auth),
  • Humans and AI agents via the storefront admin API (storefront_routes.py, API-key auth) — the dominant path during integration/tuning work, often against live merchant storefronts.

Pain points:

  1. No rollback. repository.save_settings() is a full-item put_item — the previous state is destroyed on every save. A bad agent push to a live merchant requires hunting for a manual backup file.
  2. No audit. Only updated_by_user_id + last_updated of the latest write survive. No record of who changed what, when, across time.
  3. Ad-hoc backups don't scale. The tmp/*.json convention is manual, local to one machine, unversioned, and routinely forgotten.

Requirement (from operations): every save produces a recoverable version, with list/diff/restore available to the same actors who can save.

Glossary

  • Live item — the single ShopifySettings record read at runtime (pk SHOP#{domain}, sk SETTINGS).
  • Content fieldsui_components, selector_components, configuration. These define storefront behavior and are what versioning protects.
  • Infra fieldsmetadata (webhook registrations), active_index, system_account_id, cell_id. Written by webhook_service.py, index_service.py, api_key_routes.py via partial update_settings() calls. Not user-facing config; not versioned and never restored.
  • Metafield mirrorsave_settings_to_metafields() writes the full settings JSON to Shopify shop metafield marqo.search_settings; the theme extension (marqo-search-embed.liquid) injects it into the page where storefront_search consumes it as window.MarqoUIConfig.
  • Version counter — the integer record_version attribute on the live item introduced by the settings-concurrency plan for optimistic locking (agreed interface, see Cross-plan). Starts at 1 on first conditional write (legacy records without the attribute are treated as 0); every successful write increments it by exactly 1. Distinct from the existing settings_schema_version (payload schema version).
  • Scope — which settings record a version belongs to: live today; theme:{theme_id} once the theme-deploys plan lands per-theme records.

Tenets

  1. Recoverability over save availability. A settings save that cannot produce a version must fail (fail-fast per CLAUDE.md). We accept that an S3/DDB-transact outage blocks saves; we do not accept silent versionless saves.
  2. History is append-only. Restore never rewrites history; it creates a new version recording what was restored and by whom.
  3. One counter. The optimistic-lock counter and the version-history identifier are the same number. Two counters with subtly different semantics is a standing source of bugs.
  4. Restore is a save. Restore reuses the existing save path (validation, merge, metafield sync, concurrency check) rather than a parallel write path that drifts.

Functional Requirements

  • FR-1: Every write that changes content fields produces a version capturing the full post-write state of content fields.
  • FR-2: Versions record author identity, timestamp, event type (save/restore/deploy/backfill), and a summary of changed components.
  • FR-3: API to list versions (paginated, newest first) and fetch one version's full payload.
  • FR-4: API to diff two versions (or a version against current) at component/field granularity.
  • FR-5: API to restore a version: content fields only, applied as a new version, with the metafield mirror re-synced.
  • FR-6: Writes that touch only infra fields do not create versions.
  • FR-7: Identical-content saves (no content change) do not create duplicate versions.
  • FR-8: The storefront admin UI (components/storefront_admin — the standalone settings editor; not admin_worker, the internal staff dashboard) shows the version list and offers restore with confirmation.

Non-Functional Requirements

  • NFR-1: Version capture adds ≤1 S3 PUT + 0 extra DDB round-trips to the save path (metadata Put rides the same TransactWriteItems as the live write).
  • NFR-2: Listing versions never fetches snapshot payloads (metadata items stay < 2KB).
  • NFR-3: No version data is publicly readable (the existing assets bucket is public_read_access=True and is explicitly not used).
  • NFR-4: Sized for ~200 shops each accumulating low thousands of versions (automated agent teams saving continuously), with ≥10x headroom — see Performance at scale.
  • NFR-5: Reads of the live settings record remain a single GetItem whose cost is independent of version count; no existing query's result set grows with history.

Out Of Scope

  • Versioning the embedded Shopify app's UI (a history page inside the Shopify admin app). The capture happens for those saves (they share SettingsService), but list/diff/restore UI is only built in the storefront admin editor (components/storefront_admin) for v1. A versions panel in admin_worker (the internal staff dashboard) is likewise out of scope.
  • Versioning other entities (IndexSettings, API keys, sessions, merchandising rules).
  • Importing the historical ad-hoc tmp/*.json backups (optional manual one-off, not part of the system).
  • Scheduled snapshot pruning/retention automation beyond S3 lifecycle storage-class transitions (volume is tiny; deletion machinery is more risk than the cost it saves — revisit if growth demands).
  • A separate finer-grained "restore" auth scope — restore requires the same scopes as the equivalent save against the same target (live: settings:write + settings:deploy_live; staged: settings:write); see Cross-plan interfaces.
  • Cross-shop version copy ("apply Muji CA's v12 to Muji US") — useful, but a different feature.

Success Criteria

  • 100% of content saves in prod produce a listed version (verifiable: compare save logs to version counts).
  • A live-merchant rollback is a single API call / one button, taking < 1 minute end to end including metafield re-sync — measured against today's manual process (locate backup file, hand-craft POST).
  • Zero ad-hoc tmp/ backups needed for new integration work.

API Design

All endpoints live on the storefront admin router (storefront_routes.py, API-key auth via authenticate_api_key_request + resolve_storefront_shop), alongside the existing GET/POST /shops/{shopify_domain}/settings. Shop access control is identical to the existing settings endpoints: the API key's system_account_id must own the shop.

List versions

GET /shops/{shopify_domain}/settings/versions?limit=20&cursor=<opaque>
{
"versions": [
{
"version": 42,
"scope": "live",
"event_type": "restore",
"restored_from": 38,
"author_id": "api_key:acct_123",
"author_display": null,
"created_at": "2026-06-10T03:21:00+00:00",
"settings_schema_version": 1,
"content_hash": "sha256:ab12…",
"snapshot_size_bytes": 119001,
"changed": {"ui_components": ["results_grid", "instant_search"], "selector_components": [], "configuration": []}
}
],
"next_cursor": "eyJ2IjogNDF9"
}

Newest first (ScanIndexForward=False over pk SHOPVER#{domain}, sk prefix live#). cursor is the base64-encoded DDB LastEvaluatedKey. Metadata only — no payloads; page cost is limit × ~2KB regardless of total history depth (see Performance at scale).

Get one version (full payload)

GET /shops/{shopify_domain}/settings/versions/{version}

Returns the metadata above plus "settings": {"ui_components": …, "selector_components": …, "configuration": …} fetched from S3. 404 if the version doesn't exist (RecordNotFoundError → 404 via existing handler pattern).

Diff

GET /shops/{shopify_domain}/settings/versions/{version}/diff?against=current
GET /shops/{shopify_domain}/settings/versions/{version}/diff?against=38
{
"from": 38, "to": "current", "to_version": 42,
"changes": [
{"path": "ui_components.results_grid.css", "change_type": "modified",
"diff": "--- v38\n+++ current\n@@ -10,4 +10,6 @@\n …"},
{"path": "ui_components.promo_banner", "change_type": "added"},
{"path": "selector_components.search_input.selector", "change_type": "modified",
"diff": "--- v38\n+++ current\n@@ …"}
]
}

Server-side structural diff: walk the three content dicts; report added/removed components and per-field changes. For changed string fields ≤ 64KB, include a difflib.unified_diff; larger fields report change_type and sizes only (client can fetch both payloads). This is deterministic stdlib work — no external deps. When against=current, the response's to_version records the live record_version read at diff time, so a client can review a diff and then restore with expected_version=to_version — guaranteeing the state it approved is the state being replaced.

Restore

POST /shops/{shopify_domain}/settings/versions/{version}/restore
Body: {"expected_version": 42} # optional; optimistic-lock guard, see below

(Field name matches the settings-concurrency plan's save contract; if they expose the guard differently — e.g. a header — restore mirrors their choice.)

Scopes (per storefront-admin-auth): restore requires the same scopes as the equivalent save against the same target — live target: settings:write + settings:deploy_live (any mutation of the live record needs the deploy scope, matching the live POST /settings itself); staged/theme target (future): settings:write only. No separate restore scope. Legacy raw API keys retain access per their allow→warn→deny ratchet.

Semantics:

  1. Fetch the version metadata and snapshot for {version}; validate and (if needed) schema-migrate it (see Schema drift), and run the tier-1 metafield size guard (shared two-tier capacity interface, settings-concurrency's canonical wording) — an oversized snapshot fails fast with 422 settings_too_large (the shared tier-1 status; tier 2's DDB cap is the one that maps to 413) before any write, since restore is a metafield-bound path.
  2. Apply content fields only through SettingsService.create_or_update_settings — infra fields on the live item are untouched, last_updated/author are fresh.
  3. This produces a new version (e.g. restoring v38 when current is v42 creates v43 with event_type=restore, restored_from=38).
  4. Re-sync the metafield mirror via the shared save/restore helper, including settings-concurrency's post-write metafield reconciliation (their §4.5); the 207-partial contract applies when the access token is missing or Shopify errors (response shape reuses SaveSettingsResponse).
  5. Concurrency guard: the client-supplied expected_version is optional and advisory — the server does not depend on a client echo (today's GET doesn't even return a version; see prerequisite below). When provided, restore runs caller-managed: one attempt, 409 on conflict. When omitted, restore inherits service-managed mode (bounded conditional retry of the full pipeline) — not last-writer-wins; an unguarded restore still never overwrites a concurrent save silently within an attempt.

Hard prerequisite (from settings-concurrency's plan, their §5.1 / fix M1): the storefront admin GET /shops/{domain}/settings switches to a response carrying the current version — today it returns only components (storefront_routes.py:140-143, SettingsResponse in api_requests.py), so no client could echo a guard. The restore UI reads version from that GET (or from the versions list, whose to_version the diff endpoint also reports) and passes it as expected_version.

Responses: 200 {"status": "success", "new_version": 43, …}, 207 partial (metafield sync failed; DDB state restored — same contract as save), 404 unknown version, 409 version conflict or future-schema snapshot, 422 snapshot fails current model validation (validation_failed) or exceeds the tier-1 metafield guard (settings_too_large).

Future evolution

  • ?scope=theme:{theme_id} query param on list/get/diff/restore once theme-deploys lands (additive; defaults to live).
  • A POST …/versions "pin/label" endpoint (named checkpoints) if tuning workflows want it.

Architecture

POST /settings (app or storefront admin)

SettingsService.create_or_update_settings
│ merge + validate (existing)

┌─ content hash unchanged? ── yes ──► plain save, no version
▼ no
SettingsVersionService.capture
│ 1. PUT snapshot JSON → S3 (settings-versions bucket, private)
│ 2. TransactWriteItems (same table, two partitions — transactions span pks):
│ • Update live item (pk SHOP#{domain}, sk SETTINGS; condition record_version = :expected OR
│ attribute_not_exists(record_version); SET record_version = :expected+1) ◄── settings-concurrency
│ • Put version item (pk SHOPVER#{domain}, sk live#{n:010d})

metafield mirror sync (existing, unchanged)

Components:

  • SettingsSnapshotStore (new, admin_server): thin S3 wrapper. put_snapshot(domain, scope, version, payload) -> s3_key, get_snapshot(s3_key) -> dict. Key scheme: {shop_domain}/{scope}/{version:010d}.json.
  • SettingsVersionRepository (new, extends BaseRepository): put happens inside the save transaction (see below); list_versions(domain, scope, limit, exclusive_start_key), get_version(domain, scope, version). Requires adding ExclusiveStartKey/LastEvaluatedKey passthrough to BaseRepository.query_by_pk_and_sk_prefix (currently unsupported — small additive change).
  • SettingsVersionService (new): capture (computes hash/changed-summary, S3 put, builds transact items), list, get, diff, restore. Injected into SettingsService and routes via the existing DependencyContainer @cached_property pattern (dependencies.py).
  • SettingsService (modified): create_or_update_settings gains the capture step; the final write is the canonical save_settings(settings, *, expected_version, change_source, extra_transact_items) primitive owned by settings-concurrency (see Low Level Design — pinned verbatim in both plans); this plan contributes one Put to that transaction (theme-deploys contributes a second on deploys). Sequencing of the two PRs is covered in Release.
  • Write-path coverage: all content writes funnel through create_or_update_settings (callers: settings_routes.save_settings, settings_routes.initialize_settings, storefront_routes.save_settings, api_key_routes onboarding default-init; the update_ui_components wrapper is being deleted by settings-concurrency — routes call create_or_update_settings directly). Infra-field writers (webhook_service, index_service, api_key_routes association fields) use the non-capturing update_settings path and are deliberately not captured (FR-6). A unit test pins this taxonomy so future fields must be classified.

Failure ordering

  1. S3 put fails → save fails (5xx). Tenet 1: no versionless saves. S3 single-region availability is well above this endpoint's needs.
  2. Transaction fails after S3 put (condition conflict, throttle) → orphan S3 object. Harmless: listing is DDB-driven, orphans are unreachable. Accrual is bounded by the conflict rate (each conflicted attempt — caller-managed 409s and failed service-managed attempts alike — orphans at most one object): even at a pessimistic 5% conflict rate at the full agent scenario (360K saves/mo) that's ~18K orphans/mo × 120KB ≈ 2GB/mo ≈ $0.05/mo fleet-wide. Orphans are written under the same {shop}/{scope}/ prefixes as live snapshots, so the bucket-wide 90-day STANDARD_IA lifecycle rule covers them automatically; no cleanup machinery in v1.
  3. Metafield sync fails after commit → 207-partial via the shared helper, which includes settings-concurrency's post-write metafield reconciliation (their §4.5). The version exists and is correct (it reflects DDB state, the source of truth).

Data Storage / Modeling

Version metadata item (DynamoDB, ShopifyEntitiesTable)

New model in shopify_entities.py:

class ShopifySettingsVersion(ShopifyEntityBase):
"""
Immutable version-history record for a settings save.
pk: SHOPVER#{shop_domain} (dedicated partition — never mixed with live entities)
sk: live#{version:010d} (live scope)
theme#{theme_id}#{version:010d} (theme scope, future)
"""
entity_type: Literal["SETTINGS_VERSION"] = "SETTINGS_VERSION"
version: int # == live item's record_version after the captured write
scope: str = "live" # "live" | "theme:{theme_id}"
event_type: Literal["save", "restore", "deploy", "backfill"]
change_source: str # supplied by every write path per settings-concurrency
# (admin_ui | storefront_admin | onboarding | webhook_registration
# | webhook_cleanup | index_lifecycle | theme_deploy | script)
author_id: str # UNIFORMLY PREFIXED actor string (storefront-admin-auth §4.6):
# user:{cognito_sub} | token:{token_id} | api_key:{system_account_id}
# | shopify_user:{shopify_user_id}. Capture normalizes embedded-app
# writers (raw Shopify user id today) to the shopify_user: prefix.
# Prefix->type mapping imported from admin_server/models/auth.py
# (their PR1), never re-derived; unknown prefixes fail closed.
author_display: str | None = None # email / token name, denormalized at write time
ttl: int | None = None # expiry hook, unset in v1 (see Retention)
settings_schema_version: int # copied from the saved settings
content_hash: str # sha256 of canonical content-fields JSON
snapshot_s3_key: str
snapshot_size_bytes: int
changed: dict # {"ui_components": [keys], "selector_components": [...], "configuration": [...]}
restored_from: int | None = None # event_type == "restore"
source_scope: str | None = None # event_type == "deploy"
source_version: int | None = None

Size: < 2KB. Why a dedicated SHOPVER# partition rather than new sks under SHOP#{domain}: existing code sweeps the whole SHOP# partition — ShopifySessionRepository.list_shop_sessions (shopify_session_repository.py:135) calls query_by_pk(pk) and filters entity_type == "SESSION" client-side, and it runs on every storefront settings save and restore (access-token lookup in storefront_routes._get_access_token_for_shop). Thousands of version items in that partition would be read, transferred, and discarded on every save (5,000 × 2KB = 10MB ≈ 10 paginated 1MB queries per token lookup). A separate partition makes history invisible to all current and future SHOP# partition queries by construction (NFR-5), while TransactWriteItems still covers both items (same table; different pks are fine). Independently of this plan, list_shop_sessions should switch to query_by_pk_and_sk_prefix(pk, "USER#") as hygiene — noted in Impact, not load-bearing here.

Listing history is query_by_pk_and_sk_prefix(pk="SHOPVER#{domain}", sk_prefix="live#"), no GSI needed. The theme# prefix shares the partition but never matches the live# prefix query (and vice versa). Version items carry no system_account_id, so they can never appear in the sparse GSI_SystemAccountId.

Zero-padded version in the sk gives lexicographic == numeric ordering. The counter itself has no gaps (settings-concurrency increments by exactly 1 on every successful write), but history sequence numbers are sparse: the counter advances on infra-only writes too, while history items are created only for content writes. Documented in the model docstring and the API docs.

Snapshot object (S3, new private bucket)

  • Bucket: config.envify("shopify-settings-versions") in shopify_admin_stack.py. Private (BlockPublicAccess.BLOCK_ALL), SSE-S3, removal_policy matching the table (RETAIN-equivalent in prod via deletion_protection convention — follow point_in_time_recovery=config.is_prod style), lifecycle: transition to STANDARD_IA after 90 days, no expiry. Why not the existing assets bucket: it is public_read_access=True behind CloudFront — merchant CSS/HTML and author metadata must not be world-readable. Why not the job-details bucket: different domain/lifecycle; mixing makes future retention rules hazardous.
  • Object body: full post-save state of the three content fields plus envelope:
{
"snapshot_format": 1,
"shop_domain": "laurageller.myshopify.com",
"scope": "live",
"version": 42,
"settings_schema_version": 1,
"captured_at": "2026-06-10T03:21:00+00:00",
"settings": {"ui_components": {}, "selector_components": {}, "configuration": {}}
}
  • Grant: bucket.grant_read_write(...) to the admin server task role and the backfill script role; env var SETTINGS_VERSIONS_BUCKET added to config.py (fail fast at startup if unset, matching existing config style — no silent fallback).

Why S3 payload + DDB metadata (and not DDB-only)

Real payloads are ~120KB; the DDB item limit is 400KB. Full snapshots as DDB items would work today, but:

  1. Ceiling coupling. A snapshot item is the live item plus metadata — it hits the 400KB wall at the same moment the live item does, and history would be the thing that breaks first as merchants add CSS (LG grew from 115KB→119KB in one month of tuning).
  2. Partition + query economics. At the 200-shop agent scenario, bodies-in-DDB means 600MB of history per shop in the operational table, and every history page reads megabytes; metadata-only items keep the version partition at ~10MB per 5,000 versions and list pages at ~40KB (NFR-2) without ProjectionExpression gymnastics. Full numbers in Performance at scale.
  3. Ops affordance. S3 objects are directly inspectable/downloadable with existing diagnostics tooling (the tmp/*.json workflow, formalized), and lifecycle-tiering is free.

Cost of the hybrid: one extra PUT per save and one IAM grant. See Alternative Solutions for the full comparison.

The 400KB live-item ceiling (split-record question)

Raynor's review question: what happens when a shop's settings outgrow the DDB 400KB item cap, and should we introduce split-records (N shard items assembled on read)? Point by point:

(a) Live-record exposure, stated explicitly. S3 snapshots remove the cap for history bodies, but the live SETTINGS item my capture transacts against is still one DDB item bounded at 400KB (including attribute names). Versioning does not enlarge the live item (the transaction's second item is the ~2KB metadata Put in another partition), so the exposure is pre-existing and orthogonal — but a save exceeding the cap also fails to capture (the whole transaction fails), so versioning inherits the ceiling exactly as saves do.

(b) Failure mode today. An oversized put_item raises botocore ClientError (ValidationException, "Item size has exceeded the maximum allowed size"). Nothing in the repository or routes handles it specifically — it bubbles to the routes' generic except Exception and surfaces as an opaque 500 "Failed to save settings" (logged with exc_info). It is surfaced, not swallowed, and it fails before the metafield sync, so DDB and the mirror both retain the previous state — no corruption, just an undiagnosable error for the caller. A typed ItemTooLargeError → 413 should land with whichever PR first reworks the repository write path (suggested to settings-concurrency). Critically, DDB is not the first ceiling: the Shopify metafield mirror caps values at ~128KB, and real payloads are already ~120KB — see the two-tier interface in (d).

(c) Capture survives a future sharded live record. If the live record later becomes manifest + N shards, this design holds because of two deliberate properties: (1) snapshots are format-independentSettingsSnapshotStore writes the fully materialized content dict assembled in the service layer, never raw DDB items, so a snapshot is identical whether the live record is one item or twelve, and every existing snapshot stays restorable across the transition (restore replays content through the save path, whatever shape that path writes); (2) version identity stays singularrecord_version and its conditional check live on the manifest item only; shards carry no counter. A sharded content save is then one TransactWriteItems: conditional manifest update + N shard puts + 1 version-metadata put. DynamoDB transactions allow 100 items / 4MB aggregate (the older 25-item limit was raised in 2022), so atomic capture survives settings up to ~3.8MB — ~10× past the point where sharding becomes necessary. Beyond that, the metafield mirror (single value) breaks first and the whole settings-distribution model needs rethinking, not just storage.

(d) Recommendation: defer sharding, with a measured trigger. Evidence (corpus ordering aligned with settings-concurrency §6.1a): the largest real merchant payload is Muji CA at 120.6KB (tmp/muji_ca_settings_2026-05-29T1527.json), with LG second at 119.0KB (tmp/lg-settings-backup-2026-05-26.json) — ~30% of the cap. Growth is episodic, not steady: LG grew 90.8KB → 119.0KB (+28KB) during its single heaviest month of CSS customization (Apr 22 → May 26 2026 backups), then flattened. Even sustaining that worst-case burst rate continuously, 400KB is ~10 months out; integration phases end. Sharding now would touch every read path, the lock design, and the mirror for headroom we don't need.

Capacity thresholds follow the shared two-tier capacity interface owned by settings-concurrency (their canonical wording governs; not restated here):

  • Tier 1 — metafield-bound paths (the FIRST ceiling): ~128KB Shopify metafield value cap, with 100KB warn / 128KB guard, enforced on every path that re-syncs the mirror — save, deploy, and this plan's restore/replay: restoring an old snapshot runs the same tier-1 guard before committing, so a restore can never succeed in DDB and then strand the mirror. At ~120KB real payloads this tier is already near — it, not DDB, is the operative constraint.
  • Tier 2 — DDB live item: 300KB warn / 400KB hard cap; the split-record trigger.

Versioning makes both tiers cheap to observe: snapshot_size_bytes is already recorded on every version item and logged at capture, giving a per-shop size time series against both thresholds with no new measurement code. Crossing tier 2 triggers a dedicated split-record design, executed via a settings_schema_version bump + migration-registry entry (the vehicle this plan already ships), with version history as the rollback anchor for that migration; crossing tier 1 is a content-budget conversation with the merchant integration, not a storage problem.

Record-shape trajectory — agreed across plans (settings-concurrency §6.1; theme-deploys notified): single item now. The shared design constraint is that the optimistic-lock condition always lives on exactly one item. If content ever outgrows the item, two trajectories preserve that: (1) manifest + shards in DDB — root sk=SETTINGS holds record_version + the shard list, all shard writes in one transaction conditioned on the root only (analyzed in (c) above); (2) — concurrency's preferred — the content body moves to S3 with the DDB item keeping record_version + an S3 key, which is exactly this plan's snapshot architecture: the snapshot store is the working precedent that trajectory 2 is sound. Under either layout, version metadata items (~2KB) and history sequence numbers are unaffected because the root counter's semantics don't change. Theme-deploys multiplies record count, not record size — each staged theme record has its own 400KB budget — so it adds no ceiling pressure. The split-record design is triggered by the 300KB alarm and executed via a settings_schema_version bump + migration-registry entry, with version history as the rollback anchor. Ownership note: the 300KB payload metric and the typed ItemTooLargeError → 413 mapping land with settings-concurrency's Phase 1 (today's oversized save is an opaque 500 on UI saves and silently swallowed in webhook_service.py:668 — their finding and fix).

Retention

Keep everything in v1; ship the expiry mechanism but leave it disabled:

  • ShopifyEntitiesTable already has TTL enabled on the ttl attribute (used by sessions, time_to_live_attribute="ttl" in shopify_admin_stack.py), so version-item expiry is a single optional ttl: int | None field on the model — zero new infra. v1 writes it as None.
  • S3 lifecycle: transition to STANDARD_IA after 90 days from day one; an expiry rule is added only when TTL is activated (the two must be enabled together, S3 expiry ≥ DDB TTL so a listed version always has its body).
  • Activation trigger (documented policy, not automation): fleet snapshot storage > 1TB or any shop > 100K versions → enable ttl = created_at + 400 days for event_type="save" items only; restore/deploy/backfill items are kept as durable anchors.

Justification: at the 200-shop agent-team scenario (see Performance at scale) steady-state accrual is ~$2/mo fleet-wide S3 — deletion machinery active from day one is more code and more risk (audit loss) than the cost it saves; but designing the field in now means activation is a config change, not a migration.

Low Level Design

Canonical shared write interface (pinned VERBATIM with settings-concurrency)

The repository primitives are owned and implemented by settings-concurrency; the block below is their FINAL canonical text (mirrored character-for-character from their §4.2/§4.3, confirmed 2026-06-11) and appears identically in both plans (a change to one must be made in both):

# ShopifySettingsRepository — the ONLY write paths for SETTINGS* records.
# Neither method retries internally; conflicts raise SettingsConflictError (409).

def save_settings(
self,
settings: ShopifySettings,
*,
expected_version: int,
change_source: str,
extra_transact_items: list[dict] | None = None,
) -> ShopifySettings: ...
# Conditional full save; persists expected_version + 1. Plain conditional
# put_item when extra_transact_items is None; one TransactWriteItems with the
# identical ConditionExpression when provided. ConditionalCheckFailedException
# or TransactionCanceledException(ConditionalCheckFailed) -> SettingsConflictError.

def update_settings(
self,
shopify_domain: str,
updates: dict,
*,
expected_version: int,
change_source: str,
) -> int: ...
# INFRA FIELDS ONLY: raises ValueError if updates touches
# ui_components / selector_components / configuration (fail-fast guard).
# Not captured by versioning (FR-6). Never retries internally; bounded
# retry (<=3, re-reading the version each attempt) lives at the calling service.

# SettingsService — single content-write funnel (update_ui_components is deleted).

def create_or_update_settings(
self,
shop_id: str,
updates: dict,
updated_by_user_id: str,
*,
expected_version: int | None, # REQUIRED keyword, no default: int = user-facing one-shot; None = service-managed mode
change_source: str,
event_type: str = "save", # versioning capture: save | restore | deploy | backfill
restored_from: int | None = None, # versioning capture: source version for restores
source_scope: str | None = None, # versioning capture: deploy events ("theme:{theme_id}")
source_version: int | None = None, # versioning capture: deploy events
) -> ShopifySettings: ...

Semantics riders pinned with the block (stated identically in both plans):

  • Service-managed mode (expected_version=None) re-runs the FULL pipeline per attempt including capture — fresh read, fresh merge + content hash, fresh provisional version, fresh S3 snapshot, fresh conditional write — max 3 attempts; failed-attempt snapshots are the accepted orphan class (quantified in Failure ordering). None does not mean unconditional/last-writer-wins. User-facing mode (expected_version=int) is one attempt → 409. Capture happens only for content-changing writes, on whichever attempt commits — never more than one version item per successful save.
  • Infra writers bypass capture entirely via update_settings (webhook metadata merges, active_index clears, association fields) — the non-capturing path per FR-6, which fail-fast rejects content fields with a ValueError so the content/infra taxonomy is enforced at the primitive, not by convention; their bounded retries live at the calling services (webhook_service/index_service/api_key_routes) and involve no snapshots.

Contract details this plan additionally depends on:

  • The condition builder is a standalone _version_condition(expected_version) precisely so it can be reused inside transaction items; both ConditionalCheckFailedException and TransactionCanceledException(ConditionalCheckFailed) map to SettingsConflictError (extends ServiceError, 409).
  • extra_transact_items may carry more than one Put: this plan's version-metadata item always; theme-deploys' rolling deploy-backup item additionally on deploys.
  • The source_scope/source_version kwargs are the deploy-event addendum negotiated 2026-06-10, now part of the canonical block in both plans.

Capture (inside create_or_update_settings, one attempt of the pipeline)

# Service-managed mode wraps THIS WHOLE BLOCK in a bounded loop (max 3) per the
# contract above; caller-managed mode runs it exactly once. The repo signature
# takes expected_version: int, so the service resolves None (service-managed)
# to the freshly-read counter on EVERY attempt before calling the repo.
def _save_attempt(self, shop_id, updates, updated_by_user_id, *, change_source,
event_type, restored_from, source_scope, source_version,
expected_version) -> ShopifySettings:
existing = self.repository.get_settings(shop_id) # fresh read every attempt
expected = expected_version if expected_version is not None \
else (existing.record_version if existing else 0) # None -> per-attempt resolution (int for repo)
settings = self._merge_settings(existing, updates) if existing else self._create_default_settings(...)
new_hash = content_hash(settings) # sha256 over canonical content-fields JSON
if existing and new_hash == content_hash(existing):
# No content change: plain conditional save of timestamps/author, no version. (FR-7)
persisted = self.repository.save_settings(settings, expected_version=expected,
change_source=change_source) # cheap path, no transact
return persisted
next_version = expected + 1 # provisional; == record_version after a successful commit
s3_key = self.snapshot_store.put_snapshot(shop_id, "live", next_version, content_fields(settings))
version_item = build_version_item(..., changed=changed_summary(existing, settings))
persisted = self.repository.save_settings( # one transact_write_items call
settings, expected_version=expected, change_source=change_source,
extra_transact_items=[version_item.to_transact_put()],
)
return persisted # NOT the pre-save object: carries the COMMITTED record_version

Notes:

  • content_hash canonicalizes via json.dumps(..., sort_keys=True, separators=(",", ":")) over only the three content fields — infra-field churn never creates versions (FR-6) and never suppresses real changes.
  • next_version is provisional until the conditional write commits; a conflicted attempt orphans its S3 object (harmless, unreachable — DDB metadata is the index). In caller-managed mode the conflict surfaces as 409; in service-managed mode the next attempt recomputes everything under the new counter value.
  • Transaction mechanics (verified against the codebase): the boto3 resource API does not support transactions, so save_settings uses a low-level client — and a standalone boto3.client("dynamodb"), not table.meta.client, because moto has a deserialization bug when transactions go through the resource's internal client (precedent with explanation: components/controller/merchandise/services/synonym_service.py:102-106). Transaction items use DynamoDB JSON, so version_item.to_transact_put() serializes via boto3.dynamodb.types.TypeSerializer (precedent: synonym_service.py _to_ddb_item, and its _execute_transaction shows the TransactionCanceledException mapping). Moto transaction tests follow the existing synonym-service test patterns.
  • changed_summary lists component keys whose serialized value differs — cheap (dict compare), powers the list-view UX without payload fetches.
  • Return the persisted object, never the pre-save one: save_settings returns the committed ShopifySettings (canonical block), and create_or_update_settings must propagate it — the pre-save object carries a stale record_version, which would leak into save/restore API responses and the metafield reconciliation helper. Pinned by Testing #19.
  • The record_version field is added to ShopifySettings by settings-concurrency; if their PR hasn't merged when this builds, this plan introduces the attribute with the agreed semantics and they adopt it (see Cross-plan).

Restore

def restore(self, shop_id, version, author_id, expected_version=None) -> RestoreResult:
meta = self.version_repo.get_version(shop_id, "live", version) # 404 if missing
snapshot = self.snapshot_store.get_snapshot(meta.snapshot_s3_key)
content = migrate_snapshot(snapshot) # schema drift, below
settings = self.settings_service.create_or_update_settings(
shop_id, content, author_id,
change_source="storefront_admin", event_type="restore",
restored_from=version, expected_version=expected_version, # None => service-managed mode
)
return settings # route then runs the shared save/restore mirror-sync helper

The route handler factors the existing metafield-sync-with-207 block in storefront_routes.save_settings into a shared helper used by both endpoints (it is currently ~70 inline lines; extraction is a pure refactor covered by existing route tests). The helper also carries settings-concurrency's post-write metafield reconciliation (their §4.5), so restore's mirror behavior is byte-identical to save's by construction — pinned by a failing-first test (Testing #14).

Schema drift between versions

ShopifySettings.settings_schema_version exists (currently always 1, never bumped). Restore handles drift explicitly:

  1. Same version (the only case today): validate snapshot["settings"] through the current Pydantic model inside create_or_update_settings (which already constructs ShopifySettings(**merged)); a validation failure is a 422 — fail fast, no partial restore.
  2. Older snapshot (snapshot.settings_schema_version < CURRENT_SETTINGS_SCHEMA_VERSION): run an ordered migration registry before merging:
    SETTINGS_MIGRATIONS: dict[int, Callable[[dict], dict]] = {} # {from_version: migrate_fn}, applied sequentially
    v1 ships the registry empty plus the loop and tests; whoever first bumps settings_schema_version is forced (by a unit test asserting registry coverage of 1..CURRENT-1) to provide a migration. This converts "schema drift" from a restore-time surprise into a compile-time-ish checklist item.
  3. Newer snapshot (> CURRENT): 409 with an explicit message — this occurs only after a code rollback; restoring a future-schema payload through old code is undefined and refused. (Fail fast.)
  4. snapshot_format (the envelope version) is checked the same way — unknown format → 500 with explicit error.

Status-code rationale (why 409 for case 3 but 500 for case 4): a newer settings_schema_version is a legitimate, client-visible state conflict — the system was rolled back and the caller can act on it (pick an older version, wait for roll-forward), which is what 409 means. An unrecognized snapshot_format is an internal invariant violation — we wrote every snapshot, so an unreadable envelope means our own writer/reader contract broke; that's a 500 and a bug, never a caller-actionable state.

Additionally, because restore replays through the normal save path, defaults-merging in get_settings_with_defaults continues to paper over additive component drift at read time exactly as it does for old live records today.

Diff

Pure function over two content dicts (current state is fetched from the live item; historical states from S3): walk ui_components / selector_components / configuration; per component key → added/removed; per scalar field → difflib.unified_diff for strings ≤ 64KB. Lives beside the service, fully unit-testable, no I/O.

Dependencies

  • DynamoDB transactions — via the low-level client (transact_write_items; the resource API doesn't support transactions), with a standalone boto3.client per the moto bug documented at synonym_service.py:102-106, and TypeSerializer-encoded items. 2–3 item transactions, well under the 100-item/4MB limits. Moto supports client-level transactions (existing synonym-service tests prove it).
  • S3 — new bucket; admin server task role grant. Assumption: admin server's boto3 session already carries region/creds (it does — DDB + existing buckets).
  • settings-concurrency plan (docs/plans/settings-concurrency-control.md) — version counter + the canonical write primitives (pinned in LLD). Hard prerequisites: (1) the primitives themselves; (2) their M1 fix switching the storefront admin GET /settings to a response carrying the current version (their §5.1) — without it no client can supply a restore/save guard. They also own: the payload-size metric + typed 413 for oversized items (their Phase 1), the post-write metafield reconciliation helper (their §4.5) that restore reuses, and the stale "DDB stream → KV export" docstring fixes in storefront_routes.py (their Phase-1 PR).
  • Shopify uplift roadmap (parallel team) — its admin_server track includes a fail-fast PR (B3) that funnels all settings writes through a single repository write path as an explicit "seam for the concurrency plan", plus golden metafield-payload tests (B1). If B3 lands first, this plan's capture hook attaches to that funnel; the golden tests also protect the restore→metafield path for free. No conflict, but implementers should rebase against that track.
  • storefront-admin-auth plan (docs/plans/storefront-admin-sso.md) — author identity (uniformly prefixed author_id + author_display; prefix↔type mapping imported from their admin_server/models/auth.py) + scope enforcement (settings:write + settings:deploy_live on live-target restore). Soft interface, agreed: v1 normalizes today's identities at capture (shopify_user:{auth.user_id}, api_key:{system_account_id}) and picks up their StorefrontAuthContext when it lands; legacy raw API keys remain a permanent author shape.
  • theme-deploys plan (docs/plans/theme-targeted-deploys.md) — consumes this plan's scope/event_type model and the deploy kwargs; their rolling SETTINGS#DEPLOY#BACKUP record is an interim undo mechanism retired when restore lands. They depend on us, not vice versa.
  • Shopify metafields — unchanged usage; restore adopts the shared helper including concurrency's §4.5 reconciliation (see Restore). The metafield is the only mirror restore must re-sync (ShopifyEntitiesTable has no stream; ecom_settings_exporter reads IndexSettings).

Engineering Excellence

Consistency and Integrity

Race: two writers, version capture vs. lock.

  1. Agent A reads settings (lock version 41), prepares save.
  2. Human B reads settings (41), saves first → live item now 42, version item live#0000000042 written atomically.
  3. A's transaction conditions on record_version = 41TransactionCanceledExceptionSettingsConflictError → 409 to A (caller-managed) or a fresh full-pipeline attempt (service-managed). A's pre-written S3 object for "42" is orphaned and unreachable (DDB metadata is the index). No torn state: metadata and live item commit together or not at all.

Race: restore vs. concurrent save — identical to the above. With expected_version supplied (caller-managed), the window is closed: 409. Without it, restore runs in service-managed mode — each attempt is still conditional (a concurrent save fails the attempt, never silently loses), and the bounded re-run re-reads before re-applying; this is strictly stronger than today's unconditional put.

Invariant: a version item exists ⇒ its snapshot exists (S3 written first); a snapshot may exist without a version item (orphan, ignorable). Stated in code comments and pinned by a test on the write ordering.

Reliability & Resilience

Failure modes are enumerated in Architecture → Failure ordering. The deliberate availability trade (S3 outage blocks saves) is Tenet 1 and called out for sign-off. Restore is idempotent in effect: restoring the same version twice produces two identical-content saves, the second of which is deduped by FR-7 (no duplicate version, plain timestamp save).

Scalability

See the dedicated Performance at scale section below — sized for 200 shops × thousands of versions each.

Observability

  • Structured logs on capture (shop, version, event_type, snapshot_size_bytes, changed keys) and restore (restored_from, new_version, mirror result).
  • Metric/alarm candidates (follow existing admin-server logging patterns; no new metrics infra in v1): count of 5xx on save (would now include S3 failures), count of restores.
  • Size alarms: the two-tier capacity interface (tier 1: 100KB warn / 128KB metafield guard; tier 2: 300KB warn / 400KB DDB) and the typed 413 mapping are owned by settings-concurrency — their canonical wording governs; this plan's restore/replay paths enforce the tier-1 guard, and snapshot_size_bytes on every version item supplies the complementary historical per-shop size series against both tiers at zero extra measurement cost.

Security

  • New bucket private + SSE; least privilege: only the admin-server task role (and one-off backfill role) gets grant_read_write. No CDN, no public ACLs (NFR-3).
  • Version payloads contain merchant HTML/CSS and author IDs — same sensitivity class as the live table; no new data categories.
  • All new endpoints sit behind the existing API-key auth + shop-ownership resolution; once storefront-admin-auth lands, live-target restore declares require_scopes("settings:write", "settings:deploy_live") — identical to the live save it is equivalent to (Cross-plan).
  • Snapshots are immutable by convention; no update/delete API is exposed.

Testing

moto for DDB + S3; fixtures extend admin_server.testing.fixtures (add settings_versions_bucket fixture mirroring existing bucket fixtures). Per CLAUDE.md, each new branch/error path gets a test that fails before the change:

Capture:

  1. Content save creates exactly one version item + one S3 object; payload matches post-save state.
  2. Metadata-only writes (webhook_service metadata, active_index clear, association fields via update_settings) create no version (taxonomy pin test).
  3. Identical-content save creates no version but updates last_updated (FR-7), via the cheap non-transact path.
  4. First-ever save for a shop creates v1 with event_type="save".
  5. Caller-managed conflict (single attempt, conditional failure) → 409, no version item written, S3 orphan tolerated.
  6. S3 put failure → save fails 5xx, live item unchanged.
  7. changed summary correctness for add/modify/remove component cases.
  8. Service-managed retry × capture: simulate a conflict on attempt 1 (concurrent writer bumps the counter between read and write), success on attempt 2 → exactly one version item exists, keyed to the committed counter value, with content from the fresh attempt-2 merge; attempt-1's snapshot is an unreferenced orphan; never two version items for one logical save.

API: 9. List: ordering (newest first), pagination cursor round-trip, empty history. 10. Get: payload round-trip; 404 unknown version. 11. Diff: added/removed/modified components; large-string fallback (no diff body); against=current and against=N; to_version equals the live record_version at diff time. 12. Restore happy path: new version with restored_from, content matches snapshot, infra fields untouched, mirror synced via the shared helper (mock ShopifyService). 13. Restore with no access token → 207 partial, DDB restored (mirrors existing save test). 14. Restore mirror parity (failing-first): restore invokes the same shared helper as save, including concurrency's §4.5 post-write metafield reconciliation — asserted by replacing the helper with a sentinel and verifying both endpoints call it with equivalent arguments. 15. Restore 409 on expected_version mismatch (caller-managed); restore without expected_version runs service-managed (asserted: conditional write attempted, not unconditional); 409 on future settings_schema_version; 422 on model-invalid snapshot; migration registry applied for older schema version (synthetic v0→v1 migration in test). 16. Auth: endpoints reject keys not owning the shop; live-target restore requires settings:write + settings:deploy_live once scope enforcement lands (reuse existing storefront route auth tests pattern).

Backfill: 17. Script is idempotent (re-run creates no duplicate version items), skips shops with existing history, and its transaction's ConditionCheck aborts cleanly when live traffic bumps the counter mid-backfill (no version item written for that shop on that pass; re-run picks it up).

Committed-version propagation: 19. Save and restore responses, and the payload handed to the metafield reconciliation helper, carry the committed record_version (the persisted object returned by save_settings), never the pre-save object's stale value — failing-first against an implementation that returns the merged input. 20. Restore of a snapshot exceeding the tier-1 metafield guard fails 422 settings_too_large (shared tier-1 status per concurrency §6.1) before any DDB write (mirror can never be stranded by a restore).

Performance isolation: 21. Partition-isolation pin: insert 1,000 version items for a shop; get_settings, list_shop_sessions, and list_settings_by_system_account results are unchanged (NFR-5).

Run via pants test //components/shopify/admin_server:: plus the e2e suite in CI; no local e2e.

Performance at scale

Sizing basis (per Raynor's review question): 200 shops, each with an automated agent team saving settings edits — low thousands of version records per shop. Concretely: 200 shops × 5,000 versions × ~120KB payloads, with hot shops sustaining ~1 save/minute during agent sessions.

1. Reads of the live settings record — O(1), provably unaffected

get_settings is GetItem(pk="SHOP#{domain}", sk="SETTINGS") — exact composite key, cost fixed at ~1 RCU per 4KB... of that one item, independent of anything else in the table. Version items live in a different partition (SHOPVER#{domain}), so they cannot appear in:

  • any GetItem/Query on SHOP#{domain} — including the full-partition sweep in list_shop_sessions (query_by_pk, filters client-side), which runs on every storefront save/restore for the access-token lookup. This sweep is exactly the failure mode the sk-prefix placement would have caused: at 5,000 versions × 2KB it would have added ~10MB (≈10 paginated queries, ~1,250 RCU) to every save. The dedicated partition eliminates the class, not just the instance.
  • a hypothetical future begins_with(sk, "SETTINGS") query — version sks (live#…) don't share the prefix, and version pks don't share the partition.
  • GSI_SystemAccountId — sparse; version items never carry system_account_id.

A pin test asserts that list_shop_sessions and get_settings results are byte-identical before/after inserting 1,000 version items for the shop.

2. List-versions pagination at thousands of items

Query reads only the requested page: limit=20 × 2KB ≈ 40KB ≈ 5 RCU (eventually consistent), independent of total history depth — DDB Query with Limit + ExclusiveStartKey never scans past the page boundary, and newest-first ordering means the common request (recent history) is always the first page. Walking all 5,000 versions costs ~10MB / ~13 sequential pages of 1MB — only the backfill/ops tooling ever does that, never the UI (page size capped at 100). No FilterExpression is used anywhere in the listing path (filters would read-then-discard and break Limit semantics).

3. Partition growth and the 10GB question

  • Version metadata partition (SHOPVER#{domain}): 5,000 × 2KB = 10MB; even 100,000 versions = 200MB. The ~10GB partition guidance is 3 orders of magnitude away; on-demand tables also split hot partitions automatically. Because snapshot bodies are in S3, partition size grows at 2KB/version, not 120KB/version — this is the decisive reason for S3-bodies + DDB-pointers (storing bodies in DDB would put 5,000 × 120KB = 600MB per shop into the hot operational table and make every history page read megabytes).
  • S3: 5,000 × 120KB = 600MB/shop; ×200 shops = 120GB fleet ≈ $1.5/mo at IA. Unbounded growth is handled by the retention mechanism (TTL field + lifecycle, pre-plumbed, disabled in v1 — see Retention).
  • Live item partition: unchanged — still exactly one ~120KB settings item plus sessions/API-key items.

4. Write amplification per save

TodayWith versioning (+ concurrency)
DDB1 PutItem, ~120KB ≈ 120 WCU1 TransactWriteItems (live update 120KB + version put 2KB), transact 2× premium ≈ 244 WCU
S31 PUT (~120KB)
Latency~10–20ms+~10ms (transact) +20–50ms (S3 PUT, same region)

Latency context: the save path already makes 1–3 Shopify GraphQL calls (metafield sync, token checks) at 200–800ms each; +60ms is noise. Cost context at the aggressive scenario — 200 shops × 60 saves/day = 12K saves/day ≈ 360K/mo: today 43M WCU ≈ $54/mo; after, 88M WCU ≈ $110/mo + $2/mo S3 PUTs. The increment is dominated by the transact 2× premium, which optimistic locking (settings-concurrency) requires regardless of versioning; versioning's own marginal cost is the 2KB item + S3 ($3/mo fleet). High-frequency agent saves hit two external ceilings long before DDB: Shopify GraphQL rate limits on the metafield mirror, and the per-partition 1000 WCU/s cap on the live item (~8 saves/s/shop at 120KB — pre-existing, unchanged by this design since the version item lands in a different partition).

5. Retention/TTL as the mitigation

Pre-plumbed, disabled in v1, activation is a config change (details in Data Storage → Retention): table TTL already enabled on ttl; version items gain an optional ttl; matching S3 expiry rule activated together with it; restore/deploy/backfill versions exempt so rollback anchors survive. The scale analysis above shows TTL is a cost lever, not a performance prerequisite — no read or write path degrades with version count.

Counter interaction

None of this changes the shared record_version design: the counter lives on the live item and its conditional increment is unaffected by where history rows are stored. (Confirmed no change needed with settings-concurrency.)

Key Risks

  1. Sequencing with settings-concurrency. The transactional write primitive is shared. Mitigation: agreed attribute + primitive interface (below); whichever PR lands first introduces it.
  2. Save-path availability now includes S3. Accepted trade (Tenet 1); flagged for explicit sign-off.
  3. Author identity churn. Mitigated: agreed with storefront-admin-auth on self-contained author_id actor strings + denormalized author_display — no foreign key into their system.
  4. tmp/-era expectations. Operators must learn the new flow; mitigated by updating docs/integrations/storefront-css-customization-guide.md (it currently teaches manual backups) in the UI/API PR.
  5. Size ceilings. Pre-existing, not introduced by versioning, but shared with it: the metafield mirror (~128KB) binds before DDB (400KB). Mitigated by the shared two-tier capacity interface (concurrency-owned thresholds; restore/replay enforce the tier-1 guard) plus the snapshot_size_bytes time series and a pre-agreed sharding trajectory — see "The 400KB live-item ceiling".

Cost Analysis

Full derivation in Performance at scale §4. Summary at the 200-shop agent-team scenario (360K saves/mo fleet-wide): DDB write cost rises from ~$54/mo to ~$110/mo — almost entirely the TransactWriteItems 2× premium required by optimistic locking (settings-concurrency) regardless of versioning; versioning's own marginal cost (2KB items + S3 PUTs + storage) is ~$5/mo fleet-wide. Steady-state (today's real save rates, ~2 orders of magnitude lower): noise. New bucket: $0 fixed.

Release / Roll-out

Phased, each phase shippable and independently revertible:

  1. PR1 — infra: bucket + grants + SETTINGS_VERSIONS_BUCKET config (fail-fast startup check deferred to PR2 so PR1 deploys cleanly before the env var consumer exists; PR2 adds the strict check).
  2. PR2 — capture: models, SettingsSnapshotStore, SettingsVersionRepository, capture in create_or_update_settings, transact primitive (coordinated with settings-concurrency: if their conditional-write PR is merged, extend it; else introduce the agreed primitive). From this deploy, every content save is versioned. Rollback = revert PR2; saves regress to current behavior, existing version data stays valid.
  3. Backfill (one-off script, scripts/ecom/backfill_settings_versions.py): scan ShopifyEntities for sk=SETTINGS; for each shop with no history, snapshot the current state and write the version item (event_type="backfill", version = the counter value read, author = script principal). Collision safety with live traffic: the version item is written in a TransactWriteItems of [ConditionCheck on the live item asserting record_version still equals the value read (or attribute_not_exists for legacy), Put version item with attribute_not_exists(sk)] — if a live save lands mid-backfill the transaction cancels cleanly and the re-run picks the shop up at its new counter value; a backfill item can never collide with or shadow a live-traffic version number. Idempotent; run once per env after PR2. Production write → human-confirmation gate per CLAUDE.md (--dry-run defaults on; show output first).
  4. PR3 — API: list/get/diff/restore endpoints + shared metafield-sync helper refactor (carrying concurrency's §4.5 reconciliation). (The stale stream-docstring fix is owned by settings-concurrency's Phase-1 PR.)
  5. PR4 — UI (components/storefront_admin, the standalone settings editor): versions panel as a new section alongside the existing settings sections: table (time, author, event type, changed components), "View"/"Diff vs current", "Restore…" with confirmation dialog showing the diff summary. No editing. Why storefront_admin and not admin_worker: it is the surface that already consumes the storefront admin API (storefront_routes) these endpoints extend, and it is where storefront-admin-auth's session/token auth and the settings:deploy_live scope land — admin_worker (the internal staff dashboard) has neither.

Backward compatibility: all changes are additive (new SHOPVER# partitions, new model, new endpoints). Old code never reads version items (different partition). Rollback strategy per phase: revert the PR; version data is inert without the code.

Impact on other components

  • admin_server: described above. BaseRepository gains pagination passthrough (additive). Recommended (independent) hygiene fix: list_shop_sessions should query by sk prefix USER# instead of sweeping the whole SHOP# partition — not load-bearing for this design (version items are in a separate partition) but it's the cheap fix for a latent inefficiency discovered during this analysis.
  • infra/ecom/stacks/shopify_admin_stack.py: new bucket + grants + env var.
  • storefront_admin: new versions section (PR4). admin_worker: no change.
  • storefront_search, search_proxy, theme extension: no change — they consume the metafield/MarqoUIConfig, and restore goes through the same mirror sync.
  • scripts/: backfill script; docs/integrations/storefront-css-customization-guide.md updated to retire manual backups.
  • Teammate plans: interfaces below.

Alternative Solutions Considered

A. Full snapshots as DDB itemsB. S3 payload + DDB metadata (chosen)C. S3 only (no DDB metadata)D. DDB streams → async archiver
Atomic with saveyes (transact)yes (metadata; payload pre-written)no (list = S3 ListObjects)no — capture lags/can drop
Item-size ceilinginherits 400KB minus overhead — binds firstpayload unboundedunboundedunbounded
List-versions costhigh (or Projection gymnastics)trivialS3 List + per-object HEAD for metadatatrivial
"Every save versioned" guaranteeyesyesyesno (stream lambda failure ⇒ silent gap — violates Tenet 1)
Extra infranone1 private bucket1 bucketstream + lambda + DLQ
Notesviable today; rejected for ceiling coupling + hot-table bloatrejected: no transactional listing/audit recordrejected: also requires enabling a table stream that doesn't exist

D deserves a sentence: an async archiver is the standard "don't touch the write path" answer, but it cannot satisfy the operational requirement that motivated this work — a save whose version silently failed to materialize is exactly the incident we're trying to end.

Cross-plan interfaces

All three interfaces negotiated by message 2026-06-10 and reconciled against the teammates' published plans.

settings-concurrency (docs/plans/settings-concurrency-control.md) — AGREED; canonical block ACKed verbatim

  • Single int attribute record_version on the live item (avoids clash with settings_schema_version). Starts at 1 on creation; every successful write (content and infra, all writers) increments it by exactly 1 via _version_condition(expected_version) (record_version = :expected OR attribute_not_exists(record_version)); legacy records count as 0. Strictly monotonic, no counter gaps; history sequence numbers are sparse over it (content writes only). Storage name record_version; all API payloads expose version.
  • Canonical signatures pinned verbatim in my LLD and their §4.2/§4.3: save_settings(settings, *, expected_version, change_source, extra_transact_items=None) and the non-capturing update_settings(shopify_domain, updates, *, expected_version, change_source), plus the create_or_update_settings funnel (with my ACKed source_scope/source_version addendum). Old unconditioned methods are deleted, not aliased. Conflicts → SettingsConflictError (409). Repository primitives never retry; bounded retry is service-managed mode re-running the full pipeline (fresh snapshot per attempt). update_ui_components is deleted (their B1) — routes call the funnel directly.
  • Every write path supplies change_source (admin_ui | storefront_admin | onboarding | webhook_registration | webhook_cleanup | index_lifecycle | theme_deploy | script); version items persist it alongside event_type and author fields.
  • Their M1 fix is my hard prerequisite: storefront admin GET /settings returns the current version (their §5.1) so clients can supply restore/save guards.
  • They own: payload-size metric + 300KB threshold + typed 413 (Phase 1), §4.5 post-write metafield reconciliation (reused by my restore via the shared helper), stale stream-docstring fixes (Phase 1), and the 400KB §6.1 trajectory statement (single-root-lock constraint; my S3 snapshot store cited as the trajectory-2 precedent).
  • Their verified caveats, both honored by my design: version items must carry a distinct entity_type (SETTINGS_VERSION) and must never carry system_account_id (sparse-GSI key — would bloat GSI_SystemAccountId and admin_lambda scan costs); both hold, and the dedicated SHOPVER# partition goes further by removing version items from the list_shop_sessions sweep entirely.

theme-deploys (docs/plans/theme-targeted-deploys.md) — AGREED

  • Their staged theme records: sk=SETTINGS#THEME#{theme_id} under SHOP#{domain}, plus one internal rolling sk=SETTINGS#DEPLOY#BACKUP (pre-versioning undo snapshot — retired when my restore lands). Staged/backup records carry no system_account_id (invisible to the sparse GSI) and their plan pins tests that GSI listing excludes them. Each theme record has its own record_version counter, so my per-theme history numbering matches their read of the concurrency plan.
  • My history stays in the dedicated SHOPVER#{domain} partition with per-source-record scopes: live#{n} and theme#{theme_id}#{n}. Histories never interleave; version items carry a scope attribute.
  • Deploy-to-live is a version event on the live history: they call create_or_update_settings(shop_id, content, updated_by, expected_version=expected_live_version, change_source="theme_deploy", event_type="deploy", source_scope=f"theme:{theme_id}", source_version=staged.record_version). Their backup Put rides the same transaction via extra_transact_items (alongside my version Put — 3-item transaction on deploys). Capture, and rollback-of-a-deploy via restore, come for free. change_source="theme_deploy" (their writer identity, added to concurrency's enum); event_type="deploy" is the semantic discriminator — no conflict with my model.
  • v1 ships live-scope capture; theme-scope capture (versioning staged-record saves) is an extension owned by their plan — model, S3 keying, and repository are already scope-aware.

storefront-admin-auth (docs/plans/storefront-admin-sso.md) — AGREED

  • Version records store the two-field author shape (their §8.2): author_id + denormalized author_display (email | token name | None). Identity grammar (their §4.6, corrected after their round-2 review): uniformly prefixeduser:{cognito_sub} | token:{token_id} | api_key:{system_account_id} | shopify_user:{shopify_user_id}; capture normalizes embedded-app writers (which stamp raw Shopify user ids into updated_by_user_id today) to the shopify_user: prefix at write time, so split-at-first-colon extraction can never misparse. The canonical enum {console_user, cli_token, api_key, shopify_user} and the prefix↔type mapping live as code in admin_server/models/auth.py (their PR1) and are imported, never re-derived; unknown prefixes fail closed. The shared module's classify_legacy(value) exists for reading historical unprefixed data only (no colon + not "system" → shopify_user) — relevant if old updated_by_user_id values are ever attributed during backfill; new writes always produce prefixed forms. The literal "system" written by webhook writers can never appear as a version author because infra writes are not captured (FR-6). The live item's updated_by_user_id continues to receive whatever the auth context supplies, so existing consumers are unaffected.
  • Legacy raw API keys remain valid on storefront routes indefinitely (allow→warn→deny ratchet is config); api_key:{system_account_id} with null display is a permanent author shape, and historical updated_by_user_id values already use it.
  • Scopes settled: settings:read / settings:write / settings:deploy_live (the deploy-live scope gates any mutation of the live record, per theme-deploys' semantics). Restore requires the same scopes as the equivalent save against the same target — live: settings:write + settings:deploy_live (identical to live POST /settings); staged (future): settings:write. No separate restore scope. Invisible to console sessions (which carry all scopes); the boundary bites only CLI tokens minted without the deploy scope — intended.
  • Restore is a write and receives the same StorefrontAuthContext; the restoring actor becomes the new version's author (event_type=restore).

FAQs

  • Why is restore "content fields only"? Infra fields (active_index, system_account_id, webhook metadata) describe current infrastructure, not merchant intent; restoring a 3-week-old active_index could point metafields at a deleted index (index_service.delete_index clears it deliberately).
  • Why snapshot post-save state rather than pre-save? Post-save means "version N is exactly what the world looked like after write N" — restore is a pure put of one object. The pre-change state of the very first instrumented save is covered by the backfill.
  • Can agents still pull a JSON backup? Yes — GET …/versions/{v} returns the full payload; the manual curl-to-file workflow keeps working but is no longer load-bearing.

References

  • components/shopify/admin_server/admin_server/models/shopify_entities.py (live model)
  • components/shopify/admin_server/admin_server/services/settings_service.py (save path, metafield mirror)
  • components/shopify/admin_server/admin_server/routes/storefront_routes.py (API-key settings routes, 207 contract)
  • infra/ecom/stacks/shopify_admin_stack.py (table + bucket definitions)
  • docs/integrations/storefront-css-customization-guide.md (current manual-backup workflow this replaces)
  • components/controller/merchandise/services/synonym_service.py (in-repo precedent for client-level transact_write_items: standalone client for the moto bug, TypeSerializer item encoding, TransactionCanceledException mapping)
  • Sibling plans (paths confirmed by their authors): docs/plans/settings-concurrency-control.md, docs/plans/theme-targeted-deploys.md, docs/plans/storefront-admin-sso.md

Design Review Follow-ups

(To be filled after Raynor's review.)