Concurrency Control for ShopifyEntities Settings Records

Status: APPROVED by adversarial plan-verifier (Opus), round 3 of 3 — awaiting Raynor's sign-off
Owner: settings-concurrency (planning agent), for Raynor
Date: 2026-06-10 (reviewed 2026-06-11)
Related plans: settings-versioning (history/rollback), theme-deploys (per-theme records), storefront-admin-auth (writer identity). See Cross-plan interfaces.

1. Problem statement

The ShopifySettings record (DynamoDB table {env}-ShopifyEntitiesTable, pk SHOP#{domain}, sk SETTINGS; defined in infra/ecom/stacks/shopify_admin_stack.py:828-876) is written by at least seven independent code paths with no concurrency control of any kind. Every full-record save is an unconditioned put_item and every partial save is an unconditioned update_item:

components/shopify/admin_server/admin_server/repositories/shopify_settings_repository.py:44-52 — save_settings() → blind put_item of the entire record.
components/shopify/admin_server/admin_server/repositories/shopify_settings_repository.py:54-66 — update_settings() → unconditioned update_item (SET/REMOVE per field, base_repository.py:219-271).

The dominant write path is a classic read-modify-write (RMW) cycle with a wide race window:

SettingsService.create_or_update_settings()        # services/settings_service.py:59-95
    existing = repository.get_settings(shop_id)    # line 74  ← READ
    settings = self._merge_settings(existing, ...)  # line 78  ← MODIFY (in memory)
    self.repository.save_settings(settings)         # line 93  ← WRITE (full put_item, no condition)

Any two concurrent writers that interleave between line 74 and line 93 silently lose one writer's changes — including changes to fields the loser never touched, because save_settings rewrites the whole record.

1.1 Inventory of writers (file:line evidence)

#	Writer	Path	Write style	Fields touched
W1	Admin UI save (Shopify embedded app)	`routes/settings_routes.py:60-127` (POST `/api/settings`) → `update_ui_components` (`settings_service.py:470-499`) → `create_or_update_settings`	RMW + full `put_item`	`ui_components`, `selector_components`, `updated_by_user_id`, `last_updated`
W2	Admin UI settings initialize	`routes/settings_routes.py:158-234` (POST `/api/settings/initialize`) → same RMW path	RMW + full `put_item`	same as W1
W3	Storefront admin editor (Cloudflare Worker UI) and any API-key caller	`routes/storefront_routes.py:149-256` (POST `/api/v1/storefront/shops/{domain}/settings`) → same RMW path	RMW + full `put_item`	same as W1 (writer id `api_key:{system_account_id}`, `storefront_routes.py:178`)
W4	Settings push script / CSS-workflow agents	`scripts/ecom/storefront_settings.py:58-77` (`push` subcommand POSTs a previously backed-up JSON file to the W3 endpoint)	RMW + full `put_item`, with arbitrarily stale source data	entire `ui_components` / `selector_components` payload
W5	API-key onboarding	`routes/api_key_routes.py:158-184` (default-settings init via `create_or_update_settings`) and `:186-195` (partial `update_settings` of `system_account_id`, `cell_id`, `active_index`)	RMW + full `put_item`; then partial `update_item`	UI defaults; index-association fields
W6	Webhook registration / cleanup	`services/webhook_service.py:613-669` (`_store_webhook_metadata`: read `metadata` dict, merge, then partial `update_settings` or full `save_settings` of a brand-new record with `ui_components={}`); `:689-709` (`_clear_webhook_metadata`: RMW of `metadata`)	RMW on the `metadata` attribute + partial `update_item`, or full `put_item`	`metadata`, `last_updated`, `updated_by_user_id`
W7	Index lifecycle	`routes/index_routes.py:70-85` (set `active_index` on create), `services/index_service.py:569-579` (clear `active_index` on delete)	partial `update_item`	`active_index`, `system_account_id`, `cell_id`

Readers that depend on these fields and are corrupted by lost updates: webhook domain resolution and sync (services/sync_service.py:230), metafield writes and KV lookups via active_index (dependencies.py: get_active_index_id), storefront shop listing via the GSI_SystemAccountId sparse GSI (storefront_routes.py:51-96), and the admin console store browser (components/admin_lambda/admin_lambda/services/shopify_stores_service.py:75-143, read-only scan/query).

1.2 Concrete race scenarios

UI editor vs UI editor (lost update). Two people (or a person and a CSS-workflow agent using W4) edit the same shop. Both load settings, both save. The second put_item silently reverts the first save. This is the storefront-CSS-onboarding workflow's standing footgun: scripts/ecom/storefront_settings.py push posts a file captured at backup time, which may be hours stale, wiping any merchant edits made in between.
Webhook registration vs UI save (cross-field clobber). W6's full put branch (webhook_service.py:651-664) fires when get_settings returns None, creating a record with ui_components={}. If it races W1/W5's default-settings creation (both see None, both write), one writer's record wins entirely — either webhook metadata is lost (webhook cleanup later leaks orphaned Shopify webhooks) or the UI defaults are lost and the shop renders with empty components. Both have the same first-write window during onboarding, which is exactly when both run.
UI save vs index lifecycle. W1/W3's full put_item at settings_service.py:93 overwrites active_index/system_account_id/cell_id with the values it read at line 74. If W7 sets active_index (index create, index_routes.py:74) between a UI editor's read and save, the UI save reverts active_index to the stale value (or to None), breaking metafield indexId writes and KV lookups until something rewrites it.
Webhook metadata merge races itself. _store_webhook_metadata and _clear_webhook_metadata both RMW the metadata dict. Concurrent registration of different topics, or registration racing cleanup, loses webhook IDs → orphaned webhook subscriptions on Shopify that we can no longer delete by ID.
DDB ↔ metafield divergence. After the DDB write, W1/W3 serialize the settings into the marqo.search_settings Shopify metafield (settings_service.py:158-199 save_settings_to_metafields). Two concurrent savers can commit DDB in order A→B but metafields in order B→A, leaving the storefront (which reads the metafield) rendering settings that DDB no longer contains — indefinitely, until the next save.

1.3 A note on the FN incident

The FN customer's production lost-update conflicts were document-level write conflicts inside the ecom indexer pipeline (_mq_version conflicts between customer batches and Pixel updates). The ecom_indexer does not write the ShopifyEntities SETTINGS record — its DDB writes go to the IndexSettings and indexer-jobs tables (components/ecom_indexer/ecom_indexer/config.py:130-131; no SHOPIFY_ENTITIES_TABLE_NAME reference exists in that component). FN is cited here as evidence that the same class of bug (unconditioned concurrent writes) has already caused customer-visible data loss in this system; fixing the document pipeline is explicitly out of scope. The settings record has the identical structural flaw with more writers and zero protection.

2. Goals and non-goals

Goals

No silent lost updates on the SETTINGS record: every write either observes the latest committed state or fails loudly (CLAUDE.md fail-fast).
User-facing saves that lose a race return a structured HTTP 409 the clients can act on.
System writers (W5–W7) become safe without burdening them with user-facing conflict semantics.
Zero-downtime, backward-compatible rollout: old clients keep working during the transition; no big-bang migration.
The mechanism composes with version history (settings-versioning), per-theme records (theme-deploys), and identity attribution (storefront-admin-auth).

Non-goals — see Out of scope.

3. Chosen mechanism: optimistic locking with a `record_version` attribute and DynamoDB conditional writes

Add an integer attribute record_version to the SETTINGS item. Every write is a conditional write that asserts the version it read and increments it by exactly 1. A failed condition (ConditionalCheckFailedException) means a concurrent write happened; the caller either retries the RMW (system writers) or surfaces 409 (user-facing saves).

3.1 Why this and not the alternatives

Alternative	Verdict	Reason
Condition on `last_updated` timestamp	Rejected	`base_repository.update_item` already unconditionally overwrites `last_updated` on every partial write (`base_repository.py:235`), and `webhook_service` writes its own `datetime.now().isoformat()` without timezone (`webhook_service.py:646`) — the attribute is not trustworthy. ISO-string comparison is also vulnerable to clock skew and equal-timestamp collisions. A counter has exact, machine-checkable semantics.
DynamoDB transactions (TransactWriteItems) alone	Rejected as the primary mechanism	Transactions give atomicity across items but do not, by themselves, solve the RMW lost update — you still need a version/condition to assert "unchanged since my read". They remain available on top of `record_version` (and `settings-versioning` may use them for snapshot+write atomicity).
Pessimistic locking (lock item / lease)	Rejected	Requires lock acquisition/expiry machinery, breaks on client crashes, and serializes human editing sessions (a merchant with a tab open would block the indexer). Optimistic locking matches the actual contention profile: conflicts are rare but catastrophic.
Field-level partial updates everywhere (never full put)	Rejected as sufficient	Eliminates cross-field clobbering (scenario 3) but not same-field lost updates (scenarios 1, 2, 4) — `ui_components` and `metadata` are single map attributes that the service merges in memory. We do adopt narrower writes as a complementary hardening (§4.4), but the version check is the correctness mechanism.
Last-writer-wins + version history for recovery	Rejected	"We can roll back" is a silent-failure pattern; CLAUDE.md requires failing fast. History (settings-versioning) is for auditing and intentional rollback, not for masking races.
Streams-based async merge (CRDT-style)	Rejected	Requires adding a table stream plus merge semantics per field; massive complexity for a record edited by humans a few times a day.

DynamoDB-native optimistic locking is the textbook fix (it is what the AWS SDK DynamoDBVersionAttribute implements for Java/.NET); we implement it explicitly because the codebase uses raw boto3 table resources via BaseRepository.

3.2 Version attribute semantics (shared contract with settings-versioning)

Name: record_version (avoids collision with the existing settings_schema_version Pydantic field, models/shopify_entities.py:114-117, which versions the schema shape, not the record instance).
Starts at 1 when a record is first created; every successful DDB write to the record (full or partial, user or system) increments it by exactly 1. No gaps, no reuse, strictly monotonic per record. This makes it directly usable as the version-history sequence number (settings-versioning snapshots the write that produced record_version=N under its own sk).
Legacy records (written before this change) lack the attribute and are defined to be at version 0. The conditional expression for "expected version 0" is attribute_not_exists(record_version). This is what makes the rollout migration-free (§6).
Every write also persists a change_source string attribute (admin_ui | storefront_admin | onboarding | webhook_registration | webhook_cleanup | index_lifecycle | script | theme_deploy) for conflict diagnostics and for settings-versioning's history annotation. Writer identity continues to land in updated_by_user_id (see Cross-plan interfaces for the auth teammate's identity scheme).

4. Detailed design — server

4.1 Model (`models/shopify_entities.py`)

class ShopifySettings(ShopifyEntityBase):
    ...
    record_version: int = Field(
        default=0,
        description="Optimistic-lock counter. 0 = legacy record predating "
        "concurrency control (attribute absent in DDB). Incremented by 1 on "
        "every successful write.",
    )
    change_source: str | None = Field(
        default=None, description="Code path that produced the last write"
    )

record_version keeps the model frozen/Pydantic-validated per CLAUDE.md ("Model Everything"). When loading a legacy item, the attribute is absent → default 0. When persisting, the repository computes the new version; the model value is treated as "the version I read".

A note on model_dump(): ShopifySettings.model_dump() already strips system_account_id when None for the sparse GSI (shopify_entities.py:133-143); record_version is always written (no sparse-GSI concern) and change_source is written as-is.

4.2 Repository (`repositories/shopify_settings_repository.py` + `repositories/base_repository.py`)

New error type. Note the existing SettingsError hierarchy in exceptions/settings_exceptions.py is currently dead code — nothing raises it and no handler is registered (settings_exceptions.py:11 carries a TODO(raynor): Use these in settings routes). The live mechanism is ServiceError (models/base.py:12-29, carries status_code) with its handler at main.py:189-206. SettingsConflictError therefore extends ServiceError so it is handled even if a route forgets the specific handler, and additionally gets a dedicated handler for the structured payload (the generic ServiceError handler only emits {"detail": user_message}):

class SettingsConflictError(ServiceError):
    """The settings record was modified concurrently; the caller's view is stale."""
    def __init__(self, shop_id, expected_version, current_version=None,
                 updated_by=None, change_source=None):
        super().__init__(
            user_message="Settings were modified after you loaded them.",
            internal_message=f"settings conflict shop={shop_id} expected={expected_version} current={current_version}",
            status_code=409,
        )
        self.expected_version = expected_version
        self.current_version = current_version
        self.updated_by = updated_by
        self.change_source = change_source

BaseRepository gains two generic conditional primitives (other repositories can adopt them later; nothing else changes for them now):

def _version_condition(expected_version: int):
    """expected_version == 0 covers both 'no item yet' and 'legacy item
    without the attribute' — attribute_not_exists(record_version) passes
    in both cases."""
    if expected_version == 0:
        return Attr("record_version").not_exists()
    return Attr("record_version").eq(expected_version)

def put_item_versioned(self, item: dict, expected_version: int) -> None:
    """Full put with optimistic-lock condition on `record_version`;
    persists expected_version + 1."""
    item["record_version"] = expected_version + 1
    self.table.put_item(
        Item=item,
        ConditionExpression=_version_condition(expected_version),
        ReturnValuesOnConditionCheckFailure="ALL_OLD",
    )  # ConditionalCheckFailedException bubbles to caller

def update_item_versioned(self, pk, sk, updates: dict, expected_version: int) -> None:
    """Partial SET/REMOVE update; additionally SETs
    record_version = expected_version + 1. Reuses the SET/REMOVE expression
    builder from update_item(). The condition is STRICTER than the put path:
        Attr("pk").exists() & _version_condition(expected_version)
    — UpdateItem upserts by default, so without the attribute_exists conjunct
    an expected_version=0 partial write against a deleted/missing record would
    CREATE a malformed partial SETTINGS item (e.g. webhook metadata with no
    ui_components). With it, a missing item fails the condition; the caller
    inspects the ALL_OLD payload — empty item => RecordNotFoundError,
    non-empty => SettingsConflictError."""

The condition builder (_version_condition) is deliberately a standalone function so it can be reused inside TransactWriteItems items (settings-versioning attaches a snapshot Put to the same transaction; theme-deploys may add a source ConditionCheck).

Important subtlety for expected_version == 0, which means different things on the two primitives:

On put (put_item_versioned): attribute_not_exists(record_version) is evaluated against the existing item with the same key; if no item exists at all the condition also passes. So a single expression covers "create new" and "first write to a legacy record". Two racing creators: exactly one wins, the other gets ConditionalCheckFailedException — which fixes scenario 2 outright.
On update (update_item_versioned): "no item" must fail, not create — hence the extra attribute_exists(pk) conjunct above. expected_version=0 on the update path therefore means strictly "a legacy item exists"; a record deleted between an infra writer's read and write surfaces as RecordNotFoundError (distinguished from a version conflict via the empty ReturnValuesOnConditionCheckFailure payload) instead of resurrecting a partial item.

ShopifySettingsRepository methods become the following. This block plus the service signature in §4.3 is the canonical shared-primitive contract — settings-versioning has ACKed it and is pinning the identical text in their LLD (their save_settings_conditional/expected_record_version/VersionConflictError rename to save_settings/expected_version/SettingsConflictError; their proposed repo-side-retrying update_settings_system is dropped in favor of update_settings below, with retry at the calling service):

# ShopifySettingsRepository — the ONLY write paths for SETTINGS* records.
# Neither method retries internally; conflicts raise SettingsConflictError (409).

def save_settings(
    self,
    settings: ShopifySettings,
    *,
    expected_version: int,
    change_source: str,
    extra_transact_items: list[dict] | None = None,
) -> ShopifySettings: ...
    # Conditional full save; persists expected_version + 1. Plain conditional
    # put_item when extra_transact_items is None; one TransactWriteItems with the
    # identical ConditionExpression when provided. ConditionalCheckFailedException
    # or TransactionCanceledException(ConditionalCheckFailed) -> SettingsConflictError.

def update_settings(
    self,
    shopify_domain: str,
    updates: dict,
    *,
    expected_version: int,
    change_source: str,
) -> int: ...
    # INFRA FIELDS ONLY: raises ValueError if updates touches
    # ui_components / selector_components / configuration (fail-fast guard).
    # Not captured by versioning (FR-6). Never retries internally; bounded
    # retry (<=3, re-reading the version each attempt) lives at the calling service.

Method semantics (normative, carried as docstrings in the implementation):

save_settings — conditional full save; persists expected_version + 1 and returns the persisted model (with the new version). With extra_transact_items (settings-versioning's version-item Put; theme-deploys may add more), issues a single client-level transact_write_items carrying the identical ConditionExpression; otherwise a plain conditional put_item (no transact WCU premium). Raises SettingsConflictError on ConditionalCheckFailedException or the equivalent TransactionCanceledException cancellation reason.
update_settings — conditional partial update, infra fields only: raises ValueError (fail-fast taxonomy guard) if updates touches a content field (ui_components/selector_components/configuration), so content cannot bypass the service funnel and versioning capture. Returns the new record_version. Deliberately not captured by settings-versioning (their FR-6). Bounded retry (≤3, re-reading the version each attempt) lives at the calling service (§4.4 W5–W7). Raises SettingsConflictError on conflict, RecordNotFoundError if the item vanished.

Naming decisions (agreed with settings-versioning): the method keeps the name save_settings — the old unconditioned signature is deleted, so there is exactly one save method and a _conditional suffix would be redundant; the parameter is expected_version everywhere (the DDB attribute remains record_version). expected_version is normally settings.record_version as read; it is an explicit keyword so call sites can't accidentally save a model whose version they didn't consciously source.

The repository never retries internally. settings-versioning pre-writes its S3 snapshot keyed to the provisional next version before the conditional write; an internal retry under a new counter value would strand that snapshot under a stale number (their plan's hard requirement, settings-versioning.md LLD notes). Retry policy lives one level up, in the service (§4.3).

The repository catches only botocore ClientError with Error.Code == "ConditionalCheckFailedException" and re-raises SettingsConflictError (using ReturnValuesOnConditionCheckFailure="ALL_OLD" to populate current_version/updated_by/change_source in the error without a second read; supported on put_item/update_item since 2023). All other exceptions bubble (CLAUDE.md: never catch generic Exception).

There must be no remaining unconditioned write path for SETTINGS records. The old signatures are removed, not aliased (CLAUDE.md rename rule); the compiler/tests force every call site in §4.4 to choose an explicit expected_version and change_source. delete_settings stays unconditioned (it is only used in tests/cleanup; deleting under a lost race is idempotent and not data loss of the kind we're fixing — revisit if a user-facing delete appears).

4.3 Service (`services/settings_service.py`)

create_or_update_settings becomes the single content-write funnel and gains explicit concurrency semantics. Canonical signature (pinned verbatim with settings-versioning; the event_type/restored_from kwargs are reserved for and consumed by their capture step inside this function — this plan only threads them):

# SettingsService — single content-write funnel (update_ui_components is deleted).

def create_or_update_settings(
    self,
    shop_id: str,
    updates: dict,
    updated_by_user_id: str,
    *,
    expected_version: int | None,       # REQUIRED keyword, no default: int = user-facing one-shot; None = service-managed mode
    change_source: str,
    event_type: str = "save",           # versioning capture: save | restore | deploy | backfill
    restored_from: int | None = None,   # versioning capture: source version for restores
    source_scope: str | None = None,    # versioning capture: deploy events ("theme:{theme_id}")
    source_version: int | None = None,  # versioning capture: deploy events
) -> ShopifySettings: ...

Two deliberate signature choices: expected_version has no default — every caller must consciously pass an int (strict) or None (service-managed), so a new call site can't silently get either behavior; and the positional identity param keeps the existing name updated_by_user_id (it stores the auth context's actor_id, §11 — keeping the name avoids churn across every existing call site for zero semantic gain). source_scope/source_version are versioning's deploy-event metadata (theme-deploys passes them; pure pass-through here).

User-facing mode (expected_version is an int, supplied from the API request): one attempt. Read current settings; if current.record_version != expected_version, raise SettingsConflictError immediately (cheap pre-check, saves a doomed write and a doomed snapshot); otherwise merge and save_settings. The conditional write still backstops the pre-check (TOCTOU between read and write).
Service-managed mode (expected_version=None, used by W5's content init and by Phase-1 versionless API saves): bounded retry loop at this level, not in the repository — each attempt re-runs the entire pipeline: re-read, re-merge, re-run settings-versioning's capture (fresh provisional version, fresh S3 snapshot), conditional save_settings with the version just read. On SettingsConflictError, re-attempt, max 3 attempts, then raise. Each failed attempt orphans one pre-written snapshot — the same harmless-orphan class settings-versioning already accepts for its 409 path. This placement honors their hard no-internal-retry requirement while keeping system writers safe; it is correct because these merges are re-computable from current state (defaults-if-absent), unlike a human's edit buffer. Retries and conflicts are logged at WARNING with both writers' change_source. Note expected_version=None therefore does not mean unconditional last-writer-wins — every attempt is still a conditional write.
Infra writers (W6/W7 and W5's index-association write) do not pass through this funnel at all: they call update_settings (partial, FR-6 non-capturing) with their own read-version-then-write retry (≤3), which involves no snapshots.

update_ui_components is deleted (round-1 blocking fix). It is a 6-line wrapper (settings_service.py:470-499) that only builds {ui_components, selector_components} and forwards to create_or_update_settings — with the new keyword-only parameters it would be a dead alias that hides the concurrency arguments from the route (CLAUDE.md: no backward-compatible aliases on rename). Deletion checklist:

Call sites: settings_routes.py:88, settings_routes.py:195, storefront_routes.py:174 — each builds the updates dict and calls create_or_update_settings directly with explicit expected_version/change_source.
Docs: remove the update_ui_components() bullet from components/shopify/admin_server/CLAUDE.md:141 (Settings Service section) in the same PR; sweep docs/ for any other mention.
Tests: settings_routes_test.py contains 8 update_ui_components references (mocks/assertions) — all rewritten to mock/assert create_or_update_settings with the new kwargs (test plan item 11a).
settings-versioning's write-path taxonomy drops it from their funnel-caller list (acked by them).

_merge_settings (settings_service.py:407-434) is unchanged in behavior but must never accept record_version/change_source/pk/sk via updates — add them to an explicit reject-list so an API payload cannot smuggle a version bump (fail fast with ValueError, which the routes map to 400).

4.4 Per-writer changes

Writer	Change
W1 `settings_routes.py` POST `/api/settings`	Route builds `{ui_components, selector_components}` and calls `create_or_update_settings` directly (the `update_ui_components` wrapper at `settings_service.py:470-499` is deleted, §4.3), passing the request's `version` as `expected_version` and `change_source="admin_ui"`. Replace the blanket `except Exception → 500` (`settings_routes.py:123-127`) with explicit handling: let `SettingsConflictError` propagate to the 409 handler; keep 500 for genuinely unexpected errors.
W2 `settings_routes.py` POST `/api/settings/initialize`	Same direct call as W1. Already guards on `existing_settings` (`:172`), but the guard races; rely on the create-condition (`expected_version=0`). On `SettingsConflictError`, return the existing `already_initialized` response — initialization losing a race is "already initialized".
W3 `storefront_routes.py` POST save	Same direct call as W1 with `change_source="storefront_admin"`. 409 propagates with the structured payload (§5).
W4 `scripts/ecom/storefront_settings.py`	`push` first GETs current settings, takes `version` from the response, and sends it. Add `--force` (re-fetch version and retry once, intentionally overwriting) and `--base-version <file>` (use the version captured in a backup file — fails with 409 if anything changed since backup, which is exactly the safety the CSS workflow needs). Backup files start embedding `version`.
W5 `api_key_routes.py:158-195`	The default-settings init (a content write) moves to service-managed mode: `create_or_update_settings(..., expected_version=None, change_source="onboarding")` — versioning captures it, retries re-run capture (§4.3). The trailing index-association write (`:195`) is an infra write: read-version-then-`update_settings(..., expected_version=v)` with ≤3 retries, no capture (FR-6). The existing broad `except Exception → log warning` wrappers stay (onboarding intentionally tolerates these as non-fatal) but now log conflicts distinctly.
W6 `webhook_service.py:613-709`	Infra writes, not through the content funnel. `_store_webhook_metadata` / `_clear_webhook_metadata`: read settings incl. version → merge `metadata` → `update_settings(..., expected_version=v, change_source="webhook_registration"/"webhook_cleanup")`, retry ≤3 on conflict (no snapshots involved, FR-6). The brand-new-record branch (`:651-664`) uses `save_settings` with `expected_version=0`; on conflict, re-read and take the update path. Remove the `except Exception: log` swallow (`:668-669`) for conflict errors — after 3 failed retries this should fail the registration call loudly.
W7 `index_routes.py:70-85`, `index_service.py:569-579`	Infra writes: read-version-then-`update_settings` with ≤3 retries, `change_source="index_lifecycle"`, no capture. These only touch index-association fields, so the retry merge is trivial (re-read version, re-issue same SET/REMOVE).

4.5 DDB ↔ metafield consistency (`save_settings_to_metafields`)

The metafield mirror (settings_service.py:158-199) is a non-transactional follower of DDB and can be reordered by concurrent savers (scenario 5). Versions make it strictly better — not fully convergent: the residual divergence window is bounded to the case where the last writers crash between their DDB write and their reconciliation pass (true convergence requires the stream-based exporter deferred in §9):

Embed record_version in the StorefrontSettingsData JSON written to the marqo.search_settings metafield. (Storefront widget ignores unknown fields; verify in storefront_search during implementation.)
Post-write reconciliation: after a route finishes its metafield write for version N, it re-reads the settings record (single get_item). If record_version != N, a concurrent writer committed a newer version; the route re-runs save_settings_to_metafields once with the fresh record (and only logs if that also detects a newer version — the later writer's own reconciliation will converge it). This bounds staleness to the case where the last two writers both crash between DDB and metafield writes, which the existing 207-partial path already accepts and reports (storefront_routes.py:210-236).
Metafield write failures keep today's explicit-warning behavior (207 with warnings) — no change to that contract, no new silent fallback.

Ownership note: the stale "Settings will propagate via DDB stream → KV export" docstrings/messages in storefront_routes.py:160,227,233 are corrected in this plan's Phase-1 PR (which already touches those routes); settings-versioning has been told to leave them out of theirs.

A stream-driven exporter would be the fully robust design, but the ShopifyEntities table currently has no DynamoDB stream (shopify_admin_stack.py:845-874 configures none), and ecom_settings_exporter consumes the IndexSettings table, not this one (ecom_settings_exporter/lambda_function.py:106,156 reads INDEX_SETTINGS_TABLE_NAME). The "Settings will propagate via DDB stream → KV export" comments in storefront_routes.py:160,227,233 describe machinery that does not exist for this table; the plan corrects scope accordingly and leaves stream-based mirroring as future work (§9). (Caveat for the reviewer: verified against the CDK source in this repo, not against live AWS — if a stream was added out-of-band, item 2 still stands and a stream consumer can later replace it.)

5. API contract changes

5.1 Read endpoints — expose the version

GET /api/settings (settings_routes.py:46-57) and GET /api/v1/storefront/shops/{domain}/settings (storefront_routes.py:133-143) responses gain:

{
  "uiComponents": { ... },
  "selectorComponents": { ... },
  "version": 17,              // record_version; 0 for legacy records
  "lastUpdated": "2026-06-10T03:21:44+00:00",
  "updatedBy": "user:9f3a...",        // actor_id per storefront-admin-auth (legacy: api_key:{account})
  "updatedByDisplay": "raynor@marqo.ai"   // actor_display; null for legacy writes
}

Per-model enumeration (models/api_requests.py) — exactly which models change and which do not:

Model	Used by	Change
`SettingsRequest` (`api_requests.py:12`)	POST body of both save endpoints; also currently the response model of admin GET `/api/settings` (`settings_routes.py:54`)	Gains optional `version: int \| None = None` (the write-side expected version, §5.2).
`SettingsResponse` (`api_requests.py:22`)	Storefront GET (`storefront_routes.py:140`)	Gains `version: int`, `lastUpdated`, `updatedBy`, `updatedByDisplay`.
Admin GET `/api/settings` response	admin_ui `useSettings.ts:11-15`	Switches from `SettingsRequest` to `SettingsResponse` so the embedded app actually receives `version` (round-1 fix: extending only the request model would have left the admin GET versionless). Field names are identical camelCase, purely additive — no admin_ui break.
`SaveSettingsResponse` (`api_requests.py:39`)	Save success/207 responses	Gains `version: int` of the committed write (§5.2).
`PublicSettingsResponse` (`api_requests.py:75`, served by `auth_routes.py:30` GET `/api/public-settings`)	Storefront display path	Unchanged. It is read-only for rendering; no save flow originates from it, so exposing a lock version there would invite misuse as a save basis.

All additions are camelCase via the existing to_camel alias generator. get_settings_with_defaults (settings_service.py:107-156) already returns last_updated/updated_by_user_id; add record_version (0 when is_new_shop) and updated_by_display.

Additive response fields are backward compatible for all current consumers (admin_ui useSettings.ts, storefront_admin api-client.ts:55-67, the backup script).

5.2 Write endpoints — accept the expected version

POST /api/settings and POST /api/v1/storefront/shops/{domain}/settings request body gains an optional version field on SettingsRequest:

{ "uiComponents": {...}, "selectorComponents": {...}, "version": 17 }

version present → strict optimistic locking (§4.3 user-facing mode).
version absent → Phase-1 compat behavior: system-mode write (read-current + conditional save with retry). This is not silent last-writer-wins relative to today — it still closes the cross-field clobber (the merge re-reads current state immediately before a conditional write) — but it cannot protect the caller's stale edit buffer. Each versionless save increments a CloudWatch metric (settings_save_without_version, dimension change_source) and logs at WARNING.
Phase-3 enforcement (§7): versionless saves to these two endpoints return 428 Precondition Required with an explanatory body. 428 (not 400) so clients/scripts get an unambiguous, retryable signal distinct from validation errors. Mechanism (round-1 fix — this is not a model-config flip): SettingsRequest.version stays int | None in the Pydantic model permanently; enforcement is an explicit route-level branch in both save handlers, gated by a new config flag — if config.settings_version_required and settings.version is None: raise HTTPException(status_code=428, detail=...). The flag lives in admin_server/config.py (env var SETTINGS_VERSION_REQUIRED, default false), so Phase 3 is a config change plus flipping the dedicated test (test plan item 17), not a code rewrite; per CLAUDE.md fail-fast, the branch raises before any service call.

Success response (200 and the existing 207-partial) gains the committed version, so clients can chain saves without a re-GET: {"status": "success", "message": ..., "version": 18} — added to SaveSettingsResponse (models/api_requests.py:39) and to the plain-dict response of settings_routes.py:121, which becomes a Pydantic model in the process (CLAUDE.md: model everything).

Body field vs If-Match header: body field chosen because both UIs already round-trip the full settings JSON through converters (storefront_admin/app/lib/settings-converter.ts, admin_ui MarqoSettings), the backup script persists JSON files (the version travels with the backup naturally), and FastAPI/Pydantic validation of a body field is first-class. An If-Match: "17" header alias MAY be accepted for REST hygiene but is not required by any client; decision: body-only to keep one code path.

5.3 Conflict response

SettingsConflictError → HTTP 409:

{
  "detail": {
    "code": "settings_conflict",
    "message": "Settings were modified after you loaded them.",
    "expectedVersion": 17,
    "currentVersion": 19,
    "lastUpdated": "2026-06-10T03:25:02+00:00",
    "updatedBy": "api_key:acct_123",          // actor_id per storefront-admin-auth
    "updatedByDisplay": "API key",            // actor_display; null for legacy/system writes
    "changeSource": "webhook_registration"
  }
}

Served by a dedicated @app.exception_handler(SettingsConflictError) in main.py (alongside the existing ServiceError handler at main.py:189). The payload gives clients everything needed for a "reload & reapply" UX without a second request.

6. Migration / backfill

None required. The design is deliberately migration-free:

Legacy records (no record_version attribute) ≡ version 0; the first conditional write uses attribute_not_exists(record_version) and stamps version 1. Reads of legacy records return version: 0, which clients echo back and which the server translates to the attribute_not_exists condition. Two concurrent "first writes" to a legacy record: exactly one succeeds — strictly better than today.
No GSI changes: record_version is not a key attribute anywhere; GSI_SystemAccountId (shopify_admin_stack.py:859-873) is untouched.
No table/CDK changes at all in the core mechanism (DynamoDB is schemaless for non-key attributes).
An optional one-shot script (scripts/ecom/backfill_settings_record_version.py, scan for sk = SETTINGS lacking the attribute, conditional SET record_version = 1 with attribute_not_exists guard) may be run later purely to make dashboards uniform — explicitly not a rollout dependency, and it must use the same conditional guard so it can never race a real writer. (Per CLAUDE.md production-write gates, running it in prod requires human sign-off; it is listed for completeness, not scheduled.)

6.1 Capacity ceilings: a shared two-tier model (metafield 128KB, then DDB 400KB)

(a) Current-size evidence — and which ceiling binds. The CSS-workflow backup corpus (78 real tmp/*settings*.json files captured from production via the storefront GET endpoint) shows the largest live merchants clustering tightly: Muji CA 120.6KB, LG 119.0KB, MSQC 118.3KB, Muji US 118.2KB (API JSON; the stored DDB item is the same payload in snake_case plus small infra fields, so the same order of magnitude). Growth is tuning-driven, not organic: LG grew ~4KB over a month of active CSS tuning and is static otherwise; the dominant contributor is ui_components HTML/CSS template text. The binding constraint is not DDB. Per theme-deploys §7 (the canonical description of the Shopify numbers — API-version pins, grandfathering, and the tripwire live there, not here): json-type metafield values drop to a 128KB design ceiling on Shopify API 2026-04+, and Muji CA is already at ~94% of it. The ordering of walls is metafield 128KB → DDB 400KB → TransactWriteItems 4MB. Against the DDB cap there is ~3.3x headroom; against the ceiling that fails first, there is almost none.

(a′) The shared two-tier capacity interface (one model across this plan, theme-deploys §7, and settings-versioning — agreed wording, cross-referenced rather than restated):

Tier	Ceiling	Thresholds	Owner / where enforced
1 — Metafield mirror (binding)	128KB design ceiling on the `search_settings` JSON metafield value	warn-metric at 100KB per (shop, theme); new write paths (staged save, deploy, restore) fail fast at the ceiling with 422 `settings_too_large` before the DDB write; the legacy live path keeps today's 207-partial behavior byte-identical	theme-deploys §7 (guard lives in their new service methods, not in the shared `save_settings_to_metafields`); settings-versioning's restore adopts the same guard
2 — DDB durability	400KB item cap	warn-metric at 300KB; over-cap maps to typed 413 `SettingsPayloadTooLargeError`	this plan, Phase 1 (repository write path)

Phase 1's size guard protects DDB durability (tier 2) — it is not the early warning for the realistic failure mode, because a payload crossing 128KB starts degrading the metafield mirror (207-partials) some 170KB before the 300KB alarm would fire. The tier-1 early warning is theme-deploys' 100KB metric; this plan's tier-2 metric backstops the durability wall behind it.

(b) Failure mode today when a ceiling is hit. Tier 1 first: an oversized metafield write fails Shopify-side and surfaces through the explicit 207-partial path (storefront_routes.py:210-236) — visible but easy to ignore, which is why tier 1 gets the 100KB metric and new-path 422. Tier 2: put_item raises botocore ClientError with Error.Code == "ValidationException" ("Item size has exceeded the maximum allowed size"). On the UI save paths this lands in the blanket except Exception handlers (settings_routes.py:123-127, storefront_routes.py:251-256) → an opaque 500 "Failed to save settings" (stack-traced in logs, but indistinguishable from any other failure to the user, and the merchant's edit is simply lost); on the webhook path it is swallowed entirely (webhook_service.py:668-669 logs and returns). Phase 1 fixes the tier-2 surfacing cheaply: typed 413 + the 300KB metric.

(c) Does the version counter survive a sharded layout? Yes — via a design constraint we commit to now: the version condition always lives on exactly one item. If a record must ever split, the lock does not shard with it. Two viable trajectories, both preserving this plan's API contract (version field, 409 semantics) unchanged:

Manifest + shards in DDB: root item (sk SETTINGS) keeps record_version, a shard list, and content hashes; component payloads move to sibling items (e.g. sk SETTINGS#shard-{n} or per-component). Every save is one TransactWriteItems writing all changed shards with the ConditionExpression on the root only; shards are never written independently, so they need no counters. (TransactWriteItems allows 100 items / 4MB — far above need.) Reads fetch root then shards; the root's version is what clients echo back.
Body in S3, pointer in DDB (preferred if it comes to it): content fields move to an S3 object keyed by shop+version; the DDB item keeps record_version + the S3 key + hash. This is exactly the architecture settings-versioning already builds for snapshots — proven shape, no shard-assembly logic, and the lock mechanics are byte-for-byte this plan's existing single-item conditional write.

DDB sharding itself is deferred, with the honest rationale: DDB sharding would not relieve the binding ceiling at all — a payload big enough to need DDB shards has already broken the single-metafield distribution model ~270KB earlier, and the realistic response to tier-1 growth is splitting the mirror (per-component metafields, or serving settings from CDN/KV), per theme-deploys §7. The deferral of tier-2 sharding therefore rests on tier 2's genuine ~3.3x headroom plus the 300KB backstop metric; tier-1 pressure is alarmed at 100KB and answered with mirror-side changes, not record sharding. Both escape trajectories above preserve the locking contract without API changes. Cross-plan agreement on the record-shape trajectory: settings-versioning's VER metadata items are ~2KB (unaffected; their S3 snapshots are precedent for trajectory 2); theme-deploys multiplies record count (one ~120KB staged item per theme), not item size — each staged record is independently under the cap and independently locked, and their deploy transaction (~120KB Put + ConditionCheck) sits far below the 4MB transact limit.

Rollback safety: verified against the actual base model — ShopifySettings inherits admin_server's RecordModel (models/base.py:31-40), whose model_config = ConfigDict(frozen=True) leaves Pydantic's default extra="ignore" in place. So if the server change is rolled back while items carry record_version: loads succeed (unknown key dropped), the next full put_item from old code rewrites the record without the attribute, returning it to legacy/version-0 state. That degrades cleanly to today's (unprotected) behavior, and re-rolling forward resumes via the attribute_not_exists path. No deploy-ordering constraint.

7. Rollout phases (backward compatible)

Phase	Contents	Exit criteria
1. Server-side safety net	Model field; conditional repo primitives; all 7 writers converted (system writers in retry mode; user endpoints accept optional `version`); GET returns `version`; 409 handler; metafield reconciliation; metrics. Deployed alone, this already eliminates scenarios 2, 3, 4 and the cross-field half of 1 — without any client change.	All admin_server tests green; `settings_save_without_version` metric flowing; no 5xx regression on save endpoints in staging + e2e.
2. Clients send versions	admin_ui and storefront_admin send `version` and implement 409 UX (§8); backup/push script updated. Scope dependency once SSO lands: per `storefront-admin-sso.md` §4.5, every settings POST is a live-record mutation until theme-deploys ships staged records, so it requires the `settings:deploy_live` scope — the updated push script and any CLI token used by agent workflows must be minted with that scope ticked, and the script's 401/403 error text should say so.	Metric shows versionless saves ≈ 0 (excluding `change_source=script` stragglers); 409 UX verified in staging (Playwright per `feedback_test_via_playwright_before_push`).
3. Enforcement	The two UI-save endpoints require `version` (428 on absence). System writers keep internal retry mode (they don't take client-supplied versions by design).	A release note; metric at 0 for ≥2 weeks first.

Each phase is independently deployable and revertible. Phase 1 → 2 ordering matters (servers must accept version before clients send it); Phase 3 is a deliberate, announced breaking change for any unknown external caller of the storefront save endpoint (the API is currently used by our own worker UI and scripts; docs/integrations workflows updated in the same PR).

8. Client handling

8.1 storefront_admin editor (`components/storefront_admin`, the standalone settings editor served by a Cloudflare Worker — distinct from `components/admin_worker`, the internal admin dashboard)

State home: app/hooks/use-settings.ts. It already keeps backendSnapshot (:271, the last-loaded backend JSON, merged on save at :338) and an editVersionRef for local dirty-tracking — the remote version slots in beside these:

Keep serverVersion in state, set from GET (snapshot set at :298) and from each successful save response (save response gains version of the committed write).
save() (:325-357) sends version: serverVersion; on 409, surface a typed result so routes/editor.tsx:33-41 can show a conflict dialog instead of the generic "Failed to save settings" toast: "Settings changed since you loaded them (modified by {updatedBy}). [Reload and reapply my changes] [Discard my changes]".
- Reload and reapply (default): re-GET, set new backendSnapshot + serverVersion, re-run frontendToBackend(settings, freshSnapshot) — the existing converter already merges the user's frontend state over a backend snapshot, so "reapply onto fresh server state" is its natural semantics — then save again with the new version. One automatic retry; if it 409s again, show the dialog again (no silent loop).
- Discard: re-GET and backendToFrontend into state, clearing dirty.
ApiClient.request (app/lib/api-client.ts) maps 409 to a typed ConflictError {currentVersion, updatedBy, lastUpdated} rather than a generic error.

8.2 admin_ui (Shopify embedded app)

Note: the UI-customization editor area (pages/ui-customization.tsx, components/settings/UIComponentEditor.tsx) is being deleted concurrently with this plan (track-adminui, chore/rm-uicustom) — it is superseded by the storefront_admin editor, so do not build 409 UX there. The surviving admin_ui settings writers are the useSaveSettings consumers on the settings page: src/components/PathExclusions.tsx:3,8 and src/components/TagNamespacingConfiguration.tsx:14,36.

src/hooks/useSettings.ts:70-82 (useSaveSettings react-query mutation) is the single choke point: include version from the last query data (extend MarqoSettings type with version); on 409, onError triggers queryClient.invalidateQueries({queryKey: queryKeys.settings}) and the two consumer components surface a banner/toast: "Settings were updated elsewhere — your view has been refreshed; please reapply your change." These are small form-style edits (path exclusion list, tag namespace config), so refresh-and-reapply-manually is acceptable UX; no auto-merge. The primary rich-editor 409 UX lives in storefront_admin (§8.1).

8.3 Scripts and agent workflows

scripts/ecom/storefront_settings.py per §4.4 W4. The storefront CSS customization guide (docs/integrations/storefront-css-customization-guide.md) gets a short update: backup files now pin a version; a 409 on push means the live settings moved — re-backup, re-inject, re-diff. This converts the workflow's worst failure mode (stale push silently reverting merchant edits) into a loud, recoverable error.

9. Out of scope

Ecom indexer / document-level _mq_version conflicts (FN pipeline) — different system, different fix (pixel bypass etc., tracked separately).
IndexSettings table concurrency (ecom_utils/index_settings_service/...) — same disease, different table; the BaseRepository conditional primitives are written generically so a follow-up can adopt them, but no IndexSettings call sites change in this plan.
Version history snapshots, rollback UX — settings-versioning plan; this plan only guarantees the counter semantics it builds on.
Per-theme record keying and deploy-copy operations — theme-deploys plan; this plan guarantees per-record locking works for any sk and specifies the copy-write rule (Cross-plan interfaces).
Auth/identity for storefront_admin — storefront-admin-auth; this plan only consumes a resolved writer-identity string.
Stream-driven metafield/KV mirroring for ShopifyEntities — future hardening (§4.5); would require adding a DDB stream in CDK.
Sessions / API-key items in the same table — only sk = SETTINGS records get versioning now.
Record sharding for the 400KB DDB limit — deferred per §6.1: the binding ceiling is the 128KB metafield mirror (tier 1, owned by theme-deploys §7, answered with mirror-side splitting rather than DDB shards); tier 2 has genuine ~3.3x headroom plus a 300KB backstop metric, and the committed single-root-lock constraint means either future trajectory preserves this plan's contract.

10. Test plan

Per CLAUDE.md, tests are a completion precondition for each implementation PR, not a follow-up. All Python tests run via pants test //components/shopify/admin_server:: with moto; TS via npm test (vitest).

Repository (shopify_settings_repository_test.py, moto)

Conditional save at correct version succeeds and increments record_version.
Conditional save at stale version raises SettingsConflictError carrying current_version/updated_by/change_source from ReturnValuesOnConditionCheckFailure.
expected_version=0 creates a brand-new record at version 1.
expected_version=0 against a legacy item (seeded via raw put_item without the attribute) succeeds — the lazy-migration path.
expected_version=0 against an already-versioned item conflicts.
update_item_versioned SET/REMOVE behavior preserved (incl. None → REMOVE) + version bump + conflict case. 6a. No-upsert regression: update_settings(..., expected_version=0) against a missing item raises RecordNotFoundError and creates nothing (table item count unchanged) — guards against UpdateItem's default upsert creating a malformed partial SETTINGS item when a record is deleted between an infra writer's read and write. Same scenario at expected_version=N also raises RecordNotFoundError (empty ALL_OLD payload), not SettingsConflictError.
Two interleaved RMW writers (read same version, write sequentially): second write raises — the regression test that would have failed before this change. 7a. Item-size ValidationException maps to SettingsPayloadTooLargeError (413), not a generic 500; the >300KB warning metric fires (moto can't enforce the real 400KB cap reliably, so the mapping is tested by raising the ClientError from a stubbed table).

Service (settings_service_test.py) 8. User-mode create_or_update_settings with stale expected_version raises without writing (and without invoking the versioning capture hook, once that lands). 9. Service-managed mode retry lives in the service, not the repo: repo save_settings conflicts once (mock first call), service re-runs the full pipeline — assert the repo was called twice with different expected_version values (fresh read each attempt) and, once versioning lands, that capture ran per attempt (fresh snapshot each time, none reused). Then assert both writers' fields are present afterward (lost-update regression for scenario 3: a full save no longer reverts a concurrent active_index write). 10. Service-managed mode exhausts 3 attempts → raises; repo itself never retried (call count == attempts). 11. _merge_settings rejects record_version/pk/sk/change_source in updates. 11a. update_ui_components no longer exists; the 8 mocks/assertions referencing it in settings_routes_test.py are rewritten to mock/assert create_or_update_settings with explicit expected_version/change_source kwargs (listed here so the deletion isn't "fixed" by re-adding an alias). 11b. update_settings content-field guard: passing ui_components/selector_components/configuration in updates raises ValueError before any DDB call.

Webhook service (webhook_service_test.py additions) 12. _store_webhook_metadata merging concurrently with a settings save preserves both webhook IDs and UI components (scenario 4 regression). 13. Create-branch race: settings record appears between the None read and the write → falls back to update path, no clobber of ui_components (scenario 2 regression).

Routes (settings_routes_test.py, storefront_routes_test.py) 14. GET returns version (0 for legacy/new shop, N otherwise) + updatedBy. 15. POST with matching version → 200 and response carries the new version. 16. POST with stale version → 409 with the full §5.3 payload shape (Pydantic-modeled, not a raw dict). 17. POST without version: with SETTINGS_VERSION_REQUIRED=false → 200 via service-managed mode + metric/log hook invoked; with the flag true → 428 before any service call. Both branches tested from Phase 1 (the Phase-3 change is then only the default flip). 18. POST /settings/initialize race returns already_initialized (not 500). 19. Existing tests for metafield 207-partials unchanged (no contract regression).

Metafield reconciliation 20. After save of version N, mock a concurrent bump to N+1 before reconciliation read → save_settings_to_metafields called again with the N+1 record.

Clients (vitest) 21. storefront_admin: ApiClient maps 409 → ConflictError; use-settings save sends version; reload-and-reapply path re-GETs, rebuilds payload via frontendToBackend, retries once. 22. admin_ui: useSaveSettings includes version; 409 invalidates the settings query.

E2E (CI only, shopify/e2e_tests) 23. One concurrent-save scenario: two API saves with the same base version → exactly one 200, one 409 (storefront endpoint; pure API-level, no UI driving needed).

11. Cross-plan interfaces

Interface proposals were sent to all three teammates (2026-06-10); replies incorporated below where received. Where a teammate's plan doc lands later, the contracts below are the ones this plan commits to.

→ settings-versioning (their plan: docs/plans/settings-versioning.md; all interfaces ACKED by them — agreed, not pending)

Shared counter: the §3.2 semantics are the agreed contract — monotonic +1 per write (content and infra writes), no gaps in the counter itself, legacy=0/absent. Their history captures only content writes, so gaps in the history sequence are expected and documented on their side. change_source gives them the content/infra distinction without a second counter.
History-item location (their revision after Raynor's scale question): version-history items live in a dedicated partition pk=SHOPVER#{domain}, sk live#{record_version:010d} — not under SHOP#{domain} — precisely to avoid inflating the unfiltered list_shop_sessions partition sweep this plan flagged. TransactWriteItems spans partitions in one table, so the extra_transact_items hook is unaffected. (This supersedes the earlier SETTINGS#VER# caveats; the remaining one: their items must not carry a system_account_id attribute, which would bloat the sparse GSI_SystemAccountId.)
Attribute name (agreed): DDB/model attribute record_version (collision-safe vs settings_schema_version), exposed in all API payloads as version — exactly one external name, one storage name.
Canonical shared signatures (round-1 blocking fix — ACKED): the verbatim contract is the code blocks in §4.2 (repository) and §4.3 (service). Their renames: save_settings_conditional → save_settings, expected_record_version → expected_version, VersionConflictError → SettingsConflictError; their proposed update_settings_system (with repo-internal retry) is dropped — the infra primitive is update_settings with the content-field guard and no internal retry (retry at the calling service), preserving the B3 invariant. Their deploy-event kwargs source_scope/source_version are adopted into the canonical service signature. They additionally treat this plan's M1 (admin GET exposing version via SettingsResponse) as a hard prerequisite for their restore UX, and their restore re-reads the current record_version server-side rather than trusting a client echo.
Transact-capable, never-retrying primitive (their hard requirement): extra_transact_items carries their version-item Put in one TransactWriteItems with the identical ConditionExpression on the live item; when None, a plain conditional put_item (no transact 2x WCU premium). On TransactionCanceledException with a ConditionalCheckFailed cancellation reason, the same SettingsConflictError is raised (ReturnValuesOnConditionCheckFailure is supported per-item in transactions, so the 409 payload stays uniform). The repository never auto-retries — their S3 snapshot is pre-written under the provisional next version, so a hidden retry would strand it. Retry for service-managed mode lives in create_or_update_settings and re-runs their capture per attempt (fresh snapshot; failed attempts orphan one snapshot each, the harmless class their plan already accepts on 409). Infra writers bypass capture entirely via update_settings (their FR-6).
Their restore endpoint passes expected_version straight through to this plan's user-facing mode; restore-without-version gets service-managed mode (conditional with bounded retry), not last-writer-wins — their plan's "LWW like today when omitted" wording should align to this (flagged to them).
Their write-path taxonomy ("all content writes funnel through create_or_update_settings") must drop update_ui_components from the caller list — it is deleted by this plan (§4.3).
Sequencing: whichever PR lands first introduces record_version + the conditional primitive with these exact semantics; the other adopts it (mirrored in their Release section). The stale stream-docstring fix in storefront_routes.py is owned by this plan's Phase-1 PR (§4.5).
(The earlier same-partition SETTINGS#VER# caveats — list_shop_sessions sweep amplification and the sparse-GSI pollution — are resolved by their move to the SHOPVER# partition above; the list_shop_sessions prefix-query hardening remains a worthwhile independent cleanup.)

→ theme-deploys (their plan: docs/plans/theme-targeted-deploys.md; key shape confirmed by them)

Confirmed key shape: the live record stays pk=SHOP#{domain}, sk=SETTINGS (no change to this plan's writers); staged per-theme records live in the same table at sk=SETTINGS#THEME#{theme_id} using the same ShopifySettings model. Locking applies uniformly per-(pk, sk): every record whose sk starts with SETTINGS carries its own independent record_version. The BaseRepository primitives (§4.2) are key-agnostic by construction; ShopifySettingsRepository's domain methods currently hard-code sk=SETTINGS (DynamoDBKeys.SETTINGS_SK), so theme-deploys adds sk-parameterized variants on top of the same primitives rather than new conditional machinery.
Their deploy-to-live operation reuses this plan's user-facing mode exactly: read staged record, then conditional write to the live record with the caller-supplied expected live version; mismatch raises SettingsConflictError → the §5.3 409 payload (client refreshes). Recommended: route the live write through SettingsService.create_or_update_settings(..., expected_version=expected_live_version, change_source="theme_deploy") rather than the repository directly, so validation, metafield sync, and settings-versioning's history capture (event_type="deploy") all fire for free.
Their endpoint POST /shops/{domain}/settings/deploy body {source_theme_id, expected_live_version}: naming is theirs; semantics of expected_live_version are identical to this plan's version field on saves. Deploy success response should return the new live version, same as §5.2.
Copy-version rule (agreed): never transplant the source record's counter onto the target; the target's counter increments from its own current value (or starts at 1 via attribute_not_exists). If deploy must also assert the staged source didn't change since read, compose a transaction: ConditionCheck(source.record_version = vs) + conditional write on live — same _version_condition builder.
Sequencing: the conditional-write repo methods (this plan, Phase 1) land first; deploy-copy builds on them.

→ storefront-admin-auth (their identity scheme received; reconciled)

Their plan introduces a unified StorefrontAuthContext on storefront routes with actor_id (user:{cognito_sub} | token:{token_id} | api_key:{system_account_id} legacy) and actor_display (email / token name / "API key"). Agreed mapping: this plan stores updated_by_user_id = actor_id (no schema change — same attribute, richer values) plus a new optional updated_by_display attribute written alongside it; both flow into GET responses (updatedBy, updatedByDisplay) and the §5.3 409 payload. Flat fields rather than their proposed nested conflicting_writer object — same information, one payload shape across GET and 409; lastUpdated serves as their written_at.
Canonical principal-type enum (locked across all three plans): {console_user, cli_token, api_key, shopify_user}. This plan defines no local variant: where a principal type is needed (conflict diagnostics, settings-versioning's author struct), it is their context's principal_type for storefront writes, and shopify_user for Shopify-embedded writes (W1/W2, outside their context's coverage). updated_by in payloads stays a mixed-format opaque string (their heads-up: Shopify JWT ids and system writers don't carry actor_id form) — machine logic uses principal_type/change_source, never parses updatedBy. change_source (§3.2) remains orthogonal: it names the code path, not the principal kind. The actor grammar is uniformly prefixed (their round-2 revision — no unprefixed canonical form): user:{cognito_sub}→console_user, token:{token_id}→cli_token, api_key:{system_account_id}→api_key, shopify_user:{shopify_user_id}→shopify_user; principal type is derived by splitting at the first colon, and unknown prefixes fail closed. The mapping plus a classify_legacy(value) reader for historical raw values (no colon + not "system" → shopify_user; "system" → system writer) live once in code in admin_server/models/auth.py (their PR1); any logic in this plan that branches on writer type — e.g. the §8.1 "another session of this account" degradation — imports those rather than string-matching actor_id prefixes locally.
Shopify-embedded routes (W1/W2): since this plan rewrites their write path anyway, new writes normalize the Shopify JWT user id to shopify_user:{auth.user_id} (their recommended option — one line at the call site, removes the unprefixed ambiguity going forward); historical raw values in existing records are handled by classify_legacy on read. The webhook-cleanup "system" value is replaced by change_source-specific attribution.
Graceful degradation (their note): until SSO lands, every storefront write attributes to the same api_key:{system_account_id} per account, so conflict messaging must not imply a different person — when the conflicting actor_id equals the caller's own, the §8.1 dialog says "saved from another session of this account" instead of naming a writer.
Their offered session_id/jti claim (distinguish two tabs of one user): not required for v1 — the version check drives correctness and the dialog copy above covers the UX. If they include it in the auth context anyway, this plan can later persist it next to updated_by_display without contract change; deferred.
settings-versioning's author struct (author_principal_type/id/display_name) derives from the same actor_id/actor_display pair — one identity scheme feeds both plans.
Scope dependency on this plan's Phase 2: per their plan (§4.5 scope table), any mutation of the live settings record requires settings:deploy_live, and until theme-deploys ships staged records every settings POST is a live mutation. So once SSO lands, the version-aware push script and agent CLI tokens must be minted with settings:deploy_live ticked (default CLI tokens are effectively read-only until staged saves exist); the script's auth-error output should state this. Legacy raw API keys are scope-exempt during their transition window, so Phase 2 does not block on SSO.

12. Open questions for Raynor

Phase-3 enforcement (428 for versionless saves) — confirm we actually want to break unknown external callers of the storefront save endpoint, or keep optional-version indefinitely with the metric as a watchdog.
Should delete_settings also gain a version condition? Currently no user-facing delete path exists; plan says no.
Retry count for system writers (3) and whether onboarding (W5) should remain best-effort-with-warning after retries or fail the API-key validation call.

1. Problem statement​

1.1 Inventory of writers (file:line evidence)​

1.2 Concrete race scenarios​

1.3 A note on the FN incident​

2. Goals and non-goals​

3. Chosen mechanism: optimistic locking with a record_version attribute and DynamoDB conditional writes​

3.1 Why this and not the alternatives​

3.2 Version attribute semantics (shared contract with settings-versioning)​

4. Detailed design — server​

4.1 Model (models/shopify_entities.py)​

4.2 Repository (repositories/shopify_settings_repository.py + repositories/base_repository.py)​

4.3 Service (services/settings_service.py)​

4.4 Per-writer changes​

4.5 DDB ↔ metafield consistency (save_settings_to_metafields)​

5. API contract changes​

5.1 Read endpoints — expose the version​

5.2 Write endpoints — accept the expected version​

5.3 Conflict response​

6. Migration / backfill​

6.1 Capacity ceilings: a shared two-tier model (metafield 128KB, then DDB 400KB)​

7. Rollout phases (backward compatible)​

8. Client handling​

8.1 storefront_admin editor (components/storefront_admin, the standalone settings editor served by a Cloudflare Worker — distinct from components/admin_worker, the internal admin dashboard)​

8.2 admin_ui (Shopify embedded app)​

8.3 Scripts and agent workflows​

9. Out of scope​

10. Test plan​

11. Cross-plan interfaces​

12. Open questions for Raynor​