Decommission a fork source index

Background

After a fork has cut over and the target index has held production traffic through its soak window, the source index is no longer serving customer reads out of its own data plane. Decommissioning has two parts:

Finalize the source's read routing to the target — confirm the target is serving, then run cleanup (POST /forks/{id}/cleanup). Cleanup removes the source's self-referencing read alias and makes the target read alias visible. It deliberately retains the source→target write alias, so writes addressed to the source keep fanning out to the target.
Tear down the source's data-plane resources — the Marqo (Vespa) index on the source cell, the source's SQS queue, S3 prefix, etc.

prepare_transfer:  source writes → [source, target]   (dual-write via write alias)
cutover (100%):    source reads  → target             (read alias hides source)
cleanup:           source reads  → target             (source self-read removed)
                   source writes → [source, target]   (write fan-out RETAINED)
teardown:          source data-plane resources deleted  (separate manual step — see note below)

The source stays in the write fan-out after cleanup. Writes fan out to the source's explicit alias targets in addition to the source's own implicit self-write, and the alias model cannot express removing that self-write. So even after cleanup the source remains a live relay in the write path; reads, however, resolve exclusively to the target. Removing the source from writes automatically — so a chained parent routes straight past it — is gated on chain collapse (see Future work).

Critically, the source's EcomIndexSettings record stays in place. It remains the canonical entry point for the Ecom API: customers continue to address the source's index name in their URLs, and the read + write aliases on the source's settings record route those operations through to the target. Searches to the source should return what the target returns (modulo whatever change the fork was made to accommodate), not a 404.

Per-step code: fork_routes.py, fork_service.py cleanup Design: Fork design doc

Decommissioning the data plane is one-way. Once the source's Vespa data is gone, the only recovery path is restoring from an EBS snapshot of the source's data volume — confirm one exists before tearing down (see preflight check 5).

When to decommission

A typical fork lifecycle:

POST /forks → workflow runs → READY
POST /forks/{id}/cutover (optionally ramping targetTrafficPercent) → COMPLETE
Soak at 100% cutover with dual-write still active (default: 24–48 hours) — gives time to catch latent issues while rollback is still cheap.
POST /forks/{id}/cleanup → source read routing finalized to the target (the write fan-out to the target is retained).
Tear down the source's data-plane resources.

cleanup is only valid while the fork is in COMPLETE (any other status returns 409). It does not change the fork status — COMPLETE is terminal.

Preflight checks

The cleanup endpoint does no safety checks of its own, and there is no automation around the teardown step. Run through this list against the target environment before proceeding. Most checks are done from the fork detail page in the Admin UI: https://admin.marqo-internal.org/forks/{fork_id} (preprod: admin.preprod-marqo.org, staging: staging-admin.dev-marqo.org).

1. Cutover is at 100% and stable

On the fork detail page, confirm:

Status badge shows complete.
Transfer Config → progress shows cutover_target_percent: 100.
The Updated timestamp is at least the agreed soak window in the past (default 24–48 hours).

If reads are still split (e.g. 80), finish the ramp via the Cutover action before decommissioning.

2. No unexpected config divergences

The Config Divergences panel on the fork detail page lists the categories where source and target configs differ (Infra, IndexConfig, Merchandising, etc.).

You'll usually be forking in order to change something — a new embedding model, a different cell, updated merchandising rules — so divergences in the categories you intentionally changed are expected and fine. What you're checking for is divergence in categories you didn't intend to change, which would mean someone (or some pipeline) has been editing the source or target during the soak window. Click Refresh divergences to get a fresh comparison, then sanity-check the listed categories against the original fork intent.

Unexpected divergences are not a hard blocker — after cleanup, reads serve exclusively from the target — but they're worth understanding before you lock in the change. If something on the source has been drifting, decide whether you want that change reflected on the target before continuing.

3. Doc counts within tolerance

Click Re-verify on the fork detail page. This reruns fork.verify against the control plane stats API and shows a report including a hit_count check with source_value / target_value.

Don't expect exact equality. Some drift between source and target is normal:

Writes and deletes can be processed in different orders on the two queues, so transient counts diverge even when the steady state is consistent.
Individual writes can fail on one index but not the other (e.g. a transient Vespa or model error on one cell), and there is no automatic re-reconciliation today.
The current verify is a point-in-time hit count, not a per-document diff — equal counts don't prove equal contents either.

A more thorough reconciliation action is planned but not yet implemented. Until then, what you're looking for here is gross divergence (large percentage gap, or a gap that's growing rather than oscillating around a small delta), which indicates a downstream pipeline isn't actually dual-writing. A small steady-state delta is acceptable.

4. Target SQS queue is healthy

Writes hit the ecom API first, which fans out to all aliased queues. By the time you reach this step, the target queue has already been receiving the full write stream for the soak window. Check the target's queue (and DLQ).

From the target index detail page (/indexes/{target_index_id} → Infra tab), follow the SQS console link to inspect:

ApproximateNumberOfMessagesVisible ≈ 0
ApproximateNumberOfMessagesNotVisible ≈ 0
ApproximateAgeOfOldestMessage low / decreasing
DLQ depth is 0 (or whatever the steady-state baseline is) — see ecom-queue-backlog-increasing for triage.
The target ESM (event source mapping) is enabled — prepare_transfer disables it during snapshot/restore and activate_target re-enables it. The Infra tab shows ESM concurrency / status; a disabled ESM here means activation didn't run.

You can focus on the target queue for this check. Writes fan out at the ecom API to both queues, but reads now serve exclusively from the target, so it's the target queue's health that determines whether the cutover is safe to finalize.

5. Search-proxy error rate is normal on the target

From the target index detail page, follow the Grafana links to the per-index dashboards. Pull the search-proxy 4xx/5xx rate for the target over the soak window and compare against the pre-cutover baseline on the source.

A sustained elevation here (vs. baseline) means something on the target is wrong — fix before removing the fallback. See search-proxy diagnostics for what to look for.

6. An EBS snapshot of the source data volume exists

Tearing down the source's data plane has no software-level rollback. Once the Marqo index on the source cell is deleted, the only recovery path if the target later turns out to be broken in some unrecoverable way is restoring the source's data volume from an EBS snapshot.

From the source index detail page (/indexes/{source_index_id} → Infra tab), identify the source's data volume(s) and follow the AWS console link. Confirm:

At least one snapshot exists from after the cutover (so its contents represent a fully-populated source).
It is in completed state.

Note the snapshot ID in the fork ticket / change log before proceeding.

Run cleanup (finalize source read routing to the target)

The cleanup action is not exposed as a button on the fork detail page — call the endpoint directly:

curl -X POST ".../api/v1/accounts/{account}/indexes/{source_index}/forks/{fork_id}/cleanup" \
  -H "Content-Type: application/json"

Expected response:

{ "fork_id": "...", "status": "complete", "cleaned_up": true }

cleaned_up: false means the alias update raised (e.g. transient DDB failure) but cutover is unaffected. The endpoint is idempotent — re-run after investigating. Common causes:

A concurrent edit bumped the index settings record version — just retry.

After this runs, the source's read routing is finalized: the source's self-referencing read alias is removed and the target read alias is made visible, so 100% of reads resolve to the target. The write fan-out is intentionally left in place — writes addressed to the source continue fanning out to the target (and to the source itself, which stays a live relay). Verify on the source index detail page → Aliases tab that the doc-reads target is the target index, and that the doc-writes target still includes the target (source still serves as the entry point; reads resolve to the target while writes fan out to it).

Tear down source data-plane resources

The source is still receiving writes after cleanup. Because cleanup retains the source→target write alias (and the source's implicit self-write cannot be removed via the alias model), writes addressed to the source still fan out to the source's own data plane as well as the target. Deleting the source's Marqo index and queue below stops the source from holding a copy of new writes; writes addressed to the source still reach the target via the retained write alias. Confirm with the team that owns the source cell that the ecom write path tolerates the now-missing source queue (rather than failing the whole write) before deleting it. Fully removing the source from the write path is gated on chain collapse (see Future work).

There is no single API call for this. Do not call DELETE /api/v1/indexes/{source_index} on the Ecom API — that endpoint deletes the EcomIndexSettings record unconditionally (see index_service.delete_index in components/shopify/admin_server/admin_server/services/index_service.py), which would destroy the alias config that's still routing customer traffic to the target.

Instead, tear down just the data-plane pieces, leaving the settings record and KV routing in place:

Marqo (Vespa) index on the source cell — call the marqo-classic API directly to delete the index on the source cell. This does not touch the control-plane EcomIndexSettings record.
Source's SQS queue(s) and event source mapping — delete via the AWS console (the source index detail page → Infra tab links to the SQS console).
Source's S3 prefix ({source_shop_id}/) — delete via S3 console.
Coordinate with whoever owns the source cell — if the source cell is being retired entirely as part of a fork (e.g. blue-green infra migration), some of these are subsumed by cell teardown.

Keep in place:

The source's EcomIndexSettings record in {env}-EcomIndexSettingsTable. It holds the read/write aliases that route customer requests through to the target.
The source's KV entry in the search-proxy KV namespace (key: {source_account}-{source_index}). Search-proxy uses this to resolve incoming requests addressed to the source so they can be routed onward.

Follow-up: the Ecom API's DELETE /indexes/{index_name} should refuse to delete the settings record when an index_aliases config is present, so that operators can use the same delete endpoint to safely retire a fork source. Until that lands, the manual marqo-classic + AWS console path above is the correct way.

Post-decommission verification

Searches to the source still succeed and return target data. Send a search request addressing the source by ID (x-marqo-index-id: {source_account}-{source_index}) and confirm it returns results equivalent to what the target returns directly. A 404 here means either the source's KV entry was inadvertently cleared or its EcomIndexSettings record was deleted — investigate.
No new sync jobs land for the source. On the source index detail page → Jobs tab (or the global /reindexing list filtered to the source), confirm no new jobs appear after teardown. Customer writes that were addressed to the source should be visible as jobs on the target instead.
Target metrics unchanged. Target write rate, doc-count growth, and search-proxy latencies should look the same as during soak (you're removing redundancy, not changing the target's load profile).
Snapshot ID recorded. The EBS snapshot ID from preflight check 6 is in the fork ticket / change log for the agreed retention window.

Future work

Chain collapse: when a chain of forks exists (A → B → C), today running cleanup on the B → C fork leaves B alive as a relay because A's aliases still point at B. The intent (see TODO in fork_service.cleanup) is to update A's aliases to point directly at C so B's settings record can also be retired. Until that lands, intermediary settings records from chained forks must be kept alive in the same way the source's settings record is preserved above.
Per-document reconciliation: today's verify is a point-in-time hit count comparison, so it can't tell you which documents differ between source and target — only that the totals are close. A reconciliation action that diffs document IDs (and optionally re-writes missing docs) is planned but not implemented.
Alias-aware Ecom delete: extend DELETE /api/v1/indexes/{index_name} to skip deleting the EcomIndexSettings record when index_aliases is set, so operators can use the same endpoint to retire a fork source data plane without destroying the alias routing.
Data-plane teardown automation: the source data-plane teardown above is manual. A future POST /forks/{id}/teardown (or extension to the existing cleanup) could orchestrate the marqo-classic delete + AWS resource cleanup end-to-end with the same idempotency guarantees as the rest of the fork API.

Background​

When to decommission​

Preflight checks​

1. Cutover is at 100% and stable​

2. No unexpected config divergences​

3. Doc counts within tolerance​

4. Target SQS queue is healthy​

5. Search-proxy error rate is normal on the target​

6. An EBS snapshot of the source data volume exists​

Run cleanup (finalize source read routing to the target)​

Tear down source data-plane resources​

Post-decommission verification​

Future work​