Skip to main content

Finnish compound-word query splitter — Implementation Plan

Status: Draft 2026-05-21. Companion to the Finnish compound-word query splitter — Spec. Branch: vicilliar/finnish-compound-splitter Notion mirror: https://www.notion.so/36775d43da4c81cc85bccf0f5f3213a7

This plan describes the how of shipping the Finnish compound-word query splitter. The spec page is the what. Read the spec first — this plan assumes its decisions as given.

Conventions for this plan

  • Copy existing patterns wherever they exist. New code should not invent shapes when a working precedent is in this repo:
    • LexicalEnrichmentConfig mirrors AddDocsConfig (components/ecom_utils/ecom_utils/index_settings_service/index_settings_model.py) and its repository methods follow the add_merchandising_field atomic-update pattern in index_settings_repository.py.
    • The settings exporter changes copy patterns from existing exporters in components/ecom_settings_exporter/ (see existing tests like agentic_config_export_test.py, llm_image_search_config_export_test.py, profiles_export_test.py — these are the templates to clone).
    • The lexical analyzer service's S3 client uses a 5-minute TTL cache for the synonym dictionary.
    • Admin lambda surface follows existing index settings routes (components/admin_lambda/admin_lambda/services/index_settings_admin_service.py).
    • Indexer enrichment module is structured like ecom_indexer/merchandising.py (per-field transform applied to each batch of docs).
  • Every new module ships with sufficient unit tests. Bare minimum per PR: golden-output tests against fixed fixtures, success-path coverage, failure-path coverage, configuration-disabled coverage. No PR merges with "compile pass" as the only validation.
  • All PRs are dark-deployable. Until the first index has lexical_enrichment_config.parse_language set, every PR must produce zero behavior change for existing customers. This invariant guarantees safe rollback at any point.
  • No __init__.py files — Pants handles Python imports without them.
  • Run ruff after editing Python files; npm run format in components/search_proxy after editing TypeScript.

PR dependency graph

PR1 lexical analyzer service ──┐
├──→ PR4 ecom_indexer ──┐
PR2 lexical analyzer infra ────┘ (uses PR1) │

PR3 ecom_utils model ──┬──→ PR5 settings exp ──→ PR6 search_proxy ──┐
│ │
├──→ PR7 admin_lambda ────────────────────────┤
│ │
└─────────────────────────────────────────────→├──→ PR8 E2E

PR4 (above) ─────────────────┘

Critical path: PR1 + PR2 → PR4 → PR6 → PR8 (≈4 sequential dependencies). Parallel-able: PR3, PR5, PR7 can land alongside the critical path.

One-time prep (not a PR; ad-hoc tasks)

These tasks gate PR1's acceptance criteria but are operational, not code.

Convert the source synonym dictionary to JSON

The linguistic source file synonym_config.py is a Python module exporting SYNONYM_GROUPS: list[list[str] | dict]. Each entry is either a flat list (applies to all categories) or a dict with terms + only + exclude keys.

Write a small conversion script components/lexical_analyzer/scripts/build_synonyms_json.py that imports the source file (or vendors its contents) and writes a normalised JSON form:

[
{"terms": ["kenkä", "jalkine"]},
{"terms": ["verkka", "college", "treeni"], "only": ["Clothing & Shoes", "Sports"]},
{"terms": ["tuoli", "istuin"], "exclude": ["Automotive & Caravan Supplies"]}
]

Output path: components/lexical_analyzer/dictionaries/fi/synonyms.json. Commit to the repo so the canonical source is version-controlled. The S3 copy is the runtime mirror.

S3 upload command

Upload the JSON to the per-env {env}-lexical-analyzer bucket under the dictionaries/{language}/ prefix:

# Dev
aws s3 cp components/lexical_analyzer/dictionaries/fi/synonyms.json \
s3://dev-lexical-analyzer/dictionaries/fi/synonyms.json \
--profile MarqoControlPlane

# Staging
aws s3 cp components/lexical_analyzer/dictionaries/fi/synonyms.json \
s3://staging-lexical-analyzer/dictionaries/fi/synonyms.json \
--profile MarqoControlPlane

# Prod
aws s3 cp components/lexical_analyzer/dictionaries/fi/synonyms.json \
s3://prod-lexical-analyzer/dictionaries/fi/synonyms.json \
--profile MarqoControlPlane

Enable S3 object versioning on the {env}-lexical-analyzer bucket as part of PR2 so dictionary edits can be rolled back.

Document this command in components/lexical_analyzer/README.md so SEs can use it without engineering involvement once the format is locked.

PR1 — Lexical analyzer service (new component)

Goal: New components/lexical_analyzer/ HTTP service that wraps libvoikko + four enrichment transforms behind a single /text-variants endpoint.

Files (new)

components/lexical_analyzer/
├── Dockerfile # libvoikko + FI dict (pinned ver) + Python
├── BUILD # Pants target
├── README.md # local-dev + S3 upload commands
├── dictionaries/
│ └── fi/synonyms.json # checked-in canonical source
├── scripts/
│ └── build_synonyms_json.py # `synonym_config.py` → JSON converter
└── lexical_analyzer/
├── app.py # FastAPI app: /text-variants, /health
├── main.py # entrypoint
├── models.py # request/response models
├── s3_client.py # synonym dict loader, 5-min TTL
├── voikko_pool.py # warm Voikko instance per worker
├── app_test.py # /text-variants + /health integration
├── s3_client_test.py # cache TTL + miss/hit + error
└── transforms/
├── parsed.py # Voikko compound splitter
├── parsed_test.py
├── synonyms.py # curated-dictionary noun expansion
├── synonyms_test.py # exercises dictionary lookup
├── typos.py # double-consonant / digraph / adj-key
├── typos_test.py
├── numeric_tokens.py # model-code / dimension extraction
└── numeric_tokens_test.py

Tests

  • Per-transform unit tests with golden-output fixtures. Each transform exercises both common cases (compound, multi-word sentence) and edge cases (empty string, single char, all-numeric, non-Finnish text).
  • /text-variants integration tests — single-transform requests, multi-transform requests, batch input, malformed input (422), unsupported language (400). Includes explicit tests that synonyms / typos / numeric_tokens always operate on the parsed text even when "parsed" is not in the transforms list.
  • S3 client tests — cache hit, cache miss, TTL expiry, S3 unavailable (raises and 503s /health until recovered).
  • /health tests — returns 503 until dictionary loaded, 200 thereafter.

Dependencies

None (greenfield component).

Acceptance criteria

  • POST /text-variants {"language":"fi","transforms":["parsed"],"queries":["rattikelkka"]} returns {"parsed":["ratti kelkka"]} (with real Voikko in the docker image; brand-prefixed compounds like "stigarattikelkka" return unchanged because the brand is not in Voikko's dictionary).
  • All 4 transforms produce output matching their checked-in golden fixtures.
  • /health 503 until FI dictionary loaded, 200 after.
  • Docker image builds; final image size < 1 GB.
  • Voikko library + Finnish dictionary versions pinned and noted in Dockerfile and README.md.
  • pants test //components/lexical_analyzer:: passes.

PR2 — Lexical analyzer infra (AWS CDK)

Goal: Deploy the lexical analyzer service to ECS Fargate as a standalone CDK stack alongside the existing ecom_stack, with an internal NLB frontend, per the cost estimate in the spec. Stack lives under infra/ecom/stacks/ because the consumers (search proxy, ecom_indexer) are ecom-domain — but it is its own stack with an independent lifecycle so PR2 can deploy ahead of PR4/PR6.

Files (new + edit)

  • infra/ecom/stacks/lexical_analyzer_stack.py (new) — ECS Fargate service, internal NLB, security groups, IAM (S3 read access scoped to {env}-lexical-analyzer/dictionaries/*), CloudWatch log group, autoscaling, ECR repo
  • infra/ecom/stacks/lexical_analyzer_stack_test.py (new) — CDK snapshot test
  • infra/ecom/app.py (edit) — instantiate LexicalAnalyzerStack per environment alongside the existing ecom stacks
  • New S3 bucket {env}-lexical-analyzer with object versioning enabled on the entire bucket; encryption + private access; the bucket is dedicated to the analyzer (not piggy-backing on cloud-controller)

Sizing

Per cost estimate in spec: 2 baseline Fargate tasks (1 vCPU, 2 GB RAM) on-demand for prod; FARGATE_SPOT for dev/staging. Autoscale to 4 tasks at 70% CPU/memory.

Tests

  • CDK snapshot test asserts the synthesised template matches a checked-in golden snapshot.
  • Smoke test after cdk deploy: curl http://<internal-nlb>/health from an in-VPC host returns 200.

Dependencies

PR1 (Docker image exists; ECR has a build).

Acceptance criteria

  • cdk synth clean for dev/staging/prod.
  • Stack deploys to dev; /health returns 200 within 5 minutes.
  • Grafana alerts configured in cloud_data_plane, sourced from the CloudWatch metrics the stack emits: lexical analyzer error rate > 1% over 5 minutes, lexical analyzer p99 latency > 200 ms over 5 minutes.
  • Realised monthly cost matches spec estimate within 10% after first month.

PR3 — Shared model (ecom_utils + DDB repository)

Goal: Add LexicalEnrichmentConfig as a peer of AddDocsConfig on IndexSettings, with repository setter methods.

Pattern reference

  • Model: mirror AddDocsConfig in index_settings_model.py. LexicalEnrichmentConfig(RecordModel) with model_config = ConfigDict(frozen=True).
  • Repository methods: mirror add_merchandising_field / set_* atomic-update patterns in index_settings_repository.py. Targeted UpdateItem with expression_attribute_names to avoid clobbering sibling fields.

Files (edit)

  • components/ecom_utils/ecom_utils/index_settings_service/index_settings_model.py — add LexicalEnrichmentConfig class + IndexSettings.lexical_enrichment_config: LexicalEnrichmentConfig | None = None field
  • components/ecom_utils/ecom_utils/index_settings_service/index_settings_repository.py — add set_lexical_enrichment_config() and set_parse_language() / set_extract_numeric_tokens() / set_enriched_fields() atomic setters
  • components/ecom_utils/ecom_utils/index_settings_service/index_settings_repository_test.py — round-trip + atomic-update tests
  • components/ecom_utils/ecom_utils/index_settings_service/index_settings_model_test.py — pydantic validation tests

Tests

  • Pydantic validation: parse_language="xx" rejected (not in Literal["fi"]); defaults are parse_language=None, extract_numeric_tokens=False, enriched_fields=None.
  • DDB round-trip: write settings with lexical_enrichment_config, read back, all fields match.
  • Atomic setter doesn't clobber add_docs_config, search_config, or merchandising_fields.
  • Each setter is independently exercised against moto-backed DDB.

Dependencies

None.

Acceptance criteria

  • pants test //components/ecom_utils:: passes.
  • No __init__.py files added.
  • New fields exported on the IndexSettings pydantic model.

PR4 — Ecom indexer integration

Goal: Indexer calls the lexical analyzer to populate _mq_* enrichment fields on incoming docs. Includes one-shot backfill script.

Pattern reference

  • Enrichment orchestrator: mirror ecom_indexer/merchandising.py shape. Pure function that takes a list of docs + config and mutates them with the new fields. Module is import-light to keep cold-start fast.
  • Lexical analyzer client: mirror existing HTTP client patterns in the indexer (e.g. how it talks to Marqo).
  • Backfill script: mirror existing one-shot scripts under components/ecom_indexer/ecom_indexer/scripts/.

Files (new + edit)

  • components/ecom_indexer/ecom_indexer/lexical_analyzer_client.py (new) — HTTP client with batch endpoint, retry semantics
  • components/ecom_indexer/ecom_indexer/enrichment.py (new) — collect target-field values per doc, call lexical-analyzer batch, write back _mq_* fields
  • components/ecom_indexer/ecom_indexer/document_operations.py (edit) — wire enrichment.py into the per-doc transform sequence
  • components/ecom_indexer/ecom_indexer/lambda_function.py (edit) — initialise lexical analyzer client at boot, lexical analyzer URL from env var
  • components/ecom_indexer/ecom_indexer/scripts/backfill_enrichment.py (new) — one-shot patch via update_documents: scan an index, batch-parse, patch with new fields only. Dry-run + execute modes; resumable from state file; idempotent
  • components/ecom_indexer/ecom_indexer/lexical_analyzer_client_test.py (new)
  • components/ecom_indexer/ecom_indexer/enrichment_test.py (new)
  • components/ecom_indexer/ecom_indexer/scripts/backfill_enrichment_test.py (new)

Tests

  • lexical_analyzer_client: success path, retry on transient HTTP error, hard fail after retry budget, malformed response, timeout.
  • enrichment: with parse_language="fi" and extract_numeric_tokens=True configured, ingested docs gain expected _mq_* fields; with neither configured, no lexical analyzer call is made; field set derives from lexical-searchable attrs when enriched_fields is null; enriched_fields overrides the derived set when provided.
  • document_operations integration: end-to-end transform sequence on a fake batch, asserts merchandising + enrichment + numeric splitting all compose correctly.
  • backfill_enrichment: dry-run prints sample diff and exits without writing; execute is idempotent on re-run; state file resumes from kill point; lexical analyzer failure mid-run is surfaced and the script can be rerun without re-processing finished docs.

Dependencies

PR1 + PR2 (lexical analyzer deployed at a known URL), PR3 (model exists).

Acceptance criteria

  • One HTTP call per batch (≈100 docs), not per-doc.
  • Lexical analyzer failure during indexing fails the batch — existing retry machinery picks it up.
  • Backfill: resumable, idempotent, dry-runnable.
  • pants test //components/ecom_indexer:: passes.

PR5 — Settings exporter

Goal: Export LexicalEnrichmentConfig from DDB to CloudFlare KV so the search proxy can read it.

Pattern reference

Copy the structure of existing exports in components/ecom_settings_exporter/ecom_settings_exporter/lambda_function.py — the tests agentic_config_export_test.py, llm_image_search_config_export_test.py, profiles_export_test.py are the templates. Each existing exporter test asserts a specific config slice round-trips into the KV payload while leaving the rest untouched.

Files (edit)

  • components/ecom_settings_exporter/ecom_settings_exporter/lambda_function.py — include lexical_enrichment_config in the KV payload (passthrough; no transformation)
  • components/ecom_settings_exporter/ecom_settings_exporter/lambda_function_test.py — add an enrichment-config round-trip assertion (mirror the existing agentic_config_export_test.py shape)
  • components/ecom_settings_exporter/ecom_settings_exporter/lexical_enrichment_config_export_test.py (new) — dedicated test module for the new config slice, matching the pattern of agentic_config_export_test.py / llm_image_search_config_export_test.py

Tests

  • DDB record with lexical_enrichment_config produces a KV record containing the same fields.
  • DDB record without lexical_enrichment_config produces KV output byte-identical to pre-PR.
  • Partial lexical_enrichment_config (e.g. only extract_numeric_tokens=True, no parse_language) round-trips correctly.

Dependencies

PR3 (model exists).

Acceptance criteria

  • pants test //components/ecom_settings_exporter:: passes.
  • Backward compatibility: existing test fixtures still pass without modification.

PR6 — Search proxy integration

Goal: Search proxy reads LexicalEnrichmentConfig from KV, calls lexical analyzer on q when parse_language set, builds hybridParameters override, applies skip-list, falls back to raw on failure.

Pattern reference

  • Lexical analyzer client: mirror components/search_proxy/src/synonyms.ts for CF Cache integration shape — cachePut/cachePutAsync, key construction, TTL handling.
  • Zod schema: add to IndexSettings schema in platform.ts alongside existing slices.
  • Skip-list / Marqo request construction: extend search.ts:handleSearch minimally; isolate the new logic in a helper for testability.

Files (new + edit)

  • components/search_proxy/src/platform.ts (edit) — add lexical_enrichment_config to zod schema
  • components/search_proxy/src/lexical-analyzer-client.ts (new) — HTTP wrapper with CF Cache integration (7-day TTL, key compound-split/<lang>/<urlencoded(q)>), HTTP timeout configurable via LEXICAL_ANALYZER_TIMEOUT_MS env var (default 200 ms), fallback-to-raw + WARN log on failure
  • components/search_proxy/src/search.ts (edit) — invoke lexical-analyzer-client inside handleSearch when configured; build hybridParameters with queryLexical = "<raw> <parsed>", queryTensor = raw, searchableAttributesLexical = [<originals>, <_mq_*>], searchableAttributesTensor = <originals>; apply skip-list
  • components/search_proxy/src/lexical-analyzer-client.test.ts (new)
  • components/search_proxy/src/search.test.ts (edit) — extend with analyzer-on/off cases, skip-list cases, failure-fallback case, debug-envelope case

Tests

  • lexical-analyzer-client: cache miss → HTTP call → response cached; cache hit → no HTTP call; timeout → raw fallback + WARN log; HTTP error → raw fallback + WARN log.
  • search.ts (integration): with lexical_enrichment_config.parse_language="fi", the Marqo request body has the expected hybridParameters override; without lexical_enrichment_config, the request is byte-identical to the pre-PR baseline (regression-protect existing snapshot suite).
  • Skip-list cases: empty, image, ≤2 chars, numeric-only, >200 chars — each verified to bypass the lexical analyzer.
  • Debug envelope: with MARQO_DEBUG_HEADER_VALUE set, response carries _marqo_debug.parsed with language, raw_query, parsed_query, analyzer_latency_ms, cache_hit.

Dependencies

PR1 + PR2 (lexical analyzer deployed), PR3 (model), PR5 (KV exporter writes the field).

Acceptance criteria

  • Existing search snapshot tests still pass byte-identical when lexical_enrichment_config is unset.
  • New tests cover all configured-on and skip-list paths.
  • npm test in components/search_proxy passes.

PR7 — Admin lambda surface

Goal: Surface LexicalEnrichmentConfig in admin endpoints so SEs can enable per-index without DDB direct writes.

Pattern reference

Copy the existing index-settings admin endpoint shape — components/admin_lambda/admin_lambda/services/index_settings_admin_service.py and the corresponding routes module. Extend the request/response models; reuse the existing partial-merge persistence pattern.

Files (edit)

  • components/admin_lambda/admin_lambda/services/index_settings_admin_service.py — accept lexical_enrichment_config in update payload, delegate to set_lexical_enrichment_config repository method (PR3)
  • components/admin_lambda/admin_lambda/models/index_models.py — extend request and response models with lexical_enrichment_config: LexicalEnrichmentConfig | None
  • components/admin_lambda/admin_lambda/routes/<index_settings_routes>.py — wire it up (file name TBD by following existing route)
  • Corresponding tests in matching *_test.py files

Tests

  • PATCH .../settings accepts lexical_enrichment_config payload, persists to DDB.
  • GET .../settings returns the configured lexical_enrichment_config.
  • Invalid language code ("xx") → 422.
  • Empty lexical_enrichment_config (null) clears the field.
  • Partial update of lexical_enrichment_config doesn't clobber the rest of the index settings.

Dependencies

PR3 (model exists).

Acceptance criteria

  • pants test //components/admin_lambda:: passes.

PR8 — E2E tests

Goal: End-to-end coverage that the feature works against real services in dev. Tests are tagged @scheduled so they run a couple of times a week (not on every PR/merge) — Finnish-customer regressions are caught within a day or two without burning CI minutes on a feature that few customers exercise. The PR8 acceptance criterion still requires one successful pass against a real dev-deployed stack before merging into mainline; ongoing coverage is via the scheduled cron.

Pattern reference

Follow the existing pattern in components/shopify/e2e_tests/e2e_tests/tests/ — each test uses EcomClient to drive real customer-facing endpoints in dev, with fresh test indexes per test. Tag each test function with @pytest.mark.scheduled so it is gated to the scheduled workflow rather than the per-PR run.

Files (new)

  • components/shopify/e2e_tests/e2e_tests/tests/lexical_enrichment_test.py

Test cases

  • test_parse_language_fi_finds_split_compound_via_component_query — fresh test index with lexical_enrichment_config.parse_language="fi". Index a product titled "Stigarattikelkka". Query for a single component word, e.g. "kelkka". Without splitting, lexical sees the title as one opaque token and "kelkka" does not match; with the lexical analyzer engaged, _mq_fi_parsed_title contains "kelkka" and the product appears in the top results.
  • test_parse_language_unset_does_not_match_component — control case. Same index seeded the same way but no lexical_enrichment_config. Same "kelkka" query. Assert: product does NOT appear (proves the feature is doing the work, not free recall from another mechanism).
  • test_lexical_analyzer_down_falls_back_to_raw — inject lexical analyzer unavailability (misconfigured URL). Assert: search still returns 200 with raw-query results; no 5xx surfaced.
  • test_numeric_tokens_finds_partial_model_code — index with extract_numeric_tokens=True. Product with title "DCS391N power drill". Query "391". Assert: product appears.
  • test_debug_envelope_populated — with debug header set, assert _marqo_debug.parsed is present in the response with raw_query, parsed_query, cache_hit, analyzer_latency_ms.

Dependencies

All previous PRs deployed to dev.

Acceptance criteria

  • All 5 tests pass at least once against a real dev-deployed stack before this PR merges.
  • Tests carry the @scheduled marker so they run on the existing scheduled E2E workflow (≈twice a week) and are skipped on the default per-PR run.
  • Tests use the existing EcomClient pattern and don't introduce new infra fixtures.

Rollout sequence (after all PRs merged)

  1. Deploy PR1 + PR2 to dev, staging, prod. Lexical analyzer service live, receiving zero traffic.
  2. Deploy PR3 + PR5. Model + exporter shipped dark. No index has lexical_enrichment_config set → zero behaviour change.
  3. Deploy PR4. Indexer reads lexical analyzer when configured. Still no index configured → zero behaviour change.
  4. Deploy PR6. Search proxy reads lexical analyzer when configured. Same.
  5. Deploy PR7. Admin tooling available for SE use.
  6. Deploy PR8. E2E tests run on every CI build going forward.
  7. Enable for Kärkkäinen's staging index. SE uploads Finnish synonym dictionary to S3 (staging-lexical-analyzer/dictionaries/fi/synonyms.json), sets lexical_enrichment_config.parse_language="fi" and extract_numeric_tokens=True via admin endpoint. Run backfill_enrichment.py against the staging index. Validate end-to-end with sample Finnish queries.
  8. Enable for Kärkkäinen's prod index. Same sequence, prod bucket. Watch CloudWatch alarms (lexical analyzer error rate, search-proxy fallback rate, Kärkkäinen-specific search latency).

Each PR is independently deployable and reversible. Steps 1–6 are dark deploys; step 7 is the first behaviour-visible change, scoped to one staging index.

Test coverage summary

PRComponentUnit testsIntegration testsE2E tests
PR1lexical analyzer serviceper-transform golden fixtures; S3 client cache; /health states/text-variants single + batch + multi-transform
PR2lexical analyzer infraCDK snapshotpost-deploy smoke (curl /health)
PR3ecom_utilspydantic validation + repo round-trips + atomic setters
PR4ecom_indexerlexical analyzer client; enrichment orchestrator; backfill scriptdocument_operations transform composition
PR5settings exporterper-config-slice export tests
PR6search proxylexical-analyzer-client (cache, fallback, timeout); skip-listsearch.ts integration (hybridParameters built correctly; debug envelope)
PR7admin lambdarequest validation; persistence; partial updates
PR8shopify/e2e_tests5 cases against dev stack

Every PR must include the tests in its column above. "Tests follow in a later PR" is not an acceptable framing.