Finnish compound-word query splitter — Implementation Plan
Status: Draft 2026-05-21. Companion to the Finnish compound-word query splitter — Spec.
Branch: vicilliar/finnish-compound-splitter
Notion mirror: https://www.notion.so/36775d43da4c81cc85bccf0f5f3213a7
This plan describes the how of shipping the Finnish compound-word query splitter. The spec page is the what. Read the spec first — this plan assumes its decisions as given.
Conventions for this plan
- Copy existing patterns wherever they exist. New code should not invent shapes when a working precedent is in this repo:
LexicalEnrichmentConfigmirrorsAddDocsConfig(components/ecom_utils/ecom_utils/index_settings_service/index_settings_model.py) and its repository methods follow theadd_merchandising_fieldatomic-update pattern inindex_settings_repository.py.- The settings exporter changes copy patterns from existing exporters in
components/ecom_settings_exporter/(see existing tests likeagentic_config_export_test.py,llm_image_search_config_export_test.py,profiles_export_test.py— these are the templates to clone). - The lexical analyzer service's S3 client uses a 5-minute TTL cache for the synonym dictionary.
- Admin lambda surface follows existing index settings routes (
components/admin_lambda/admin_lambda/services/index_settings_admin_service.py). - Indexer enrichment module is structured like
ecom_indexer/merchandising.py(per-field transform applied to each batch of docs).
- Every new module ships with sufficient unit tests. Bare minimum per PR: golden-output tests against fixed fixtures, success-path coverage, failure-path coverage, configuration-disabled coverage. No PR merges with "compile pass" as the only validation.
- All PRs are dark-deployable. Until the first index has
lexical_enrichment_config.parse_languageset, every PR must produce zero behavior change for existing customers. This invariant guarantees safe rollback at any point. - No
__init__.pyfiles — Pants handles Python imports without them. - Run
ruffafter editing Python files;npm run formatincomponents/search_proxyafter editing TypeScript.
PR dependency graph
PR1 lexical analyzer service ──┐
├──→ PR4 ecom_indexer ──┐
PR2 lexical analyzer infra ────┘ (uses PR1) │
│
PR3 ecom_utils model ──┬──→ PR5 settings exp ──→ PR6 search_proxy ──┐
│ │
├──→ PR7 admin_lambda ────────────────────────┤
│ │
└─────────────────────────────────────────────→├──→ PR8 E2E
│
PR4 (above) ─────────────────┘
Critical path: PR1 + PR2 → PR4 → PR6 → PR8 (≈4 sequential dependencies). Parallel-able: PR3, PR5, PR7 can land alongside the critical path.
One-time prep (not a PR; ad-hoc tasks)
These tasks gate PR1's acceptance criteria but are operational, not code.
Convert the source synonym dictionary to JSON
The linguistic source file synonym_config.py is a Python module exporting SYNONYM_GROUPS: list[list[str] | dict]. Each entry is either a flat list (applies to all categories) or a dict with terms + only + exclude keys.
Write a small conversion script components/lexical_analyzer/scripts/build_synonyms_json.py that imports the source file (or vendors its contents) and writes a normalised JSON form:
[
{"terms": ["kenkä", "jalkine"]},
{"terms": ["verkka", "college", "treeni"], "only": ["Clothing & Shoes", "Sports"]},
{"terms": ["tuoli", "istuin"], "exclude": ["Automotive & Caravan Supplies"]}
]
Output path: components/lexical_analyzer/dictionaries/fi/synonyms.json. Commit to the repo so the canonical source is version-controlled. The S3 copy is the runtime mirror.
S3 upload command
Upload the JSON to the per-env {env}-lexical-analyzer bucket under the dictionaries/{language}/ prefix:
# Dev
aws s3 cp components/lexical_analyzer/dictionaries/fi/synonyms.json \
s3://dev-lexical-analyzer/dictionaries/fi/synonyms.json \
--profile MarqoControlPlane
# Staging
aws s3 cp components/lexical_analyzer/dictionaries/fi/synonyms.json \
s3://staging-lexical-analyzer/dictionaries/fi/synonyms.json \
--profile MarqoControlPlane
# Prod
aws s3 cp components/lexical_analyzer/dictionaries/fi/synonyms.json \
s3://prod-lexical-analyzer/dictionaries/fi/synonyms.json \
--profile MarqoControlPlane
Enable S3 object versioning on the {env}-lexical-analyzer bucket as part of PR2 so dictionary edits can be rolled back.
Document this command in components/lexical_analyzer/README.md so SEs can use it without engineering involvement once the format is locked.
PR1 — Lexical analyzer service (new component)
Goal: New components/lexical_analyzer/ HTTP service that wraps libvoikko + four enrichment transforms behind a single /text-variants endpoint.
Files (new)
components/lexical_analyzer/
├── Dockerfile # libvoikko + FI dict (pinned ver) + Python
├── BUILD # Pants target
├── README.md # local-dev + S3 upload commands
├── dictionaries/
│ └── fi/synonyms.json # checked-in canonical source
├── scripts/
│ └── build_synonyms_json.py # `synonym_config.py` → JSON converter
└── lexical_analyzer/
├── app.py # FastAPI app: /text-variants, /health
├── main.py # entrypoint
├── models.py # request/response models
├── s3_client.py # synonym dict loader, 5-min TTL
├── voikko_pool.py # warm Voikko instance per worker
├── app_test.py # /text-variants + /health integration
├── s3_client_test.py # cache TTL + miss/hit + error
└── transforms/
├── parsed.py # Voikko compound splitter
├── parsed_test.py
├── synonyms.py # curated-dictionary noun expansion
├── synonyms_test.py # exercises dictionary lookup
├── typos.py # double-consonant / digraph / adj-key
├── typos_test.py
├── numeric_tokens.py # model-code / dimension extraction
└── numeric_tokens_test.py
Tests
- Per-transform unit tests with golden-output fixtures. Each transform exercises both common cases (compound, multi-word sentence) and edge cases (empty string, single char, all-numeric, non-Finnish text).
/text-variantsintegration tests — single-transform requests, multi-transform requests, batch input, malformed input (422), unsupported language (400). Includes explicit tests that synonyms / typos / numeric_tokens always operate on the parsed text even when"parsed"is not in the transforms list.- S3 client tests — cache hit, cache miss, TTL expiry, S3 unavailable (raises and 503s
/healthuntil recovered). /healthtests — returns 503 until dictionary loaded, 200 thereafter.
Dependencies
None (greenfield component).
Acceptance criteria
POST /text-variants {"language":"fi","transforms":["parsed"],"queries":["rattikelkka"]}returns{"parsed":["ratti kelkka"]}(with real Voikko in the docker image; brand-prefixed compounds like "stigarattikelkka" return unchanged because the brand is not in Voikko's dictionary).- All 4 transforms produce output matching their checked-in golden fixtures.
/health503 until FI dictionary loaded, 200 after.- Docker image builds; final image size < 1 GB.
- Voikko library + Finnish dictionary versions pinned and noted in
DockerfileandREADME.md. pants test //components/lexical_analyzer::passes.
PR2 — Lexical analyzer infra (AWS CDK)
Goal: Deploy the lexical analyzer service to ECS Fargate as a standalone CDK stack alongside the existing ecom_stack, with an internal NLB frontend, per the cost estimate in the spec. Stack lives under infra/ecom/stacks/ because the consumers (search proxy, ecom_indexer) are ecom-domain — but it is its own stack with an independent lifecycle so PR2 can deploy ahead of PR4/PR6.
Files (new + edit)
infra/ecom/stacks/lexical_analyzer_stack.py(new) — ECS Fargate service, internal NLB, security groups, IAM (S3 read access scoped to{env}-lexical-analyzer/dictionaries/*), CloudWatch log group, autoscaling, ECR repoinfra/ecom/stacks/lexical_analyzer_stack_test.py(new) — CDK snapshot testinfra/ecom/app.py(edit) — instantiateLexicalAnalyzerStackper environment alongside the existing ecom stacks- New S3 bucket
{env}-lexical-analyzerwith object versioning enabled on the entire bucket; encryption + private access; the bucket is dedicated to the analyzer (not piggy-backing oncloud-controller)
Sizing
Per cost estimate in spec: 2 baseline Fargate tasks (1 vCPU, 2 GB RAM) on-demand for prod; FARGATE_SPOT for dev/staging. Autoscale to 4 tasks at 70% CPU/memory.
Tests
- CDK snapshot test asserts the synthesised template matches a checked-in golden snapshot.
- Smoke test after
cdk deploy:curl http://<internal-nlb>/healthfrom an in-VPC host returns 200.
Dependencies
PR1 (Docker image exists; ECR has a build).
Acceptance criteria
cdk synthclean for dev/staging/prod.- Stack deploys to dev;
/healthreturns 200 within 5 minutes. - Grafana alerts configured in
cloud_data_plane, sourced from the CloudWatch metrics the stack emits: lexical analyzer error rate > 1% over 5 minutes, lexical analyzer p99 latency > 200 ms over 5 minutes. - Realised monthly cost matches spec estimate within 10% after first month.
PR3 — Shared model (ecom_utils + DDB repository)
Goal: Add LexicalEnrichmentConfig as a peer of AddDocsConfig on IndexSettings, with repository setter methods.
Pattern reference
- Model: mirror
AddDocsConfiginindex_settings_model.py.LexicalEnrichmentConfig(RecordModel)withmodel_config = ConfigDict(frozen=True). - Repository methods: mirror
add_merchandising_field/set_*atomic-update patterns inindex_settings_repository.py. TargetedUpdateItemwithexpression_attribute_namesto avoid clobbering sibling fields.
Files (edit)
components/ecom_utils/ecom_utils/index_settings_service/index_settings_model.py— addLexicalEnrichmentConfigclass +IndexSettings.lexical_enrichment_config: LexicalEnrichmentConfig | None = Nonefieldcomponents/ecom_utils/ecom_utils/index_settings_service/index_settings_repository.py— addset_lexical_enrichment_config()andset_parse_language()/set_extract_numeric_tokens()/set_enriched_fields()atomic setterscomponents/ecom_utils/ecom_utils/index_settings_service/index_settings_repository_test.py— round-trip + atomic-update testscomponents/ecom_utils/ecom_utils/index_settings_service/index_settings_model_test.py— pydantic validation tests
Tests
- Pydantic validation:
parse_language="xx"rejected (not inLiteral["fi"]); defaults areparse_language=None,extract_numeric_tokens=False,enriched_fields=None. - DDB round-trip: write settings with
lexical_enrichment_config, read back, all fields match. - Atomic setter doesn't clobber
add_docs_config,search_config, ormerchandising_fields. - Each setter is independently exercised against moto-backed DDB.
Dependencies
None.
Acceptance criteria
pants test //components/ecom_utils::passes.- No
__init__.pyfiles added. - New fields exported on the
IndexSettingspydantic model.
PR4 — Ecom indexer integration
Goal: Indexer calls the lexical analyzer to populate _mq_* enrichment fields on incoming docs. Includes one-shot backfill script.
Pattern reference
- Enrichment orchestrator: mirror
ecom_indexer/merchandising.pyshape. Pure function that takes a list of docs + config and mutates them with the new fields. Module is import-light to keep cold-start fast. - Lexical analyzer client: mirror existing HTTP client patterns in the indexer (e.g. how it talks to Marqo).
- Backfill script: mirror existing one-shot scripts under
components/ecom_indexer/ecom_indexer/scripts/.
Files (new + edit)
components/ecom_indexer/ecom_indexer/lexical_analyzer_client.py(new) — HTTP client with batch endpoint, retry semanticscomponents/ecom_indexer/ecom_indexer/enrichment.py(new) — collect target-field values per doc, call lexical-analyzer batch, write back_mq_*fieldscomponents/ecom_indexer/ecom_indexer/document_operations.py(edit) — wireenrichment.pyinto the per-doc transform sequencecomponents/ecom_indexer/ecom_indexer/lambda_function.py(edit) — initialise lexical analyzer client at boot, lexical analyzer URL from env varcomponents/ecom_indexer/ecom_indexer/scripts/backfill_enrichment.py(new) — one-shot patch viaupdate_documents: scan an index, batch-parse, patch with new fields only. Dry-run + execute modes; resumable from state file; idempotentcomponents/ecom_indexer/ecom_indexer/lexical_analyzer_client_test.py(new)components/ecom_indexer/ecom_indexer/enrichment_test.py(new)components/ecom_indexer/ecom_indexer/scripts/backfill_enrichment_test.py(new)
Tests
lexical_analyzer_client: success path, retry on transient HTTP error, hard fail after retry budget, malformed response, timeout.enrichment: withparse_language="fi"andextract_numeric_tokens=Trueconfigured, ingested docs gain expected_mq_*fields; with neither configured, no lexical analyzer call is made; field set derives from lexical-searchable attrs whenenriched_fieldsis null;enriched_fieldsoverrides the derived set when provided.document_operationsintegration: end-to-end transform sequence on a fake batch, asserts merchandising + enrichment + numeric splitting all compose correctly.backfill_enrichment: dry-run prints sample diff and exits without writing; execute is idempotent on re-run; state file resumes from kill point; lexical analyzer failure mid-run is surfaced and the script can be rerun without re-processing finished docs.
Dependencies
PR1 + PR2 (lexical analyzer deployed at a known URL), PR3 (model exists).
Acceptance criteria
- One HTTP call per batch (≈100 docs), not per-doc.
- Lexical analyzer failure during indexing fails the batch — existing retry machinery picks it up.
- Backfill: resumable, idempotent, dry-runnable.
pants test //components/ecom_indexer::passes.
PR5 — Settings exporter
Goal: Export LexicalEnrichmentConfig from DDB to CloudFlare KV so the search proxy can read it.
Pattern reference
Copy the structure of existing exports in components/ecom_settings_exporter/ecom_settings_exporter/lambda_function.py — the tests agentic_config_export_test.py, llm_image_search_config_export_test.py, profiles_export_test.py are the templates. Each existing exporter test asserts a specific config slice round-trips into the KV payload while leaving the rest untouched.
Files (edit)
components/ecom_settings_exporter/ecom_settings_exporter/lambda_function.py— includelexical_enrichment_configin the KV payload (passthrough; no transformation)components/ecom_settings_exporter/ecom_settings_exporter/lambda_function_test.py— add an enrichment-config round-trip assertion (mirror the existingagentic_config_export_test.pyshape)components/ecom_settings_exporter/ecom_settings_exporter/lexical_enrichment_config_export_test.py(new) — dedicated test module for the new config slice, matching the pattern ofagentic_config_export_test.py/llm_image_search_config_export_test.py
Tests
- DDB record with
lexical_enrichment_configproduces a KV record containing the same fields. - DDB record without
lexical_enrichment_configproduces KV output byte-identical to pre-PR. - Partial
lexical_enrichment_config(e.g. onlyextract_numeric_tokens=True, noparse_language) round-trips correctly.
Dependencies
PR3 (model exists).
Acceptance criteria
pants test //components/ecom_settings_exporter::passes.- Backward compatibility: existing test fixtures still pass without modification.
PR6 — Search proxy integration
Goal: Search proxy reads LexicalEnrichmentConfig from KV, calls lexical analyzer on q when parse_language set, builds hybridParameters override, applies skip-list, falls back to raw on failure.
Pattern reference
- Lexical analyzer client: mirror
components/search_proxy/src/synonyms.tsfor CF Cache integration shape —cachePut/cachePutAsync, key construction, TTL handling. - Zod schema: add to
IndexSettingsschema inplatform.tsalongside existing slices. - Skip-list / Marqo request construction: extend
search.ts:handleSearchminimally; isolate the new logic in a helper for testability.
Files (new + edit)
components/search_proxy/src/platform.ts(edit) — addlexical_enrichment_configto zod schemacomponents/search_proxy/src/lexical-analyzer-client.ts(new) — HTTP wrapper with CF Cache integration (7-day TTL, keycompound-split/<lang>/<urlencoded(q)>), HTTP timeout configurable viaLEXICAL_ANALYZER_TIMEOUT_MSenv var (default 200 ms), fallback-to-raw + WARN log on failurecomponents/search_proxy/src/search.ts(edit) — invokelexical-analyzer-clientinsidehandleSearchwhen configured; buildhybridParameterswithqueryLexical = "<raw> <parsed>",queryTensor = raw,searchableAttributesLexical = [<originals>, <_mq_*>],searchableAttributesTensor = <originals>; apply skip-listcomponents/search_proxy/src/lexical-analyzer-client.test.ts(new)components/search_proxy/src/search.test.ts(edit) — extend with analyzer-on/off cases, skip-list cases, failure-fallback case, debug-envelope case
Tests
lexical-analyzer-client: cache miss → HTTP call → response cached; cache hit → no HTTP call; timeout → raw fallback + WARN log; HTTP error → raw fallback + WARN log.search.ts(integration): withlexical_enrichment_config.parse_language="fi", the Marqo request body has the expectedhybridParametersoverride; withoutlexical_enrichment_config, the request is byte-identical to the pre-PR baseline (regression-protect existing snapshot suite).- Skip-list cases: empty, image, ≤2 chars, numeric-only, >200 chars — each verified to bypass the lexical analyzer.
- Debug envelope: with
MARQO_DEBUG_HEADER_VALUEset, response carries_marqo_debug.parsedwithlanguage,raw_query,parsed_query,analyzer_latency_ms,cache_hit.
Dependencies
PR1 + PR2 (lexical analyzer deployed), PR3 (model), PR5 (KV exporter writes the field).
Acceptance criteria
- Existing search snapshot tests still pass byte-identical when
lexical_enrichment_configis unset. - New tests cover all configured-on and skip-list paths.
npm testincomponents/search_proxypasses.
PR7 — Admin lambda surface
Goal: Surface LexicalEnrichmentConfig in admin endpoints so SEs can enable per-index without DDB direct writes.
Pattern reference
Copy the existing index-settings admin endpoint shape — components/admin_lambda/admin_lambda/services/index_settings_admin_service.py and the corresponding routes module. Extend the request/response models; reuse the existing partial-merge persistence pattern.
Files (edit)
components/admin_lambda/admin_lambda/services/index_settings_admin_service.py— acceptlexical_enrichment_configin update payload, delegate toset_lexical_enrichment_configrepository method (PR3)components/admin_lambda/admin_lambda/models/index_models.py— extend request and response models withlexical_enrichment_config: LexicalEnrichmentConfig | Nonecomponents/admin_lambda/admin_lambda/routes/<index_settings_routes>.py— wire it up (file name TBD by following existing route)- Corresponding tests in matching
*_test.pyfiles
Tests
PATCH .../settingsacceptslexical_enrichment_configpayload, persists to DDB.GET .../settingsreturns the configuredlexical_enrichment_config.- Invalid language code (
"xx") → 422. - Empty
lexical_enrichment_config(null) clears the field. - Partial update of
lexical_enrichment_configdoesn't clobber the rest of the index settings.
Dependencies
PR3 (model exists).
Acceptance criteria
pants test //components/admin_lambda::passes.
PR8 — E2E tests
Goal: End-to-end coverage that the feature works against real services in dev. Tests are tagged @scheduled so they run a couple of times a week (not on every PR/merge) — Finnish-customer regressions are caught within a day or two without burning CI minutes on a feature that few customers exercise. The PR8 acceptance criterion still requires one successful pass against a real dev-deployed stack before merging into mainline; ongoing coverage is via the scheduled cron.
Pattern reference
Follow the existing pattern in components/shopify/e2e_tests/e2e_tests/tests/ — each test uses EcomClient to drive real customer-facing endpoints in dev, with fresh test indexes per test. Tag each test function with @pytest.mark.scheduled so it is gated to the scheduled workflow rather than the per-PR run.
Files (new)
components/shopify/e2e_tests/e2e_tests/tests/lexical_enrichment_test.py
Test cases
test_parse_language_fi_finds_split_compound_via_component_query— fresh test index withlexical_enrichment_config.parse_language="fi". Index a product titled "Stigarattikelkka". Query for a single component word, e.g."kelkka". Without splitting, lexical sees the title as one opaque token and"kelkka"does not match; with the lexical analyzer engaged,_mq_fi_parsed_titlecontains"kelkka"and the product appears in the top results.test_parse_language_unset_does_not_match_component— control case. Same index seeded the same way but nolexical_enrichment_config. Same"kelkka"query. Assert: product does NOT appear (proves the feature is doing the work, not free recall from another mechanism).test_lexical_analyzer_down_falls_back_to_raw— inject lexical analyzer unavailability (misconfigured URL). Assert: search still returns 200 with raw-query results; no 5xx surfaced.test_numeric_tokens_finds_partial_model_code— index withextract_numeric_tokens=True. Product with title "DCS391N power drill". Query"391". Assert: product appears.test_debug_envelope_populated— with debug header set, assert_marqo_debug.parsedis present in the response with raw_query, parsed_query, cache_hit, analyzer_latency_ms.
Dependencies
All previous PRs deployed to dev.
Acceptance criteria
- All 5 tests pass at least once against a real dev-deployed stack before this PR merges.
- Tests carry the
@scheduledmarker so they run on the existing scheduled E2E workflow (≈twice a week) and are skipped on the default per-PR run. - Tests use the existing
EcomClientpattern and don't introduce new infra fixtures.
Rollout sequence (after all PRs merged)
- Deploy PR1 + PR2 to dev, staging, prod. Lexical analyzer service live, receiving zero traffic.
- Deploy PR3 + PR5. Model + exporter shipped dark. No index has
lexical_enrichment_configset → zero behaviour change. - Deploy PR4. Indexer reads lexical analyzer when configured. Still no index configured → zero behaviour change.
- Deploy PR6. Search proxy reads lexical analyzer when configured. Same.
- Deploy PR7. Admin tooling available for SE use.
- Deploy PR8. E2E tests run on every CI build going forward.
- Enable for Kärkkäinen's staging index. SE uploads Finnish synonym dictionary to S3 (
staging-lexical-analyzer/dictionaries/fi/synonyms.json), setslexical_enrichment_config.parse_language="fi"andextract_numeric_tokens=Truevia admin endpoint. Runbackfill_enrichment.pyagainst the staging index. Validate end-to-end with sample Finnish queries. - Enable for Kärkkäinen's prod index. Same sequence, prod bucket. Watch CloudWatch alarms (lexical analyzer error rate, search-proxy fallback rate, Kärkkäinen-specific search latency).
Each PR is independently deployable and reversible. Steps 1–6 are dark deploys; step 7 is the first behaviour-visible change, scoped to one staging index.
Test coverage summary
| PR | Component | Unit tests | Integration tests | E2E tests |
|---|---|---|---|---|
| PR1 | lexical analyzer service | per-transform golden fixtures; S3 client cache; /health states | /text-variants single + batch + multi-transform | — |
| PR2 | lexical analyzer infra | CDK snapshot | post-deploy smoke (curl /health) | — |
| PR3 | ecom_utils | pydantic validation + repo round-trips + atomic setters | — | — |
| PR4 | ecom_indexer | lexical analyzer client; enrichment orchestrator; backfill script | document_operations transform composition | — |
| PR5 | settings exporter | per-config-slice export tests | — | — |
| PR6 | search proxy | lexical-analyzer-client (cache, fallback, timeout); skip-list | search.ts integration (hybridParameters built correctly; debug envelope) | — |
| PR7 | admin lambda | request validation; persistence; partial updates | — | — |
| PR8 | shopify/e2e_tests | — | — | 5 cases against dev stack |
Every PR must include the tests in its column above. "Tests follow in a later PR" is not an acceptable framing.