Cloud Control Plane Development Guide
Architecture
- FastAPI web server for CRUD of Marqo Cloud resources, primarily users, accounts, and index configs (specifications for a Marqo search index that we host for customers).
- Operations delegated by a BFF to microservices responsible for Identity, UsersAccounts, Indexes, Clusters, ApiKeys, Integrations, and Billing.
- CRUD operations on indexes implemented via long-running workflows triggered by changes to index config records.
- BFF and all microservices deployed as Monolith, a single Docker image.
- Hosted on AWS with DDB for persisting records, API Gateway and ECS for serving Monolith, and Step Functions for invoking workflows.
Project Tracking
- Jira:
s2search.atlassian.net, project keyCP(Control Plane) - Issues labeled
agentare suitable for AI agents to pick up
Project Structure
Monorepo with Pants build system for Python code and npm scripts for TypeScript/frontend code.
Top-Level Directories
components/- All application code (services, workers, utilities)tests/- Component-level test suites (mirrorscomponents/structure)infra/- Infrastructure as code (AWS CDK)scripts/- Utility scripts for project managementdocs/- Developer documentation
Core Control Plane Services (Python/FastAPI)
Located in components/. Note: These are not currently used; ignore when implementing.
bff_console- Backend-for-frontend serving the console UIidentity_service- Authentication and identity managementusers_accounts_service- User and account CRUDindexes_service- Index configuration managementcluster_service- Cluster lifecycle managementapi_keys_service- API key managementbilling_service- Billing and subscription handlingmonolith- Composes all services into a single Docker image
E-commerce Platform (Python)
shopify/admin_server- Ecommerce API and Shopify app backend (FastAPI)shopify/admin_ui- Shopify app frontend (React)shopify/storefront_search- Storefront search widgetecom_indexer- Product indexing serviceecom_utils- Shared e-commerce utilities and modelsecom_metrics_consumer- Metrics processingecom_settings_exporter- Settings export to KVmerchandising_exporter- Merchandising rules exportecom_merchandising_sync- Merchandising rule sync to write-alias target indexes
Cloudflare Workers (TypeScript)
search_proxy- Search API proxy with auth and cachingagentic_search- AI-powered search featuresadmin_worker- Admin dashboard (React Router + Cloudflare)admin_lambda- Admin API Lambda functions
Shared Utilities
apis- Pydantic API models shared across microservicesservice_utils- Common service utilities (logging, config, errors)ecom_utils- E-commerce models, repositories, testing fixtures
Infrastructure
infra/aws- AWS CDK stacksinfra/ecom- E-commerce infrastructure (Lambdas, Step Functions)infra/admin- Admin infrastructureinfra/controller- Controller infrastructure
Testing
tests/console/- Playwright-based UI tests for the console (TypeScript)components/shopify/e2e_tests- Ecommerce Platform and Shopify app E2E tests (run in CI, or locally against Hippodrome — seecomponents/hippodrome/AGENTS.md"Running E2E Tests")
Core Principles
- Fail Fast: No silent failures or fallbacks unless absolutely necessary. Raise exceptions early when validation fails.
- Model Everything: Never pass raw dicts/objects. Use Pydantic models (Python) or TypeScript interfaces.
- Test Everything: A change isn't complete until tests cover its new behavior. Compilation passing and pre-existing tests passing are not substitutes. New exported functions, new branches in dispatch/decision logic, and new error paths each need a test that would have failed before the change.
- Dependency Injection: Use constructor injection in Python services. Dependencies injected via FastAPI
Depends(). - Catch Errors Sensibly: Use semantically meaningful error types (e.g.,
RecordNotFoundError). The API layer converts these to HTTP status codes. Never catch genericExceptionunless re-raising.
Definition of Done
A code change is done when:
- Typecheck and the full test suite pass.
- Tests have been written for the new behavior — added, modified, or branched logic.
- CLAUDE.md guidance applicable to the change has been followed (no fallbacks, no dead aliases on rename, etc.).
Treat (2) as a precondition for marking implementation tasks complete, not a follow-up. "Build green" is not "feature done".
Executing Actions with Care
The following actions ALWAYS require explicit human confirmation regardless of which skill / team / workflow / agent is asking. They are universal gates — every teammate, in every flow, honors them. Auto Mode does NOT relax these.
- Merging PRs to
main/master— humans merge. Teammates raise + iterate; humans gate. - Force pushes (
git push --force,--force-with-lease) — never without explicit human ask. Never force-push tomain/master. - Production writes to customer / shared systems — storefront settings APIs, customer DDB writes, third-party app enables, search-config DDB writes (override-header testing is fine; DDB writes are not), live Shopify theme installs, infra teardowns. Show the diff/plan and wait.
- Destructive git —
git reset --hard, branch deletion, history rewrites,rm -rfon tracked dirs,git checkout --over uncommitted work. Investigate unexpected state before deleting — it may be in-progress work. - Third-party API writes outside dev cells — anything mutating state in a system you don't own (GitHub releases, Slack posts, external dashboards, payment providers).
- Credential / secret access — reading
.env, credentials.json, AWS profile rotations — only when the user has authorized that specific scope.
When in doubt, surface the action to the user before taking it. The cost of pausing is low; the cost of an unwanted destructive action is high.
Per-skill or per-domain playbooks (e.g. docs/integrations/AGENTS.md for Shopify, docs/dev/orchestration/code-review-guide.md for general dev) may add additional domain-specific gates on top of these. They do not override the universal list.
Adding New Python Components
When asked to add a new Python microservice component (e.g., for domain-specific CRUD operations with business logic), follow the Microservice Pattern guide. This pattern applies to services that:
- Provide domain-specific CRUD operations against a database
- Require business logic validation
- Need to be composable via dependency injection
Do not use this pattern for UIs, scripts, or shared library code.
Python Guidelines
Code Style
Run ruff for formatting and import ordering after changing Python files.
Renaming Internal Symbols
When renaming a function, class, or variable within the same component, update all call sites and imports directly. Do not leave backward-compatible aliases (e.g., old_name = new_name). Callers within the same component should just use the new name.
Package Structure
- Avoid
__init__.pyfiles unless they serve a specific purpose (e.g., re-exporting public APIs, package initialization logic) - Pants handles Python imports without
__init__.pyfiles - Import directly from modules:
from mypackage.module import MyClass
Dependency Injection
For ecom and admin lambdas, use centralized DependencyContainer with @cached_property for singleton services. Wrap container methods in FastAPI dependency functions for route injection.
Note: The monolith uses injector instead.
Example: components/shopify/admin_server/admin_server/dependencies.py
class DependencyContainer:
@cached_property
def get_sync_service(self) -> SyncService:
return SyncService(
sync_job_repo=self.get_sync_job_repository,
api_key_repo=self.get_shopify_api_key_repository,
)
# FastAPI dependency wrapper
def get_sync_service() -> SyncService:
return container.get_sync_service
Error Handling
- Create structured exception hierarchies with semantically meaningful error types
- Service layer returns domain errors (e.g.,
RecordNotFoundError,ValidationError) - API layer converts domain errors to HTTP status codes
- Reuse common errors from
service_utils/errors - Example:
components/shopify/admin_server/admin_server/exceptions/
class SettingsError(Exception):
"""Base exception for settings-related errors."""
def __init__(self, message: str, shop_id: str, context: dict | None = None):
self.message = message
self.shop_id = shop_id
self.context = context or {}
class RecordNotFoundError(SettingsError):
"""Record not found in database."""
pass
class ValidationError(SettingsError):
"""User input validation failed."""
pass
Data Modeling
- All models inherit from a base
RecordModelwithfrozen=True - Use Pydantic for validation; never pass raw dicts between functions
- Example:
components/ecom_utils/ecom_utils/models/base.py
class RecordModel(BaseModel):
model_config = ConfigDict(frozen=True)
def to_dynamodb_item(self) -> dict:
return self.floats_to_decimals(self.model_dump())
Repository & Service Patterns
- Repositories handle data access; services handle business logic
- Services receive repositories via constructor injection
- Let exceptions bubble up to FastAPI (fail-fast)
- Example:
components/ecom_utils/ecom_utils/repositories/base_repository.py
Testing
- Services should reuse fixtures from components they depend on via
pytest_pluginsinconftest.py - Extract test environment and test data setup to fixtures and builder functions. Keep test case function bodies clean, concise, and focused on the logic under test.
- Mock AWS services with moto; reset config between tests
- For Python/backend development, always run
pants testduring development
pytest_plugins = [
"ecom_utils.testing.fixtures",
"admin_server.testing.fixtures",
]
def test_sync_service_creates_job(mock_table, sync_service, sample_shop):
mock_table.put_item(Item=sample_shop.to_dynamodb_item())
result = sync_service.create_job(sample_shop.shop_id)
assert result.status == "pending"
TypeScript Guidelines
Code Style
- 2-space indentation
- Double quotes
- 120 char line width
- Trailing commas
- Semicolons
Error Handling (Server-side only)
- Use
ClientErrorclass with status, userMessage, and extra fields - Fail fast in middleware; throw immediately when validation fails
- Example:
components/search_proxy/src/errors.ts
export class ClientError extends Error {
public readonly status: number;
public readonly userMessage?: string;
public readonly extra?: Record<string, any>;
constructor({ message, status, userMessage, extra }: ErrorParams) {
super(message ?? userMessage ?? "Unknown client error");
this.status = status ?? 400;
this.userMessage = userMessage;
this.extra = extra;
}
}
Data Modeling
- Define interfaces for domain models; never use plain objects
- Use Zod schemas for configuration validation
- Example:
components/search_proxy/src/platform.ts
Logging (Server-side only)
- Use structured JSON logging with request IDs
- Use fluent
.with()API for context propagation - Example:
components/search_proxy/src/logger.ts
Testing
- Use Vitest
- Mock with
vi.spyOn()andvi.mock() - Restore mocks after each test
- Keep test bodies focused on logic under test
describe("ClientError", () => {
it("should create error with default status", () => {
const error = new ClientError({ message: "Test" });
expect(error.status).toBe(400);
});
});
Build, Lint, Test Commands
# Run all tests for a component
pants test //components/{component_name}::
# Run a single test file
pants test //components/{component_name}/path/to/test_file.py
# Run tests with auto-rerun on changes
pants test {targets} --loop
# All formatting and fixing (requires targets)
pants green ::
# TypeScript: Run tests (from component directory)
npm test
# TypeScript: Format component
npm run format
Important Notes
- Always run the stack-appropriate tests during development using
docs/dev/component_command_registry.md. - Don't run E2E tests locally (
shopify/e2e_tests) - these run in CI - Use
ruffdirectly for formatting and fixing Python code.pants fmt fix ::may miss some issues.
Branch Naming
mainis protected and requires a PR to merge.- Keep branch names ≤ 20 characters (hard limit: 24).
infra/sanitize.py(MAX_ENV_BRANCH_LEN = 24) silently truncates branch names at 24 characters when generating dev cell resource names (CloudFormation stacksdev-<branch>-EcomApiStack, API Gateway custom domainsdev-<branch>-admin.ecom.marqo-staging.com, Route 53 records, KV namespaces). Two branches sharing the same first 24 characters collapse to the sameenv_branchtoken, so when one branch's PR is closed and its dev cell is destroyed,destroy.pydeletes the API Gateway domain by name and tears down the other branch's still-live resources. The 20-character target provides headroom; never exceed 24. - AI/kata agents generating branch names must respect this limit. Auto-generated names like
agent-fix-very-long-descriptionare a known source of collisions. - Good:
fix/ecom-sync,feat/v4-workflow - Bad:
feature/workflow-version-header-shopify-e2e(too long — truncated at 24 chars, collides with other long-named branches)
PR Titles (Conventional Commits)
Use conventional commit prefixes in PR titles. The prefix determines how changes appear in release notes:
User-facing (appear in main changelog section):
feat:- New featuresfix:- Bug fixesperf:- Performance improvements
Internal (grouped under "Internal Changes" section):
build:- Build system changeschore:- Maintenance tasksci:- CI/CD changesdocs:- Documentation onlyrefactor:- Code refactoringrevert:- Reverting changesstyle:- Code style/formattingtest:- Test changes
Examples:
feat: Add bulk product sync- User-facing featurefix: Correct pagination in search results- User-facing bug fixchore: Update dependencies- Internal maintenanceci: Add workflow for staging deploys- Internal CI changerefactor: Extract common search logic- Internal refactoring
PR Descriptions
PR descriptions should have a ## Summary section with bullet points summarizing the change. Do NOT add a separate ## Test plan section — any notable testing decisions, patterns, or process details should be included in the summary itself.
Documentation
- Developer documentation in Markdown under
/docs - Ecommerce Diagnostics — jobs, logs, S3 inspection, search proxy queries, common issues
- Shopify Diagnostics — Shopify GraphQL queries, metafield debugging, bulk export constraints
- Shopify Metafield Indexing — how product and variant metafields flow through the indexing pipeline
- Storefront CSS Customization Guide — workflow for onboarding new merchants with custom CSS, agent prompt templates, settings helper scripts
- Per-merchant integration docs in
/docs/integrations/(e.g., MSQC)