Skip to main content

Cloud Control Plane Development Guide

Architecture

  • FastAPI web server for CRUD of Marqo Cloud resources, primarily users, accounts, and index configs (specifications for a Marqo search index that we host for customers).
  • Operations delegated by a BFF to microservices responsible for Identity, UsersAccounts, Indexes, Clusters, ApiKeys, Integrations, and Billing.
  • CRUD operations on indexes implemented via long-running workflows triggered by changes to index config records.
  • BFF and all microservices deployed as Monolith, a single Docker image.
  • Hosted on AWS with DDB for persisting records, API Gateway and ECS for serving Monolith, and Step Functions for invoking workflows.

Project Tracking

  • Jira: s2search.atlassian.net, project key CP (Control Plane)
  • Issues labeled agent are suitable for AI agents to pick up

Project Structure

Monorepo with Pants build system for Python code and npm scripts for TypeScript/frontend code.

Top-Level Directories

  • components/ - All application code (services, workers, utilities)
  • tests/ - Component-level test suites (mirrors components/ structure)
  • infra/ - Infrastructure as code (AWS CDK)
  • scripts/ - Utility scripts for project management
  • docs/ - Developer documentation

Core Control Plane Services (Python/FastAPI)

Located in components/. Note: These are not currently used; ignore when implementing.

  • bff_console - Backend-for-frontend serving the console UI
  • identity_service - Authentication and identity management
  • users_accounts_service - User and account CRUD
  • indexes_service - Index configuration management
  • cluster_service - Cluster lifecycle management
  • api_keys_service - API key management
  • billing_service - Billing and subscription handling
  • monolith - Composes all services into a single Docker image

E-commerce Platform (Python)

  • shopify/admin_server - Ecommerce API and Shopify app backend (FastAPI)
  • shopify/admin_ui - Shopify app frontend (React)
  • shopify/storefront_search - Storefront search widget
  • ecom_indexer - Product indexing service
  • ecom_utils - Shared e-commerce utilities and models
  • ecom_metrics_consumer - Metrics processing
  • ecom_settings_exporter - Settings export to KV
  • merchandising_exporter - Merchandising rules export
  • ecom_merchandising_sync - Merchandising rule sync to write-alias target indexes

Cloudflare Workers (TypeScript)

  • search_proxy - Search API proxy with auth and caching
  • agentic_search - AI-powered search features
  • admin_worker - Admin dashboard (React Router + Cloudflare)
  • admin_lambda - Admin API Lambda functions

Shared Utilities

  • apis - Pydantic API models shared across microservices
  • service_utils - Common service utilities (logging, config, errors)
  • ecom_utils - E-commerce models, repositories, testing fixtures

Infrastructure

  • infra/aws - AWS CDK stacks
  • infra/ecom - E-commerce infrastructure (Lambdas, Step Functions)
  • infra/admin - Admin infrastructure
  • infra/controller - Controller infrastructure

Testing

  • tests/console/ - Playwright-based UI tests for the console (TypeScript)
  • components/shopify/e2e_tests - Ecommerce Platform and Shopify app E2E tests (run in CI, or locally against Hippodrome — see components/hippodrome/AGENTS.md "Running E2E Tests")

Core Principles

  1. Fail Fast: No silent failures or fallbacks unless absolutely necessary. Raise exceptions early when validation fails.
  2. Model Everything: Never pass raw dicts/objects. Use Pydantic models (Python) or TypeScript interfaces.
  3. Test Everything: A change isn't complete until tests cover its new behavior. Compilation passing and pre-existing tests passing are not substitutes. New exported functions, new branches in dispatch/decision logic, and new error paths each need a test that would have failed before the change.
  4. Dependency Injection: Use constructor injection in Python services. Dependencies injected via FastAPI Depends().
  5. Catch Errors Sensibly: Use semantically meaningful error types (e.g., RecordNotFoundError). The API layer converts these to HTTP status codes. Never catch generic Exception unless re-raising.

Definition of Done

A code change is done when:

  1. Typecheck and the full test suite pass.
  2. Tests have been written for the new behavior — added, modified, or branched logic.
  3. CLAUDE.md guidance applicable to the change has been followed (no fallbacks, no dead aliases on rename, etc.).

Treat (2) as a precondition for marking implementation tasks complete, not a follow-up. "Build green" is not "feature done".

Executing Actions with Care

The following actions ALWAYS require explicit human confirmation regardless of which skill / team / workflow / agent is asking. They are universal gates — every teammate, in every flow, honors them. Auto Mode does NOT relax these.

  • Merging PRs to main / master — humans merge. Teammates raise + iterate; humans gate.
  • Force pushes (git push --force, --force-with-lease) — never without explicit human ask. Never force-push to main/master.
  • Production writes to customer / shared systems — storefront settings APIs, customer DDB writes, third-party app enables, search-config DDB writes (override-header testing is fine; DDB writes are not), live Shopify theme installs, infra teardowns. Show the diff/plan and wait.
  • Destructive gitgit reset --hard, branch deletion, history rewrites, rm -rf on tracked dirs, git checkout -- over uncommitted work. Investigate unexpected state before deleting — it may be in-progress work.
  • Third-party API writes outside dev cells — anything mutating state in a system you don't own (GitHub releases, Slack posts, external dashboards, payment providers).
  • Credential / secret access — reading .env, credentials.json, AWS profile rotations — only when the user has authorized that specific scope.

When in doubt, surface the action to the user before taking it. The cost of pausing is low; the cost of an unwanted destructive action is high.

Per-skill or per-domain playbooks (e.g. docs/integrations/AGENTS.md for Shopify, docs/dev/orchestration/code-review-guide.md for general dev) may add additional domain-specific gates on top of these. They do not override the universal list.

Adding New Python Components

When asked to add a new Python microservice component (e.g., for domain-specific CRUD operations with business logic), follow the Microservice Pattern guide. This pattern applies to services that:

  • Provide domain-specific CRUD operations against a database
  • Require business logic validation
  • Need to be composable via dependency injection

Do not use this pattern for UIs, scripts, or shared library code.


Python Guidelines

Code Style

Run ruff for formatting and import ordering after changing Python files.

Renaming Internal Symbols

When renaming a function, class, or variable within the same component, update all call sites and imports directly. Do not leave backward-compatible aliases (e.g., old_name = new_name). Callers within the same component should just use the new name.

Package Structure

  • Avoid __init__.py files unless they serve a specific purpose (e.g., re-exporting public APIs, package initialization logic)
  • Pants handles Python imports without __init__.py files
  • Import directly from modules: from mypackage.module import MyClass

Dependency Injection

For ecom and admin lambdas, use centralized DependencyContainer with @cached_property for singleton services. Wrap container methods in FastAPI dependency functions for route injection.

Note: The monolith uses injector instead.

Example: components/shopify/admin_server/admin_server/dependencies.py

class DependencyContainer:
@cached_property
def get_sync_service(self) -> SyncService:
return SyncService(
sync_job_repo=self.get_sync_job_repository,
api_key_repo=self.get_shopify_api_key_repository,
)

# FastAPI dependency wrapper
def get_sync_service() -> SyncService:
return container.get_sync_service

Error Handling

  • Create structured exception hierarchies with semantically meaningful error types
  • Service layer returns domain errors (e.g., RecordNotFoundError, ValidationError)
  • API layer converts domain errors to HTTP status codes
  • Reuse common errors from service_utils/errors
  • Example: components/shopify/admin_server/admin_server/exceptions/
class SettingsError(Exception):
"""Base exception for settings-related errors."""
def __init__(self, message: str, shop_id: str, context: dict | None = None):
self.message = message
self.shop_id = shop_id
self.context = context or {}

class RecordNotFoundError(SettingsError):
"""Record not found in database."""
pass

class ValidationError(SettingsError):
"""User input validation failed."""
pass

Data Modeling

  • All models inherit from a base RecordModel with frozen=True
  • Use Pydantic for validation; never pass raw dicts between functions
  • Example: components/ecom_utils/ecom_utils/models/base.py
class RecordModel(BaseModel):
model_config = ConfigDict(frozen=True)

def to_dynamodb_item(self) -> dict:
return self.floats_to_decimals(self.model_dump())

Repository & Service Patterns

  • Repositories handle data access; services handle business logic
  • Services receive repositories via constructor injection
  • Let exceptions bubble up to FastAPI (fail-fast)
  • Example: components/ecom_utils/ecom_utils/repositories/base_repository.py

Testing

  • Services should reuse fixtures from components they depend on via pytest_plugins in conftest.py
  • Extract test environment and test data setup to fixtures and builder functions. Keep test case function bodies clean, concise, and focused on the logic under test.
  • Mock AWS services with moto; reset config between tests
  • For Python/backend development, always run pants test during development
pytest_plugins = [
"ecom_utils.testing.fixtures",
"admin_server.testing.fixtures",
]

def test_sync_service_creates_job(mock_table, sync_service, sample_shop):
mock_table.put_item(Item=sample_shop.to_dynamodb_item())
result = sync_service.create_job(sample_shop.shop_id)
assert result.status == "pending"

TypeScript Guidelines

Code Style

  • 2-space indentation
  • Double quotes
  • 120 char line width
  • Trailing commas
  • Semicolons

Error Handling (Server-side only)

  • Use ClientError class with status, userMessage, and extra fields
  • Fail fast in middleware; throw immediately when validation fails
  • Example: components/search_proxy/src/errors.ts
export class ClientError extends Error {
public readonly status: number;
public readonly userMessage?: string;
public readonly extra?: Record<string, any>;

constructor({ message, status, userMessage, extra }: ErrorParams) {
super(message ?? userMessage ?? "Unknown client error");
this.status = status ?? 400;
this.userMessage = userMessage;
this.extra = extra;
}
}

Data Modeling

  • Define interfaces for domain models; never use plain objects
  • Use Zod schemas for configuration validation
  • Example: components/search_proxy/src/platform.ts

Logging (Server-side only)

  • Use structured JSON logging with request IDs
  • Use fluent .with() API for context propagation
  • Example: components/search_proxy/src/logger.ts

Testing

  • Use Vitest
  • Mock with vi.spyOn() and vi.mock()
  • Restore mocks after each test
  • Keep test bodies focused on logic under test
describe("ClientError", () => {
it("should create error with default status", () => {
const error = new ClientError({ message: "Test" });
expect(error.status).toBe(400);
});
});

Build, Lint, Test Commands

# Run all tests for a component
pants test //components/{component_name}::

# Run a single test file
pants test //components/{component_name}/path/to/test_file.py

# Run tests with auto-rerun on changes
pants test {targets} --loop

# All formatting and fixing (requires targets)
pants green ::

# TypeScript: Run tests (from component directory)
npm test

# TypeScript: Format component
npm run format

Important Notes

  • Always run the stack-appropriate tests during development using docs/dev/component_command_registry.md.
  • Don't run E2E tests locally (shopify/e2e_tests) - these run in CI
  • Use ruff directly for formatting and fixing Python code. pants fmt fix :: may miss some issues.

Branch Naming

  • main is protected and requires a PR to merge.
  • Keep branch names ≤ 20 characters (hard limit: 24). infra/sanitize.py (MAX_ENV_BRANCH_LEN = 24) silently truncates branch names at 24 characters when generating dev cell resource names (CloudFormation stacks dev-<branch>-EcomApiStack, API Gateway custom domains dev-<branch>-admin.ecom.marqo-staging.com, Route 53 records, KV namespaces). Two branches sharing the same first 24 characters collapse to the same env_branch token, so when one branch's PR is closed and its dev cell is destroyed, destroy.py deletes the API Gateway domain by name and tears down the other branch's still-live resources. The 20-character target provides headroom; never exceed 24.
  • AI/kata agents generating branch names must respect this limit. Auto-generated names like agent-fix-very-long-description are a known source of collisions.
  • Good: fix/ecom-sync, feat/v4-workflow
  • Bad: feature/workflow-version-header-shopify-e2e (too long — truncated at 24 chars, collides with other long-named branches)

PR Titles (Conventional Commits)

Use conventional commit prefixes in PR titles. The prefix determines how changes appear in release notes:

User-facing (appear in main changelog section):

  • feat: - New features
  • fix: - Bug fixes
  • perf: - Performance improvements

Internal (grouped under "Internal Changes" section):

  • build: - Build system changes
  • chore: - Maintenance tasks
  • ci: - CI/CD changes
  • docs: - Documentation only
  • refactor: - Code refactoring
  • revert: - Reverting changes
  • style: - Code style/formatting
  • test: - Test changes

Examples:

  • feat: Add bulk product sync - User-facing feature
  • fix: Correct pagination in search results - User-facing bug fix
  • chore: Update dependencies - Internal maintenance
  • ci: Add workflow for staging deploys - Internal CI change
  • refactor: Extract common search logic - Internal refactoring

PR Descriptions

PR descriptions should have a ## Summary section with bullet points summarizing the change. Do NOT add a separate ## Test plan section — any notable testing decisions, patterns, or process details should be included in the summary itself.

Documentation

  • Developer documentation in Markdown under /docs
  • Ecommerce Diagnostics — jobs, logs, S3 inspection, search proxy queries, common issues
  • Shopify Diagnostics — Shopify GraphQL queries, metafield debugging, bulk export constraints
  • Shopify Metafield Indexing — how product and variant metafields flow through the indexing pipeline
  • Storefront CSS Customization Guide — workflow for onboarding new merchants with custom CSS, agent prompt templates, settings helper scripts
  • Per-merchant integration docs in /docs/integrations/ (e.g., MSQC)