Hippodrome Component

Workflow Tips

Check @/AGENTS.md whenever you want repo-wide context or optional workflow suggestions, and adapt whatever helps this task.

Development environment orchestrator for running all Cloud Control Plane components in one place for system testing. Starts all control plane services with proper configuration and service discovery.

Quick Start

# `pants hd` is a built-in alias (see pants.toml) for running the Hippodrome CLI.
# It handles PANTS_CONCURRENT automatically.

# Explain the CLI
pants hd --help

# Use a random seed to ensure that the seed data is always the same.
export HIPPODROME_RANDOM_SEED=42

# Start with ecom profile (includes admin_server, search_proxy, etc.)
pants hd up --profile ecom

# Connect to staging cell instead of fake_cell
pants hd up --profile ecom --cell staging

# Use custom table prefix
pants hd up --profile ecom --table-prefix my-feature-

# With explicit project root
pants hd up --project-root /path/to/project

Running E2E Tests Against Local Hippodrome

All E2E tests run fully locally against moto (AWS emulation) and fake_cell. No remote credentials are required (no AWS SSO, Shopify, or Cloudflare tokens). Tests that need real external credentials will skip automatically in hippodrome mode.

Environment setup

Source the convenience env file (sets ENVIRONMENT=hippodrome, HIPPODROME_RANDOM_SEED=42, PANTS_CONCURRENT=True):

source components/hippodrome/.env.e2e

This tells the E2E config loader to read components/shopify/e2e_tests/.env.hippodrome, which points service URLs at localhost and uses dummy AWS credentials for moto.

1) Verify Hippodrome is running and ready

After pants hd up, services take 60-90 seconds to become fully healthy. Use this to check:

# Quick check: is the dashboard responding?
curl -sf http://localhost:9000/api/status

# Detailed check: wait until all services report "running"
curl -s http://localhost:9000/api/status | python3 -c "
import sys, json
data = json.load(sys.stdin)
for name, info in data.items():
    status = info.get('status', 'unknown')
    symbol = '✓' if status == 'running' else '✗'
    print(f'  {symbol} {name}: {status}')
not_running = [n for n, i in data.items() if i.get('status') != 'running']
if not_running:
    print(f'\nWaiting for: {', '.join(not_running)}')
    sys.exit(1)
print('\nAll services ready.')
"

Layered startup order: Services start in dependency layers. Each layer waits for its health-checked services to become healthy before the next layer starts. Shutdown happens in reverse layer order.

Layer	Services	Approx time
0	fake_cognito, fake_cell	~5-15s
1	controller	~15-30s
2	admin_server, admin_lambda, search_proxy, ecom_indexer, exporters, console	~30-60s
3	agentic_search, admin_worker	~60-90s

If this fails or returns no response, Hippodrome is not running. Do not start Hippodrome yourself — tell the user:

Hippodrome does not appear to be running (localhost:9000 is not responding). Please start it with pants hd up --profile ecom and let me know when it's ready.

2) If E2E tests fail but status looks healthy, run smoke tests

pants test //tests/hippodrome::

This runs lightweight checks against every Hippodrome service (dashboard, moto, fake_cell, fake_cognito, admin_server, search_proxy, seeded data). Completes in under 10 seconds and is safe to run alongside other work. If smoke tests pass, the problem is in the component or test code, not Hippodrome.

3) Run Shopify E2E tests

# Single test file
pants test //components/shopify/e2e_tests/e2e_tests/tests/ecom_onboarding_test.py -- -v

# All E2E tests (tests needing external creds will skip)
pants test //components/shopify/e2e_tests/e2e_tests/tests:: -- -v

Tests that work fully locally (no external creds):

ecom_onboarding_test.py - Ecom account onboarding flow
ecom_onboarding_text_test.py - Text-based onboarding variant
ecom_import_export_test.py - Import/export workflows
ecom_aliasing_test.py - Index aliasing
ecom_analytics_test.py - Analytics tracking
index_forking_test.py - Index fork workflow

Tests that skip in hippodrome mode (need Shopify creds):

shopify_onboarding_test.py

4) Run Hippothesis smoke tests

Hippothesis is a lightweight smoke-test suite that probes service health. No env file needed — it defaults to local environment.

# Hello-world probe (just checks services are reachable)
pants test //components/hippothesis/hippothesis:tests

# Full hippothesis suite (includes property-based search tests)
pants test //components/hippothesis::

How it works (no-creds architecture)

┌─────────────────────────────────────────────────────────┐
│  Hippodrome (all local, no remote dependencies)         │
│                                                         │
│  moto (:9003)          ← DynamoDB, S3, SQS, SNS, SFN   │
│  fake_cognito (:9012)  ← Auth token validation          │
│  fake_cell (:9001)     ← Marqo data plane mock          │
│  controller (:9002)    ← Control plane API (Django)     │
│  admin_server (:9004)  ← Ecom backend (FastAPI)         │
│  search_proxy (:9005)  ← Search API (Cloudflare Worker) │
│  console (:9008)       ← Web UI (React)                 │
└─────────────────────────────────────────────────────────┘
         ▲
         │  ENVIRONMENT=hippodrome
         │  AWS creds → dummy (testing/testing)
         │  Service URLs → localhost
         │
┌────────┴────────┐
│  E2E test suite  │
└─────────────────┘

moto emulates DynamoDB, S3, SQS, SNS, Secrets Manager, Step Functions — no real AWS account needed
fake_cognito issues JWT tokens locally — no real Cognito pool needed
fake_cell mocks the Marqo data plane — no real Marqo deployment needed
Shopify/Cloudflare credentials are optional; tests that need them skip gracefully

Known local behavior

Under moto + local queue wiring, admin add-docs can intermittently fail with SQS 500s.
The simple health test falls back to direct index POST /documents for the single health doc when that happens.
This fallback is only for the deterministic health scenario; fuzz suites still exercise normal API paths.

Never Bypass Services for Expedience

Hippodrome runs the same service topology as production. Never skip a service in the request path (e.g., calling admin_server directly instead of going through search_proxy) to work around a bug or speed things up. If a request fails at search_proxy, fix the root cause in search_proxy — do not route around it. Bypassing a service hides bugs and creates divergence between local and production behavior.

Hippodrome Is Long-Running — Do Not Kill Services

Hippodrome is designed to run continuously in the background. The user starts it once and leaves it running. Agents must not kill, restart, or stop Hippodrome services unless the user explicitly asks.

Hot reload handles code changes for most services. admin_server and controller use --reload; the console uses HMR. Editing those source files triggers automatic restarts. fake_cell and fake_cognito run without --reload (it causes deadlocks under concurrent load); restart hippodrome to pick up changes to those services.
Restarting Hippodrome is needed for changes to Hippodrome itself (e.g., config/config.py, management/orchestrator.py, cli.py) or to fake_cell/fake_cognito source code.
If you believe a restart is needed, ask the user to do it rather than running restart/stop/kill commands yourself. The user may have other work depending on the running stack.
Never kill -9 Hippodrome ports (9000–9019) unless the user explicitly instructs you to.
To check if Hippodrome is running, call curl -s http://localhost:9000/api/status. This returns JSON with the status of all services. Do not run pants hd commands just to check status — the dashboard API is instant and non-disruptive.

Agent Iteration Loop

# User starts daemon once (auto-starts services by default)
pants hd up --profile ecom

# Check status via the dashboard API (preferred — no pants overhead)
curl -s http://localhost:9000/api/status | python -m json.tool

# If needed, user manages stack lifecycle
pants hd control restart
pants hd control stop / start

# User ends the session cleanly
pants hd control shutdown

Troubleshooting First

Before debugging any error, read troubleshooting.md in this directory. It contains solutions to common problems. If you solve a new problem, add it to troubleshooting.md.

Log Files

All service logs are written to timestamped directories in components/hippodrome/.logs/:

components/hippodrome/.logs/
├── latest -> 20260120-143022  # Symlink to most recent run
├── 20260120-143022/
│   ├── fake_cell.log
│   ├── controller.log
│   ├── console.log
│   └── admin_server.log
│   └── ...
└── 20260120-140815/
    └── ...

Use these log files when:

Debugging test failures that occur in hippodrome services
Investigating service startup issues
Analyzing errors that scroll off the terminal
Reviewing complete service output history

Quick access to latest logs:

# View all logs from latest run
ls components/hippodrome/.logs/latest/

# Tail a specific service
tail -f components/hippodrome/.logs/latest/controller.log

# Search for errors across all services
grep -i error components/hippodrome/.logs/latest/*.log

Component Structure

hippodrome/
├── AGENTS.md                  # This file - instructions for AI assistants
├── troubleshooting.md         # Common problems and solutions (read first!)
├── BUILD                      # Pants build configuration
└── hippodrome/
    ├── cli.py                 # Click CLI entry point
    ├── config/config.py       # Service configurations (ports, env vars)
    ├── management/orchestrator.py  # Process spawner and manager
    ├── dashboard/dashboard.py      # HTTP server for status page
    ├── dashboard/dashboard.html    # Single-file HTML dashboard
    ├── events/                # Event handling (tunnels, etc.)
    ├── workflows/             # Step Functions workflow engine
    └── wrappers/              # Service wrappers (moto, etc.)

Profiles

The --profile flag controls which services are started:

Profile	Services	Use Case
`core`	fake_cell, controller, console	Basic control plane development
`ecom`	core + admin_server, search_proxy, indexers, exporters	E-commerce development
`full` (default)	All services	Full stack testing

Profile Examples

# Default (full): Starts all services
pants hd up

# Ecom: Starts core + admin_server (9004), search_proxy (9005), indexers, exporters
pants hd up --profile ecom

Cell Connection

The --cell flag controls which cell to connect to:

Cell	Description	fake_cell Started?
`local` (default)	Use local fake_cell	Yes
`staging`	Connect to staging cell	No
`prod`	Connect to production cell	No

When using staging or prod, the fake_cell service is not started and services connect to the real deployed cell.

Warning: Using --cell prod connects to production data. A warning message is displayed.

# Use staging cell (skips fake_cell)
pants hd up --profile ecom --cell staging

Table Prefix

The --table-prefix flag controls DynamoDB table naming:

# Default: dev-{git-branch}- (e.g., "dev-feature-auth-")
pants hd up --profile ecom

# Custom prefix
pants hd up --profile ecom --table-prefix my-feature-

Branch names with special characters (slashes, etc.) are sanitized: feature/add-auth → dev-feature-add-auth-

Services Managed

Core Services (Profile: core, ecom, full)

Service	Port	Description
dashboard	9000	Status dashboard (http://localhost:9000)
fake_cell	9001	Data plane mock (skipped with --cell staging/prod)
controller	9002	Control plane API (Django)
console	9008	Web dashboard UI (React, non-blocking)

E-commerce Services (Profile: ecom, full)

Service	Port	Description
admin_server	9004	E-commerce backend (FastAPI)
search_proxy	9005	Search API gateway (Cloudflare Worker)
ecom_indexer_service	9018	Go document indexer service
ecom_ingest	9019	Local document ingest buffer API

External Services (Manual Setup Required)

Service	Port	Description
global_worker	9012	Search query router (external repo, Cloudflare Worker)

global_worker Setup (External Repository)

The global_worker is a Cloudflare Worker that handles search query routing, merchandising rules, and caching. It lives in a separate repository and must be set up manually for full search functionality.

Why global_worker is Needed

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│search_proxy │────▶│global_worker│────▶│  fake_cell  │
│   :9005     │     │   :9012     │     │   :9001     │
└─────────────┘     └─────────────┘     └─────────────┘

search_proxy receives search API requests
global_worker applies merchandising rules and caching
global_worker proxies the final request to the cell (fake_cell or real cell)

Setup Instructions

Clone the global-worker repository:

# Clone to a directory outside this repo
cd ~/dev
git clone git@github.com:marqo-ai/global-worker.git
cd global-worker

Install dependencies:
```
npm install
```

Create local wrangler configuration: Create a wrangler.local.toml file (or use an existing one) with:

name = "local-global-worker"
main = "src/index.ts"
compatibility_date = "2024-09-23"
compatibility_flags = ["nodejs_compat"]

[vars]
ENV = "dev"
FULL_ENV = "dev-local"
# Point to fake_cell for local development
CELL_URL = "http://localhost:9001"

[dev]
port = 9012
local_protocol = "http"

Start global_worker:

npx wrangler dev --config wrangler.local.toml --port 9012

Running Without global_worker

If you don't need to test the full search flow, you can skip global_worker setup:

search_proxy will return errors for search requests that require global_worker
admin_server and other write operations will still work
Direct cell operations (via controller) are unaffected

Verifying Setup

Once global_worker is running:

# Check global_worker health (if health endpoint exists)
curl http://localhost:9012/health

# Test full search flow (requires valid index and data)
curl -X POST http://localhost:9005/indexes/test-index/search \
  -H "Content-Type: application/json" \
  -d '{"q": "test query"}'

Key Files

File	Purpose
`cli.py`	Entry point - parses arguments, runs orchestrator
`config/config.py`	`ServiceConfig` dataclass, `get_service_configs()` returns services in dependency order
`management/orchestrator.py`	`Orchestrator` class handles process lifecycle, log aggregation, graceful shutdown
`dashboard/dashboard.py`	`DashboardServer` runs HTTP server in a thread, serves status API
`dashboard/dashboard.html`	Single-file HTML/CSS/JS dashboard with auto-refresh

Service Configuration Pattern

Services are configured in config/config.py:

ServiceConfig(
    name="service_name",
    port=9001,
    command=["pants", "run", "//path/to/app.py", "--", "--reload"],
    env={"ENV_VAR": "value"},
    cwd=project_root,  # or relative path like "components/console"
    blocking=True,      # False = orchestrator continues if service fails
)

Adding a New Service

Add a ServiceConfig in config/config.py in the appropriate dependency layer
Services in Layer 0 have no dependencies
Services in Layer 1+ depend on earlier layers
Set blocking=False for optional services (like console)
Set profiles=frozenset({Profile.ECOM, Profile.FULL}) to include in specific profiles

Example for an ecom service:

ServiceConfig(
    name="my_service",
    port=9020,
    command=["pants", "run", "//components/my_service:local", "--", "--reload"],
    env={
        "CELL_URL": cell_url,
        **table_names,  # Inject all table names
    },
    profiles=frozenset({Profile.ECOM, Profile.FULL}),
)

Development Commands

# Run linting
pants lint //components/hippodrome::

# Run tests
pants test //components/hippodrome::

# Check CLI help
pants hd --help
pants hd up --help

Design Decisions

No __init__.py files - Uses namespace packages per project convention
Async subprocess management - Uses asyncio.create_subprocess_exec for non-blocking process spawning
Threaded dashboard server - Uses Python's built-in http.server in a daemon thread to avoid external dependencies
Colored log prefixes - Each service gets a unique ANSI color for easy identification
Graceful shutdown - SIGTERM with 5-second timeout, then SIGKILL

Environment Variables

Controller

The orchestrator sets these for controller to connect to fake_cell:

CONTROL_PLANE_URL_OVERRIDE=http://localhost:9001 - Overrides cell URL (dynamically set based on --cell flag)
CONTROLLER_CELL=local - Uses local configuration
DEBUG=true - Enables debug mode
SECRET_KEY=local-dev-secret-key-not-for-production - Local-only secret

E-commerce Services (--profile ecom)

Table names are injected based on --table-prefix, which defaults to dev-{sanitized-git-branch}-. This should be safe to omit when running locally on a dev branch.

admin_server also receives:

DATA_PLANE_CELLS - JSON config for cell API gateway (based on --cell flag)
MARQO_BASE_URL - Marqo Cloud API URL

Console Notes

Console is non-blocking: orchestrator continues if it fails to start
If node_modules/ is missing, orchestrator runs npm ci automatically
Console uses PORT env var to set port 9008
BROWSER=none prevents auto-opening browser

Service Resilience Principles

When hardening hippodrome services (or any service that runs under fuzz testing), follow these principles:

Never return HTML error pages from APIs. JSON error responses always, including CSRF failures, Django DEBUG 500s, and upstream proxy errors.
Use .get() with defaults on external/upstream response dicts. Bracket access on responses from other services or AWS SDK calls is a crash waiting to happen. This includes e.response["Error"]["Code"] in botocore exception handlers.
Classify upstream failures as 502, not 500. When a service can't reach its dependency (ECONNREFUSED, timeout, malformed response), that's a Bad Gateway, not an Internal Server Error. This makes it possible to distinguish "our code is broken" from "our dependency is down".
Guard all response.json() / JSON.parse() calls. Upstream services can return HTML error pages, empty bodies, or garbage. Wrap and return 502.
Set default timeouts on all outbound HTTP. No request should hang indefinitely. Use a _TimeoutSession wrapper or equivalent.
Use HTTP health checks, not TCP port checks. A process listening on a port doesn't mean the app is initialized. Use /healthz endpoints.
Expose /healthz on every service. Simple endpoint returning 200 when the app is ready to serve.
Cap in-memory caches. Unbounded dicts (e.g., token-to-user mappings) leak memory under sustained load. Cap and evict.
Use atomic file writes for state persistence. Write to a temp file, then rename(). Never write directly to the state file.
Release locks before blocking operations. Holding a lock while calling thread.join() or doing network I/O invites deadlocks.
Make seeding idempotent. Duplicate-record errors during seed are success, not failure. Only abort on actual errors.
Return 409 for duplicate creation, not 500. Idempotent-safe clients depend on this.
Validate inputs at the boundary. Return 400/422 for malformed JSON bodies, missing required fields, and invalid enum values before they propagate deeper.

Implicit Dependencies and Gotchas

These are things that can cause confusing failures if not set up correctly:

`HIPPODROME_RANDOM_SEED` is required for reproducible tests

Without this env var, seed data (account IDs, system_account_ids) is generated randomly on each startup. Tests that hard-code IDs from a previous run will fail. Always set it:

export HIPPODROME_RANDOM_SEED=42  # or source .env.e2e

Node.js services need `npm install` in their directories

search_proxy, agentic_search, admin_worker, and console all need node_modules/. The orchestrator auto-runs npm ci for console but NOT for search_proxy or agentic_search. Run manually if missing:

cd components/search_proxy && npm install
cd components/agentic_search && npm install

Controller requires a working venv

The controller runs via a local venv (components/controller/.venv), not pants. If the venv is missing or stale, the orchestrator creates it automatically. If Django imports fail despite the venv existing, delete and let it recreate:

rm -rf components/controller/.venv
# Restart hippodrome

Seed data requires fake_cell + fake_cognito to be healthy first

API keys are created in fake_cell during seeding. If fake_cell is slow to start, API key seeding is skipped and dependent services (admin_server, search_proxy) can't authenticate. The orchestrator uses layered startup to prevent this, but if seeding logs show "Failed to seed API key", restart hippodrome.

Moto resources are created concurrently at startup

DynamoDB tables, S3 buckets, SQS queues, Lambda functions, and SFN state machines are all created concurrently in moto during startup. If any fail, they log at WARNING or ERROR level with specific resource names. Check moto_server.log if services fail with ResourceNotFoundException.

The encryption secret has a default fallback

If HIPPODROME_ENCRYPTION_SECRET is not set, a default key (1234567890ABCDEF) is used. This is fine for local development but means API keys created with one secret can't be decrypted with another. If you see "Failed to decrypt" errors, ensure the secret matches what was used when the API key was created.

Testing Notes

Tests use MagicMock to avoid starting real processes
Dashboard tests handle missing dashboard.html (sandbox doesn't include resources)
Use pytest-asyncio for async test methods

Workflow Tips​

Quick Start​

Running E2E Tests Against Local Hippodrome​

Environment setup​

1) Verify Hippodrome is running and ready​

2) If E2E tests fail but status looks healthy, run smoke tests​

3) Run Shopify E2E tests​

4) Run Hippothesis smoke tests​

How it works (no-creds architecture)​

Known local behavior​

Never Bypass Services for Expedience​

Hippodrome Is Long-Running — Do Not Kill Services​

Agent Iteration Loop​

Troubleshooting First​

Log Files​

Component Structure​

Profiles​

Profile Examples​

Cell Connection​

Table Prefix​

Services Managed​

Core Services (Profile: core, ecom, full)​

E-commerce Services (Profile: ecom, full)​

External Services (Manual Setup Required)​

global_worker Setup (External Repository)​

Why global_worker is Needed​

Setup Instructions​

Running Without global_worker​

Verifying Setup​

Key Files​

Service Configuration Pattern​

Adding a New Service​

Development Commands​

Design Decisions​

Environment Variables​

Controller​

E-commerce Services (--profile ecom)​

Console Notes​

Service Resilience Principles​

Implicit Dependencies and Gotchas​

HIPPODROME_RANDOM_SEED is required for reproducible tests​

Node.js services need npm install in their directories​

Controller requires a working venv​

Seed data requires fake_cell + fake_cognito to be healthy first​

Moto resources are created concurrently at startup​

The encryption secret has a default fallback​

Testing Notes​