Hippodrome Component
Workflow Tips
Check @/AGENTS.md whenever you want repo-wide context or optional workflow suggestions, and adapt whatever helps this task.
Development environment orchestrator for running all Cloud Control Plane components in one place for system testing. Starts all control plane services with proper configuration and service discovery.
Quick Start
# `pants hd` is a built-in alias (see pants.toml) for running the Hippodrome CLI.
# It handles PANTS_CONCURRENT automatically.
# Explain the CLI
pants hd --help
# Use a random seed to ensure that the seed data is always the same.
export HIPPODROME_RANDOM_SEED=42
# Start with ecom profile (includes admin_server, search_proxy, etc.)
pants hd up --profile ecom
# Connect to staging cell instead of fake_cell
pants hd up --profile ecom --cell staging
# Use custom table prefix
pants hd up --profile ecom --table-prefix my-feature-
# With explicit project root
pants hd up --project-root /path/to/project
Running E2E Tests Against Local Hippodrome
All E2E tests run fully locally against moto (AWS emulation) and fake_cell. No remote credentials are required (no AWS SSO, Shopify, or Cloudflare tokens). Tests that need real external credentials will skip automatically in hippodrome mode.
Environment setup
Source the convenience env file (sets ENVIRONMENT=hippodrome, HIPPODROME_RANDOM_SEED=42, PANTS_CONCURRENT=True):
source components/hippodrome/.env.e2e
This tells the E2E config loader to read components/shopify/e2e_tests/.env.hippodrome, which points service URLs at localhost and uses dummy AWS credentials for moto.
1) Verify Hippodrome is running and ready
After pants hd up, services take 60-90 seconds to become fully healthy. Use this to check:
# Quick check: is the dashboard responding?
curl -sf http://localhost:9000/api/status
# Detailed check: wait until all services report "running"
curl -s http://localhost:9000/api/status | python3 -c "
import sys, json
data = json.load(sys.stdin)
for name, info in data.items():
status = info.get('status', 'unknown')
symbol = '✓' if status == 'running' else '✗'
print(f' {symbol} {name}: {status}')
not_running = [n for n, i in data.items() if i.get('status') != 'running']
if not_running:
print(f'\nWaiting for: {', '.join(not_running)}')
sys.exit(1)
print('\nAll services ready.')
"
Layered startup order: Services start in dependency layers. Each layer waits for its health-checked services to become healthy before the next layer starts. Shutdown happens in reverse layer order.
| Layer | Services | Approx time |
|---|---|---|
| 0 | fake_cognito, fake_cell | ~5-15s |
| 1 | controller | ~15-30s |
| 2 | admin_server, admin_lambda, search_proxy, ecom_indexer, exporters, console | ~30-60s |
| 3 | agentic_search, admin_worker | ~60-90s |
If this fails or returns no response, Hippodrome is not running. Do not start Hippodrome yourself — tell the user:
Hippodrome does not appear to be running (localhost:9000 is not responding). Please start it with
pants hd up --profile ecomand let me know when it's ready.
2) If E2E tests fail but status looks healthy, run smoke tests
pants test //tests/hippodrome::
This runs lightweight checks against every Hippodrome service (dashboard, moto, fake_cell, fake_cognito, admin_server, search_proxy, seeded data). Completes in under 10 seconds and is safe to run alongside other work. If smoke tests pass, the problem is in the component or test code, not Hippodrome.
3) Run Shopify E2E tests
# Single test file
pants test //components/shopify/e2e_tests/e2e_tests/tests/ecom_onboarding_test.py -- -v
# All E2E tests (tests needing external creds will skip)
pants test //components/shopify/e2e_tests/e2e_tests/tests:: -- -v
Tests that work fully locally (no external creds):
ecom_onboarding_test.py- Ecom account onboarding flowecom_onboarding_text_test.py- Text-based onboarding variantecom_import_export_test.py- Import/export workflowsecom_aliasing_test.py- Index aliasingecom_analytics_test.py- Analytics trackingindex_forking_test.py- Index fork workflow
Tests that skip in hippodrome mode (need Shopify creds):
shopify_onboarding_test.py
4) Run Hippothesis smoke tests
Hippothesis is a lightweight smoke-test suite that probes service health. No env file needed — it defaults to local environment.
# Hello-world probe (just checks services are reachable)
pants test //components/hippothesis/hippothesis:tests
# Full hippothesis suite (includes property-based search tests)
pants test //components/hippothesis::
How it works (no-creds architecture)
┌─────────────────────────────────────────────────────────┐
│ Hippodrome (all local, no remote dependencies) │
│ │
│ moto (:9003) ← DynamoDB, S3, SQS, SNS, SFN │
│ fake_cognito (:9012) ← Auth token validation │
│ fake_cell (:9001) ← Marqo data plane mock │
│ controller (:9002) ← Control plane API (Django) │
│ admin_server (:9004) ← Ecom backend (FastAPI) │
│ search_proxy (:9005) ← Search API (Cloudflare Worker) │
│ console (:9008) ← Web UI (React) │
└─────────────────────────────────────────────────────────┘
▲
│ ENVIRONMENT=hippodrome
│ AWS creds → dummy (testing/testing)
│ Service URLs → localhost
│
┌────────┴────────┐
│ E2E test suite │
└─────────────────┘
- moto emulates DynamoDB, S3, SQS, SNS, Secrets Manager, Step Functions — no real AWS account needed
- fake_cognito issues JWT tokens locally — no real Cognito pool needed
- fake_cell mocks the Marqo data plane — no real Marqo deployment needed
- Shopify/Cloudflare credentials are optional; tests that need them skip gracefully
Known local behavior
- Under moto + local queue wiring, admin add-docs can intermittently fail with SQS 500s.
- The simple health test falls back to direct index
POST /documentsfor the single health doc when that happens. - This fallback is only for the deterministic health scenario; fuzz suites still exercise normal API paths.
Never Bypass Services for Expedience
Hippodrome runs the same service topology as production. Never skip a service in the request path (e.g., calling admin_server directly instead of going through search_proxy) to work around a bug or speed things up. If a request fails at search_proxy, fix the root cause in search_proxy — do not route around it. Bypassing a service hides bugs and creates divergence between local and production behavior.
Hippodrome Is Long-Running — Do Not Kill Services
Hippodrome is designed to run continuously in the background. The user starts it once and leaves it running. Agents must not kill, restart, or stop Hippodrome services unless the user explicitly asks.
- Hot reload handles code changes for most services. admin_server and controller use
--reload; the console uses HMR. Editing those source files triggers automatic restarts. fake_cell and fake_cognito run without--reload(it causes deadlocks under concurrent load); restart hippodrome to pick up changes to those services. - Restarting Hippodrome is needed for changes to Hippodrome itself (e.g.,
config/config.py,management/orchestrator.py,cli.py) or to fake_cell/fake_cognito source code. - If you believe a restart is needed, ask the user to do it rather than running restart/stop/kill commands yourself. The user may have other work depending on the running stack.
- Never
kill -9Hippodrome ports (9000–9019) unless the user explicitly instructs you to. - To check if Hippodrome is running, call
curl -s http://localhost:9000/api/status. This returns JSON with the status of all services. Do not runpants hdcommands just to check status — the dashboard API is instant and non-disruptive.
Agent Iteration Loop
# User starts daemon once (auto-starts services by default)
pants hd up --profile ecom
# Check status via the dashboard API (preferred — no pants overhead)
curl -s http://localhost:9000/api/status | python -m json.tool
# If needed, user manages stack lifecycle
pants hd control restart
pants hd control stop / start
# User ends the session cleanly
pants hd control shutdown
Troubleshooting First
Before debugging any error, read troubleshooting.md in this directory. It contains solutions to common problems. If you solve a new problem, add it to troubleshooting.md.
Log Files
All service logs are written to timestamped directories in components/hippodrome/.logs/:
components/hippodrome/.logs/
├── latest -> 20260120-143022 # Symlink to most recent run
├── 20260120-143022/
│ ├── fake_cell.log
│ ├── controller.log
│ ├── console.log
│ └── admin_server.log
│ └── ...
└── 20260120-140815/
└── ...
Use these log files when:
- Debugging test failures that occur in hippodrome services
- Investigating service startup issues
- Analyzing errors that scroll off the terminal
- Reviewing complete service output history
Quick access to latest logs:
# View all logs from latest run
ls components/hippodrome/.logs/latest/
# Tail a specific service
tail -f components/hippodrome/.logs/latest/controller.log
# Search for errors across all services
grep -i error components/hippodrome/.logs/latest/*.log
Component Structure
hippodrome/
├── AGENTS.md # This file - instructions for AI assistants
├── troubleshooting.md # Common problems and solutions (read first!)
├── BUILD # Pants build configuration
└── hippodrome/
├── cli.py # Click CLI entry point
├── config/config.py # Service configurations (ports, env vars)
├── management/orchestrator.py # Process spawner and manager
├── dashboard/dashboard.py # HTTP server for status page
├── dashboard/dashboard.html # Single-file HTML dashboard
├── events/ # Event handling (tunnels, etc.)
├── workflows/ # Step Functions workflow engine
└── wrappers/ # Service wrappers (moto, etc.)
Profiles
The --profile flag controls which services are started:
| Profile | Services | Use Case |
|---|---|---|
core | fake_cell, controller, console | Basic control plane development |
ecom | core + admin_server, search_proxy, indexers, exporters | E-commerce development |
full (default) | All services | Full stack testing |
Profile Examples
# Default (full): Starts all services
pants hd up
# Ecom: Starts core + admin_server (9004), search_proxy (9005), indexers, exporters
pants hd up --profile ecom
Cell Connection
The --cell flag controls which cell to connect to:
| Cell | Description | fake_cell Started? |
|---|---|---|
local (default) | Use local fake_cell | Yes |
staging | Connect to staging cell | No |
prod | Connect to production cell | No |
When using staging or prod, the fake_cell service is not started and services connect to the real deployed cell.
Warning: Using --cell prod connects to production data. A warning message is displayed.
# Use staging cell (skips fake_cell)
pants hd up --profile ecom --cell staging
Table Prefix
The --table-prefix flag controls DynamoDB table naming:
# Default: dev-{git-branch}- (e.g., "dev-feature-auth-")
pants hd up --profile ecom
# Custom prefix
pants hd up --profile ecom --table-prefix my-feature-
Branch names with special characters (slashes, etc.) are sanitized: feature/add-auth → dev-feature-add-auth-
Services Managed
Core Services (Profile: core, ecom, full)
| Service | Port | Description |
|---|---|---|
| dashboard | 9000 | Status dashboard (http://localhost:9000) |
| fake_cell | 9001 | Data plane mock (skipped with --cell staging/prod) |
| controller | 9002 | Control plane API (Django) |
| console | 9008 | Web dashboard UI (React, non-blocking) |
E-commerce Services (Profile: ecom, full)
| Service | Port | Description |
|---|---|---|
| admin_server | 9004 | E-commerce backend (FastAPI) |
| search_proxy | 9005 | Search API gateway (Cloudflare Worker) |
| ecom_indexer_service | 9018 | Go document indexer service |
| ecom_ingest | 9019 | Local document ingest buffer API |
External Services (Manual Setup Required)
| Service | Port | Description |
|---|---|---|
| global_worker | 9012 | Search query router (external repo, Cloudflare Worker) |
global_worker Setup (External Repository)
The global_worker is a Cloudflare Worker that handles search query routing, merchandising rules, and caching. It lives in a separate repository and must be set up manually for full search functionality.
Why global_worker is Needed
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│search_proxy │────▶│global_worker│────▶│ fake_cell │
│ :9005 │ │ :9012 │ │ :9001 │
└─────────────┘ └─────────────┘ └─────────────┘
- search_proxy receives search API requests
- global_worker applies merchandising rules and caching
- global_worker proxies the final request to the cell (fake_cell or real cell)
Setup Instructions
-
Clone the global-worker repository:
# Clone to a directory outside this repocd ~/devgit clone git@github.com:marqo-ai/global-worker.gitcd global-worker -
Install dependencies:
npm install -
Create local wrangler configuration: Create a
wrangler.local.tomlfile (or use an existing one) with:name = "local-global-worker"main = "src/index.ts"compatibility_date = "2024-09-23"compatibility_flags = ["nodejs_compat"][vars]ENV = "dev"FULL_ENV = "dev-local"# Point to fake_cell for local developmentCELL_URL = "http://localhost:9001"[dev]port = 9012local_protocol = "http" -
Start global_worker:
npx wrangler dev --config wrangler.local.toml --port 9012
Running Without global_worker
If you don't need to test the full search flow, you can skip global_worker setup:
- search_proxy will return errors for search requests that require global_worker
- admin_server and other write operations will still work
- Direct cell operations (via controller) are unaffected
Verifying Setup
Once global_worker is running:
# Check global_worker health (if health endpoint exists)
curl http://localhost:9012/health
# Test full search flow (requires valid index and data)
curl -X POST http://localhost:9005/indexes/test-index/search \
-H "Content-Type: application/json" \
-d '{"q": "test query"}'
Key Files
| File | Purpose |
|---|---|
cli.py | Entry point - parses arguments, runs orchestrator |
config/config.py | ServiceConfig dataclass, get_service_configs() returns services in dependency order |
management/orchestrator.py | Orchestrator class handles process lifecycle, log aggregation, graceful shutdown |
dashboard/dashboard.py | DashboardServer runs HTTP server in a thread, serves status API |
dashboard/dashboard.html | Single-file HTML/CSS/JS dashboard with auto-refresh |
Service Configuration Pattern
Services are configured in config/config.py:
ServiceConfig(
name="service_name",
port=9001,
command=["pants", "run", "//path/to/app.py", "--", "--reload"],
env={"ENV_VAR": "value"},
cwd=project_root, # or relative path like "components/console"
blocking=True, # False = orchestrator continues if service fails
)
Adding a New Service
- Add a
ServiceConfiginconfig/config.pyin the appropriate dependency layer - Services in Layer 0 have no dependencies
- Services in Layer 1+ depend on earlier layers
- Set
blocking=Falsefor optional services (like console) - Set
profiles=frozenset({Profile.ECOM, Profile.FULL})to include in specific profiles
Example for an ecom service:
ServiceConfig(
name="my_service",
port=9020,
command=["pants", "run", "//components/my_service:local", "--", "--reload"],
env={
"CELL_URL": cell_url,
**table_names, # Inject all table names
},
profiles=frozenset({Profile.ECOM, Profile.FULL}),
)
Development Commands
# Run linting
pants lint //components/hippodrome::
# Run tests
pants test //components/hippodrome::
# Check CLI help
pants hd --help
pants hd up --help
Design Decisions
- No
__init__.pyfiles - Uses namespace packages per project convention - Async subprocess management - Uses
asyncio.create_subprocess_execfor non-blocking process spawning - Threaded dashboard server - Uses Python's built-in
http.serverin a daemon thread to avoid external dependencies - Colored log prefixes - Each service gets a unique ANSI color for easy identification
- Graceful shutdown - SIGTERM with 5-second timeout, then SIGKILL
Environment Variables
Controller
The orchestrator sets these for controller to connect to fake_cell:
CONTROL_PLANE_URL_OVERRIDE=http://localhost:9001- Overrides cell URL (dynamically set based on --cell flag)CONTROLLER_CELL=local- Uses local configurationDEBUG=true- Enables debug modeSECRET_KEY=local-dev-secret-key-not-for-production- Local-only secret
E-commerce Services (--profile ecom)
Table names are injected based on --table-prefix, which defaults to dev-{sanitized-git-branch}-. This should be safe to omit when running locally on a dev branch.
admin_server also receives:
DATA_PLANE_CELLS- JSON config for cell API gateway (based on--cellflag)MARQO_BASE_URL- Marqo Cloud API URL
Console Notes
- Console is non-blocking: orchestrator continues if it fails to start
- If
node_modules/is missing, orchestrator runsnpm ciautomatically - Console uses
PORTenv var to set port 9008 BROWSER=noneprevents auto-opening browser
Service Resilience Principles
When hardening hippodrome services (or any service that runs under fuzz testing), follow these principles:
- Never return HTML error pages from APIs. JSON error responses always, including CSRF failures, Django DEBUG 500s, and upstream proxy errors.
- Use
.get()with defaults on external/upstream response dicts. Bracket access on responses from other services or AWS SDK calls is a crash waiting to happen. This includese.response["Error"]["Code"]in botocore exception handlers. - Classify upstream failures as 502, not 500. When a service can't reach its dependency (ECONNREFUSED, timeout, malformed response), that's a Bad Gateway, not an Internal Server Error. This makes it possible to distinguish "our code is broken" from "our dependency is down".
- Guard all
response.json()/JSON.parse()calls. Upstream services can return HTML error pages, empty bodies, or garbage. Wrap and return 502. - Set default timeouts on all outbound HTTP. No request should hang indefinitely. Use a
_TimeoutSessionwrapper or equivalent. - Use HTTP health checks, not TCP port checks. A process listening on a port doesn't mean the app is initialized. Use
/healthzendpoints. - Expose
/healthzon every service. Simple endpoint returning 200 when the app is ready to serve. - Cap in-memory caches. Unbounded dicts (e.g., token-to-user mappings) leak memory under sustained load. Cap and evict.
- Use atomic file writes for state persistence. Write to a temp file, then
rename(). Never write directly to the state file. - Release locks before blocking operations. Holding a lock while calling
thread.join()or doing network I/O invites deadlocks. - Make seeding idempotent. Duplicate-record errors during seed are success, not failure. Only abort on actual errors.
- Return 409 for duplicate creation, not 500. Idempotent-safe clients depend on this.
- Validate inputs at the boundary. Return 400/422 for malformed JSON bodies, missing required fields, and invalid enum values before they propagate deeper.
Implicit Dependencies and Gotchas
These are things that can cause confusing failures if not set up correctly:
HIPPODROME_RANDOM_SEED is required for reproducible tests
Without this env var, seed data (account IDs, system_account_ids) is generated randomly on each startup. Tests that hard-code IDs from a previous run will fail. Always set it:
export HIPPODROME_RANDOM_SEED=42 # or source .env.e2e
Node.js services need npm install in their directories
search_proxy, agentic_search, admin_worker, and console all need node_modules/. The orchestrator auto-runs npm ci for console but NOT for search_proxy or agentic_search. Run manually if missing:
cd components/search_proxy && npm install
cd components/agentic_search && npm install
Controller requires a working venv
The controller runs via a local venv (components/controller/.venv), not pants. If the venv is missing or stale, the orchestrator creates it automatically. If Django imports fail despite the venv existing, delete and let it recreate:
rm -rf components/controller/.venv
# Restart hippodrome
Seed data requires fake_cell + fake_cognito to be healthy first
API keys are created in fake_cell during seeding. If fake_cell is slow to start, API key seeding is skipped and dependent services (admin_server, search_proxy) can't authenticate. The orchestrator uses layered startup to prevent this, but if seeding logs show "Failed to seed API key", restart hippodrome.
Moto resources are created concurrently at startup
DynamoDB tables, S3 buckets, SQS queues, Lambda functions, and SFN state machines are all created concurrently in moto during startup. If any fail, they log at WARNING or ERROR level with specific resource names. Check moto_server.log if services fail with ResourceNotFoundException.
The encryption secret has a default fallback
If HIPPODROME_ENCRYPTION_SECRET is not set, a default key (1234567890ABCDEF) is used. This is fine for local development but means API keys created with one secret can't be decrypted with another. If you see "Failed to decrypt" errors, ensure the secret matches what was used when the API key was created.
Testing Notes
- Tests use
MagicMockto avoid starting real processes - Dashboard tests handle missing
dashboard.html(sandbox doesn't include resources) - Use
pytest-asynciofor async test methods