Local Stack Troubleshooting Guide
This document contains solutions to common problems encountered while running hippodrome. When you encounter an issue, search this document first. If you solve a new problem, add it here following the format below.
Problem: Port already in use
Symptoms:
- Error message: "Address already in use"
- Error message: "OSError: [Errno 48] Address already in use"
- Service fails to start with port conflict
Cause: Another process is using one of the required ports (9000, 9001, 9002, or 9008).
Solution:
- Find the process using the port:
lsof -i :9001 # Replace with the conflicting port
- Kill the process:
kill -9 <PID>
- Or use a single command:
kill -9 $(lsof -ti :9001)
Or on Mac, try witr --port 9001 (brew install witr)
Problem: Pants fails with PermissionError on sandboxer binary
Symptoms:
- Error like
PermissionError: [Errno 1] Operation not permitted: .../pants/bin/sandboxer - Stack trace mentions
os.chmodon a Pants cache path pants runorpants testexits immediately
Cause: The Pants cache directory is not writable/executable in the current environment (e.g., sandboxed execution, restrictive filesystem permissions, or a corrupted cache).
Solution:
- Ensure the Pants cache path is writable (Mac default:
~/Library/Caches/nce/) - If running in a sandboxed environment, run Pants outside the sandbox or disable the sandbox
- Clear the Pants cache and retry:
rm -rf ~/Library/Caches/nce
Problem: Service fails to start - command not found
Symptoms:
- Error message: "FileNotFoundError: [Errno 2] No such file or directory: 'pants'"
- Service shows status "failed" immediately after start
Cause:
The pants command is not in your PATH, or you're running from the wrong directory.
Solution:
- Ensure you're in the project root (where
pants.tomllives) - Verify pants is installed:
./pants --version - If using a custom pants installation, ensure it's in your PATH
Problem: Console fails to start - npm not found
Symptoms:
- Error message: "FileNotFoundError: [Errno 2] No such file or directory: 'npm'"
- Console shows status "failed"
Cause: Node.js and npm are not installed or not in PATH.
Solution:
- Install Node.js (includes npm): https://nodejs.org/
- Verify installation:
npm --version - Console is non-blocking, so other services will continue running
Problem: Console fails to start - npm ci fails
Symptoms:
- Error during "Running npm ci..."
- npm ci returns non-zero exit code
Cause:
package-lock.jsonis out of sync withpackage.json- npm cache is corrupted
- Node version mismatch
Solution:
- Navigate to console directory:
cd components/console
- Clear npm cache and reinstall:
rm -rf node_modulesnpm cache clean --forcenpm install
- If version mismatch, check required Node version in
package.json
Problem: Hot reload not working
Symptoms:
- Changes to Python files don't trigger restart
- Changes to console files don't refresh
Cause:
- Python: uvicorn's
--reloadmay not detect changes in all cases - Console: React's hot module replacement requires specific setup
Solution: For Python services:
- Verify the service was started with
--reloadflag (check logs) - Ensure you're editing files within the watched directory
- Try manually restarting the local stack
For console:
- Check browser console for HMR errors
- Try hard refresh (Cmd+Shift+R)
- Restart the console service
Problem: Controller can't connect to fake_cell
Symptoms:
- Controller logs show connection errors to localhost:9001
- API requests to controller return 500 errors
- Logs show "Connection refused"
Cause:
- fake_cell hasn't finished starting
- fake_cell crashed
- Environment variables not set correctly
Solution:
- Check fake_cell status in dashboard (http://localhost:9000)
- Check fake_cell logs for errors
- Verify fake_cell is running:
curl http://localhost:9001/health
- If fake_cell is not running, check its logs and restart hippodrome
Problem: Dashboard shows services but no logs
Symptoms:
- Dashboard at http://localhost:9000 shows service status
- Log viewer tabs are empty
Cause:
- Services just started (logs haven't been captured yet)
- Log capture encountered an error
Solution:
- Wait a few seconds for logs to be captured
- Check terminal output - logs should be visible there with colored prefixes
- If terminal has logs but dashboard doesn't, check browser console for JavaScript errors
Problem: Ctrl+C doesn't stop services
Symptoms:
- Pressing Ctrl+C doesn't terminate all services
- Some services continue running after shutdown
Cause:
- Signal handler may not have triggered shutdown
- Some processes may be orphaned
Solution:
-
Use the
stopcommand to cleanly terminate all processes:pants hd stopThis will find and kill the main orchestrator process and all service subprocesses.
-
If you prefer manual cleanup:
- Press Ctrl+C again (multiple times if needed)
- Find and kill processes manually:
# Find pants/python processes for hippodrome servicesps aux | grep -E "(fake_cell|controller)" | grep -v grep# Kill by PIDkill -9 <PID>
- Kill processes by port:
kill -9 $(lsof -ti :9001) $(lsof -ti :9002) 2>/dev/null
Problem: Django settings module not found
Symptoms:
- Controller fails with "ModuleNotFoundError: No module named 'config'"
- Error related to DJANGO_SETTINGS_MODULE
- Traceback shows Django trying to import
config.settings
Cause:
Django's module system doesn't work well with Pants' sandbox environment. When running via pants run //components/controller/manage.py, Pants creates a sandbox that doesn't have the correct Python path structure for Django imports.
Solution: The controller now runs using its local venv instead of pants (as of iteration 4). If you're seeing this error:
- Ensure the controller venv exists:
ls components/controller/.venv - If it doesn't exist, the orchestrator will create it automatically. You can also create it manually:
cd components/controllerpython3 -m venv .venv.venv/bin/pip install -r requirements.txt
- Verify Django is installed:
components/controller/.venv/bin/python -c "import django; print(django.VERSION)"
Note: The orchestrator's _setup_controller_venv() method automatically creates and configures the venv if it's missing.
Problem: Services stuck waiting for pants invocation
Symptoms:
- Log shows "Another pants invocation is running. Will wait up to 60.0 seconds..."
- Services fail to start after the timeout
- Only affects services run via pants (fake_cell)
Cause: By default, pants doesn't allow concurrent invocations. When the orchestrator starts fake_cell via pants while another pants process is running (including the orchestrator itself), it waits for the lock.
Solution:
Run the local stack with PANTS_CONCURRENT=True:
pants hd up
This allows multiple pants processes to run concurrently.
Problem: E-commerce services not starting
Symptoms:
- admin_server and search_proxy are not started
- Only fake_cell, controller, and console are running
- Dashboard at http://localhost:9000 shows only core services
Cause:
E-commerce services are not included in the default core profile. They require the ecom or full profile to be explicitly selected.
Solution:
Use the --profile ecom flag to start e-commerce services:
pants hd up --profile ecom
Available profiles:
core(default): fake_cell, controller, consoleecom: core + admin_server, search_proxyfull: all services
Problem: admin_server fails to start - validation error
Symptoms:
- admin_server exits immediately after starting
- Error message: "ValidationError" or "field required"
- Logs show missing environment variable
Cause: admin_server requires several environment variables for configuration. The orchestrator sets placeholder values, but some features may require real values.
Solution:
- For basic local development, the default placeholder values should work
- For Shopify integration, create a
.envfile incomponents/shopify/admin_server/with real credentials:SHOPIFY_API_KEY=your-api-keySHOPIFY_API_SECRET=your-api-secret - For AWS integrations, ensure valid AWS credentials are available in your environment
Problem: search_proxy fails to start - wrangler not found
Symptoms:
- search_proxy shows status "failed"
- Error message: "npx: command not found" or "wrangler: command not found"
Cause: search_proxy is a Cloudflare Worker that requires Node.js and wrangler to run locally.
Solution:
- Install Node.js if missing: https://nodejs.org/
- Install dependencies in search_proxy:
cd components/search_proxynpm install
- Verify wrangler works:
cd components/search_proxynpx wrangler --version
Problem: Connected to wrong DynamoDB tables
Symptoms:
- Data operations affect unexpected tables
- Table not found errors
- Reading/writing to tables from another developer's branch
Cause:
The table prefix defaults to dev-{git-branch}-. If you're on a different branch than expected, or the branch name contains special characters, the table names may not match what you expect.
Solution:
- Check the startup message for the table prefix being used:
Table prefix: dev-feature-auth-
- Use
--table-prefixto explicitly set the prefix:pants hd up --profile ecom --table-prefix my-feature- - Branch names with slashes are sanitized:
feature/add-auth→dev-feature-add-auth-
Problem: Production warning when starting
Symptoms:
- Red warning message: "WARNING: Connecting to PRODUCTION cell!"
- Concern about affecting production data
Cause:
You're using --cell prod which connects to the production cell instead of fake_cell.
Solution:
- If this was intentional, proceed carefully - you're working with production data
- If unintentional, stop the orchestrator and restart without
--cell prod:pants hd up --profile ecom# or explicitly use local cell:pants hd up --profile ecom --cell local - For testing against non-production deployed infrastructure, use
--cell staging
Problem: fake_cell not starting with --cell flag
Symptoms:
- Using
--cell stagingor--cell prod - fake_cell is not running (expected)
- Services can't connect to cell
Cause:
When using --cell staging or --cell prod, fake_cell is intentionally skipped because services connect to the real deployed cell. However, network issues or authentication may prevent connection.
Solution:
- Verify AWS credentials are configured for the target cell
- Check VPN connection if required for staging/prod access
- Verify the cell is accessible:
# For staging cellcurl -I https://n6wwdwmk2m.execute-api.us-east-1.amazonaws.com/prod/health
- If you need local development without real cell access, use
--cell local(the default)
Problem: Search requests fail - global_worker not running
Symptoms:
- Search requests to
http://localhost:9005/indexes/.../searchfail - Error message: "ECONNREFUSED" or "Failed to fetch from global_worker"
- search_proxy logs show connection error to port 9012
Cause: The global_worker is an external service (separate repository) that must be running for search requests to work. When search_proxy receives a search request, it routes to global_worker which then forwards to the cell.
Solution:
-
The global_worker is in a separate repository. Clone and set it up:
cd ~/devgit clone git@github.com:marqo-ai/global-worker.gitcd global-workernpm install -
Create a
wrangler.local.tomlconfiguration pointing to fake_cell:name = "local-global-worker"main = "src/index.ts"compatibility_date = "2024-09-23"compatibility_flags = ["nodejs_compat"][vars]ENV = "dev"CELL_URL = "http://localhost:9001"[dev]port = 9012local_protocol = "http" -
Start global_worker:
npx wrangler dev --config wrangler.local.toml --port 9012 -
If you don't need search functionality, you can ignore this - admin operations will still work.
Problem: global_worker can't connect to fake_cell
Symptoms:
- global_worker is running on port 9012
- Search requests fail with "connection refused" to port 9001
- global_worker logs show errors connecting to CELL_URL
Cause:
- fake_cell is not running (check with
--cellflag) - CELL_URL in global_worker's config is wrong
- hippodrome not started with correct profile
Solution:
- Verify fake_cell is running (dashboard shows it at http://localhost:9000)
- Check global_worker's
wrangler.local.tomlhasCELL_URL = "http://localhost:9001" - Ensure hippodrome is running:
pants hd up --profile ecom - If using
--cell stagingor--cell prod, update global_worker's CELL_URL to match the real cell URL
Problem: agentic_search fails to start
Symptoms:
- agentic_search shows status "failed" in dashboard
- Error message: "npx: command not found" or "wrangler: command not found"
- Error about missing dependencies
Cause: agentic_search is a Cloudflare Worker that requires Node.js, npm, and its dependencies to be installed.
Solution:
- Verify Node.js and npm are installed:
node --versionnpm --version
- Install dependencies in agentic_search directory:
cd components/agentic_searchnpm install
- Verify wrangler works:
npx wrangler --version
- If you see
.dev.varserrors, create the file with required credentials:cd components/agentic_searchcat > .dev.vars << EOFGOOGLE_API_KEY=your-google-api-keyAGENTIC_AWS_ACCESS_KEY_ID=your-aws-access-keyAGENTIC_AWS_SECRET_ACCESS_KEY=your-aws-secret-keyEOF
Problem: agentic_search can't communicate with search_proxy
Symptoms:
- agentic_search starts successfully on port 9007
- API calls to agentic_search fail with service binding errors
- Logs show "SEARCH_PROXY_WORKER is not defined" or similar
Cause:
In local development mode, Cloudflare service bindings don't work. agentic_search uses SEARCH_PROXY_WORKER service binding in production, but needs HTTP fallback for local development.
Solution:
- Verify
SEARCH_PROXY_URLis set in the environment. The orchestrator sets this automatically:# Check the orchestrator config sets thisSEARCH_PROXY_URL=http://localhost:9005 - Ensure search_proxy is running on port 9005
- The agentic_search code should fall back to HTTP when
SEARCH_PROXY_URLis set. If you see service binding errors, the HTTP fallback code may not be working correctly. - Check that
wrangler.local.tomlhas the correct configuration:[vars]SEARCH_PROXY_URL = "http://localhost:9005"
Problem: Cloudflare Worker Durable Objects not persisting
Symptoms:
- Durable Object data (like conversation history in agentic_search) is lost between restarts
- Each request creates a new Durable Object instance
Cause: By default, wrangler dev does not persist Durable Object state. State is stored in memory and lost when the worker restarts.
Solution:
- Use the
--persistflag when running wrangler dev:npx wrangler dev --config wrangler.local.toml --port 9007 --persist - Data is stored in
.wrangler/state/directory - Note: The orchestrator doesn't add
--persistby default. You can:- Stop the orchestrator and run agentic_search manually with
--persist - Or accept that state will be lost on restart (usually fine for development)
- Stop the orchestrator and run agentic_search manually with
Problem: AWS SSO token expired - services fail to validate tables
Symptoms:
- admin_server logs show: "Could not validate X table: Error when retrieving token from sso: Token has expired"
- ecom services start but show warning messages
- DynamoDB operations fail with authentication errors
Cause: E-commerce services (admin_server, ecom_indexer, etc.) need AWS credentials to access DynamoDB tables. When using AWS SSO, the temporary credentials expire and need to be refreshed.
Solution:
- Refresh your AWS SSO credentials:
aws sso login --profile your-profile-name
- Export the profile if needed:
export AWS_PROFILE=your-profile-name
- Restart hippodrome to pick up the new credentials
- Note: Services can still start with expired credentials (they show warnings) but DynamoDB operations will fail
Prevention:
- Refresh SSO credentials before starting hippodrome
- Consider using long-lived credentials for local development (in
~/.aws/credentials)
Problem: Multiple Cloudflare Workers conflict on debug port
Symptoms:
- Error: "Address already in use" for port 9229
- Second Cloudflare Worker fails to start
- agentic_search or search_proxy won't start when both are running
Cause: Wrangler's Node.js debugger uses port 9229 by default. When running multiple workers, they conflict on this port.
Solution:
- The workers should use different inspector ports. Check if
--inspector-portis set in the commands. - If you see this error, you can disable the inspector for one worker:
# Run agentic_search without inspectornpx wrangler dev --config wrangler.local.toml --port 9007 --inspector-port 9230
- Or kill the process using port 9229:
lsof -ti :9229 | xargs kill -9
- Restart hippodrome
Problem: KV namespace errors in Cloudflare Workers
Symptoms:
- Error: "KV namespace binding not found"
- search_proxy or agentic_search fails to read/write KV data
- 500 errors from workers when accessing cached data
Cause: KV namespaces in wrangler.local.toml use placeholder IDs that wrangler dev accepts, but the local KV storage starts empty.
Solution:
- For basic functionality, the workers should handle missing KV data gracefully
- If you need specific KV data for testing:
- Use wrangler CLI to populate local KV:
cd components/search_proxynpx wrangler kv:key put --binding SEARCH_PROXY_KV "key-name" "value" --local
- Use wrangler CLI to populate local KV:
- Local KV data is stored in
.wrangler/state/and can be cleared:rm -rf components/search_proxy/.wrangler/state/rm -rf components/agentic_search/.wrangler/state/
Problem: EventBridge webhook returns "No route configured"
Symptoms:
- POST to
/webhook/eventbridgereturns{"status": "ignored", "reason": "No route configured for detail-type: ..."} - Events are not forwarded to local services
Cause:
The detail-type in the event doesn't match any configured route prefix. Routes are configured for EcomIndexSettings.* and Merchandising.* only.
Solution:
- Verify your event has the correct
detail-typefield:- For index settings:
EcomIndexSettings.MODIFY,EcomIndexSettings.INSERT,EcomIndexSettings.REMOVE - For merchandising:
Merchandising.MODIFY,Merchandising.INSERT,Merchandising.REMOVE
- For index settings:
- Check your event format matches the expected structure:
{"source": "marqo.dynamodb","detail-type": "EcomIndexSettings.MODIFY","detail": { ... }}
- If you need to route events to a new service, add the route to
EVENTBRIDGE_ROUTESindashboard.py
Problem: EventBridge webhook returns service unavailable
Symptoms:
- POST to
/webhook/eventbridgereturns status 503 - Response contains
{"error": "Service unavailable: ..."} - Logs show "Failed to forward event to localhost:9010"
Cause: The target service (ecom_settings_exporter or merchandising_exporter) is not running on the expected port.
Solution:
- Verify the target service is running with the
ecomprofile:pants hd up --profile ecom - Check the dashboard at http://localhost:9000 for service status
- Test the service directly:
# For ecom_settings_exportercurl http://localhost:9010/healthz# For merchandising_exportercurl http://localhost:9011/healthz
- If the service shows as failed, check its logs in the dashboard or terminal output
Problem: EventBridge webhook returns 400 - missing detail-type
Symptoms:
- POST to
/webhook/eventbridgereturns{"error": "Missing 'detail-type' field in event"} - HTTP status 400
Cause:
The event JSON is missing the required detail-type field.
Solution:
- Ensure your event includes
detail-type(note the hyphen, not underscore):{"source": "marqo.dynamodb","detail-type": "EcomIndexSettings.MODIFY","detail": { ... }} - Common mistake: Using
detailTypeordetail_typeinstead ofdetail-type - Verify JSON is valid:
echo '{"detail-type": "test"}' | jq .
Problem: Events not reaching local services from AWS
Symptoms:
- Events sent from AWS EventBridge don't reach the local webhook
- Manual webhook testing works fine
- No logs showing incoming events
Cause: The tunnel or EventBridge rule may not be set up correctly, or AWS credentials may be missing.
Solution:
- Use the
--enable-eventsflag to automatically set up tunneling and EventBridge rules:pants hd up --profile ecom --enable-events - Ensure you have:
- cloudflared installed and in your PATH
- AWS credentials configured with EventBridge and IAM permissions
- Run
aws sso loginif using AWS SSO
- Check the startup output for:
[tunnel] EventBridge webhook exposed at ...- confirms tunnel is working[eventbridge] Rule '...' created for ...- confirms rule was created
- If rule creation fails but tunnel works, you can manually create a rule:
- Copy the tunnel URL from the startup output
- Create an EventBridge rule in the AWS Console targeting that URL
- For manual testing without
--enable-events, use curl:curl -X POST http://localhost:9000/webhook/eventbridge \-H "Content-Type: application/json" \-d '{"detail-type": "EcomIndexSettings.MODIFY", "detail": {...}}'
Problem: EventBridge rule creation fails
Symptoms:
- Tunnel starts successfully
[eventbridge] Failed to create EventBridge rulemessage appears- Events are not forwarded from AWS
Cause: Missing AWS credentials or insufficient IAM permissions.
Solution:
- Ensure AWS credentials are configured:
aws sts get-caller-identity # Should show your identityaws sso login # If using AWS SSO
- Required IAM permissions:
events:PutRule,events:PutTargets,events:DeleteRule,events:RemoveTargetsevents:CreateConnection,events:DeleteConnection,events:DescribeConnectionevents:CreateApiDestination,events:DeleteApiDestination,events:DescribeApiDestinationiam:CreateRole,iam:PutRolePolicy,iam:GetRole(for first-time setup)
- The orchestrator will still work without the rule - you can test manually with curl
Problem: EventBridge Pipes not running or missing
Symptoms:
- Validation script shows
[MISSING]or[WARN]for pipes - EventBridge events never reach the custom event buses
- No events forwarded to local services even with
--enable-events
Cause: The CDK infrastructure (EventBridge Pipes) hasn't been deployed or failed to deploy.
Solution:
- Run the validation script to check infrastructure status:
python components/hippodrome/scripts/validate_eventbridge.py --table-prefix dev-main- --verbose
- If pipes are missing, deploy the CDK stacks:
# Deploy ecom stack (creates EcomEventBus and IndexSettingsEventPipe)cd infra/ecomnpx cdk deploy --require-approval never# Deploy controller stack (creates MerchandisingEventBus and MerchandisingEventPipe)cd ../controllernpx cdk deploy MerchandisingStack --require-approval never
- If pipes exist but aren't RUNNING, check CloudWatch Logs for errors:
- Go to CloudWatch Logs in the AWS Console
- Search for log groups containing "EventBridge" or "Pipes"
- Check for permission errors or configuration issues
- Verify DynamoDB streams are enabled on the source tables
Problem: EventBridge events not appearing in console
Symptoms:
- EventBridge Pipes are RUNNING
- Local webhook never receives events
- No activity in EventBridge console metrics
Cause: The custom event buses may not have any subscribers, or the DynamoDB table isn't receiving writes.
Solution:
- Verify the event bus exists and has the correct name format:
- For table prefix
dev-main-, event buses should be:dev-main-EcomEventBusdev-main-MerchandisingEventBus
- For table prefix
- Make a test write to trigger an event:
aws dynamodb put-item \--table-name dev-main-EcomIndexSettingsTable \--item '{"pk": {"S": "test"}, "sk": {"S": "INDEX#test"}}'
- Check EventBridge metrics in CloudWatch:
- Go to CloudWatch → Metrics → Events → By Event Bus
- Look for
MatchedEventsandFailedInvocationsmetrics
- Enable EventBridge archive/logging for debugging:
- In EventBridge console, create an archive rule to capture all events
- This helps identify if events are reaching the bus
Problem: Pants resolve conflicts when starting Lambda services
Symptoms:
- Error: "Targets that are in different resolves cannot be mixed"
- Lambda wrappers fail to build with dependency resolution errors
- Services like ecom_indexer, ecom_settings_exporter, merchandising_exporter fail to start
Cause: Lambda services (ecom_indexer, ecom_settings_exporter, merchandising_exporter) each have their own Pants resolve. If hippodrome tries to run wrapper scripts that import from these services while using a different resolve, the build fails.
Solution:
The Lambda services should be started using their native :local targets in their respective directories, not through wrapper scripts in hippodrome:
//components/ecom_indexer:local//components/ecom_settings_exporter:local//components/merchandising_exporter:local//components/shopify/admin_server:local
The config.py's create_python_lambda_service function should use f"//components/{name}:local" format.
Problem: API key authentication fails with "Invalid API key"
Symptoms:
- Requests to admin_server return 401 "Invalid API key"
- admin_server logs show: "API key encryption secret not available, cannot decrypt API key"
- admin_server logs show: "Failed to extract system account ID from API key"
Cause:
Marqo API keys are encrypted. The admin_server needs access to the encryption secret (either from AWS Secrets Manager or via MARQO_API_KEY_SECRET environment variable) to decrypt them. In local development, AWS Secrets Manager access may not be available.
Solution:
- Ensure you have valid AWS credentials:
aws sso login --profile stagingexport AWS_PROFILE=staging
- The hippodrome config.py should automatically fetch the encryption secret from AWS Secrets Manager (
dev/api_key_encryption_key_secret) and set it asMARQO_API_KEY_SECRET. - If AWS access fails, you can manually set the secret:
export MARQO_API_KEY_SECRET=$(aws secretsmanager get-secret-value --secret-id dev/api_key_encryption_key_secret --query 'SecretString' --output text)
- Restart hippodrome after setting credentials
Problem: API key validation fails with KeyError for cell ID
Symptoms:
- Requests to admin_server return 500 error
- Error message:
KeyError: 'S'orKeyError: 'D' - Traceback shows error in
ControlPlaneGatewaylooking up cell inDATA_PLANE_CELLS
Cause:
The API key was created for a different cell (e.g., "S" for staging) but hippodrome's DATA_PLANE_CELLS only contains the "local" cell configuration. The admin_server tries to validate the API key against the cell specified in the key, but that cell isn't configured.
Solution:
The config.py's get_data_plane_cells_json function should include staging cell configuration when using local cell, so that API keys created for staging can be validated:
# In config.py - get_data_plane_cells_json should include staging config for local cell
if cell == Cell.LOCAL:
staging_config = CELL_CONFIGS[Cell.STAGING]
cells_config[staging_config.id] = {
"aws_region": staging_config.region,
"gateway_id": staging_config.gateway_id,
}
This allows staging API keys to be validated against the staging cell while other operations use fake_cell.
Problem: E2E tests fail with ResourceNotFoundException for DynamoDB tables
Symptoms:
- Tests return 500 error with
ResourceNotFoundExceptionwhen calling PutItem - Error mentions table like
dev-hippodrome-EcomIndexSettingsTablenot found - admin_server logs show DynamoDB table not found
Cause:
The hippodrome uses a table prefix based on the git branch (e.g., dev-hippodrome-). These DynamoDB tables don't exist in AWS - they would need to be created, or an existing table prefix should be used.
Solution:
- Use an existing table prefix by specifying
--table-prefixwhen starting hippodrome:pants hd up --profile ecom --table-prefix dev-main- - Or create the tables for your branch using the CDK infrastructure:
cd infra/ecomnpx cdk deploy --context tablePrefix=dev-hippodrome-
- Or run e2e tests against a deployed dev environment instead of hippodrome, using the environment variables in
.env.dev
Note: Running e2e tests against hippodrome requires DynamoDB tables to exist with the matching prefix. The fake_cell handles cell operations but DynamoDB tables are always in AWS.
Problem: E2E tests require specific environment configuration
Symptoms:
- Tests fail with various errors when run against hippodrome
- Errors about missing environment variables or configuration
Cause: E2E tests are designed to run against deployed AWS infrastructure. Running them against hippodrome requires specific configuration.
Solution: Run e2e tests with these environment variables:
ENVIRONMENT=dev \
ENV_BRANCH=hippodrome \
ADMIN_SERVER_BASE_URL=http://localhost:9004 \
SEARCH_PROXY_URL=http://localhost:9005 \
AWS_PROFILE=staging \
pants test //components/shopify/e2e_tests/e2e_tests/tests/ecom_onboarding_test.py -- -v
Key variables:
ENVIRONMENT=dev- Tells tests to use dev environmentENV_BRANCH- Must match your hippodrome table prefix (default is git branch)ADMIN_SERVER_BASE_URL- Points to local admin_server (port 9004)SEARCH_PROXY_URL- Points to local search_proxy (port 9005)AWS_PROFILE- AWS profile with access to DynamoDB tables
Problem: Marqo API calls return 401 "Invalid API Key" despite successful decryption
Symptoms:
- admin_server logs show: "HTTP error fetching indexes: 401 - {"error":"Invalid API Key."}"
- The encryption secret is retrieved successfully (from env var
MARQO_API_KEY_SECRET) - API key decryption appears to work (can decrypt to get system_account_id, cell, token)
- But calls to Marqo API fail with 401
Cause:
The admin_server stores encrypted API keys in the ShopifyApiKeyRecord.marqo_api_key field. When calling Marqo, it passes this encrypted key directly to MarqoClient, which sends it in the x-api-key header. Marqo expects the decrypted token, not the encrypted key.
The flow is:
- User sends encrypted API key as Bearer token
- Admin_server decrypts it to get
{system_account_id, cell, token} - Admin_server stores the ENCRYPTED key in DynamoDB (
marqo_api_keyfield) - Later, when calling Marqo, it retrieves
api_key_record.marqo_api_key(encrypted) - Passes encrypted key to
MarqoClient(api_key=encrypted_key) - Marqo receives encrypted key and rejects it with 401
Solution: This is a design issue in the admin_server. For local development, the issue doesn't manifest when the admin_server is deployed because the encrypted key in the database matches what Marqo expects (there may be additional infrastructure handling this).
For now, running e2e tests against hippodrome requires either:
- Running against a deployed dev environment (using AWS-hosted admin_server) instead of local admin_server
- Or modifying the admin_server code to decrypt the stored key before calling Marqo
To run e2e tests against deployed infrastructure:
# Don't set ADMIN_SERVER_BASE_URL - let it default to deployed AWS environment
ENVIRONMENT=dev \
ENV_BRANCH=admin-api \
pants test //components/shopify/e2e_tests/e2e_tests/tests:e2e_tests -- -k ecom_onboarding -v
Problem: fake_cell stops responding under concurrent load (event loop hang)
Symptoms:
- HTTP requests to fake_cell (:9001) time out or return connection errors
- The fake_cell process is still alive (
ss -tlnp | grep 9001shows LISTEN) - The process is consuming 80-95% CPU with a single thread
- The TCP accept backlog grows (visible in
ssoutput as large numbers) - Other services that depend on fake_cell (search_proxy, controller) also start failing
Cause: fake_cell previously ran both the main API and fake index servers in a shared asyncio event loop. Under concurrent HTTP load, the loop could enter CPU-spinning state. This was fixed by running each fake index server in its own dedicated thread with a separate event loop.
Solution: The root cause has been addressed — each fake index server now runs in a dedicated thread. Additionally, two layers of self-healing provide defense in depth:
- fake_cell watchdog (in-process): A daemon thread checks
/healthzevery 5 seconds. After 2 consecutive failures (~10-15s), it saves state to disk and force-exits withos._exit(1). - hippodrome health monitor (orchestrator-level): Periodically probes all RUNNING services with
auto_restart=True. After 3 consecutive health failures (~30s), hippodrome restarts the service.
Both mechanisms work with hippodrome's auto_restart to bring the service back within seconds. State is persisted — accounts, API keys, and index metadata survive restarts automatically via /tmp/fake_cell_state.json.
If self-healing fails or fake_cell is stuck:
- Kill the fake_cell process:
kill -9 $(ss -tlnp 2>/dev/null | grep ':9001 ' | grep -oP 'pid=\K[0-9]+')
- hippodrome will auto-restart it within a few seconds
- State (accounts, API keys, indexes) is restored automatically from disk
Note: The --reload flag is disabled for fake_cell because uvicorn's reload mode causes deadlocks under concurrent load.
Problem: Controller returns 500 for missing accounts/records instead of 404
Symptoms:
- Controller endpoints return HTTP 500 with "Record does not exist" in the response body
- Controller endpoints return 500 with "Account not found" when
accountIddoesn't match any account - Fuzz tests flag these as server errors but they're actually "not found" scenarios
Cause: Two issues combined:
get_system_account_id()raised a genericException("Account not found")instead of a domain-specific errorRecordNotFoundErrorandRecordAccessDeniedErrorfrom the DDB client could bubble up through views that lacked specific exception handling
Solution: Fixed at multiple layers:
get_system_account_id()now raisesAccountRecordNotFoundError(subclass ofRecordNotFoundError)- A
DomainExceptionMiddlewareinconfig/middleware.pycatches unhandledRecordNotFoundError→ 404 andRecordAccessDeniedError→ 403 at the Django level - The auth backend explicitly raises
AuthenticationFailed(401)forMembershipRecordNotFoundErrorinstead of falling through to the genericexcept Exceptionhandler - All
except Exceptionblocks in index v2 views, merchandise views, and integrations views now re-raiseRecordNotFoundError/RecordAccessDeniedErrorbefore the generic handler, ensuring domain exceptions reach the middleware instead of being swallowed as 500s
These changes are transparent to existing views that already catch these exceptions explicitly.
- The
DomainExceptionMiddlewarealso has a DEBUG-mode catch-all: any unhandled exception returns a JSON 500 response ({"error": "ExceptionType: message"}) instead of Django's default HTML error page. This ensures API clients always receive parseable JSON, even for unexpected failures.
Problem: fake_cell _reindex_jobs race condition under concurrent load
Symptoms:
- Occasional
RuntimeErroror lost reindex job records when multiple reindex operations run concurrently - Difficult to reproduce — only manifests under real concurrent load (e.g., fuzz testing)
Cause:
The module-level _reindex_jobs list was mutated (append + length check + slice delete) without synchronization. The compound operation was not atomic under concurrent access.
Solution:
Added a threading.Lock (_reindex_jobs_lock) to protect both the append+evict operation and the list query in list_reindex_jobs.
Problem: Services report "healthy" before fully initialized
Symptoms:
- Dependent services get connection errors or malformed responses despite health checks passing
- Race conditions during startup where a service accepts connections but isn't ready to serve
ResourceNotFoundExceptionerrors from admin_lambda during the first few seconds after startup
Cause:
Services without an HTTP health_url configured used TCP-only port checks. A service can bind to its port and accept connections before the FastAPI/Django app is fully initialized (routes registered, middleware loaded, etc.).
Solution:
All FastAPI services (fake_cell, fake_cognito, controller, admin_server, admin_lambda) now have health_url="/healthz" configured in their ServiceConfig. The orchestrator polls these HTTP endpoints during startup, only marking services as healthy when the app responds with 200.
Wrangler Workers (search_proxy, agentic_search) still use TCP-only checks since Wrangler doesn't expose a configurable health endpoint, but Wrangler only opens the port when fully ready.
The health check also now logs non-200 responses during startup polling, making it visible when a service is up but returning errors.
Problem: Hippothesis fuzz test reports spurious 500 errors
Symptoms:
_assert_okassertions fail with "Internal server error", "Transport error", "Bad gateway", etc.- Circuit breaker trips for one or more services
_error_codessummary shows manylabel:500entries
Cause:
These are infrastructure-level failures, not real application bugs. The fuzz test's _INFRA_ERRORS allowlist in cloud_machine.py recognizes known patterns:
"Transport error"— Connection drops (service process crashed or restarting)"ResourceNotFoundException"— DDB table missing (moto not fully seeded)"Failed to authenticate"— Auth failure (fake_cell slow/down)"Bad gateway"— Network error to downstream service (502 from search_proxy, controller, or admin_lambda)
Previous entries removed after service hardening:
"Internal server error"— search_proxy now classifies network errors as 502 instead of generic 500"Service misconfigured"— search_proxy env validation returns 503 with clear message"CSRF"— controller now returns JSON 403 instead of HTML CSRF failure page
Solution:
- Ensure hippodrome has been running for at least 60-90 seconds before starting fuzz tests
- Check
curl -s http://localhost:9000/api/status— all services should showrunning - If a specific service keeps failing, check its log:
tail components/hippodrome/.logs/latest/<service>.log - If errors persist after all services are healthy, the issue may be a real bug:
- Check if the error body matches any
_INFRA_ERRORSpattern — if not, it's a genuine assertion failure - Look at the
_error_codessummary in the test output to identify which endpoints fail most - The circuit breaker (
CB_OPEN=<service>) indicates which service is unhealthy
- Check if the error body matches any
Problem: Moto resource creation fails during startup
Symptoms:
RuntimeError: Moto resource creation failed for: _create_tables(or_create_s3_buckets, etc.)RuntimeError: Failed to create DynamoDB tables: <table_suffix>- Services that depend on DDB tables fail with
ResourceNotFoundException
Cause: Moto creates AWS resources (tables, buckets, queues, Lambdas, secrets, state machines) concurrently at startup. If any creation fails, the error message now identifies which resource group(s) and which specific table(s) failed. Common causes:
- Network contention during startup (rare)
- Memory pressure from too many concurrent moto operations
- Moto bug triggered by specific table definitions
Solution:
- Check the moto server log for specific errors:
grep -i error components/hippodrome/.logs/latest/moto_server.log - The log now shows
Moto resource group ready: <name>for each successful group andMoto resource group failed: <name>for failures - For table failures, the log shows
Failed to create table <suffix>: <error> - Restart hippodrome — transient failures usually resolve on retry
- If a specific table consistently fails, check its definition in
wrappers/moto_server.py:TABLE_DEFINITIONS
Problem: Controller returns 502 Bad Gateway for upstream service failures
Symptoms:
- Controller API returns
{"error": "Bad gateway"}with status 502 - Usually during merchandise/search operations that call search_proxy or admin_server
Cause:
The controller's DomainExceptionMiddleware classifies requests.ConnectionError, requests.Timeout, and other requests.RequestException errors as 502 Bad Gateway. This means the controller couldn't reach an upstream service (search_proxy, admin_server, or fake_cell).
Solution:
- Check that the upstream service is running:
curl -s http://localhost:9000/api/status - Check the upstream service's health endpoint directly (e.g.,
curl http://localhost:9005/for search_proxy) - If the upstream service is running, it may be overloaded — wait a moment and retry
- Check the upstream service log:
tail components/hippodrome/.logs/latest/<service>.log
Problem: fake_cell state save race condition in tests
Symptoms:
- Warning:
Failed to save state to /tmp/fake_cell_state.json FileNotFoundError: [Errno 2] No such file or directory: '/tmp/fake_cell_state.tmp' -> '/tmp/fake_cell_state.json'- Occurs during parallel test execution (pytest-xdist)
Cause:
Multiple test workers share the same predictable temp file path (/tmp/fake_cell_state.tmp). When workers save concurrently, one worker's rename deletes the other worker's temp file.
Solution:
Fixed in state_manager.py — now uses tempfile.mkstemp() with a unique random suffix per write, and Path.replace() for atomic rename. No manual intervention needed.
Problem: Deep health check shows "unhealthy" with unclear error
Symptoms:
- Deep health check returns
"error_type": "connection_refused"— the service is not listening on its port yet - Deep health check returns
"error_type": "timeout"— the service is listening but not responding in time - Deep health check returns
"error_type": "http_error"with"status_code"— the service responded with an error
Cause: Different root causes require different remediation:
connection_refused: Service hasn't finished starting. Wait longer or check logs for startup errors.timeout: Service is overloaded or hung. Check for deadlocks in the service log.http_error: Service is running but returning errors. Check the status code and service logs.parse_error: Service returned unexpected response format. Check for version mismatches.
Solution:
- Check the
error_typefield in the health check response for targeted diagnosis - For
connection_refused:tail -f components/hippodrome/.logs/latest/<service>.log - For
timeout: increaserequest_timeout_secondsor investigate service performance - For
http_error: check the service-specific error at the reported status code
Problem: Service in higher layer starts before its dependencies are ready
Symptoms:
- Controller logs show connection errors to fake_cell/fake_cognito during first few seconds
- admin_server fails to register with search_proxy because controller isn't ready yet
- Cascading 502 errors during startup that resolve after a few seconds
Cause: Services were starting nearly simultaneously regardless of dependency order. Higher-layer services (like controller at Layer 1) could attempt to connect to lower-layer services (like fake_cell at Layer 0) before they were fully initialized.
Solution: This is now handled automatically by the layered startup system. Services are grouped into dependency layers (0-3) and each layer waits for health checks to pass before the next layer starts. If you still see startup ordering issues:
- Check the service's
layervalue inconfig/config.py - Ensure the service has a
health_urlconfigured (services without health URLs are not gated) - Check dashboard logs for
[orchestrator] Layer N healthy/not healthymessages - If a layer times out (60s), the next layer starts anyway — check for slow service initialization
Problem: Shutdown leaves orphan processes on ports
Symptoms:
- After stopping hippodrome, ports are still in use
- Restarting hippodrome fails with "Port already in use"
lsof -i :9001shows processes still running
Cause: Shutdown proceeds in reverse layer order (Layer 3 → 0). If a higher-layer service spawns child processes that don't respond to SIGTERM within 5 seconds, they're force-killed with SIGKILL. Orphaned grandchild processes may not be in the same process group.
Solution:
- Run
pants hd killto clean up all hippodrome ports - If that doesn't work:
kill -9 $(lsof -ti :9001 :9002 :9004 :9005) - The orchestrator uses
start_new_session=Truefor process groups, sokill_tree()should catch most cases
Problem: Controller returns HTML error page instead of JSON
Symptoms:
- API calls to controller return HTML 500 page
- Fuzz test or search_proxy gets unparseable HTML response
- Error like "JSONDecodeError" in caller logs
Cause: Django returns HTML error pages for unhandled exceptions by default. In hippodrome (DEBUG=True), the DomainExceptionMiddleware catches these and returns JSON instead, but only when DEBUG=True is set.
Solution: This should not happen in hippodrome (DEBUG is always True). If it does:
- Check that the controller's
DEBUGenv var is set totrueinconfig/config.py - Check
config/middleware.py—DomainExceptionMiddlewaremust be last inMIDDLEWARElist - The middleware catches:
RecordNotFoundError→ 404,RecordAccessDeniedError→ 403,requests.RequestException→ 502, and all other unhandled exceptions → JSON 500
Adding New Entries
When you solve a new problem, add an entry following this format:
## Problem: [Brief description]
**Symptoms:**
- [What error messages you see]
- [What behavior you observe]
**Cause:**
[Why this happens]
**Solution:**
[Step-by-step instructions to fix it]
---
Place new entries in a logical order (common problems first, edge cases later).