Skip to main content

Local Stack Troubleshooting Guide

This document contains solutions to common problems encountered while running hippodrome. When you encounter an issue, search this document first. If you solve a new problem, add it here following the format below.

Problem: Port already in use

Symptoms:

  • Error message: "Address already in use"
  • Error message: "OSError: [Errno 48] Address already in use"
  • Service fails to start with port conflict

Cause: Another process is using one of the required ports (9000, 9001, 9002, or 9008).

Solution:

  1. Find the process using the port:
    lsof -i :9001 # Replace with the conflicting port
  2. Kill the process:
    kill -9 <PID>
  3. Or use a single command:
    kill -9 $(lsof -ti :9001)

Or on Mac, try witr --port 9001 (brew install witr)


Problem: Pants fails with PermissionError on sandboxer binary

Symptoms:

  • Error like PermissionError: [Errno 1] Operation not permitted: .../pants/bin/sandboxer
  • Stack trace mentions os.chmod on a Pants cache path
  • pants run or pants test exits immediately

Cause: The Pants cache directory is not writable/executable in the current environment (e.g., sandboxed execution, restrictive filesystem permissions, or a corrupted cache).

Solution:

  1. Ensure the Pants cache path is writable (Mac default: ~/Library/Caches/nce/)
  2. If running in a sandboxed environment, run Pants outside the sandbox or disable the sandbox
  3. Clear the Pants cache and retry:
    rm -rf ~/Library/Caches/nce

Problem: Service fails to start - command not found

Symptoms:

  • Error message: "FileNotFoundError: [Errno 2] No such file or directory: 'pants'"
  • Service shows status "failed" immediately after start

Cause: The pants command is not in your PATH, or you're running from the wrong directory.

Solution:

  1. Ensure you're in the project root (where pants.toml lives)
  2. Verify pants is installed: ./pants --version
  3. If using a custom pants installation, ensure it's in your PATH

Problem: Console fails to start - npm not found

Symptoms:

  • Error message: "FileNotFoundError: [Errno 2] No such file or directory: 'npm'"
  • Console shows status "failed"

Cause: Node.js and npm are not installed or not in PATH.

Solution:

  1. Install Node.js (includes npm): https://nodejs.org/
  2. Verify installation: npm --version
  3. Console is non-blocking, so other services will continue running

Problem: Console fails to start - npm ci fails

Symptoms:

  • Error during "Running npm ci..."
  • npm ci returns non-zero exit code

Cause:

  • package-lock.json is out of sync with package.json
  • npm cache is corrupted
  • Node version mismatch

Solution:

  1. Navigate to console directory:
    cd components/console
  2. Clear npm cache and reinstall:
    rm -rf node_modules
    npm cache clean --force
    npm install
  3. If version mismatch, check required Node version in package.json

Problem: Hot reload not working

Symptoms:

  • Changes to Python files don't trigger restart
  • Changes to console files don't refresh

Cause:

  • Python: uvicorn's --reload may not detect changes in all cases
  • Console: React's hot module replacement requires specific setup

Solution: For Python services:

  1. Verify the service was started with --reload flag (check logs)
  2. Ensure you're editing files within the watched directory
  3. Try manually restarting the local stack

For console:

  1. Check browser console for HMR errors
  2. Try hard refresh (Cmd+Shift+R)
  3. Restart the console service

Problem: Controller can't connect to fake_cell

Symptoms:

  • Controller logs show connection errors to localhost:9001
  • API requests to controller return 500 errors
  • Logs show "Connection refused"

Cause:

  • fake_cell hasn't finished starting
  • fake_cell crashed
  • Environment variables not set correctly

Solution:

  1. Check fake_cell status in dashboard (http://localhost:9000)
  2. Check fake_cell logs for errors
  3. Verify fake_cell is running:
    curl http://localhost:9001/health
  4. If fake_cell is not running, check its logs and restart hippodrome

Problem: Dashboard shows services but no logs

Symptoms:

Cause:

  • Services just started (logs haven't been captured yet)
  • Log capture encountered an error

Solution:

  1. Wait a few seconds for logs to be captured
  2. Check terminal output - logs should be visible there with colored prefixes
  3. If terminal has logs but dashboard doesn't, check browser console for JavaScript errors

Problem: Ctrl+C doesn't stop services

Symptoms:

  • Pressing Ctrl+C doesn't terminate all services
  • Some services continue running after shutdown

Cause:

  • Signal handler may not have triggered shutdown
  • Some processes may be orphaned

Solution:

  1. Use the stop command to cleanly terminate all processes:

    pants hd stop

    This will find and kill the main orchestrator process and all service subprocesses.

  2. If you prefer manual cleanup:

    • Press Ctrl+C again (multiple times if needed)
    • Find and kill processes manually:
      # Find pants/python processes for hippodrome services
      ps aux | grep -E "(fake_cell|controller)" | grep -v grep
      # Kill by PID
      kill -9 <PID>
    • Kill processes by port:
      kill -9 $(lsof -ti :9001) $(lsof -ti :9002) 2>/dev/null

Problem: Django settings module not found

Symptoms:

  • Controller fails with "ModuleNotFoundError: No module named 'config'"
  • Error related to DJANGO_SETTINGS_MODULE
  • Traceback shows Django trying to import config.settings

Cause: Django's module system doesn't work well with Pants' sandbox environment. When running via pants run //components/controller/manage.py, Pants creates a sandbox that doesn't have the correct Python path structure for Django imports.

Solution: The controller now runs using its local venv instead of pants (as of iteration 4). If you're seeing this error:

  1. Ensure the controller venv exists: ls components/controller/.venv
  2. If it doesn't exist, the orchestrator will create it automatically. You can also create it manually:
    cd components/controller
    python3 -m venv .venv
    .venv/bin/pip install -r requirements.txt
  3. Verify Django is installed: components/controller/.venv/bin/python -c "import django; print(django.VERSION)"

Note: The orchestrator's _setup_controller_venv() method automatically creates and configures the venv if it's missing.


Problem: Services stuck waiting for pants invocation

Symptoms:

  • Log shows "Another pants invocation is running. Will wait up to 60.0 seconds..."
  • Services fail to start after the timeout
  • Only affects services run via pants (fake_cell)

Cause: By default, pants doesn't allow concurrent invocations. When the orchestrator starts fake_cell via pants while another pants process is running (including the orchestrator itself), it waits for the lock.

Solution: Run the local stack with PANTS_CONCURRENT=True:

pants hd up

This allows multiple pants processes to run concurrently.


Problem: E-commerce services not starting

Symptoms:

  • admin_server and search_proxy are not started
  • Only fake_cell, controller, and console are running
  • Dashboard at http://localhost:9000 shows only core services

Cause: E-commerce services are not included in the default core profile. They require the ecom or full profile to be explicitly selected.

Solution: Use the --profile ecom flag to start e-commerce services:

pants hd up --profile ecom

Available profiles:

  • core (default): fake_cell, controller, console
  • ecom: core + admin_server, search_proxy
  • full: all services

Problem: admin_server fails to start - validation error

Symptoms:

  • admin_server exits immediately after starting
  • Error message: "ValidationError" or "field required"
  • Logs show missing environment variable

Cause: admin_server requires several environment variables for configuration. The orchestrator sets placeholder values, but some features may require real values.

Solution:

  1. For basic local development, the default placeholder values should work
  2. For Shopify integration, create a .env file in components/shopify/admin_server/ with real credentials:
    SHOPIFY_API_KEY=your-api-key
    SHOPIFY_API_SECRET=your-api-secret
  3. For AWS integrations, ensure valid AWS credentials are available in your environment

Problem: search_proxy fails to start - wrangler not found

Symptoms:

  • search_proxy shows status "failed"
  • Error message: "npx: command not found" or "wrangler: command not found"

Cause: search_proxy is a Cloudflare Worker that requires Node.js and wrangler to run locally.

Solution:

  1. Install Node.js if missing: https://nodejs.org/
  2. Install dependencies in search_proxy:
    cd components/search_proxy
    npm install
  3. Verify wrangler works:
    cd components/search_proxy
    npx wrangler --version

Problem: Connected to wrong DynamoDB tables

Symptoms:

  • Data operations affect unexpected tables
  • Table not found errors
  • Reading/writing to tables from another developer's branch

Cause: The table prefix defaults to dev-{git-branch}-. If you're on a different branch than expected, or the branch name contains special characters, the table names may not match what you expect.

Solution:

  1. Check the startup message for the table prefix being used:
    Table prefix: dev-feature-auth-
  2. Use --table-prefix to explicitly set the prefix:
    pants hd up --profile ecom --table-prefix my-feature-
  3. Branch names with slashes are sanitized: feature/add-authdev-feature-add-auth-

Problem: Production warning when starting

Symptoms:

  • Red warning message: "WARNING: Connecting to PRODUCTION cell!"
  • Concern about affecting production data

Cause: You're using --cell prod which connects to the production cell instead of fake_cell.

Solution:

  1. If this was intentional, proceed carefully - you're working with production data
  2. If unintentional, stop the orchestrator and restart without --cell prod:
    pants hd up --profile ecom
    # or explicitly use local cell:
    pants hd up --profile ecom --cell local
  3. For testing against non-production deployed infrastructure, use --cell staging

Problem: fake_cell not starting with --cell flag

Symptoms:

  • Using --cell staging or --cell prod
  • fake_cell is not running (expected)
  • Services can't connect to cell

Cause: When using --cell staging or --cell prod, fake_cell is intentionally skipped because services connect to the real deployed cell. However, network issues or authentication may prevent connection.

Solution:

  1. Verify AWS credentials are configured for the target cell
  2. Check VPN connection if required for staging/prod access
  3. Verify the cell is accessible:
    # For staging cell
    curl -I https://n6wwdwmk2m.execute-api.us-east-1.amazonaws.com/prod/health
  4. If you need local development without real cell access, use --cell local (the default)

Problem: Search requests fail - global_worker not running

Symptoms:

  • Search requests to http://localhost:9005/indexes/.../search fail
  • Error message: "ECONNREFUSED" or "Failed to fetch from global_worker"
  • search_proxy logs show connection error to port 9012

Cause: The global_worker is an external service (separate repository) that must be running for search requests to work. When search_proxy receives a search request, it routes to global_worker which then forwards to the cell.

Solution:

  1. The global_worker is in a separate repository. Clone and set it up:

    cd ~/dev
    git clone git@github.com:marqo-ai/global-worker.git
    cd global-worker
    npm install
  2. Create a wrangler.local.toml configuration pointing to fake_cell:

    name = "local-global-worker"
    main = "src/index.ts"
    compatibility_date = "2024-09-23"
    compatibility_flags = ["nodejs_compat"]

    [vars]
    ENV = "dev"
    CELL_URL = "http://localhost:9001"

    [dev]
    port = 9012
    local_protocol = "http"
  3. Start global_worker:

    npx wrangler dev --config wrangler.local.toml --port 9012
  4. If you don't need search functionality, you can ignore this - admin operations will still work.


Problem: global_worker can't connect to fake_cell

Symptoms:

  • global_worker is running on port 9012
  • Search requests fail with "connection refused" to port 9001
  • global_worker logs show errors connecting to CELL_URL

Cause:

  • fake_cell is not running (check with --cell flag)
  • CELL_URL in global_worker's config is wrong
  • hippodrome not started with correct profile

Solution:

  1. Verify fake_cell is running (dashboard shows it at http://localhost:9000)
  2. Check global_worker's wrangler.local.toml has CELL_URL = "http://localhost:9001"
  3. Ensure hippodrome is running: pants hd up --profile ecom
  4. If using --cell staging or --cell prod, update global_worker's CELL_URL to match the real cell URL

Problem: agentic_search fails to start

Symptoms:

  • agentic_search shows status "failed" in dashboard
  • Error message: "npx: command not found" or "wrangler: command not found"
  • Error about missing dependencies

Cause: agentic_search is a Cloudflare Worker that requires Node.js, npm, and its dependencies to be installed.

Solution:

  1. Verify Node.js and npm are installed:
    node --version
    npm --version
  2. Install dependencies in agentic_search directory:
    cd components/agentic_search
    npm install
  3. Verify wrangler works:
    npx wrangler --version
  4. If you see .dev.vars errors, create the file with required credentials:
    cd components/agentic_search
    cat > .dev.vars << EOF
    GOOGLE_API_KEY=your-google-api-key
    AGENTIC_AWS_ACCESS_KEY_ID=your-aws-access-key
    AGENTIC_AWS_SECRET_ACCESS_KEY=your-aws-secret-key
    EOF

Problem: agentic_search can't communicate with search_proxy

Symptoms:

  • agentic_search starts successfully on port 9007
  • API calls to agentic_search fail with service binding errors
  • Logs show "SEARCH_PROXY_WORKER is not defined" or similar

Cause: In local development mode, Cloudflare service bindings don't work. agentic_search uses SEARCH_PROXY_WORKER service binding in production, but needs HTTP fallback for local development.

Solution:

  1. Verify SEARCH_PROXY_URL is set in the environment. The orchestrator sets this automatically:
    # Check the orchestrator config sets this
    SEARCH_PROXY_URL=http://localhost:9005
  2. Ensure search_proxy is running on port 9005
  3. The agentic_search code should fall back to HTTP when SEARCH_PROXY_URL is set. If you see service binding errors, the HTTP fallback code may not be working correctly.
  4. Check that wrangler.local.toml has the correct configuration:
    [vars]
    SEARCH_PROXY_URL = "http://localhost:9005"

Problem: Cloudflare Worker Durable Objects not persisting

Symptoms:

  • Durable Object data (like conversation history in agentic_search) is lost between restarts
  • Each request creates a new Durable Object instance

Cause: By default, wrangler dev does not persist Durable Object state. State is stored in memory and lost when the worker restarts.

Solution:

  1. Use the --persist flag when running wrangler dev:
    npx wrangler dev --config wrangler.local.toml --port 9007 --persist
  2. Data is stored in .wrangler/state/ directory
  3. Note: The orchestrator doesn't add --persist by default. You can:
    • Stop the orchestrator and run agentic_search manually with --persist
    • Or accept that state will be lost on restart (usually fine for development)

Problem: AWS SSO token expired - services fail to validate tables

Symptoms:

  • admin_server logs show: "Could not validate X table: Error when retrieving token from sso: Token has expired"
  • ecom services start but show warning messages
  • DynamoDB operations fail with authentication errors

Cause: E-commerce services (admin_server, ecom_indexer, etc.) need AWS credentials to access DynamoDB tables. When using AWS SSO, the temporary credentials expire and need to be refreshed.

Solution:

  1. Refresh your AWS SSO credentials:
    aws sso login --profile your-profile-name
  2. Export the profile if needed:
    export AWS_PROFILE=your-profile-name
  3. Restart hippodrome to pick up the new credentials
  4. Note: Services can still start with expired credentials (they show warnings) but DynamoDB operations will fail

Prevention:

  • Refresh SSO credentials before starting hippodrome
  • Consider using long-lived credentials for local development (in ~/.aws/credentials)

Problem: Multiple Cloudflare Workers conflict on debug port

Symptoms:

  • Error: "Address already in use" for port 9229
  • Second Cloudflare Worker fails to start
  • agentic_search or search_proxy won't start when both are running

Cause: Wrangler's Node.js debugger uses port 9229 by default. When running multiple workers, they conflict on this port.

Solution:

  1. The workers should use different inspector ports. Check if --inspector-port is set in the commands.
  2. If you see this error, you can disable the inspector for one worker:
    # Run agentic_search without inspector
    npx wrangler dev --config wrangler.local.toml --port 9007 --inspector-port 9230
  3. Or kill the process using port 9229:
    lsof -ti :9229 | xargs kill -9
  4. Restart hippodrome

Problem: KV namespace errors in Cloudflare Workers

Symptoms:

  • Error: "KV namespace binding not found"
  • search_proxy or agentic_search fails to read/write KV data
  • 500 errors from workers when accessing cached data

Cause: KV namespaces in wrangler.local.toml use placeholder IDs that wrangler dev accepts, but the local KV storage starts empty.

Solution:

  1. For basic functionality, the workers should handle missing KV data gracefully
  2. If you need specific KV data for testing:
    • Use wrangler CLI to populate local KV:
      cd components/search_proxy
      npx wrangler kv:key put --binding SEARCH_PROXY_KV "key-name" "value" --local
  3. Local KV data is stored in .wrangler/state/ and can be cleared:
    rm -rf components/search_proxy/.wrangler/state/
    rm -rf components/agentic_search/.wrangler/state/

Problem: EventBridge webhook returns "No route configured"

Symptoms:

  • POST to /webhook/eventbridge returns {"status": "ignored", "reason": "No route configured for detail-type: ..."}
  • Events are not forwarded to local services

Cause: The detail-type in the event doesn't match any configured route prefix. Routes are configured for EcomIndexSettings.* and Merchandising.* only.

Solution:

  1. Verify your event has the correct detail-type field:
    • For index settings: EcomIndexSettings.MODIFY, EcomIndexSettings.INSERT, EcomIndexSettings.REMOVE
    • For merchandising: Merchandising.MODIFY, Merchandising.INSERT, Merchandising.REMOVE
  2. Check your event format matches the expected structure:
    {
    "source": "marqo.dynamodb",
    "detail-type": "EcomIndexSettings.MODIFY",
    "detail": { ... }
    }
  3. If you need to route events to a new service, add the route to EVENTBRIDGE_ROUTES in dashboard.py

Problem: EventBridge webhook returns service unavailable

Symptoms:

  • POST to /webhook/eventbridge returns status 503
  • Response contains {"error": "Service unavailable: ..."}
  • Logs show "Failed to forward event to localhost:9010"

Cause: The target service (ecom_settings_exporter or merchandising_exporter) is not running on the expected port.

Solution:

  1. Verify the target service is running with the ecom profile:
    pants hd up --profile ecom
  2. Check the dashboard at http://localhost:9000 for service status
  3. Test the service directly:
    # For ecom_settings_exporter
    curl http://localhost:9010/healthz

    # For merchandising_exporter
    curl http://localhost:9011/healthz
  4. If the service shows as failed, check its logs in the dashboard or terminal output

Problem: EventBridge webhook returns 400 - missing detail-type

Symptoms:

  • POST to /webhook/eventbridge returns {"error": "Missing 'detail-type' field in event"}
  • HTTP status 400

Cause: The event JSON is missing the required detail-type field.

Solution:

  1. Ensure your event includes detail-type (note the hyphen, not underscore):
    {
    "source": "marqo.dynamodb",
    "detail-type": "EcomIndexSettings.MODIFY",
    "detail": { ... }
    }
  2. Common mistake: Using detailType or detail_type instead of detail-type
  3. Verify JSON is valid:
    echo '{"detail-type": "test"}' | jq .

Problem: Events not reaching local services from AWS

Symptoms:

  • Events sent from AWS EventBridge don't reach the local webhook
  • Manual webhook testing works fine
  • No logs showing incoming events

Cause: The tunnel or EventBridge rule may not be set up correctly, or AWS credentials may be missing.

Solution:

  1. Use the --enable-events flag to automatically set up tunneling and EventBridge rules:
    pants hd up --profile ecom --enable-events
  2. Ensure you have:
    • cloudflared installed and in your PATH
    • AWS credentials configured with EventBridge and IAM permissions
    • Run aws sso login if using AWS SSO
  3. Check the startup output for:
    • [tunnel] EventBridge webhook exposed at ... - confirms tunnel is working
    • [eventbridge] Rule '...' created for ... - confirms rule was created
  4. If rule creation fails but tunnel works, you can manually create a rule:
    • Copy the tunnel URL from the startup output
    • Create an EventBridge rule in the AWS Console targeting that URL
  5. For manual testing without --enable-events, use curl:
    curl -X POST http://localhost:9000/webhook/eventbridge \
    -H "Content-Type: application/json" \
    -d '{"detail-type": "EcomIndexSettings.MODIFY", "detail": {...}}'

Problem: EventBridge rule creation fails

Symptoms:

  • Tunnel starts successfully
  • [eventbridge] Failed to create EventBridge rule message appears
  • Events are not forwarded from AWS

Cause: Missing AWS credentials or insufficient IAM permissions.

Solution:

  1. Ensure AWS credentials are configured:
    aws sts get-caller-identity # Should show your identity
    aws sso login # If using AWS SSO
  2. Required IAM permissions:
    • events:PutRule, events:PutTargets, events:DeleteRule, events:RemoveTargets
    • events:CreateConnection, events:DeleteConnection, events:DescribeConnection
    • events:CreateApiDestination, events:DeleteApiDestination, events:DescribeApiDestination
    • iam:CreateRole, iam:PutRolePolicy, iam:GetRole (for first-time setup)
  3. The orchestrator will still work without the rule - you can test manually with curl

Problem: EventBridge Pipes not running or missing

Symptoms:

  • Validation script shows [MISSING] or [WARN] for pipes
  • EventBridge events never reach the custom event buses
  • No events forwarded to local services even with --enable-events

Cause: The CDK infrastructure (EventBridge Pipes) hasn't been deployed or failed to deploy.

Solution:

  1. Run the validation script to check infrastructure status:
    python components/hippodrome/scripts/validate_eventbridge.py --table-prefix dev-main- --verbose
  2. If pipes are missing, deploy the CDK stacks:
    # Deploy ecom stack (creates EcomEventBus and IndexSettingsEventPipe)
    cd infra/ecom
    npx cdk deploy --require-approval never

    # Deploy controller stack (creates MerchandisingEventBus and MerchandisingEventPipe)
    cd ../controller
    npx cdk deploy MerchandisingStack --require-approval never
  3. If pipes exist but aren't RUNNING, check CloudWatch Logs for errors:
    • Go to CloudWatch Logs in the AWS Console
    • Search for log groups containing "EventBridge" or "Pipes"
    • Check for permission errors or configuration issues
  4. Verify DynamoDB streams are enabled on the source tables

Problem: EventBridge events not appearing in console

Symptoms:

  • EventBridge Pipes are RUNNING
  • Local webhook never receives events
  • No activity in EventBridge console metrics

Cause: The custom event buses may not have any subscribers, or the DynamoDB table isn't receiving writes.

Solution:

  1. Verify the event bus exists and has the correct name format:
    • For table prefix dev-main-, event buses should be:
      • dev-main-EcomEventBus
      • dev-main-MerchandisingEventBus
  2. Make a test write to trigger an event:
    aws dynamodb put-item \
    --table-name dev-main-EcomIndexSettingsTable \
    --item '{"pk": {"S": "test"}, "sk": {"S": "INDEX#test"}}'
  3. Check EventBridge metrics in CloudWatch:
    • Go to CloudWatch → Metrics → Events → By Event Bus
    • Look for MatchedEvents and FailedInvocations metrics
  4. Enable EventBridge archive/logging for debugging:
    • In EventBridge console, create an archive rule to capture all events
    • This helps identify if events are reaching the bus

Problem: Pants resolve conflicts when starting Lambda services

Symptoms:

  • Error: "Targets that are in different resolves cannot be mixed"
  • Lambda wrappers fail to build with dependency resolution errors
  • Services like ecom_indexer, ecom_settings_exporter, merchandising_exporter fail to start

Cause: Lambda services (ecom_indexer, ecom_settings_exporter, merchandising_exporter) each have their own Pants resolve. If hippodrome tries to run wrapper scripts that import from these services while using a different resolve, the build fails.

Solution: The Lambda services should be started using their native :local targets in their respective directories, not through wrapper scripts in hippodrome:

  • //components/ecom_indexer:local
  • //components/ecom_settings_exporter:local
  • //components/merchandising_exporter:local
  • //components/shopify/admin_server:local

The config.py's create_python_lambda_service function should use f"//components/{name}:local" format.


Problem: API key authentication fails with "Invalid API key"

Symptoms:

  • Requests to admin_server return 401 "Invalid API key"
  • admin_server logs show: "API key encryption secret not available, cannot decrypt API key"
  • admin_server logs show: "Failed to extract system account ID from API key"

Cause: Marqo API keys are encrypted. The admin_server needs access to the encryption secret (either from AWS Secrets Manager or via MARQO_API_KEY_SECRET environment variable) to decrypt them. In local development, AWS Secrets Manager access may not be available.

Solution:

  1. Ensure you have valid AWS credentials:
    aws sso login --profile staging
    export AWS_PROFILE=staging
  2. The hippodrome config.py should automatically fetch the encryption secret from AWS Secrets Manager (dev/api_key_encryption_key_secret) and set it as MARQO_API_KEY_SECRET.
  3. If AWS access fails, you can manually set the secret:
    export MARQO_API_KEY_SECRET=$(aws secretsmanager get-secret-value --secret-id dev/api_key_encryption_key_secret --query 'SecretString' --output text)
  4. Restart hippodrome after setting credentials

Problem: API key validation fails with KeyError for cell ID

Symptoms:

  • Requests to admin_server return 500 error
  • Error message: KeyError: 'S' or KeyError: 'D'
  • Traceback shows error in ControlPlaneGateway looking up cell in DATA_PLANE_CELLS

Cause: The API key was created for a different cell (e.g., "S" for staging) but hippodrome's DATA_PLANE_CELLS only contains the "local" cell configuration. The admin_server tries to validate the API key against the cell specified in the key, but that cell isn't configured.

Solution: The config.py's get_data_plane_cells_json function should include staging cell configuration when using local cell, so that API keys created for staging can be validated:

# In config.py - get_data_plane_cells_json should include staging config for local cell
if cell == Cell.LOCAL:
staging_config = CELL_CONFIGS[Cell.STAGING]
cells_config[staging_config.id] = {
"aws_region": staging_config.region,
"gateway_id": staging_config.gateway_id,
}

This allows staging API keys to be validated against the staging cell while other operations use fake_cell.


Problem: E2E tests fail with ResourceNotFoundException for DynamoDB tables

Symptoms:

  • Tests return 500 error with ResourceNotFoundException when calling PutItem
  • Error mentions table like dev-hippodrome-EcomIndexSettingsTable not found
  • admin_server logs show DynamoDB table not found

Cause: The hippodrome uses a table prefix based on the git branch (e.g., dev-hippodrome-). These DynamoDB tables don't exist in AWS - they would need to be created, or an existing table prefix should be used.

Solution:

  1. Use an existing table prefix by specifying --table-prefix when starting hippodrome:
    pants hd up --profile ecom --table-prefix dev-main-
  2. Or create the tables for your branch using the CDK infrastructure:
    cd infra/ecom
    npx cdk deploy --context tablePrefix=dev-hippodrome-
  3. Or run e2e tests against a deployed dev environment instead of hippodrome, using the environment variables in .env.dev

Note: Running e2e tests against hippodrome requires DynamoDB tables to exist with the matching prefix. The fake_cell handles cell operations but DynamoDB tables are always in AWS.


Problem: E2E tests require specific environment configuration

Symptoms:

  • Tests fail with various errors when run against hippodrome
  • Errors about missing environment variables or configuration

Cause: E2E tests are designed to run against deployed AWS infrastructure. Running them against hippodrome requires specific configuration.

Solution: Run e2e tests with these environment variables:

ENVIRONMENT=dev \
ENV_BRANCH=hippodrome \
ADMIN_SERVER_BASE_URL=http://localhost:9004 \
SEARCH_PROXY_URL=http://localhost:9005 \
AWS_PROFILE=staging \
pants test //components/shopify/e2e_tests/e2e_tests/tests/ecom_onboarding_test.py -- -v

Key variables:

  • ENVIRONMENT=dev - Tells tests to use dev environment
  • ENV_BRANCH - Must match your hippodrome table prefix (default is git branch)
  • ADMIN_SERVER_BASE_URL - Points to local admin_server (port 9004)
  • SEARCH_PROXY_URL - Points to local search_proxy (port 9005)
  • AWS_PROFILE - AWS profile with access to DynamoDB tables

Problem: Marqo API calls return 401 "Invalid API Key" despite successful decryption

Symptoms:

  • admin_server logs show: "HTTP error fetching indexes: 401 - {"error":"Invalid API Key."}"
  • The encryption secret is retrieved successfully (from env var MARQO_API_KEY_SECRET)
  • API key decryption appears to work (can decrypt to get system_account_id, cell, token)
  • But calls to Marqo API fail with 401

Cause: The admin_server stores encrypted API keys in the ShopifyApiKeyRecord.marqo_api_key field. When calling Marqo, it passes this encrypted key directly to MarqoClient, which sends it in the x-api-key header. Marqo expects the decrypted token, not the encrypted key.

The flow is:

  1. User sends encrypted API key as Bearer token
  2. Admin_server decrypts it to get {system_account_id, cell, token}
  3. Admin_server stores the ENCRYPTED key in DynamoDB (marqo_api_key field)
  4. Later, when calling Marqo, it retrieves api_key_record.marqo_api_key (encrypted)
  5. Passes encrypted key to MarqoClient(api_key=encrypted_key)
  6. Marqo receives encrypted key and rejects it with 401

Solution: This is a design issue in the admin_server. For local development, the issue doesn't manifest when the admin_server is deployed because the encrypted key in the database matches what Marqo expects (there may be additional infrastructure handling this).

For now, running e2e tests against hippodrome requires either:

  1. Running against a deployed dev environment (using AWS-hosted admin_server) instead of local admin_server
  2. Or modifying the admin_server code to decrypt the stored key before calling Marqo

To run e2e tests against deployed infrastructure:

# Don't set ADMIN_SERVER_BASE_URL - let it default to deployed AWS environment
ENVIRONMENT=dev \
ENV_BRANCH=admin-api \
pants test //components/shopify/e2e_tests/e2e_tests/tests:e2e_tests -- -k ecom_onboarding -v

Problem: fake_cell stops responding under concurrent load (event loop hang)

Symptoms:

  • HTTP requests to fake_cell (:9001) time out or return connection errors
  • The fake_cell process is still alive (ss -tlnp | grep 9001 shows LISTEN)
  • The process is consuming 80-95% CPU with a single thread
  • The TCP accept backlog grows (visible in ss output as large numbers)
  • Other services that depend on fake_cell (search_proxy, controller) also start failing

Cause: fake_cell previously ran both the main API and fake index servers in a shared asyncio event loop. Under concurrent HTTP load, the loop could enter CPU-spinning state. This was fixed by running each fake index server in its own dedicated thread with a separate event loop.

Solution: The root cause has been addressed — each fake index server now runs in a dedicated thread. Additionally, two layers of self-healing provide defense in depth:

  1. fake_cell watchdog (in-process): A daemon thread checks /healthz every 5 seconds. After 2 consecutive failures (~10-15s), it saves state to disk and force-exits with os._exit(1).
  2. hippodrome health monitor (orchestrator-level): Periodically probes all RUNNING services with auto_restart=True. After 3 consecutive health failures (~30s), hippodrome restarts the service.

Both mechanisms work with hippodrome's auto_restart to bring the service back within seconds. State is persisted — accounts, API keys, and index metadata survive restarts automatically via /tmp/fake_cell_state.json.

If self-healing fails or fake_cell is stuck:

  1. Kill the fake_cell process:
    kill -9 $(ss -tlnp 2>/dev/null | grep ':9001 ' | grep -oP 'pid=\K[0-9]+')
  2. hippodrome will auto-restart it within a few seconds
  3. State (accounts, API keys, indexes) is restored automatically from disk

Note: The --reload flag is disabled for fake_cell because uvicorn's reload mode causes deadlocks under concurrent load.


Problem: Controller returns 500 for missing accounts/records instead of 404

Symptoms:

  • Controller endpoints return HTTP 500 with "Record does not exist" in the response body
  • Controller endpoints return 500 with "Account not found" when accountId doesn't match any account
  • Fuzz tests flag these as server errors but they're actually "not found" scenarios

Cause: Two issues combined:

  1. get_system_account_id() raised a generic Exception("Account not found") instead of a domain-specific error
  2. RecordNotFoundError and RecordAccessDeniedError from the DDB client could bubble up through views that lacked specific exception handling

Solution: Fixed at multiple layers:

  1. get_system_account_id() now raises AccountRecordNotFoundError (subclass of RecordNotFoundError)
  2. A DomainExceptionMiddleware in config/middleware.py catches unhandled RecordNotFoundError → 404 and RecordAccessDeniedError → 403 at the Django level
  3. The auth backend explicitly raises AuthenticationFailed(401) for MembershipRecordNotFoundError instead of falling through to the generic except Exception handler
  4. All except Exception blocks in index v2 views, merchandise views, and integrations views now re-raise RecordNotFoundError/RecordAccessDeniedError before the generic handler, ensuring domain exceptions reach the middleware instead of being swallowed as 500s

These changes are transparent to existing views that already catch these exceptions explicitly.

  1. The DomainExceptionMiddleware also has a DEBUG-mode catch-all: any unhandled exception returns a JSON 500 response ({"error": "ExceptionType: message"}) instead of Django's default HTML error page. This ensures API clients always receive parseable JSON, even for unexpected failures.

Problem: fake_cell _reindex_jobs race condition under concurrent load

Symptoms:

  • Occasional RuntimeError or lost reindex job records when multiple reindex operations run concurrently
  • Difficult to reproduce — only manifests under real concurrent load (e.g., fuzz testing)

Cause: The module-level _reindex_jobs list was mutated (append + length check + slice delete) without synchronization. The compound operation was not atomic under concurrent access.

Solution: Added a threading.Lock (_reindex_jobs_lock) to protect both the append+evict operation and the list query in list_reindex_jobs.


Problem: Services report "healthy" before fully initialized

Symptoms:

  • Dependent services get connection errors or malformed responses despite health checks passing
  • Race conditions during startup where a service accepts connections but isn't ready to serve
  • ResourceNotFoundException errors from admin_lambda during the first few seconds after startup

Cause: Services without an HTTP health_url configured used TCP-only port checks. A service can bind to its port and accept connections before the FastAPI/Django app is fully initialized (routes registered, middleware loaded, etc.).

Solution: All FastAPI services (fake_cell, fake_cognito, controller, admin_server, admin_lambda) now have health_url="/healthz" configured in their ServiceConfig. The orchestrator polls these HTTP endpoints during startup, only marking services as healthy when the app responds with 200.

Wrangler Workers (search_proxy, agentic_search) still use TCP-only checks since Wrangler doesn't expose a configurable health endpoint, but Wrangler only opens the port when fully ready.

The health check also now logs non-200 responses during startup polling, making it visible when a service is up but returning errors.


Problem: Hippothesis fuzz test reports spurious 500 errors

Symptoms:

  • _assert_ok assertions fail with "Internal server error", "Transport error", "Bad gateway", etc.
  • Circuit breaker trips for one or more services
  • _error_codes summary shows many label:500 entries

Cause: These are infrastructure-level failures, not real application bugs. The fuzz test's _INFRA_ERRORS allowlist in cloud_machine.py recognizes known patterns:

  • "Transport error" — Connection drops (service process crashed or restarting)
  • "ResourceNotFoundException" — DDB table missing (moto not fully seeded)
  • "Failed to authenticate" — Auth failure (fake_cell slow/down)
  • "Bad gateway" — Network error to downstream service (502 from search_proxy, controller, or admin_lambda)

Previous entries removed after service hardening:

  • "Internal server error" — search_proxy now classifies network errors as 502 instead of generic 500
  • "Service misconfigured" — search_proxy env validation returns 503 with clear message
  • "CSRF" — controller now returns JSON 403 instead of HTML CSRF failure page

Solution:

  1. Ensure hippodrome has been running for at least 60-90 seconds before starting fuzz tests
  2. Check curl -s http://localhost:9000/api/status — all services should show running
  3. If a specific service keeps failing, check its log: tail components/hippodrome/.logs/latest/<service>.log
  4. If errors persist after all services are healthy, the issue may be a real bug:
    • Check if the error body matches any _INFRA_ERRORS pattern — if not, it's a genuine assertion failure
    • Look at the _error_codes summary in the test output to identify which endpoints fail most
    • The circuit breaker (CB_OPEN=<service>) indicates which service is unhealthy

Problem: Moto resource creation fails during startup

Symptoms:

  • RuntimeError: Moto resource creation failed for: _create_tables (or _create_s3_buckets, etc.)
  • RuntimeError: Failed to create DynamoDB tables: <table_suffix>
  • Services that depend on DDB tables fail with ResourceNotFoundException

Cause: Moto creates AWS resources (tables, buckets, queues, Lambdas, secrets, state machines) concurrently at startup. If any creation fails, the error message now identifies which resource group(s) and which specific table(s) failed. Common causes:

  • Network contention during startup (rare)
  • Memory pressure from too many concurrent moto operations
  • Moto bug triggered by specific table definitions

Solution:

  1. Check the moto server log for specific errors: grep -i error components/hippodrome/.logs/latest/moto_server.log
  2. The log now shows Moto resource group ready: <name> for each successful group and Moto resource group failed: <name> for failures
  3. For table failures, the log shows Failed to create table <suffix>: <error>
  4. Restart hippodrome — transient failures usually resolve on retry
  5. If a specific table consistently fails, check its definition in wrappers/moto_server.py:TABLE_DEFINITIONS

Problem: Controller returns 502 Bad Gateway for upstream service failures

Symptoms:

  • Controller API returns {"error": "Bad gateway"} with status 502
  • Usually during merchandise/search operations that call search_proxy or admin_server

Cause: The controller's DomainExceptionMiddleware classifies requests.ConnectionError, requests.Timeout, and other requests.RequestException errors as 502 Bad Gateway. This means the controller couldn't reach an upstream service (search_proxy, admin_server, or fake_cell).

Solution:

  1. Check that the upstream service is running: curl -s http://localhost:9000/api/status
  2. Check the upstream service's health endpoint directly (e.g., curl http://localhost:9005/ for search_proxy)
  3. If the upstream service is running, it may be overloaded — wait a moment and retry
  4. Check the upstream service log: tail components/hippodrome/.logs/latest/<service>.log

Problem: fake_cell state save race condition in tests

Symptoms:

  • Warning: Failed to save state to /tmp/fake_cell_state.json
  • FileNotFoundError: [Errno 2] No such file or directory: '/tmp/fake_cell_state.tmp' -> '/tmp/fake_cell_state.json'
  • Occurs during parallel test execution (pytest-xdist)

Cause: Multiple test workers share the same predictable temp file path (/tmp/fake_cell_state.tmp). When workers save concurrently, one worker's rename deletes the other worker's temp file.

Solution: Fixed in state_manager.py — now uses tempfile.mkstemp() with a unique random suffix per write, and Path.replace() for atomic rename. No manual intervention needed.


Problem: Deep health check shows "unhealthy" with unclear error

Symptoms:

  • Deep health check returns "error_type": "connection_refused" — the service is not listening on its port yet
  • Deep health check returns "error_type": "timeout" — the service is listening but not responding in time
  • Deep health check returns "error_type": "http_error" with "status_code" — the service responded with an error

Cause: Different root causes require different remediation:

  • connection_refused: Service hasn't finished starting. Wait longer or check logs for startup errors.
  • timeout: Service is overloaded or hung. Check for deadlocks in the service log.
  • http_error: Service is running but returning errors. Check the status code and service logs.
  • parse_error: Service returned unexpected response format. Check for version mismatches.

Solution:

  1. Check the error_type field in the health check response for targeted diagnosis
  2. For connection_refused: tail -f components/hippodrome/.logs/latest/<service>.log
  3. For timeout: increase request_timeout_seconds or investigate service performance
  4. For http_error: check the service-specific error at the reported status code

Problem: Service in higher layer starts before its dependencies are ready

Symptoms:

  • Controller logs show connection errors to fake_cell/fake_cognito during first few seconds
  • admin_server fails to register with search_proxy because controller isn't ready yet
  • Cascading 502 errors during startup that resolve after a few seconds

Cause: Services were starting nearly simultaneously regardless of dependency order. Higher-layer services (like controller at Layer 1) could attempt to connect to lower-layer services (like fake_cell at Layer 0) before they were fully initialized.

Solution: This is now handled automatically by the layered startup system. Services are grouped into dependency layers (0-3) and each layer waits for health checks to pass before the next layer starts. If you still see startup ordering issues:

  1. Check the service's layer value in config/config.py
  2. Ensure the service has a health_url configured (services without health URLs are not gated)
  3. Check dashboard logs for [orchestrator] Layer N healthy/not healthy messages
  4. If a layer times out (60s), the next layer starts anyway — check for slow service initialization

Problem: Shutdown leaves orphan processes on ports

Symptoms:

  • After stopping hippodrome, ports are still in use
  • Restarting hippodrome fails with "Port already in use"
  • lsof -i :9001 shows processes still running

Cause: Shutdown proceeds in reverse layer order (Layer 3 → 0). If a higher-layer service spawns child processes that don't respond to SIGTERM within 5 seconds, they're force-killed with SIGKILL. Orphaned grandchild processes may not be in the same process group.

Solution:

  1. Run pants hd kill to clean up all hippodrome ports
  2. If that doesn't work: kill -9 $(lsof -ti :9001 :9002 :9004 :9005)
  3. The orchestrator uses start_new_session=True for process groups, so kill_tree() should catch most cases

Problem: Controller returns HTML error page instead of JSON

Symptoms:

  • API calls to controller return HTML 500 page
  • Fuzz test or search_proxy gets unparseable HTML response
  • Error like "JSONDecodeError" in caller logs

Cause: Django returns HTML error pages for unhandled exceptions by default. In hippodrome (DEBUG=True), the DomainExceptionMiddleware catches these and returns JSON instead, but only when DEBUG=True is set.

Solution: This should not happen in hippodrome (DEBUG is always True). If it does:

  1. Check that the controller's DEBUG env var is set to true in config/config.py
  2. Check config/middleware.pyDomainExceptionMiddleware must be last in MIDDLEWARE list
  3. The middleware catches: RecordNotFoundError → 404, RecordAccessDeniedError → 403, requests.RequestException → 502, and all other unhandled exceptions → JSON 500

Adding New Entries

When you solve a new problem, add an entry following this format:

## Problem: [Brief description]

**Symptoms:**
- [What error messages you see]
- [What behavior you observe]

**Cause:**
[Why this happens]

**Solution:**
[Step-by-step instructions to fix it]

---

Place new entries in a logical order (common problems first, edge cases later).