Local Stack Troubleshooting Guide

This document contains solutions to common problems encountered while running hippodrome. When you encounter an issue, search this document first. If you solve a new problem, add it here following the format below.

Problem: Port already in use

Symptoms: - Error message: "Address already in use" - Error message: "OSError: [Errno 48] Address already in use" - Service fails to start with port conflict

Cause: Another process is using one of the required ports (9000, 9001, 9002, or 9008).

Solution: 1. Find the process using the port:

lsof -i :9001  # Replace with the conflicting port

2. Kill the process:

kill -9 <PID>

3. Or use a single command:

kill -9 $(lsof -ti :9001)

Or on Mac, try witr --port 9001 (brew install witr)

Problem: Pants fails with PermissionError on sandboxer binary

Symptoms: - Error like PermissionError: [Errno 1] Operation not permitted: .../pants/bin/sandboxer - Stack trace mentions os.chmod on a Pants cache path - pants run or pants test exits immediately

Cause: The Pants cache directory is not writable/executable in the current environment (e.g., sandboxed execution, restrictive filesystem permissions, or a corrupted cache).

Solution: 1. Ensure the Pants cache path is writable (Mac default: ~/Library/Caches/nce/) 2. If running in a sandboxed environment, run Pants outside the sandbox or disable the sandbox 3. Clear the Pants cache and retry:

rm -rf ~/Library/Caches/nce

Problem: Service fails to start - command not found

Symptoms: - Error message: "FileNotFoundError: [Errno 2] No such file or directory: 'pants'" - Service shows status "failed" immediately after start

Cause: The pants command is not in your PATH, or you're running from the wrong directory.

Solution: 1. Ensure you're in the project root (where pants.toml lives) 2. Verify pants is installed: ./pants --version 3. If using a custom pants installation, ensure it's in your PATH

Problem: Console fails to start - npm not found

Symptoms: - Error message: "FileNotFoundError: [Errno 2] No such file or directory: 'npm'" - Console shows status "failed"

Cause: Node.js and npm are not installed or not in PATH.

Solution: 1. Install Node.js (includes npm): https://nodejs.org/ 2. Verify installation: npm --version 3. Console is non-blocking, so other services will continue running

Problem: Console fails to start - npm ci fails

Symptoms: - Error during "Running npm ci..." - npm ci returns non-zero exit code

Cause: - package-lock.json is out of sync with package.json - npm cache is corrupted - Node version mismatch

Solution: 1. Navigate to console directory:

cd components/console

2. Clear npm cache and reinstall:

rm -rf node_modules
npm cache clean --force
npm install

3. If version mismatch, check required Node version in package.json

Problem: Hot reload not working

Symptoms: - Changes to Python files don't trigger restart - Changes to console files don't refresh

Cause: - Python: uvicorn's --reload may not detect changes in all cases - Console: React's hot module replacement requires specific setup

Solution: For Python services: 1. Verify the service was started with --reload flag (check logs) 2. Ensure you're editing files within the watched directory 3. Try manually restarting the local stack

For console: 1. Check browser console for HMR errors 2. Try hard refresh (Cmd+Shift+R) 3. Restart the console service

Problem: Controller can't connect to fake_cell

Symptoms: - Controller logs show connection errors to localhost:9001 - API requests to controller return 500 errors - Logs show "Connection refused"

Cause: - fake_cell hasn't finished starting - fake_cell crashed - Environment variables not set correctly

Solution: 1. Check fake_cell status in dashboard (http://localhost:9000) 2. Check fake_cell logs for errors 3. Verify fake_cell is running:

curl http://localhost:9001/health

4. If fake_cell is not running, check its logs and restart hippodrome

Problem: Dashboard shows services but no logs

Symptoms: - Dashboard at http://localhost:9000 shows service status - Log viewer tabs are empty

Cause: - Services just started (logs haven't been captured yet) - Log capture encountered an error

Solution: 1. Wait a few seconds for logs to be captured 2. Check terminal output - logs should be visible there with colored prefixes 3. If terminal has logs but dashboard doesn't, check browser console for JavaScript errors

Problem: Ctrl+C doesn't stop services

Symptoms: - Pressing Ctrl+C doesn't terminate all services - Some services continue running after shutdown

Cause: - Signal handler may not have triggered shutdown - Some processes may be orphaned

Solution: 1. Use the stop command to cleanly terminate all processes:

pants run //components/hippodrome/hippodrome/cli.py -- stop

This will find and kill the main orchestrator process and all service subprocesses.

If you prefer manual cleanup:
Press Ctrl+C again (multiple times if needed)

Find and kill processes manually:

# Find pants/python processes for hippodrome services
ps aux | grep -E "(fake_cell|controller)" | grep -v grep
# Kill by PID
kill -9 <PID>

Kill processes by port:

kill -9 $(lsof -ti :9001) $(lsof -ti :9002) 2>/dev/null

Problem: Django settings module not found

Symptoms: - Controller fails with "ModuleNotFoundError: No module named 'config'" - Error related to DJANGO_SETTINGS_MODULE - Traceback shows Django trying to import config.settings

Cause: Django's module system doesn't work well with Pants' sandbox environment. When running via pants run //components/controller/manage.py, Pants creates a sandbox that doesn't have the correct Python path structure for Django imports.

Solution: The controller now runs using its local venv instead of pants (as of iteration 4). If you're seeing this error: 1. Ensure the controller venv exists: ls components/controller/.venv 2. If it doesn't exist, the orchestrator will create it automatically. You can also create it manually:

cd components/controller
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt

3. Verify Django is installed: components/controller/.venv/bin/python -c "import django; print(django.VERSION)"

Note: The orchestrator's _setup_controller_venv() method automatically creates and configures the venv if it's missing.

Problem: Services stuck waiting for pants invocation

Symptoms: - Log shows "Another pants invocation is running. Will wait up to 60.0 seconds..." - Services fail to start after the timeout - Only affects services run via pants (fake_cell)

Cause: By default, pants doesn't allow concurrent invocations. When the orchestrator starts fake_cell via pants while another pants process is running (including the orchestrator itself), it waits for the lock.

Solution: Run the local stack with PANTS_CONCURRENT=True:

PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start

This allows multiple pants processes to run concurrently.

Problem: E-commerce services not starting

Symptoms: - admin_server and search_proxy are not started - Only fake_cell, controller, and console are running - Dashboard at http://localhost:9000 shows only core services

Cause: E-commerce services are not included in the default core profile. They require the ecom or full profile to be explicitly selected.

Solution: Use the --profile ecom flag to start e-commerce services:

PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom

Available profiles: - core (default): fake_cell, controller, console - ecom: core + admin_server, search_proxy - full: all services

Problem: admin_server fails to start - validation error

Symptoms: - admin_server exits immediately after starting - Error message: "ValidationError" or "field required" - Logs show missing environment variable

Cause: admin_server requires several environment variables for configuration. The orchestrator sets placeholder values, but some features may require real values.

Solution: 1. For basic local development, the default placeholder values should work 2. For Shopify integration, create a .env file in components/shopify/admin_server/ with real credentials:

SHOPIFY_API_KEY=your-api-key
SHOPIFY_API_SECRET=your-api-secret

3. For AWS integrations, ensure valid AWS credentials are available in your environment

Problem: search_proxy fails to start - wrangler not found

Symptoms: - search_proxy shows status "failed" - Error message: "npx: command not found" or "wrangler: command not found"

Cause: search_proxy is a Cloudflare Worker that requires Node.js and wrangler to run locally.

Solution: 1. Install Node.js if missing: https://nodejs.org/ 2. Install dependencies in search_proxy:

cd components/search_proxy
npm install

3. Verify wrangler works:

cd components/search_proxy
npx wrangler --version

Problem: Connected to wrong DynamoDB tables

Symptoms: - Data operations affect unexpected tables - Table not found errors - Reading/writing to tables from another developer's branch

Cause: The table prefix defaults to dev-{git-branch}-. If you're on a different branch than expected, or the branch name contains special characters, the table names may not match what you expect.

Solution: 1. Check the startup message for the table prefix being used:

Table prefix: dev-feature-auth-

2. Use --table-prefix to explicitly set the prefix:

PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --table-prefix my-feature-

3. Branch names with slashes are sanitized: feature/add-auth → dev-feature-add-auth-

Problem: Production warning when starting

Symptoms: - Red warning message: "WARNING: Connecting to PRODUCTION cell!" - Concern about affecting production data

Cause: You're using --cell prod which connects to the production cell instead of fake_cell.

Solution: 1. If this was intentional, proceed carefully - you're working with production data 2. If unintentional, stop the orchestrator and restart without --cell prod:

PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom
# or explicitly use local cell:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --cell local

3. For testing against non-production deployed infrastructure, use --cell staging

Problem: fake_cell not starting with --cell flag

Symptoms: - Using --cell staging or --cell prod - fake_cell is not running (expected) - Services can't connect to cell

Cause: When using --cell staging or --cell prod, fake_cell is intentionally skipped because services connect to the real deployed cell. However, network issues or authentication may prevent connection.

Solution: 1. Verify AWS credentials are configured for the target cell 2. Check VPN connection if required for staging/prod access 3. Verify the cell is accessible:

# For staging cell
curl -I https://9ok9ywt6u5.execute-api.us-east-1.amazonaws.com/prod/health

4. If you need local development without real cell access, use --cell local (the default)

Problem: Search requests fail - global_worker not running

Symptoms: - Search requests to http://localhost:9005/indexes/.../search fail - Error message: "ECONNREFUSED" or "Failed to fetch from global_worker" - search_proxy logs show connection error to port 9012

Cause: The global_worker is an external service (separate repository) that must be running for search requests to work. When search_proxy receives a search request, it routes to global_worker which then forwards to the cell.

Solution: 1. The global_worker is in a separate repository. Clone and set it up:

cd ~/dev
git clone git@github.com:marqo-ai/global-worker.git
cd global-worker
npm install

Create a wrangler.local.toml configuration pointing to fake_cell:

name = "local-global-worker"
main = "src/index.ts"
compatibility_date = "2024-09-23"
compatibility_flags = ["nodejs_compat"]

[vars]
ENV = "dev"
CELL_URL = "http://localhost:9001"

[dev]
port = 9012
local_protocol = "http"

Start global_worker:

npx wrangler dev --config wrangler.local.toml --port 9012

If you don't need search functionality, you can ignore this - admin operations will still work.

Problem: global_worker can't connect to fake_cell

Symptoms: - global_worker is running on port 9012 - Search requests fail with "connection refused" to port 9001 - global_worker logs show errors connecting to CELL_URL

Cause: - fake_cell is not running (check with --cell flag) - CELL_URL in global_worker's config is wrong - hippodrome not started with correct profile

Solution: 1. Verify fake_cell is running (dashboard shows it at http://localhost:9000) 2. Check global_worker's wrangler.local.toml has CELL_URL = "http://localhost:9001" 3. Ensure hippodrome is running: PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom 4. If using --cell staging or --cell prod, update global_worker's CELL_URL to match the real cell URL

Problem: agentic_search fails to start

Symptoms: - agentic_search shows status "failed" in dashboard - Error message: "npx: command not found" or "wrangler: command not found" - Error about missing dependencies

Cause: agentic_search is a Cloudflare Worker that requires Node.js, npm, and its dependencies to be installed.

Solution: 1. Verify Node.js and npm are installed:

node --version
npm --version

2. Install dependencies in agentic_search directory:

cd components/agentic_search
npm install

3. Verify wrangler works:

npx wrangler --version

4. If you see .dev.vars errors, create the file with required credentials:

cd components/agentic_search
cat > .dev.vars << EOF
GOOGLE_API_KEY=your-google-api-key
AGENTIC_AWS_ACCESS_KEY_ID=your-aws-access-key
AGENTIC_AWS_SECRET_ACCESS_KEY=your-aws-secret-key
EOF

Problem: agentic_search can't communicate with search_proxy

Symptoms: - agentic_search starts successfully on port 9007 - API calls to agentic_search fail with service binding errors - Logs show "SEARCH_PROXY_WORKER is not defined" or similar

Cause: In local development mode, Cloudflare service bindings don't work. agentic_search uses SEARCH_PROXY_WORKER service binding in production, but needs HTTP fallback for local development.

Solution: 1. Verify SEARCH_PROXY_URL is set in the environment. The orchestrator sets this automatically:

# Check the orchestrator config sets this
SEARCH_PROXY_URL=http://localhost:9005

2. Ensure search_proxy is running on port 9005 3. The agentic_search code should fall back to HTTP when SEARCH_PROXY_URL is set. If you see service binding errors, the HTTP fallback code may not be working correctly. 4. Check that wrangler.local.toml has the correct configuration:

[vars]
SEARCH_PROXY_URL = "http://localhost:9005"

Problem: Cloudflare Worker Durable Objects not persisting

Symptoms: - Durable Object data (like conversation history in agentic_search) is lost between restarts - Each request creates a new Durable Object instance

Cause: By default, wrangler dev does not persist Durable Object state. State is stored in memory and lost when the worker restarts.

Solution: 1. Use the --persist flag when running wrangler dev:

npx wrangler dev --config wrangler.local.toml --port 9007 --persist

2. Data is stored in .wrangler/state/ directory 3. Note: The orchestrator doesn't add --persist by default. You can: - Stop the orchestrator and run agentic_search manually with --persist - Or accept that state will be lost on restart (usually fine for development)

Problem: AWS SSO token expired - services fail to validate tables

Symptoms: - admin_server logs show: "Could not validate X table: Error when retrieving token from sso: Token has expired" - ecom services start but show warning messages - DynamoDB operations fail with authentication errors

Cause: E-commerce services (admin_server, ecom_indexer, etc.) need AWS credentials to access DynamoDB tables. When using AWS SSO, the temporary credentials expire and need to be refreshed.

Solution: 1. Refresh your AWS SSO credentials:

aws sso login --profile your-profile-name

2. Export the profile if needed:

export AWS_PROFILE=your-profile-name

3. Restart hippodrome to pick up the new credentials 4. Note: Services can still start with expired credentials (they show warnings) but DynamoDB operations will fail

Prevention: - Refresh SSO credentials before starting hippodrome - Consider using long-lived credentials for local development (in ~/.aws/credentials)

Problem: Multiple Cloudflare Workers conflict on debug port

Symptoms: - Error: "Address already in use" for port 9229 - Second Cloudflare Worker fails to start - agentic_search or search_proxy won't start when both are running

Cause: Wrangler's Node.js debugger uses port 9229 by default. When running multiple workers, they conflict on this port.

Solution: 1. The workers should use different inspector ports. Check if --inspector-port is set in the commands. 2. If you see this error, you can disable the inspector for one worker:

# Run agentic_search without inspector
npx wrangler dev --config wrangler.local.toml --port 9007 --inspector-port 9230

3. Or kill the process using port 9229:

lsof -ti :9229 | xargs kill -9

4. Restart hippodrome

Problem: KV namespace errors in Cloudflare Workers

Symptoms: - Error: "KV namespace binding not found" - search_proxy or agentic_search fails to read/write KV data - 500 errors from workers when accessing cached data

Cause: KV namespaces in wrangler.local.toml use placeholder IDs that wrangler dev accepts, but the local KV storage starts empty.

Solution: 1. For basic functionality, the workers should handle missing KV data gracefully 2. If you need specific KV data for testing: - Use wrangler CLI to populate local KV:

cd components/search_proxy
npx wrangler kv:key put --binding SEARCH_PROXY_KV "key-name" "value" --local

3. Local KV data is stored in .wrangler/state/ and can be cleared:

rm -rf components/search_proxy/.wrangler/state/
rm -rf components/agentic_search/.wrangler/state/

Problem: EventBridge webhook returns "No route configured"

Symptoms: - POST to /webhook/eventbridge returns {"status": "ignored", "reason": "No route configured for detail-type: ..."} - Events are not forwarded to local services

Cause: The detail-type in the event doesn't match any configured route prefix. Routes are configured for EcomIndexSettings.* and Merchandising.* only.

Solution: 1. Verify your event has the correct detail-type field: - For index settings: EcomIndexSettings.MODIFY, EcomIndexSettings.INSERT, EcomIndexSettings.REMOVE - For merchandising: Merchandising.MODIFY, Merchandising.INSERT, Merchandising.REMOVE 2. Check your event format matches the expected structure:

{
  "source": "marqo.dynamodb",
  "detail-type": "EcomIndexSettings.MODIFY",
  "detail": { ... }
}

3. If you need to route events to a new service, add the route to EVENTBRIDGE_ROUTES in dashboard.py

Problem: EventBridge webhook returns service unavailable

Symptoms: - POST to /webhook/eventbridge returns status 503 - Response contains {"error": "Service unavailable: ..."} - Logs show "Failed to forward event to localhost:9010"

Cause: The target service (ecom_settings_exporter or merchandising_exporter) is not running on the expected port.

Solution: 1. Verify the target service is running with the ecom profile:

PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom

2. Check the dashboard at http://localhost:9000 for service status 3. Test the service directly:

# For ecom_settings_exporter
curl http://localhost:9010/healthz

# For merchandising_exporter
curl http://localhost:9011/healthz

4. If the service shows as failed, check its logs in the dashboard or terminal output

Problem: EventBridge webhook returns 400 - missing detail-type

Symptoms: - POST to /webhook/eventbridge returns {"error": "Missing 'detail-type' field in event"} - HTTP status 400

Cause: The event JSON is missing the required detail-type field.

Solution: 1. Ensure your event includes detail-type (note the hyphen, not underscore):

{
  "source": "marqo.dynamodb",
  "detail-type": "EcomIndexSettings.MODIFY",
  "detail": { ... }
}

2. Common mistake: Using detailType or detail_type instead of detail-type 3. Verify JSON is valid:

echo '{"detail-type": "test"}' | jq .

Problem: Events not reaching local services from AWS

Symptoms: - Events sent from AWS EventBridge don't reach the local webhook - Manual webhook testing works fine - No logs showing incoming events

Cause: The tunnel or EventBridge rule may not be set up correctly, or AWS credentials may be missing.

Solution: 1. Use the --enable-events flag to automatically set up tunneling and EventBridge rules:

PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --enable-events

2. Ensure you have: - cloudflared installed and in your PATH - AWS credentials configured with EventBridge and IAM permissions - Run aws sso login if using AWS SSO 3. Check the startup output for: - [tunnel] EventBridge webhook exposed at ... - confirms tunnel is working - [eventbridge] Rule '...' created for ... - confirms rule was created 4. If rule creation fails but tunnel works, you can manually create a rule: - Copy the tunnel URL from the startup output - Create an EventBridge rule in the AWS Console targeting that URL 5. For manual testing without --enable-events, use curl:

curl -X POST http://localhost:9000/webhook/eventbridge \
  -H "Content-Type: application/json" \
  -d '{"detail-type": "EcomIndexSettings.MODIFY", "detail": {...}}'

Problem: EventBridge rule creation fails

Symptoms: - Tunnel starts successfully - [eventbridge] Failed to create EventBridge rule message appears - Events are not forwarded from AWS

Cause: Missing AWS credentials or insufficient IAM permissions.

Solution: 1. Ensure AWS credentials are configured:

aws sts get-caller-identity  # Should show your identity
aws sso login                # If using AWS SSO

2. Required IAM permissions: - events:PutRule, events:PutTargets, events:DeleteRule, events:RemoveTargets - events:CreateConnection, events:DeleteConnection, events:DescribeConnection - events:CreateApiDestination, events:DeleteApiDestination, events:DescribeApiDestination - iam:CreateRole, iam:PutRolePolicy, iam:GetRole (for first-time setup) 3. The orchestrator will still work without the rule - you can test manually with curl

Problem: EventBridge Pipes not running or missing

Symptoms: - Validation script shows [MISSING] or [WARN] for pipes - EventBridge events never reach the custom event buses - No events forwarded to local services even with --enable-events

Cause: The CDK infrastructure (EventBridge Pipes) hasn't been deployed or failed to deploy.

Solution: 1. Run the validation script to check infrastructure status:

python components/hippodrome/scripts/validate_eventbridge.py --table-prefix dev-main- --verbose

2. If pipes are missing, deploy the CDK stacks:

# Deploy ecom stack (creates EcomEventBus and IndexSettingsEventPipe)
cd infra/ecom
npx cdk deploy --require-approval never

# Deploy controller stack (creates MerchandisingEventBus and MerchandisingEventPipe)
cd ../controller
npx cdk deploy MerchandisingStack --require-approval never

3. If pipes exist but aren't RUNNING, check CloudWatch Logs for errors: - Go to CloudWatch Logs in the AWS Console - Search for log groups containing "EventBridge" or "Pipes" - Check for permission errors or configuration issues 4. Verify DynamoDB streams are enabled on the source tables

Problem: EventBridge events not appearing in console

Symptoms: - EventBridge Pipes are RUNNING - Local webhook never receives events - No activity in EventBridge console metrics

Cause: The custom event buses may not have any subscribers, or the DynamoDB table isn't receiving writes.

Solution: 1. Verify the event bus exists and has the correct name format: - For table prefix dev-main-, event buses should be: - dev-main-EcomEventBus - dev-main-MerchandisingEventBus 2. Make a test write to trigger an event:

aws dynamodb put-item \
  --table-name dev-main-EcomIndexSettingsTable \
  --item '{"pk": {"S": "test"}, "sk": {"S": "INDEX#test"}}'

3. Check EventBridge metrics in CloudWatch: - Go to CloudWatch → Metrics → Events → By Event Bus - Look for MatchedEvents and FailedInvocations metrics 4. Enable EventBridge archive/logging for debugging: - In EventBridge console, create an archive rule to capture all events - This helps identify if events are reaching the bus

Problem: Pants resolve conflicts when starting Lambda services

Symptoms: - Error: "Targets that are in different resolves cannot be mixed" - Lambda wrappers fail to build with dependency resolution errors - Services like ecom_indexer, ecom_settings_exporter, merchandising_exporter fail to start

Cause: Lambda services (ecom_indexer, ecom_settings_exporter, merchandising_exporter) each have their own Pants resolve. If hippodrome tries to run wrapper scripts that import from these services while using a different resolve, the build fails.

Solution: The Lambda services should be started using their native :local targets in their respective directories, not through wrapper scripts in hippodrome: - //components/ecom_indexer:local - //components/ecom_settings_exporter:local - //components/merchandising_exporter:local - //components/shopify/admin_server:local

The config.py's create_python_lambda_service function should use f"//components/{name}:local" format.

Problem: ecom_settings_exporter fails with "No module named 'uvicorn'"

Symptoms: - ecom_settings_exporter service exits immediately with ModuleNotFoundError: No module named 'uvicorn' - Other Lambda services start fine

Cause: The 3rdparty/python/ecom_settings_exporter.lock lockfile doesn't include fastapi and uvicorn packages which are needed for the local development HTTP wrapper.

Solution: 1. Regenerate the lockfile to include the local development dependencies:

pants generate-lockfiles --resolve=ecom_settings_exporter

2. Verify uvicorn is now in the lockfile:

grep uvicorn 3rdparty/python/ecom_settings_exporter.lock

3. Restart hippodrome

Problem: API key authentication fails with "Invalid API key"

Symptoms: - Requests to admin_server return 401 "Invalid API key" - admin_server logs show: "API key encryption secret not available, cannot decrypt API key" - admin_server logs show: "Failed to extract system account ID from API key"

Cause: Marqo API keys are encrypted. The admin_server needs access to the encryption secret (either from AWS Secrets Manager or via MARQO_API_KEY_SECRET environment variable) to decrypt them. In local development, AWS Secrets Manager access may not be available.

Solution: 1. Ensure you have valid AWS credentials:

aws sso login --profile staging
export AWS_PROFILE=staging

2. The hippodrome config.py should automatically fetch the encryption secret from AWS Secrets Manager (dev/api_key_encryption_key_secret) and set it as MARQO_API_KEY_SECRET. 3. If AWS access fails, you can manually set the secret:

export MARQO_API_KEY_SECRET=$(aws secretsmanager get-secret-value --secret-id dev/api_key_encryption_key_secret --query 'SecretString' --output text)

4. Restart hippodrome after setting credentials

Problem: API key validation fails with KeyError for cell ID

Symptoms: - Requests to admin_server return 500 error - Error message: KeyError: 'S' or KeyError: 'D' - Traceback shows error in ControlPlaneGateway looking up cell in DATA_PLANE_CELLS

Cause: The API key was created for a different cell (e.g., "S" for staging) but hippodrome's DATA_PLANE_CELLS only contains the "local" cell configuration. The admin_server tries to validate the API key against the cell specified in the key, but that cell isn't configured.

Solution: The config.py's get_data_plane_cells_json function should include staging cell configuration when using local cell, so that API keys created for staging can be validated:

# In config.py - get_data_plane_cells_json should include staging config for local cell
if cell == Cell.LOCAL:
    staging_config = CELL_CONFIGS[Cell.STAGING]
    cells_config[staging_config.id] = {
        "aws_region": staging_config.region,
        "gateway_id": staging_config.gateway_id,
    }

This allows staging API keys to be validated against the staging cell while other operations use fake_cell.

Problem: E2E tests fail with ResourceNotFoundException for DynamoDB tables

Symptoms: - Tests return 500 error with ResourceNotFoundException when calling PutItem - Error mentions table like dev-hippodrome-EcomIndexSettingsTable not found - admin_server logs show DynamoDB table not found

Cause: The hippodrome uses a table prefix based on the git branch (e.g., dev-hippodrome-). These DynamoDB tables don't exist in AWS - they would need to be created, or an existing table prefix should be used.

Solution: 1. Use an existing table prefix by specifying --table-prefix when starting hippodrome:

PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --table-prefix dev-main-

2. Or create the tables for your branch using the CDK infrastructure:

cd infra/ecom
npx cdk deploy --context tablePrefix=dev-hippodrome-

3. Or run e2e tests against a deployed dev environment instead of hippodrome, using the environment variables in .env.dev

Note: Running e2e tests against hippodrome requires DynamoDB tables to exist with the matching prefix. The fake_cell handles cell operations but DynamoDB tables are always in AWS.

Problem: E2E tests require specific environment configuration

Symptoms: - Tests fail with various errors when run against hippodrome - Errors about missing environment variables or configuration

Cause: E2E tests are designed to run against deployed AWS infrastructure. Running them against hippodrome requires specific configuration.

Solution: Run e2e tests with these environment variables:

ENVIRONMENT=dev \
ENV_BRANCH=hippodrome \
ADMIN_SERVER_BASE_URL=http://localhost:9004 \
SEARCH_PROXY_URL=http://localhost:9005 \
AWS_PROFILE=staging \
pants test //components/shopify/e2e_tests/e2e_tests/tests/ecom_onboarding_test.py -- -v

Key variables: - ENVIRONMENT=dev - Tells tests to use dev environment - ENV_BRANCH - Must match your hippodrome table prefix (default is git branch) - ADMIN_SERVER_BASE_URL - Points to local admin_server (port 9004) - SEARCH_PROXY_URL - Points to local search_proxy (port 9005) - AWS_PROFILE - AWS profile with access to DynamoDB tables

Problem: Marqo API calls return 401 "Invalid API Key" despite successful decryption

Symptoms: - admin_server logs show: "HTTP error fetching indexes: 401 - {\"error\":\"Invalid API Key.\"}" - The encryption secret is retrieved successfully (from env var MARQO_API_KEY_SECRET) - API key decryption appears to work (can decrypt to get system_account_id, cell, token) - But calls to Marqo API fail with 401

Cause: The admin_server stores encrypted API keys in the ShopifyApiKeyRecord.marqo_api_key field. When calling Marqo, it passes this encrypted key directly to MarqoClient, which sends it in the x-api-key header. Marqo expects the decrypted token, not the encrypted key.

The flow is: 1. User sends encrypted API key as Bearer token 2. Admin_server decrypts it to get {system_account_id, cell, token} 3. Admin_server stores the ENCRYPTED key in DynamoDB (marqo_api_key field) 4. Later, when calling Marqo, it retrieves api_key_record.marqo_api_key (encrypted) 5. Passes encrypted key to MarqoClient(api_key=encrypted_key) 6. Marqo receives encrypted key and rejects it with 401

Solution: This is a design issue in the admin_server. For local development, the issue doesn't manifest when the admin_server is deployed because the encrypted key in the database matches what Marqo expects (there may be additional infrastructure handling this).

For now, running e2e tests against hippodrome requires either: 1. Running against a deployed dev environment (using AWS-hosted admin_server) instead of local admin_server 2. Or modifying the admin_server code to decrypt the stored key before calling Marqo

To run e2e tests against deployed infrastructure:

# Don't set ADMIN_SERVER_BASE_URL - let it default to deployed AWS environment
ENVIRONMENT=dev \
ENV_BRANCH=admin-api \
pants test //components/shopify/e2e_tests/e2e_tests/tests:e2e_tests -- -k ecom_onboarding -v

Adding New Entries

When you solve a new problem, add an entry following this format:

## Problem: [Brief description]

**Symptoms:**
- [What error messages you see]
- [What behavior you observe]

**Cause:**
[Why this happens]

**Solution:**
[Step-by-step instructions to fix it]

---

Place new entries in a logical order (common problems first, edge cases later).