Local Stack Troubleshooting Guide
This document contains solutions to common problems encountered while running hippodrome. When you encounter an issue, search this document first. If you solve a new problem, add it here following the format below.
Problem: Port already in use
Symptoms: - Error message: "Address already in use" - Error message: "OSError: [Errno 48] Address already in use" - Service fails to start with port conflict
Cause: Another process is using one of the required ports (9000, 9001, 9002, or 9008).
Solution: 1. Find the process using the port:
lsof -i :9001 # Replace with the conflicting port
kill -9 <PID>
kill -9 $(lsof -ti :9001)
Or on Mac, try witr --port 9001 (brew install witr)
Problem: Pants fails with PermissionError on sandboxer binary
Symptoms:
- Error like PermissionError: [Errno 1] Operation not permitted: .../pants/bin/sandboxer
- Stack trace mentions os.chmod on a Pants cache path
- pants run or pants test exits immediately
Cause: The Pants cache directory is not writable/executable in the current environment (e.g., sandboxed execution, restrictive filesystem permissions, or a corrupted cache).
Solution:
1. Ensure the Pants cache path is writable (Mac default: ~/Library/Caches/nce/)
2. If running in a sandboxed environment, run Pants outside the sandbox or disable the sandbox
3. Clear the Pants cache and retry:
rm -rf ~/Library/Caches/nce
Problem: Service fails to start - command not found
Symptoms: - Error message: "FileNotFoundError: [Errno 2] No such file or directory: 'pants'" - Service shows status "failed" immediately after start
Cause:
The pants command is not in your PATH, or you're running from the wrong directory.
Solution:
1. Ensure you're in the project root (where pants.toml lives)
2. Verify pants is installed: ./pants --version
3. If using a custom pants installation, ensure it's in your PATH
Problem: Console fails to start - npm not found
Symptoms: - Error message: "FileNotFoundError: [Errno 2] No such file or directory: 'npm'" - Console shows status "failed"
Cause: Node.js and npm are not installed or not in PATH.
Solution:
1. Install Node.js (includes npm): https://nodejs.org/
2. Verify installation: npm --version
3. Console is non-blocking, so other services will continue running
Problem: Console fails to start - npm ci fails
Symptoms: - Error during "Running npm ci..." - npm ci returns non-zero exit code
Cause:
- package-lock.json is out of sync with package.json
- npm cache is corrupted
- Node version mismatch
Solution: 1. Navigate to console directory:
cd components/console
rm -rf node_modules
npm cache clean --force
npm install
package.json
Problem: Hot reload not working
Symptoms: - Changes to Python files don't trigger restart - Changes to console files don't refresh
Cause:
- Python: uvicorn's --reload may not detect changes in all cases
- Console: React's hot module replacement requires specific setup
Solution:
For Python services:
1. Verify the service was started with --reload flag (check logs)
2. Ensure you're editing files within the watched directory
3. Try manually restarting the local stack
For console: 1. Check browser console for HMR errors 2. Try hard refresh (Cmd+Shift+R) 3. Restart the console service
Problem: Controller can't connect to fake_cell
Symptoms: - Controller logs show connection errors to localhost:9001 - API requests to controller return 500 errors - Logs show "Connection refused"
Cause: - fake_cell hasn't finished starting - fake_cell crashed - Environment variables not set correctly
Solution: 1. Check fake_cell status in dashboard (http://localhost:9000) 2. Check fake_cell logs for errors 3. Verify fake_cell is running:
curl http://localhost:9001/health
Problem: Dashboard shows services but no logs
Symptoms: - Dashboard at http://localhost:9000 shows service status - Log viewer tabs are empty
Cause: - Services just started (logs haven't been captured yet) - Log capture encountered an error
Solution: 1. Wait a few seconds for logs to be captured 2. Check terminal output - logs should be visible there with colored prefixes 3. If terminal has logs but dashboard doesn't, check browser console for JavaScript errors
Problem: Ctrl+C doesn't stop services
Symptoms: - Pressing Ctrl+C doesn't terminate all services - Some services continue running after shutdown
Cause: - Signal handler may not have triggered shutdown - Some processes may be orphaned
Solution:
1. Use the stop command to cleanly terminate all processes:
pants run //components/hippodrome/hippodrome/cli.py -- stop
- If you prefer manual cleanup:
- Press Ctrl+C again (multiple times if needed)
- Find and kill processes manually:
# Find pants/python processes for hippodrome services ps aux | grep -E "(fake_cell|controller)" | grep -v grep # Kill by PID kill -9 <PID> - Kill processes by port:
kill -9 $(lsof -ti :9001) $(lsof -ti :9002) 2>/dev/null
Problem: Django settings module not found
Symptoms:
- Controller fails with "ModuleNotFoundError: No module named 'config'"
- Error related to DJANGO_SETTINGS_MODULE
- Traceback shows Django trying to import config.settings
Cause:
Django's module system doesn't work well with Pants' sandbox environment. When running via pants run //components/controller/manage.py, Pants creates a sandbox that doesn't have the correct Python path structure for Django imports.
Solution:
The controller now runs using its local venv instead of pants (as of iteration 4). If you're seeing this error:
1. Ensure the controller venv exists: ls components/controller/.venv
2. If it doesn't exist, the orchestrator will create it automatically. You can also create it manually:
cd components/controller
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
components/controller/.venv/bin/python -c "import django; print(django.VERSION)"
Note: The orchestrator's _setup_controller_venv() method automatically creates and configures the venv if it's missing.
Problem: Services stuck waiting for pants invocation
Symptoms: - Log shows "Another pants invocation is running. Will wait up to 60.0 seconds..." - Services fail to start after the timeout - Only affects services run via pants (fake_cell)
Cause: By default, pants doesn't allow concurrent invocations. When the orchestrator starts fake_cell via pants while another pants process is running (including the orchestrator itself), it waits for the lock.
Solution:
Run the local stack with PANTS_CONCURRENT=True:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start
This allows multiple pants processes to run concurrently.
Problem: E-commerce services not starting
Symptoms: - admin_server and search_proxy are not started - Only fake_cell, controller, and console are running - Dashboard at http://localhost:9000 shows only core services
Cause:
E-commerce services are not included in the default core profile. They require the ecom or full profile to be explicitly selected.
Solution:
Use the --profile ecom flag to start e-commerce services:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom
Available profiles:
- core (default): fake_cell, controller, console
- ecom: core + admin_server, search_proxy
- full: all services
Problem: admin_server fails to start - validation error
Symptoms: - admin_server exits immediately after starting - Error message: "ValidationError" or "field required" - Logs show missing environment variable
Cause: admin_server requires several environment variables for configuration. The orchestrator sets placeholder values, but some features may require real values.
Solution:
1. For basic local development, the default placeholder values should work
2. For Shopify integration, create a .env file in components/shopify/admin_server/ with real credentials:
SHOPIFY_API_KEY=your-api-key
SHOPIFY_API_SECRET=your-api-secret
Problem: search_proxy fails to start - wrangler not found
Symptoms: - search_proxy shows status "failed" - Error message: "npx: command not found" or "wrangler: command not found"
Cause: search_proxy is a Cloudflare Worker that requires Node.js and wrangler to run locally.
Solution: 1. Install Node.js if missing: https://nodejs.org/ 2. Install dependencies in search_proxy:
cd components/search_proxy
npm install
cd components/search_proxy
npx wrangler --version
Problem: Connected to wrong DynamoDB tables
Symptoms: - Data operations affect unexpected tables - Table not found errors - Reading/writing to tables from another developer's branch
Cause:
The table prefix defaults to dev-{git-branch}-. If you're on a different branch than expected, or the branch name contains special characters, the table names may not match what you expect.
Solution: 1. Check the startup message for the table prefix being used:
Table prefix: dev-feature-auth-
--table-prefix to explicitly set the prefix:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --table-prefix my-feature-
feature/add-auth → dev-feature-add-auth-
Problem: Production warning when starting
Symptoms: - Red warning message: "WARNING: Connecting to PRODUCTION cell!" - Concern about affecting production data
Cause:
You're using --cell prod which connects to the production cell instead of fake_cell.
Solution:
1. If this was intentional, proceed carefully - you're working with production data
2. If unintentional, stop the orchestrator and restart without --cell prod:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom
# or explicitly use local cell:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --cell local
--cell staging
Problem: fake_cell not starting with --cell flag
Symptoms:
- Using --cell staging or --cell prod
- fake_cell is not running (expected)
- Services can't connect to cell
Cause:
When using --cell staging or --cell prod, fake_cell is intentionally skipped because services connect to the real deployed cell. However, network issues or authentication may prevent connection.
Solution: 1. Verify AWS credentials are configured for the target cell 2. Check VPN connection if required for staging/prod access 3. Verify the cell is accessible:
# For staging cell
curl -I https://9ok9ywt6u5.execute-api.us-east-1.amazonaws.com/prod/health
--cell local (the default)
Problem: Search requests fail - global_worker not running
Symptoms:
- Search requests to http://localhost:9005/indexes/.../search fail
- Error message: "ECONNREFUSED" or "Failed to fetch from global_worker"
- search_proxy logs show connection error to port 9012
Cause: The global_worker is an external service (separate repository) that must be running for search requests to work. When search_proxy receives a search request, it routes to global_worker which then forwards to the cell.
Solution: 1. The global_worker is in a separate repository. Clone and set it up:
cd ~/dev
git clone git@github.com:marqo-ai/global-worker.git
cd global-worker
npm install
-
Create a
wrangler.local.tomlconfiguration pointing to fake_cell:name = "local-global-worker" main = "src/index.ts" compatibility_date = "2024-09-23" compatibility_flags = ["nodejs_compat"] [vars] ENV = "dev" CELL_URL = "http://localhost:9001" [dev] port = 9012 local_protocol = "http" -
Start global_worker:
npx wrangler dev --config wrangler.local.toml --port 9012 -
If you don't need search functionality, you can ignore this - admin operations will still work.
Problem: global_worker can't connect to fake_cell
Symptoms: - global_worker is running on port 9012 - Search requests fail with "connection refused" to port 9001 - global_worker logs show errors connecting to CELL_URL
Cause:
- fake_cell is not running (check with --cell flag)
- CELL_URL in global_worker's config is wrong
- hippodrome not started with correct profile
Solution:
1. Verify fake_cell is running (dashboard shows it at http://localhost:9000)
2. Check global_worker's wrangler.local.toml has CELL_URL = "http://localhost:9001"
3. Ensure hippodrome is running: PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom
4. If using --cell staging or --cell prod, update global_worker's CELL_URL to match the real cell URL
Problem: agentic_search fails to start
Symptoms: - agentic_search shows status "failed" in dashboard - Error message: "npx: command not found" or "wrangler: command not found" - Error about missing dependencies
Cause: agentic_search is a Cloudflare Worker that requires Node.js, npm, and its dependencies to be installed.
Solution: 1. Verify Node.js and npm are installed:
node --version
npm --version
cd components/agentic_search
npm install
npx wrangler --version
.dev.vars errors, create the file with required credentials:
cd components/agentic_search
cat > .dev.vars << EOF
GOOGLE_API_KEY=your-google-api-key
AGENTIC_AWS_ACCESS_KEY_ID=your-aws-access-key
AGENTIC_AWS_SECRET_ACCESS_KEY=your-aws-secret-key
EOF
Problem: agentic_search can't communicate with search_proxy
Symptoms: - agentic_search starts successfully on port 9007 - API calls to agentic_search fail with service binding errors - Logs show "SEARCH_PROXY_WORKER is not defined" or similar
Cause:
In local development mode, Cloudflare service bindings don't work. agentic_search uses SEARCH_PROXY_WORKER service binding in production, but needs HTTP fallback for local development.
Solution:
1. Verify SEARCH_PROXY_URL is set in the environment. The orchestrator sets this automatically:
# Check the orchestrator config sets this
SEARCH_PROXY_URL=http://localhost:9005
SEARCH_PROXY_URL is set. If you see service binding errors, the HTTP fallback code may not be working correctly.
4. Check that wrangler.local.toml has the correct configuration:
[vars]
SEARCH_PROXY_URL = "http://localhost:9005"
Problem: Cloudflare Worker Durable Objects not persisting
Symptoms: - Durable Object data (like conversation history in agentic_search) is lost between restarts - Each request creates a new Durable Object instance
Cause: By default, wrangler dev does not persist Durable Object state. State is stored in memory and lost when the worker restarts.
Solution:
1. Use the --persist flag when running wrangler dev:
npx wrangler dev --config wrangler.local.toml --port 9007 --persist
.wrangler/state/ directory
3. Note: The orchestrator doesn't add --persist by default. You can:
- Stop the orchestrator and run agentic_search manually with --persist
- Or accept that state will be lost on restart (usually fine for development)
Problem: AWS SSO token expired - services fail to validate tables
Symptoms: - admin_server logs show: "Could not validate X table: Error when retrieving token from sso: Token has expired" - ecom services start but show warning messages - DynamoDB operations fail with authentication errors
Cause: E-commerce services (admin_server, ecom_indexer, etc.) need AWS credentials to access DynamoDB tables. When using AWS SSO, the temporary credentials expire and need to be refreshed.
Solution: 1. Refresh your AWS SSO credentials:
aws sso login --profile your-profile-name
export AWS_PROFILE=your-profile-name
Prevention:
- Refresh SSO credentials before starting hippodrome
- Consider using long-lived credentials for local development (in ~/.aws/credentials)
Problem: Multiple Cloudflare Workers conflict on debug port
Symptoms: - Error: "Address already in use" for port 9229 - Second Cloudflare Worker fails to start - agentic_search or search_proxy won't start when both are running
Cause: Wrangler's Node.js debugger uses port 9229 by default. When running multiple workers, they conflict on this port.
Solution:
1. The workers should use different inspector ports. Check if --inspector-port is set in the commands.
2. If you see this error, you can disable the inspector for one worker:
# Run agentic_search without inspector
npx wrangler dev --config wrangler.local.toml --port 9007 --inspector-port 9230
lsof -ti :9229 | xargs kill -9
Problem: KV namespace errors in Cloudflare Workers
Symptoms: - Error: "KV namespace binding not found" - search_proxy or agentic_search fails to read/write KV data - 500 errors from workers when accessing cached data
Cause: KV namespaces in wrangler.local.toml use placeholder IDs that wrangler dev accepts, but the local KV storage starts empty.
Solution: 1. For basic functionality, the workers should handle missing KV data gracefully 2. If you need specific KV data for testing: - Use wrangler CLI to populate local KV:
cd components/search_proxy
npx wrangler kv:key put --binding SEARCH_PROXY_KV "key-name" "value" --local
.wrangler/state/ and can be cleared:
rm -rf components/search_proxy/.wrangler/state/
rm -rf components/agentic_search/.wrangler/state/
Problem: EventBridge webhook returns "No route configured"
Symptoms:
- POST to /webhook/eventbridge returns {"status": "ignored", "reason": "No route configured for detail-type: ..."}
- Events are not forwarded to local services
Cause:
The detail-type in the event doesn't match any configured route prefix. Routes are configured for EcomIndexSettings.* and Merchandising.* only.
Solution:
1. Verify your event has the correct detail-type field:
- For index settings: EcomIndexSettings.MODIFY, EcomIndexSettings.INSERT, EcomIndexSettings.REMOVE
- For merchandising: Merchandising.MODIFY, Merchandising.INSERT, Merchandising.REMOVE
2. Check your event format matches the expected structure:
{
"source": "marqo.dynamodb",
"detail-type": "EcomIndexSettings.MODIFY",
"detail": { ... }
}
EVENTBRIDGE_ROUTES in dashboard.py
Problem: EventBridge webhook returns service unavailable
Symptoms:
- POST to /webhook/eventbridge returns status 503
- Response contains {"error": "Service unavailable: ..."}
- Logs show "Failed to forward event to localhost:9010"
Cause: The target service (ecom_settings_exporter or merchandising_exporter) is not running on the expected port.
Solution:
1. Verify the target service is running with the ecom profile:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom
# For ecom_settings_exporter
curl http://localhost:9010/healthz
# For merchandising_exporter
curl http://localhost:9011/healthz
Problem: EventBridge webhook returns 400 - missing detail-type
Symptoms:
- POST to /webhook/eventbridge returns {"error": "Missing 'detail-type' field in event"}
- HTTP status 400
Cause:
The event JSON is missing the required detail-type field.
Solution:
1. Ensure your event includes detail-type (note the hyphen, not underscore):
{
"source": "marqo.dynamodb",
"detail-type": "EcomIndexSettings.MODIFY",
"detail": { ... }
}
detailType or detail_type instead of detail-type
3. Verify JSON is valid:
echo '{"detail-type": "test"}' | jq .
Problem: Events not reaching local services from AWS
Symptoms: - Events sent from AWS EventBridge don't reach the local webhook - Manual webhook testing works fine - No logs showing incoming events
Cause: The tunnel or EventBridge rule may not be set up correctly, or AWS credentials may be missing.
Solution:
1. Use the --enable-events flag to automatically set up tunneling and EventBridge rules:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --enable-events
aws sso login if using AWS SSO
3. Check the startup output for:
- [tunnel] EventBridge webhook exposed at ... - confirms tunnel is working
- [eventbridge] Rule '...' created for ... - confirms rule was created
4. If rule creation fails but tunnel works, you can manually create a rule:
- Copy the tunnel URL from the startup output
- Create an EventBridge rule in the AWS Console targeting that URL
5. For manual testing without --enable-events, use curl:
curl -X POST http://localhost:9000/webhook/eventbridge \
-H "Content-Type: application/json" \
-d '{"detail-type": "EcomIndexSettings.MODIFY", "detail": {...}}'
Problem: EventBridge rule creation fails
Symptoms:
- Tunnel starts successfully
- [eventbridge] Failed to create EventBridge rule message appears
- Events are not forwarded from AWS
Cause: Missing AWS credentials or insufficient IAM permissions.
Solution: 1. Ensure AWS credentials are configured:
aws sts get-caller-identity # Should show your identity
aws sso login # If using AWS SSO
events:PutRule, events:PutTargets, events:DeleteRule, events:RemoveTargets
- events:CreateConnection, events:DeleteConnection, events:DescribeConnection
- events:CreateApiDestination, events:DeleteApiDestination, events:DescribeApiDestination
- iam:CreateRole, iam:PutRolePolicy, iam:GetRole (for first-time setup)
3. The orchestrator will still work without the rule - you can test manually with curl
Problem: EventBridge Pipes not running or missing
Symptoms:
- Validation script shows [MISSING] or [WARN] for pipes
- EventBridge events never reach the custom event buses
- No events forwarded to local services even with --enable-events
Cause: The CDK infrastructure (EventBridge Pipes) hasn't been deployed or failed to deploy.
Solution: 1. Run the validation script to check infrastructure status:
python components/hippodrome/scripts/validate_eventbridge.py --table-prefix dev-main- --verbose
# Deploy ecom stack (creates EcomEventBus and IndexSettingsEventPipe)
cd infra/ecom
npx cdk deploy --require-approval never
# Deploy controller stack (creates MerchandisingEventBus and MerchandisingEventPipe)
cd ../controller
npx cdk deploy MerchandisingStack --require-approval never
Problem: EventBridge events not appearing in console
Symptoms: - EventBridge Pipes are RUNNING - Local webhook never receives events - No activity in EventBridge console metrics
Cause: The custom event buses may not have any subscribers, or the DynamoDB table isn't receiving writes.
Solution:
1. Verify the event bus exists and has the correct name format:
- For table prefix dev-main-, event buses should be:
- dev-main-EcomEventBus
- dev-main-MerchandisingEventBus
2. Make a test write to trigger an event:
aws dynamodb put-item \
--table-name dev-main-EcomIndexSettingsTable \
--item '{"pk": {"S": "test"}, "sk": {"S": "INDEX#test"}}'
MatchedEvents and FailedInvocations metrics
4. Enable EventBridge archive/logging for debugging:
- In EventBridge console, create an archive rule to capture all events
- This helps identify if events are reaching the bus
Problem: Pants resolve conflicts when starting Lambda services
Symptoms: - Error: "Targets that are in different resolves cannot be mixed" - Lambda wrappers fail to build with dependency resolution errors - Services like ecom_indexer, ecom_settings_exporter, merchandising_exporter fail to start
Cause: Lambda services (ecom_indexer, ecom_settings_exporter, merchandising_exporter) each have their own Pants resolve. If hippodrome tries to run wrapper scripts that import from these services while using a different resolve, the build fails.
Solution:
The Lambda services should be started using their native :local targets in their respective directories, not through wrapper scripts in hippodrome:
- //components/ecom_indexer:local
- //components/ecom_settings_exporter:local
- //components/merchandising_exporter:local
- //components/shopify/admin_server:local
The config.py's create_python_lambda_service function should use f"//components/{name}:local" format.
Problem: ecom_settings_exporter fails with "No module named 'uvicorn'"
Symptoms:
- ecom_settings_exporter service exits immediately with ModuleNotFoundError: No module named 'uvicorn'
- Other Lambda services start fine
Cause:
The 3rdparty/python/ecom_settings_exporter.lock lockfile doesn't include fastapi and uvicorn packages which are needed for the local development HTTP wrapper.
Solution: 1. Regenerate the lockfile to include the local development dependencies:
pants generate-lockfiles --resolve=ecom_settings_exporter
grep uvicorn 3rdparty/python/ecom_settings_exporter.lock
Problem: API key authentication fails with "Invalid API key"
Symptoms: - Requests to admin_server return 401 "Invalid API key" - admin_server logs show: "API key encryption secret not available, cannot decrypt API key" - admin_server logs show: "Failed to extract system account ID from API key"
Cause:
Marqo API keys are encrypted. The admin_server needs access to the encryption secret (either from AWS Secrets Manager or via MARQO_API_KEY_SECRET environment variable) to decrypt them. In local development, AWS Secrets Manager access may not be available.
Solution: 1. Ensure you have valid AWS credentials:
aws sso login --profile staging
export AWS_PROFILE=staging
dev/api_key_encryption_key_secret) and set it as MARQO_API_KEY_SECRET.
3. If AWS access fails, you can manually set the secret:
export MARQO_API_KEY_SECRET=$(aws secretsmanager get-secret-value --secret-id dev/api_key_encryption_key_secret --query 'SecretString' --output text)
Problem: API key validation fails with KeyError for cell ID
Symptoms:
- Requests to admin_server return 500 error
- Error message: KeyError: 'S' or KeyError: 'D'
- Traceback shows error in ControlPlaneGateway looking up cell in DATA_PLANE_CELLS
Cause:
The API key was created for a different cell (e.g., "S" for staging) but hippodrome's DATA_PLANE_CELLS only contains the "local" cell configuration. The admin_server tries to validate the API key against the cell specified in the key, but that cell isn't configured.
Solution:
The config.py's get_data_plane_cells_json function should include staging cell configuration when using local cell, so that API keys created for staging can be validated:
# In config.py - get_data_plane_cells_json should include staging config for local cell
if cell == Cell.LOCAL:
staging_config = CELL_CONFIGS[Cell.STAGING]
cells_config[staging_config.id] = {
"aws_region": staging_config.region,
"gateway_id": staging_config.gateway_id,
}
This allows staging API keys to be validated against the staging cell while other operations use fake_cell.
Problem: E2E tests fail with ResourceNotFoundException for DynamoDB tables
Symptoms:
- Tests return 500 error with ResourceNotFoundException when calling PutItem
- Error mentions table like dev-hippodrome-EcomIndexSettingsTable not found
- admin_server logs show DynamoDB table not found
Cause:
The hippodrome uses a table prefix based on the git branch (e.g., dev-hippodrome-). These DynamoDB tables don't exist in AWS - they would need to be created, or an existing table prefix should be used.
Solution:
1. Use an existing table prefix by specifying --table-prefix when starting hippodrome:
PANTS_CONCURRENT=True pants run //components/hippodrome/hippodrome/cli.py -- start --profile ecom --table-prefix dev-main-
cd infra/ecom
npx cdk deploy --context tablePrefix=dev-hippodrome-
.env.dev
Note: Running e2e tests against hippodrome requires DynamoDB tables to exist with the matching prefix. The fake_cell handles cell operations but DynamoDB tables are always in AWS.
Problem: E2E tests require specific environment configuration
Symptoms: - Tests fail with various errors when run against hippodrome - Errors about missing environment variables or configuration
Cause: E2E tests are designed to run against deployed AWS infrastructure. Running them against hippodrome requires specific configuration.
Solution: Run e2e tests with these environment variables:
ENVIRONMENT=dev \
ENV_BRANCH=hippodrome \
ADMIN_SERVER_BASE_URL=http://localhost:9004 \
SEARCH_PROXY_URL=http://localhost:9005 \
AWS_PROFILE=staging \
pants test //components/shopify/e2e_tests/e2e_tests/tests/ecom_onboarding_test.py -- -v
Key variables:
- ENVIRONMENT=dev - Tells tests to use dev environment
- ENV_BRANCH - Must match your hippodrome table prefix (default is git branch)
- ADMIN_SERVER_BASE_URL - Points to local admin_server (port 9004)
- SEARCH_PROXY_URL - Points to local search_proxy (port 9005)
- AWS_PROFILE - AWS profile with access to DynamoDB tables
Problem: Marqo API calls return 401 "Invalid API Key" despite successful decryption
Symptoms:
- admin_server logs show: "HTTP error fetching indexes: 401 - {\"error\":\"Invalid API Key.\"}"
- The encryption secret is retrieved successfully (from env var MARQO_API_KEY_SECRET)
- API key decryption appears to work (can decrypt to get system_account_id, cell, token)
- But calls to Marqo API fail with 401
Cause:
The admin_server stores encrypted API keys in the ShopifyApiKeyRecord.marqo_api_key field. When calling Marqo, it passes this encrypted key directly to MarqoClient, which sends it in the x-api-key header. Marqo expects the decrypted token, not the encrypted key.
The flow is:
1. User sends encrypted API key as Bearer token
2. Admin_server decrypts it to get {system_account_id, cell, token}
3. Admin_server stores the ENCRYPTED key in DynamoDB (marqo_api_key field)
4. Later, when calling Marqo, it retrieves api_key_record.marqo_api_key (encrypted)
5. Passes encrypted key to MarqoClient(api_key=encrypted_key)
6. Marqo receives encrypted key and rejects it with 401
Solution: This is a design issue in the admin_server. For local development, the issue doesn't manifest when the admin_server is deployed because the encrypted key in the database matches what Marqo expects (there may be additional infrastructure handling this).
For now, running e2e tests against hippodrome requires either: 1. Running against a deployed dev environment (using AWS-hosted admin_server) instead of local admin_server 2. Or modifying the admin_server code to decrypt the stored key before calling Marqo
To run e2e tests against deployed infrastructure:
# Don't set ADMIN_SERVER_BASE_URL - let it default to deployed AWS environment
ENVIRONMENT=dev \
ENV_BRANCH=admin-api \
pants test //components/shopify/e2e_tests/e2e_tests/tests:e2e_tests -- -k ecom_onboarding -v
Adding New Entries
When you solve a new problem, add an entry following this format:
## Problem: [Brief description]
**Symptoms:**
- [What error messages you see]
- [What behavior you observe]
**Cause:**
[Why this happens]
**Solution:**
[Step-by-step instructions to fix it]
---
Place new entries in a logical order (common problems first, edge cases later).