Runbook: log_investigation_with_claude
This runbook covers how to investigate logs and metrics during an incident using the local
Grafana MCP tool (tools/local-grafana-mcp.sh) driven by Claude Code. It is aimed at any
on-call engineer, including non–data-plane engineers, who needs to look through logs but is
not sure where to start.
The tool spins up a local Grafana wired to this environment's AMP (Prometheus), Athena, and
CloudWatch datasources, then launches Claude with a Grafana MCP server. The
marqo-log-investigation skill loads automatically and supplies the architecture map — which
datasource and AWS account holds each component's logs.
Prerequisites
- Docker Desktop running
- AWS CLI v2 with SSO access to the affected environment's account
claudeCLI on yourPATH(claude --version)jq,curl- A local clone of
cloud_data_plane - For Athena log queries: your SSO profile must be able to write Athena query results to
the workgroup output bucket (
s3:PutObject). TheReadOnlyAccesspermission set is granted this on the*-multitenantekscluster-*-resultsandathena-query-output-prodbuckets. If queries fail withAccess denied when writing to location, see Troubleshooting.
Steps
1. Log into the affected environment's AWS account
aws sso login --profile <env-profile>
Data-plane (cell) accounts the tool attaches to:
| Environment | Cell AWS account |
|---|---|
| staging / dev | 468036072962 |
| preprod | 339712831429 |
| prod (prod1 / prod2) | 651774330118 |
2. Launch the tool
From your cloud_data_plane checkout:
./tools/local-grafana-mcp.sh --agent claude
It auto-detects the environment from your AWS account, provisions Grafana on
http://localhost:3000, and launches Claude with the Grafana MCP server. The first run builds
a one-time local image with the Athena plugin (slower); later runs are faster.
3. Ask where to start
The marqo-log-investigation skill loads automatically. Open with a where-to-start prompt and
let it route you to the right datasource/account. Examples:
The ecom search endpoint is throwing 5xx. Where do I start — which logs and which datasource?
Index <name> for account <id> is unreachable. Walk me through the data-plane signals and pull
the relevant pod logs from Athena.
Latency spike on prod2. Find the right Prometheus metrics and the matching dashboard.
Is this a data-plane or control-plane problem? Symptom: <describe>. Tell me which account and
store to look in.
4. Tear down
Press Ctrl-C / exit Claude. The tool removes the container, temporary credentials, and generated config automatically.
What the tool can and cannot reach
| Reachable via the tool | NOT reachable — use something else |
|---|---|
| Data-plane metrics (Prometheus/AMP) | Live pod state / crashloops → kubectl |
| Data-plane pod & host logs (Athena parquet) | Control-plane logs in a different account (prod/preprod) → AWS CLI |
| Cloudflare edge worker logs (Athena) | Any CloudWatch Logs → AWS CLI Logs Insights |
| This cell account's CloudWatch metrics | ecom service logs (metrics are visible; logs in the mgmt account) |
The skill explains the boundary in detail. For deeper architecture and per-store query
templates, see .claude/skills/marqo-log-investigation/SKILL.md and tools/README.md.
Caveats
- Credentials expire (~1 hour). They are baked into the Grafana datasources from your SSO session. If queries start failing with auth errors, exit and re-run the script to refresh.
- Live pod logs (crashloops,
--previous) needkubectl, not this tool. Fluent Bit ships pod logs to Athena, but with a delay. - The MCP service-account token is a Grafana Viewer — the agent is read-only and cannot modify dashboards or alerts.
Troubleshooting
Access denied when writing to location: s3://...— your SSO profile can start Athena queries and read data but cannot write query results to the workgroup's output bucket. Athena writes results using your credentials, so the profile needss3:PutObject(pluss3:AbortMultipartUpload) on that bucket. Confirm theReadOnlyAccesspermission set grants it for the*-multitenantekscluster-*-resultsbuckets.- For other startup/datasource errors, see the Troubleshooting section of
tools/README.md.