Skip to main content

Runbook: log_investigation_with_claude

This runbook covers how to investigate logs and metrics during an incident using the local Grafana MCP tool (tools/local-grafana-mcp.sh) driven by Claude Code. It is aimed at any on-call engineer, including non–data-plane engineers, who needs to look through logs but is not sure where to start.

The tool spins up a local Grafana wired to this environment's AMP (Prometheus), Athena, and CloudWatch datasources, then launches Claude with a Grafana MCP server. The marqo-log-investigation skill loads automatically and supplies the architecture map — which datasource and AWS account holds each component's logs.

Prerequisites

  • Docker Desktop running
  • AWS CLI v2 with SSO access to the affected environment's account
  • claude CLI on your PATH (claude --version)
  • jq, curl
  • A local clone of cloud_data_plane
  • For Athena log queries: your SSO profile must be able to write Athena query results to the workgroup output bucket (s3:PutObject). The ReadOnlyAccess permission set is granted this on the *-multitenantekscluster-*-results and athena-query-output-prod buckets. If queries fail with Access denied when writing to location, see Troubleshooting.

Steps

1. Log into the affected environment's AWS account

aws sso login --profile <env-profile>

Data-plane (cell) accounts the tool attaches to:

EnvironmentCell AWS account
staging / dev468036072962
preprod339712831429
prod (prod1 / prod2)651774330118

2. Launch the tool

From your cloud_data_plane checkout:

./tools/local-grafana-mcp.sh --agent claude

It auto-detects the environment from your AWS account, provisions Grafana on http://localhost:3000, and launches Claude with the Grafana MCP server. The first run builds a one-time local image with the Athena plugin (slower); later runs are faster.

3. Ask where to start

The marqo-log-investigation skill loads automatically. Open with a where-to-start prompt and let it route you to the right datasource/account. Examples:

The ecom search endpoint is throwing 5xx. Where do I start — which logs and which datasource?
Index <name> for account <id> is unreachable. Walk me through the data-plane signals and pull
the relevant pod logs from Athena.
Latency spike on prod2. Find the right Prometheus metrics and the matching dashboard.
Is this a data-plane or control-plane problem? Symptom: <describe>. Tell me which account and
store to look in.

4. Tear down

Press Ctrl-C / exit Claude. The tool removes the container, temporary credentials, and generated config automatically.

What the tool can and cannot reach

Reachable via the toolNOT reachable — use something else
Data-plane metrics (Prometheus/AMP)Live pod state / crashloops → kubectl
Data-plane pod & host logs (Athena parquet)Control-plane logs in a different account (prod/preprod) → AWS CLI
Cloudflare edge worker logs (Athena)Any CloudWatch Logs → AWS CLI Logs Insights
This cell account's CloudWatch metricsecom service logs (metrics are visible; logs in the mgmt account)

The skill explains the boundary in detail. For deeper architecture and per-store query templates, see .claude/skills/marqo-log-investigation/SKILL.md and tools/README.md.

Caveats

  • Credentials expire (~1 hour). They are baked into the Grafana datasources from your SSO session. If queries start failing with auth errors, exit and re-run the script to refresh.
  • Live pod logs (crashloops, --previous) need kubectl, not this tool. Fluent Bit ships pod logs to Athena, but with a delay.
  • The MCP service-account token is a Grafana Viewer — the agent is read-only and cannot modify dashboards or alerts.

Troubleshooting

  • Access denied when writing to location: s3://... — your SSO profile can start Athena queries and read data but cannot write query results to the workgroup's output bucket. Athena writes results using your credentials, so the profile needs s3:PutObject (plus s3:AbortMultipartUpload) on that bucket. Confirm the ReadOnlyAccess permission set grants it for the *-multitenantekscluster-*-results buckets.
  • For other startup/datasource errors, see the Troubleshooting section of tools/README.md.