Polo v2 Documentation
North Star
Every feature in Polo exists to answer one question: Why are costs the way they are?
Not "what do things cost" — the team roughly knows that. The problem is that when costs change, there's uncertainty about why. Someone says "costs are up $3,000 this week" and the response is hand-waving: "probably that performance test, or maybe a customer scaled up." That uncertainty is the gap through which cost waste creeps in unchallenged.
Polo eliminates that uncertainty. For any time range, for any cost delta, Polo decomposes the change into concrete, attributable causes: this customer's index scaled up (+$800), this test cluster ran 3 days longer than expected (+$1,200), these orphaned volumes accumulated (+$400), NAT traffic spiked because of a deployment bug (+$600).
The two fundamental interaction patterns
-
"Costs are up by $X. Show me why." Start from a delta, drill into what caused it. Click the spike -> see which hierarchy nodes moved -> see which resources changed -> see what event caused it. This drill-down path must be frictionless.
-
"Explain costs across this time range." Decompose total cost into a tree of attributable causes with evidence. Every dollar traces to a reason. Dollars without reasons are the highest-priority items to investigate.
When making design or prioritisation decisions, ask: "does this help answer why?" If yes, do it well. If no, do it minimally or defer it.
What Polo Is
Polo v2 is an infrastructure observability and cost analytics platform for Marqo Cloud.
Currently implemented:
- Collects configuration, cost, lifecycle, tag, and relationship data from AWS accounts
- Stores everything in ClickHouse as a unified event stream
- Serves a React SPA on Cloudflare Workers, protected by Cloudflare Access
- Provides cost-by-customer, cost-by-account-role, hierarchy tree navigation, and resource browsing
- Cost delta decomposition with drill-down (hierarchy nodes -> resources -> events)
- Hierarchy explorer with expandable tree and cost trend chart
- shadcn/ui component system with CSS variable theming (system/light/dark mode)
- Playwright e2e test suite (26 tests, 14 screenshots)
Planned (not yet implemented):
- Multi-account collection from 10-20+ AWS accounts (currently single-account)
- Collection from GitHub, Cloudflare, and Datadog
- Budget enforcement, anomaly detection, and actionable remediation
- Savings tracking from actions taken
- CUR v2 as primary cost data source (currently using Cost Explorer API)
Implementation Status
| Layer | Status | Details |
|---|---|---|
| Schema | 10 migrations | resource_events, resource_snapshots, cost_hourly + MV, resource_relationships, hierarchy_nodes, resource_ancestry, hierarchy_dict, cost_rollup_daily |
| Collectors | 9 implemented | config_ec2, config_ebs, config_network, cost_explorer, lifecycle, tags, relationships, snapshot_builder, hierarchy_builder |
| API | 14 endpoints | Cost (4), Delta (3), Resources (4), Hierarchy (2), Health (1) — all under /api/v1/ |
| UI | 5 pages | Dashboard, Cost Delta, Costs, Resources, Hierarchy — shadcn/ui components, Lucide icons |
| Tests | 26 e2e tests | Playwright smoke tests (12) + screenshot tests (14, dark/light) |
| Infrastructure | AWS CDK | Python CDK stacks in infra/legacy/ |
Documentation Map
Architecture — How the system is built
| Document | Contents |
|---|---|
| overview | Design principles, why ClickHouse, system-level architecture |
| schema | ClickHouse table DDL, engine choices, rationale (implemented + planned tables) |
| hierarchy | Resource hierarchy model, closure table, tag inheritance, metadata dictionaries |
| collection | Collector inventory, normalisation layer, ingestion pipeline |
| delta-decomposition | The core query pattern — "why are costs different?" SQL queries and UI flow |
Features — Design specs for planned features
These docs describe features that are not yet implemented. They serve as design specifications for future development. See features/README.md for the implementation priority order.
| Document | Contents | Priority |
|---|---|---|
| actions | Remediation system: action types, safety constraints, execution, savings | High |
| budget-rules | Zero-based budgeting: rule types, violation detection, default rules | High |
| anomalies | Anomaly detection (z-score), Slack notifications, weekly digest | High |
| multi-account | Account discovery, IAM role provisioning, coverage monitoring | High |
| multi-platform | GitHub, Cloudflare, Datadog cost collection | Medium |
| forecasting | Daily snapshots, changelog, cost forecasting | Medium |
| shared-costs | Allocation rules for shared infrastructure costs | Medium |
Components — Design and decisions per component
| Document | Contents |
|---|---|
| cli | polo CLI: design decisions, command reference, adding commands |
Development — How to build and test
| Document | Contents |
|---|---|
| getting-started | Developer onboarding: setup, run, test |
| project-structure | Directory layout, conventions, build system |
| testing | Testing strategy: unit -> integration -> performance -> e2e, phase checkpoints |
| api | Cloudflare Worker API: endpoints, query pattern, auth |
| ui | React SPA: routing, components, hooks |
Operations — How to deploy and run
| Document | Contents |
|---|---|
| deployment | ClickHouse deployment, Cloudflare Workers, Lambda, IaC (CDK) |
| cur-setup | CUR v2 enablement step-by-step (planned — currently using Cost Explorer) |
| new-account | How to add a new AWS account (planned — automated discovery not yet built) |
| inspecting-aws | How to find what's deployed, check logs, metrics, and errors |
| monitoring | Self-monitoring: collector health, system status, alerting (planned) |
Historical design documents
The root-level files polo-data-architecture.md and polo-implementation-prompt.md are the original design documents that informed this implementation. The docs/ directory is the canonical, maintained reference.
Key Design Decisions
These are the settled decisions that shape the entire system. Don't revisit unless fundamentally broken.
- ClickHouse as the primary store — columnar, fast analytical queries, Map type for schemaless properties, MergeTree for time-series.
- Everything is an event in
resource_events— append-only, one table, unified envelope. - Hierarchies are reified at ingest, metadata at query time — identity columns (account, cluster, index) stamped on events; mutable metadata (account role, customer tier) resolved via ClickHouse dictionaries.
- Physical resource parentage (snapshot->volume->instance) is a first-class hierarchy dimension alongside the Marqo logical hierarchy (resource->index->cluster->customer).
- Actions will execute via a separate Lambda with scoped IAM permissions — the Cloudflare Worker never holds AWS credentials for writes. (Planned)
- Savings will be measured conservatively — five basis categories, no assumption of perpetual savings. (Planned)
- CUR v2 is the planned primary cost data source (hourly granularity). Currently using Cost Explorer API as a daily source.
- EC2 instance rightsizing is deprioritised — the infra team is actively reshuffling architecture, so CPU-based recommendations would be noise right now.
- Pants monorepo — follows the cloud_control_plane pattern. Legacy code in
components/polo-legacy/(Pants-ignored). Roottasks.pyandrequirements.dev.txtare deployment shims for CDK compatibility. - Migration runner is incremental — tracks applied migrations in
polo._migrations, only executes pending ones. Safe to re-run.
Current State & What to Work On
The core data pipeline, API, and UI are implemented and tested. Cost delta decomposition (the primary analytical feature) is live with 3 API endpoints and a drill-down UI. The hierarchy explorer provides tree navigation with cost trends. The UI uses a shadcn/ui component system with CSS variable theming (system/light/dark mode), Lucide icons, and a responsive sidebar. Playwright e2e tests cover all pages.
Recommended next steps (in priority order):
- Resource detail page + timeline — Complete the delta drill-down flow. The
useResourceEventshook and/cost/delta/eventsendpoint are ready, only the UI component is missing. - Actions & remediation — Suggestion generation, preview/execute flow, savings tracking
- Multi-account collection — Account discovery, cross-account IAM roles, CUR v2
- Budget rules & anomaly detection — Zero-based budgeting, z-score anomalies, Slack notifications
- TanStack Router migration — Replace useState routing with URL-based routing for bookmarkable pages
Refer to development/testing.md for the phased checkpoint approach — phases 0-8 cover the implemented foundation, phases 9-14 cover the planned features.