Skip to main content

Polo v2 Documentation

North Star

Every feature in Polo exists to answer one question: Why are costs the way they are?

Not "what do things cost" — the team roughly knows that. The problem is that when costs change, there's uncertainty about why. Someone says "costs are up $3,000 this week" and the response is hand-waving: "probably that performance test, or maybe a customer scaled up." That uncertainty is the gap through which cost waste creeps in unchallenged.

Polo eliminates that uncertainty. For any time range, for any cost delta, Polo decomposes the change into concrete, attributable causes: this customer's index scaled up (+$800), this test cluster ran 3 days longer than expected (+$1,200), these orphaned volumes accumulated (+$400), NAT traffic spiked because of a deployment bug (+$600).

The two fundamental interaction patterns

  1. "Costs are up by $X. Show me why." Start from a delta, drill into what caused it. Click the spike -> see which hierarchy nodes moved -> see which resources changed -> see what event caused it. This drill-down path must be frictionless.

  2. "Explain costs across this time range." Decompose total cost into a tree of attributable causes with evidence. Every dollar traces to a reason. Dollars without reasons are the highest-priority items to investigate.

When making design or prioritisation decisions, ask: "does this help answer why?" If yes, do it well. If no, do it minimally or defer it.


What Polo Is

Polo v2 is an infrastructure observability and cost analytics platform for Marqo Cloud.

Currently implemented:

  • Collects configuration, cost, lifecycle, tag, and relationship data from AWS accounts
  • Stores everything in ClickHouse as a unified event stream
  • Serves a React SPA on Cloudflare Workers, protected by Cloudflare Access
  • Provides cost-by-customer, cost-by-account-role, hierarchy tree navigation, and resource browsing
  • Cost delta decomposition with drill-down (hierarchy nodes -> resources -> events)
  • Hierarchy explorer with expandable tree and cost trend chart
  • shadcn/ui component system with CSS variable theming (system/light/dark mode)
  • Playwright e2e test suite (26 tests, 14 screenshots)

Planned (not yet implemented):

  • Multi-account collection from 10-20+ AWS accounts (currently single-account)
  • Collection from GitHub, Cloudflare, and Datadog
  • Budget enforcement, anomaly detection, and actionable remediation
  • Savings tracking from actions taken
  • CUR v2 as primary cost data source (currently using Cost Explorer API)

Implementation Status

LayerStatusDetails
Schema10 migrationsresource_events, resource_snapshots, cost_hourly + MV, resource_relationships, hierarchy_nodes, resource_ancestry, hierarchy_dict, cost_rollup_daily
Collectors9 implementedconfig_ec2, config_ebs, config_network, cost_explorer, lifecycle, tags, relationships, snapshot_builder, hierarchy_builder
API14 endpointsCost (4), Delta (3), Resources (4), Hierarchy (2), Health (1) — all under /api/v1/
UI5 pagesDashboard, Cost Delta, Costs, Resources, Hierarchy — shadcn/ui components, Lucide icons
Tests26 e2e testsPlaywright smoke tests (12) + screenshot tests (14, dark/light)
InfrastructureAWS CDKPython CDK stacks in infra/legacy/

Documentation Map

Architecture — How the system is built

DocumentContents
overviewDesign principles, why ClickHouse, system-level architecture
schemaClickHouse table DDL, engine choices, rationale (implemented + planned tables)
hierarchyResource hierarchy model, closure table, tag inheritance, metadata dictionaries
collectionCollector inventory, normalisation layer, ingestion pipeline
delta-decompositionThe core query pattern — "why are costs different?" SQL queries and UI flow

Features — Design specs for planned features

These docs describe features that are not yet implemented. They serve as design specifications for future development. See features/README.md for the implementation priority order.

DocumentContentsPriority
actionsRemediation system: action types, safety constraints, execution, savingsHigh
budget-rulesZero-based budgeting: rule types, violation detection, default rulesHigh
anomaliesAnomaly detection (z-score), Slack notifications, weekly digestHigh
multi-accountAccount discovery, IAM role provisioning, coverage monitoringHigh
multi-platformGitHub, Cloudflare, Datadog cost collectionMedium
forecastingDaily snapshots, changelog, cost forecastingMedium
shared-costsAllocation rules for shared infrastructure costsMedium

Components — Design and decisions per component

DocumentContents
clipolo CLI: design decisions, command reference, adding commands

Development — How to build and test

DocumentContents
getting-startedDeveloper onboarding: setup, run, test
project-structureDirectory layout, conventions, build system
testingTesting strategy: unit -> integration -> performance -> e2e, phase checkpoints
apiCloudflare Worker API: endpoints, query pattern, auth
uiReact SPA: routing, components, hooks

Operations — How to deploy and run

DocumentContents
deploymentClickHouse deployment, Cloudflare Workers, Lambda, IaC (CDK)
cur-setupCUR v2 enablement step-by-step (planned — currently using Cost Explorer)
new-accountHow to add a new AWS account (planned — automated discovery not yet built)
inspecting-awsHow to find what's deployed, check logs, metrics, and errors
monitoringSelf-monitoring: collector health, system status, alerting (planned)

Historical design documents

The root-level files polo-data-architecture.md and polo-implementation-prompt.md are the original design documents that informed this implementation. The docs/ directory is the canonical, maintained reference.


Key Design Decisions

These are the settled decisions that shape the entire system. Don't revisit unless fundamentally broken.

  1. ClickHouse as the primary store — columnar, fast analytical queries, Map type for schemaless properties, MergeTree for time-series.
  2. Everything is an event in resource_events — append-only, one table, unified envelope.
  3. Hierarchies are reified at ingest, metadata at query time — identity columns (account, cluster, index) stamped on events; mutable metadata (account role, customer tier) resolved via ClickHouse dictionaries.
  4. Physical resource parentage (snapshot->volume->instance) is a first-class hierarchy dimension alongside the Marqo logical hierarchy (resource->index->cluster->customer).
  5. Actions will execute via a separate Lambda with scoped IAM permissions — the Cloudflare Worker never holds AWS credentials for writes. (Planned)
  6. Savings will be measured conservatively — five basis categories, no assumption of perpetual savings. (Planned)
  7. CUR v2 is the planned primary cost data source (hourly granularity). Currently using Cost Explorer API as a daily source.
  8. EC2 instance rightsizing is deprioritised — the infra team is actively reshuffling architecture, so CPU-based recommendations would be noise right now.
  9. Pants monorepo — follows the cloud_control_plane pattern. Legacy code in components/polo-legacy/ (Pants-ignored). Root tasks.py and requirements.dev.txt are deployment shims for CDK compatibility.
  10. Migration runner is incremental — tracks applied migrations in polo._migrations, only executes pending ones. Safe to re-run.

Current State & What to Work On

The core data pipeline, API, and UI are implemented and tested. Cost delta decomposition (the primary analytical feature) is live with 3 API endpoints and a drill-down UI. The hierarchy explorer provides tree navigation with cost trends. The UI uses a shadcn/ui component system with CSS variable theming (system/light/dark mode), Lucide icons, and a responsive sidebar. Playwright e2e tests cover all pages.

Recommended next steps (in priority order):

  1. Resource detail page + timeline — Complete the delta drill-down flow. The useResourceEvents hook and /cost/delta/events endpoint are ready, only the UI component is missing.
  2. Actions & remediation — Suggestion generation, preview/execute flow, savings tracking
  3. Multi-account collection — Account discovery, cross-account IAM roles, CUR v2
  4. Budget rules & anomaly detection — Zero-based budgeting, z-score anomalies, Slack notifications
  5. TanStack Router migration — Replace useState routing with URL-based routing for bookmarkable pages

Refer to development/testing.md for the phased checkpoint approach — phases 0-8 cover the implemented foundation, phases 9-14 cover the planned features.