Architecture Overview
Why ClickHouse
Polo previously tried S3 + Athena (cheap archival but no indexed lookups or low-latency dashboards), a graph DB (good for topology but wrong paradigm for aggregation and time-series), SQL/Postgres (schema explosion — every resource type has different attributes), and DynamoDB (functional for key-value lookups but rigid — any analytical query requires full scans or pre-planned GSIs).
ClickHouse resolves all four: columnar compression handles sparse/heterogeneous schemas, its SQL interface supports ad-hoc analytics, the MergeTree engine handles time-series natively, and the Map(String, String) type gives schema-free properties without migrations.
Core Design Principles
-
Everything is an event. Resource creation, state changes, cost accrual, metric samples — all modeled as timestamped events with a common envelope in
resource_events. -
Properties are schemaless. Resource-type-specific attributes go in a
Map(String, String)column. No migrations when you add a new resource type. -
ARN is the universal join key. Every event ties back to a resource ARN (or platform-specific equivalent like
github:org/repo), which ties to an account, region, service, and (via tags) to a customer/cluster/index. -
Ingest once, materialise many views. Raw events land in one table. Materialised views pre-aggregate for dashboards (hourly cost rollups, daily snapshots).
-
Collection is decoupled from storage. Collectors are independent Lambda functions on schedules. Adding a new collector requires no schema changes and no changes to existing collectors.
-
Hierarchies are reified at ingest, metadata is inherited at query time. Identity columns (account, cluster, index) are stamped on events at ingest because they don't change. Mutable metadata (account role, customer tier) lives in a small dimension table and is resolved via ClickHouse dictionaries (in-memory hash maps — effectively free lookups). This avoids both expensive query-time graph traversal and brittle baking of metadata into every row.
System Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Data Sources │
├──────────┬──────────┬──────────┬────────────────────────────────────┤
│CloudTrail│Cost │EC2 API │ (Planned: CUR v2, CloudWatch, │
│ Events │Explorer │Describe* │ GitHub, Cloudflare, Datadog) │
└────┬─────┴────┬─────┴────┬─────┴────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ Collector Lambdas (per data source) │
│ │ │
│ Normalisation Layer │
│ • Map to ResourceEvent envelope • Resolve ARNs │
│ • Derive marqo_* fields from tags • Stamp account_role │
└─────────────────────────┬───────────────────────────────────────────┘
│
│ Direct HTTP insert (clickhouse-connect)
│ Batched (1000 rows) with retry
▼
┌───────────────────────┐
│ ClickHouse │
│ │
│ resource_events │──→ cost_hourly_mv
│ resource_snapshots │──→ cost_rollup_daily
│ resource_ancestry │
│ resource_relationships│
└───────────┬───────────┘
│
│ HTTPS
▼
┌───────────────────────┐
│ Cloudflare Worker │
│ (API + auth + SPA) │
│ │
│ • Named SQL queries │
│ • CF Access JWT auth │
│ • Static SPA assets │
└───────────────────────┘
Key architectural boundaries
- Cloudflare Worker serves the SPA and proxies named queries to ClickHouse. It never holds AWS write credentials.
- Collector Lambdas have read-only access to AWS accounts (via
{env}-PoloReadRole). They insert directly to ClickHouse via the HTTP interface. - ClickHouse is the sole datastore. No other database. Every query, every dashboard, every alert runs against ClickHouse.
Planned additions (not yet implemented)
- Action Lambda — write access to target accounts (via
polo-action-role) for remediation actions, invoked by the Worker. - SQS buffer — between collectors and ClickHouse for backpressure and dead-letter handling at scale.
- Multi-account collection — iterating over 10-20+ AWS accounts per collector run.
- Non-AWS data sources — GitHub, Cloudflare, Datadog cost collection.
- Materialised views for anomalies and rule violations — downstream of cost_rollup_daily.
What's deprioritised
- EC2 instance rightsizing — The infra team is actively reshuffling architecture, making CPU-based resize recommendations unreliable noise. The schema supports it for later.
- CloudWatch Agent / memory metrics — Required for rightsizing. Not needed until rightsizing is prioritised.
- Embeddable widgets — Nice to have for other internal tools, but not a launch requirement.