Schema Reference
All tables live in the polo database. Migrations are numbered SQL files in components/schema/migrations/, applied in order by migrate.py.
Table Inventory
Implemented Tables
Core Event Stream
| Table | Engine | Purpose |
|---|---|---|
resource_events | MergeTree | Source of truth. Every piece of data enters Polo as a row here. Append-only. |
resource_snapshots | ReplacingMergeTree | Latest state of each resource. Deduplicates on resource_arn via _version. |
cost_hourly + cost_hourly_mv | SummingMergeTree + MV | Pre-aggregated hourly cost. MV auto-populates from cost events. |
cost_rollup_daily | SummingMergeTree | Pre-aggregated daily cost at every hierarchy level. Populated by scheduled job. |
Hierarchy & Topology
| Table | Engine | Purpose |
|---|---|---|
resource_relationships | ReplacingMergeTree | Direct edges between resources (physical graph). Has is_cost_parent flag. |
hierarchy_nodes | ReplacingMergeTree | Logical hierarchy levels (accounts, customers, clusters, indexes) with metadata. |
resource_ancestry | ReplacingMergeTree | Closure table. Pre-computed transitive ancestors for every resource across all 3 hierarchy dimensions. |
hierarchy_dict | Dictionary | In-memory hash of hierarchy_nodes for near-free metadata lookups in queries. |
Planned Tables
These tables will be created as their corresponding features are implemented. DDL is defined in the feature design docs and will become migrations when implemented.
Actions & Remediation — see actions.md
| Table | Engine | Purpose |
|---|---|---|
action_types | ReplacingMergeTree | Registry of supported action types (terminate, resize, delete, etc.). |
action_suggestions | ReplacingMergeTree | Suggested actions from waste detectors, with estimated savings. |
action_log | ReplacingMergeTree | Audit trail of every action Polo executes on AWS. |
Governance — see budget-rules.md, anomalies.md
| Table | Engine | Purpose |
|---|---|---|
budget_rules | ReplacingMergeTree | Zero-based budgeting rules. |
rule_violations | ReplacingMergeTree | Active and historical violations of budget rules. |
anomalies | ReplacingMergeTree | Cost anomalies detected by z-score analysis. |
notifications | MergeTree | Log of all notifications sent (Slack). |
Multi-Account & Ops — see multi-account.md, monitoring.md
| Table | Engine | Purpose |
|---|---|---|
aws_accounts | ReplacingMergeTree | Registry of all AWS accounts, their Polo status, roles, and last scan time. |
collector_runs | MergeTree | Health log — every collector run records success/failure, duration, event count. |
Analytics & Forecasting — see forecasting.md, shared-costs.md
| Table | Engine | Purpose |
|---|---|---|
resource_snapshots_daily | MergeTree | Nightly archival copy of resource_snapshots. Powers changelog diffs. |
cost_forecasts | MergeTree | Projected costs with actuals filled in later for accuracy tracking. |
allocation_rules | ReplacingMergeTree | Rules for allocating shared infrastructure costs to customers. |
allocated_costs | MergeTree | Results of applying allocation rules. |
aws_pricing | ReplacingMergeTree | AWS pricing data. Maps (service, region, SKU) -> price. |
Core Table DDL
resource_events
CREATE TABLE polo.resource_events
(
-- Envelope
event_id UUID DEFAULT generateUUIDv4(),
event_time DateTime64(3),
ingested_at DateTime64(3) DEFAULT now64(3),
-- Source
collector LowCardinality(String),
aws_account_id LowCardinality(String),
aws_region LowCardinality(String),
-- Resource identity
resource_arn String, -- universal join key
resource_type LowCardinality(String), -- 'ec2:instance', 'ebs:volume', etc.
resource_id String,
resource_name String DEFAULT '',
-- Event classification
event_type LowCardinality(String), -- 'lifecycle', 'cost', 'metric', 'config', 'tag_change'
event_action LowCardinality(String),
-- Who (from CloudTrail)
actor_arn String DEFAULT '',
actor_name String DEFAULT '',
-- Marqo logical grouping (derived from tags at ingest)
marqo_customer LowCardinality(String) DEFAULT '',
marqo_cluster String DEFAULT '',
marqo_index String DEFAULT '',
marqo_env LowCardinality(String) DEFAULT '',
marqo_purpose LowCardinality(String) DEFAULT '',
-- Denormalised hierarchy metadata (stamped at ingest)
account_role LowCardinality(String) DEFAULT '',
-- Numeric payload
value Float64 DEFAULT 0,
unit LowCardinality(String) DEFAULT '',
-- Flexible properties
properties Map(String, String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (resource_type, aws_account_id, resource_arn, event_time)
TTL event_time + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;
Properties by resource type (implemented collectors):
| resource_type | Typical keys |
|---|---|
ec2:instance | instance_type, ami_id, subnet_id, vpc_id, launch_time, state, platform |
ebs:volume | volume_type, size_gb, iops, throughput_mbps, encrypted, attached_instance |
vpc:nat_gateway | vpc_id, subnet_id, bytes_processed, active_connections |
elastic_ip | associated, instance_id, public_ip |
cost_allocation | usage_type, operation, pricing_term, line_item_type |
Properties for planned resource types:
| resource_type | Typical keys |
|---|---|
s3:bucket | storage_class, versioning, object_count, total_size_bytes |
sagemaker:notebook | instance_type, status, started_by, idle_hours |
resource_snapshots
CREATE TABLE polo.resource_snapshots
(
resource_arn String,
resource_type LowCardinality(String),
resource_id String,
resource_name String,
aws_account_id LowCardinality(String),
aws_region LowCardinality(String),
state LowCardinality(String),
first_seen DateTime64(3),
last_seen DateTime64(3),
last_updated DateTime64(3),
marqo_customer LowCardinality(String),
marqo_cluster String,
marqo_index String,
marqo_env LowCardinality(String),
marqo_purpose LowCardinality(String),
account_role LowCardinality(String),
properties Map(String, String),
tags Map(String, String),
cost_daily_usd Float64 DEFAULT 0,
cost_monthly_usd Float64 DEFAULT 0,
_version UInt64
)
ENGINE = ReplacingMergeTree(_version)
ORDER BY (resource_arn)
SETTINGS index_granularity = 8192;
resource_relationships
CREATE TABLE polo.resource_relationships
(
source_arn String,
target_arn String,
relationship LowCardinality(String),
is_cost_parent UInt8 DEFAULT 0,
properties Map(String, String),
discovered_at DateTime64(3),
last_seen DateTime64(3),
_version UInt64
)
ENGINE = ReplacingMergeTree(_version)
ORDER BY (source_arn, target_arn, relationship);
Edge types:
| relationship | Example | is_cost_parent | Rationale |
|---|---|---|---|
attached_to | EBS vol -> EC2 instance | 1 | Volume cost rolls up to its instance |
snapshot_of | EBS snapshot -> EBS volume | 1 | Snapshot cost rolls up to source volume |
associated_with | EIP -> EC2 instance | 1 | IPv4 cost rolls up to instance |
eni_of | ENI -> EC2 instance | 1 | ENI cost rolls up to instance |
backed_by | ELB -> target -> instance | 1 | LB cost rolls up to what it serves |
log_group_for | CW log group -> producer | 1 | Log cost rolls up to producer |
nat_for | NAT -> subnet | 0 | Topological only |
in_subnet | Instance -> subnet | 0 | Topological only |
in_vpc | Subnet -> VPC | 0 | Topological only |
routes_via | Route table -> NAT/IGW | 0 | Topological only |
secured_by | Instance -> security group | 0 | Topological only |
Hierarchy tables
See hierarchy.md for full DDL and design rationale for hierarchy_nodes, resource_ancestry, hierarchy_dict, and cost_rollup_daily.
Migration Files
Implemented (001-010)
components/schema/migrations/
├── 001_create_database.sql
├── 002_resource_events.sql
├── 003_resource_snapshots.sql
├── 004_cost_hourly.sql # Table + materialised view
├── 005_resource_relationships.sql
├── 006_hierarchy_nodes.sql
├── 007_resource_ancestry.sql
├── 008_hierarchy_dict.sql # Dictionary
├── 009_cost_rollup_daily.sql
├── 010_migrations_table.sql # Migration tracking
└── 011_retention_policies.sql # TTL: raw 90d, hourly 1y, daily indefinite
Planned (012-025)
These migrations will be added as their corresponding features are implemented:
012_action_types.sql # Seed action type registry
013_action_suggestions.sql
014_action_log.sql
015_aws_accounts.sql
016_collector_runs.sql
017_notifications.sql
018_budget_rules.sql # Seed default rules
019_rule_violations.sql
020_anomalies.sql
021_aws_pricing.sql
022_cost_forecasts.sql
023_resource_snapshots_daily.sql
024_allocation_rules.sql
025_allocated_costs.sql
Migrations are idempotent (CREATE TABLE IF NOT EXISTS). Applied by components/schema/migrate.py which tracks state in polo._migrations.
Data Retention & TTL Policies
Three retention tiers ensure raw data ages out while aggregates persist:
| Tier | Table | TTL | Partitioning | Rationale |
|---|---|---|---|---|
| Raw events | resource_events | 90 days | Monthly (toYYYYMM(event_time)) | Append-only stream; hourly rollups preserve older data |
| Hourly rollups | cost_hourly | 1 year | Monthly (toYYYYMM(hour)) | Hour-grain drill-down for the past year |
| Daily rollups | cost_rollup_daily | Indefinite | Monthly (toYYYYMM(day)) | Permanent historical record |
Dimension/state tables (resource_snapshots, hierarchy_nodes, resource_relationships, resource_ancestry) have no TTL — they are small and deduplicated via ReplacingMergeTree.
Partition Pruning
All time-series tables use monthly partitions (PARTITION BY toYYYYMM(...)). This enables:
- Automatic TTL cleanup — ClickHouse drops entire partitions when all rows expire, avoiding row-by-row deletion overhead.
- Manual partition drops — For immediate cleanup or one-off purges:
-- Drop a specific month of raw eventsALTER TABLE polo.resource_events DROP PARTITION 202501;-- List partitions with row countsSELECT partition, table, rows, formatReadableSize(bytes_on_disk) AS sizeFROM system.partsWHERE database = 'polo' AND activeGROUP BY partition, table, rows, bytes_on_diskORDER BY table, partition;
- Efficient query pruning — Queries with
WHERE event_time >= ...automatically skip irrelevant partitions.
Retention Tiers Applied by Migration
Migration 011_retention_policies.sql applies ALTER TABLE ... MODIFY TTL to existing tables. Base migration files (002, 004) are updated so fresh installations get the correct TTLs from the start.
Schema Evolution
- Additive changes (new columns with defaults): use
ALTER TABLE ... ADD COLUMN. Safe on MergeTree. - Structural changes (changing sort key, engine): requires creating a new table and migrating data. Document the migration carefully.
- New tables: add a new numbered migration file.
- Map properties: just add new keys in collector code. No schema change needed.