Skip to main content

Schema Reference

All tables live in the polo database. Migrations are numbered SQL files in components/schema/migrations/, applied in order by migrate.py.

Table Inventory

Implemented Tables

Core Event Stream

TableEnginePurpose
resource_eventsMergeTreeSource of truth. Every piece of data enters Polo as a row here. Append-only.
resource_snapshotsReplacingMergeTreeLatest state of each resource. Deduplicates on resource_arn via _version.
cost_hourly + cost_hourly_mvSummingMergeTree + MVPre-aggregated hourly cost. MV auto-populates from cost events.
cost_rollup_dailySummingMergeTreePre-aggregated daily cost at every hierarchy level. Populated by scheduled job.

Hierarchy & Topology

TableEnginePurpose
resource_relationshipsReplacingMergeTreeDirect edges between resources (physical graph). Has is_cost_parent flag.
hierarchy_nodesReplacingMergeTreeLogical hierarchy levels (accounts, customers, clusters, indexes) with metadata.
resource_ancestryReplacingMergeTreeClosure table. Pre-computed transitive ancestors for every resource across all 3 hierarchy dimensions.
hierarchy_dictDictionaryIn-memory hash of hierarchy_nodes for near-free metadata lookups in queries.

Planned Tables

These tables will be created as their corresponding features are implemented. DDL is defined in the feature design docs and will become migrations when implemented.

Actions & Remediation — see actions.md

TableEnginePurpose
action_typesReplacingMergeTreeRegistry of supported action types (terminate, resize, delete, etc.).
action_suggestionsReplacingMergeTreeSuggested actions from waste detectors, with estimated savings.
action_logReplacingMergeTreeAudit trail of every action Polo executes on AWS.

Governance — see budget-rules.md, anomalies.md

TableEnginePurpose
budget_rulesReplacingMergeTreeZero-based budgeting rules.
rule_violationsReplacingMergeTreeActive and historical violations of budget rules.
anomaliesReplacingMergeTreeCost anomalies detected by z-score analysis.
notificationsMergeTreeLog of all notifications sent (Slack).

Multi-Account & Ops — see multi-account.md, monitoring.md

TableEnginePurpose
aws_accountsReplacingMergeTreeRegistry of all AWS accounts, their Polo status, roles, and last scan time.
collector_runsMergeTreeHealth log — every collector run records success/failure, duration, event count.

Analytics & Forecasting — see forecasting.md, shared-costs.md

TableEnginePurpose
resource_snapshots_dailyMergeTreeNightly archival copy of resource_snapshots. Powers changelog diffs.
cost_forecastsMergeTreeProjected costs with actuals filled in later for accuracy tracking.
allocation_rulesReplacingMergeTreeRules for allocating shared infrastructure costs to customers.
allocated_costsMergeTreeResults of applying allocation rules.
aws_pricingReplacingMergeTreeAWS pricing data. Maps (service, region, SKU) -> price.

Core Table DDL

resource_events

CREATE TABLE polo.resource_events
(
-- Envelope
event_id UUID DEFAULT generateUUIDv4(),
event_time DateTime64(3),
ingested_at DateTime64(3) DEFAULT now64(3),

-- Source
collector LowCardinality(String),
aws_account_id LowCardinality(String),
aws_region LowCardinality(String),

-- Resource identity
resource_arn String, -- universal join key
resource_type LowCardinality(String), -- 'ec2:instance', 'ebs:volume', etc.
resource_id String,
resource_name String DEFAULT '',

-- Event classification
event_type LowCardinality(String), -- 'lifecycle', 'cost', 'metric', 'config', 'tag_change'
event_action LowCardinality(String),

-- Who (from CloudTrail)
actor_arn String DEFAULT '',
actor_name String DEFAULT '',

-- Marqo logical grouping (derived from tags at ingest)
marqo_customer LowCardinality(String) DEFAULT '',
marqo_cluster String DEFAULT '',
marqo_index String DEFAULT '',
marqo_env LowCardinality(String) DEFAULT '',
marqo_purpose LowCardinality(String) DEFAULT '',

-- Denormalised hierarchy metadata (stamped at ingest)
account_role LowCardinality(String) DEFAULT '',

-- Numeric payload
value Float64 DEFAULT 0,
unit LowCardinality(String) DEFAULT '',

-- Flexible properties
properties Map(String, String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (resource_type, aws_account_id, resource_arn, event_time)
TTL event_time + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;

Properties by resource type (implemented collectors):

resource_typeTypical keys
ec2:instanceinstance_type, ami_id, subnet_id, vpc_id, launch_time, state, platform
ebs:volumevolume_type, size_gb, iops, throughput_mbps, encrypted, attached_instance
vpc:nat_gatewayvpc_id, subnet_id, bytes_processed, active_connections
elastic_ipassociated, instance_id, public_ip
cost_allocationusage_type, operation, pricing_term, line_item_type

Properties for planned resource types:

resource_typeTypical keys
s3:bucketstorage_class, versioning, object_count, total_size_bytes
sagemaker:notebookinstance_type, status, started_by, idle_hours

resource_snapshots

CREATE TABLE polo.resource_snapshots
(
resource_arn String,
resource_type LowCardinality(String),
resource_id String,
resource_name String,
aws_account_id LowCardinality(String),
aws_region LowCardinality(String),
state LowCardinality(String),
first_seen DateTime64(3),
last_seen DateTime64(3),
last_updated DateTime64(3),
marqo_customer LowCardinality(String),
marqo_cluster String,
marqo_index String,
marqo_env LowCardinality(String),
marqo_purpose LowCardinality(String),
account_role LowCardinality(String),
properties Map(String, String),
tags Map(String, String),
cost_daily_usd Float64 DEFAULT 0,
cost_monthly_usd Float64 DEFAULT 0,
_version UInt64
)
ENGINE = ReplacingMergeTree(_version)
ORDER BY (resource_arn)
SETTINGS index_granularity = 8192;

resource_relationships

CREATE TABLE polo.resource_relationships
(
source_arn String,
target_arn String,
relationship LowCardinality(String),
is_cost_parent UInt8 DEFAULT 0,
properties Map(String, String),
discovered_at DateTime64(3),
last_seen DateTime64(3),
_version UInt64
)
ENGINE = ReplacingMergeTree(_version)
ORDER BY (source_arn, target_arn, relationship);

Edge types:

relationshipExampleis_cost_parentRationale
attached_toEBS vol -> EC2 instance1Volume cost rolls up to its instance
snapshot_ofEBS snapshot -> EBS volume1Snapshot cost rolls up to source volume
associated_withEIP -> EC2 instance1IPv4 cost rolls up to instance
eni_ofENI -> EC2 instance1ENI cost rolls up to instance
backed_byELB -> target -> instance1LB cost rolls up to what it serves
log_group_forCW log group -> producer1Log cost rolls up to producer
nat_forNAT -> subnet0Topological only
in_subnetInstance -> subnet0Topological only
in_vpcSubnet -> VPC0Topological only
routes_viaRoute table -> NAT/IGW0Topological only
secured_byInstance -> security group0Topological only

Hierarchy tables

See hierarchy.md for full DDL and design rationale for hierarchy_nodes, resource_ancestry, hierarchy_dict, and cost_rollup_daily.


Migration Files

Implemented (001-010)

components/schema/migrations/
├── 001_create_database.sql
├── 002_resource_events.sql
├── 003_resource_snapshots.sql
├── 004_cost_hourly.sql # Table + materialised view
├── 005_resource_relationships.sql
├── 006_hierarchy_nodes.sql
├── 007_resource_ancestry.sql
├── 008_hierarchy_dict.sql # Dictionary
├── 009_cost_rollup_daily.sql
├── 010_migrations_table.sql # Migration tracking
└── 011_retention_policies.sql # TTL: raw 90d, hourly 1y, daily indefinite

Planned (012-025)

These migrations will be added as their corresponding features are implemented:

012_action_types.sql # Seed action type registry
013_action_suggestions.sql
014_action_log.sql
015_aws_accounts.sql
016_collector_runs.sql
017_notifications.sql
018_budget_rules.sql # Seed default rules
019_rule_violations.sql
020_anomalies.sql
021_aws_pricing.sql
022_cost_forecasts.sql
023_resource_snapshots_daily.sql
024_allocation_rules.sql
025_allocated_costs.sql

Migrations are idempotent (CREATE TABLE IF NOT EXISTS). Applied by components/schema/migrate.py which tracks state in polo._migrations.

Data Retention & TTL Policies

Three retention tiers ensure raw data ages out while aggregates persist:

TierTableTTLPartitioningRationale
Raw eventsresource_events90 daysMonthly (toYYYYMM(event_time))Append-only stream; hourly rollups preserve older data
Hourly rollupscost_hourly1 yearMonthly (toYYYYMM(hour))Hour-grain drill-down for the past year
Daily rollupscost_rollup_dailyIndefiniteMonthly (toYYYYMM(day))Permanent historical record

Dimension/state tables (resource_snapshots, hierarchy_nodes, resource_relationships, resource_ancestry) have no TTL — they are small and deduplicated via ReplacingMergeTree.

Partition Pruning

All time-series tables use monthly partitions (PARTITION BY toYYYYMM(...)). This enables:

  1. Automatic TTL cleanup — ClickHouse drops entire partitions when all rows expire, avoiding row-by-row deletion overhead.
  2. Manual partition drops — For immediate cleanup or one-off purges:
    -- Drop a specific month of raw events
    ALTER TABLE polo.resource_events DROP PARTITION 202501;

    -- List partitions with row counts
    SELECT partition, table, rows, formatReadableSize(bytes_on_disk) AS size
    FROM system.parts
    WHERE database = 'polo' AND active
    GROUP BY partition, table, rows, bytes_on_disk
    ORDER BY table, partition;
  3. Efficient query pruning — Queries with WHERE event_time >= ... automatically skip irrelevant partitions.

Retention Tiers Applied by Migration

Migration 011_retention_policies.sql applies ALTER TABLE ... MODIFY TTL to existing tables. Base migration files (002, 004) are updated so fresh installations get the correct TTLs from the start.


Schema Evolution

  • Additive changes (new columns with defaults): use ALTER TABLE ... ADD COLUMN. Safe on MergeTree.
  • Structural changes (changing sort key, engine): requires creating a new table and migrating data. Document the migration carefully.
  • New tables: add a new numbered migration file.
  • Map properties: just add new keys in collector code. No schema change needed.