Skip to main content

Resource Hierarchy & Cost Rollup

The Problem

Polo's most important analytical capability is rolling costs up through multiple overlapping hierarchies that extend all the way down to physical AWS resource parentage.

A snapshot's cost rolls up to its volume, which rolls up to its instance, which rolls up to its index, cluster, and customer. The same snapshot simultaneously rolls up through the account hierarchy. The UI needs total cost at every level, and each level carries metadata that all descendants inherit (account role, customer tier, cluster version).

Marqo logical hierarchy:

Customer "Acme Corp" (tier: enterprise)
├── Cluster "acme-prod-1" (version: v2)
│ ├── Index "products"
│ │ ├── ALB alb-xxx ────────────── $0.02/hr (backed_by → i-aaa, i-ggg)
│ │ ├── EC2 i-aaa ─────────────── $2.10/hr
│ │ │ ├── EBS vol-bbb ──────── $0.08/hr (attached_to i-aaa)
│ │ │ │ └── Snapshot snap-ccc ── $0.01 (snapshot_of vol-bbb)
│ │ │ ├── ENI eni-ddd ──────── $0.00 (eni_of i-aaa)
│ │ │ └── Public IPv4 ──────── $0.005/hr (associated_with i-aaa)
│ │ └── NAT nat-fff ────────────── $0.045/hr
│ └── Index "images"
│ └── EC2 i-ggg ─────────────── $1.50/hr
└── Cluster "acme-staging-1"
└── ...

Network topology (non-cost edges):

VPC vpc-111
├── Subnet subnet-aaa (us-east-1a)
│ ├── EC2 i-aaa (in_subnet)
│ ├── ENI eni-ddd (in_subnet)
│ └── NAT nat-fff (in_subnet)
└── Subnet subnet-bbb (us-east-1b)
└── EC2 i-ggg (in_subnet)

Three Hierarchy Dimensions

DimensionEdgesExample pathPurpose
aws_physicalis_cost_parent=1 edges from resource_relationshipssnapshot → volume → instanceRoll up child resource costs to their physical parents
marqo_logicalTags on resources (or inherited from physical parents)resource → index → cluster → customerBusiness-level cost attribution
aws_accountResource's aws_account_idresource → accountAccount-level cost tracking and role-based filtering

All three coexist in resource_ancestry and can be queried independently or together.

Reify vs Infer — The Hybrid Approach

StrategyProCon
Query-time inferenceAlways correctClickHouse terrible at recursive CTEs; every query walks the graph
Full ingest-time reificationBlazing fast GROUP BYMetadata changes require backfilling billions of rows
Hybrid (chosen)Fast GROUP BY on identity; metadata changes are instantSlight query complexity for metadata columns

Identity columns (account_id, cluster, index) are stamped on events at ingest — they don't change after creation.

Metadata (account role, customer tier) lives in hierarchy_nodes and is resolved via ClickHouse dictionary at query time — changes take effect instantly without backfill.

account_role is the one exception — reified on events too because it's universally useful for filtering, with a backfill job if it changes (which is rare).

Key Tables

hierarchy_nodes — Logical hierarchy levels

Small table (hundreds of rows) defining accounts, customers, clusters, indexes with inheritable metadata. Individual AWS resources are NOT stored here — they participate via resource_ancestry.

CREATE TABLE polo.hierarchy_nodes
(
node_id String, -- 'account:123456789012', 'cluster:abc123', etc.
node_type LowCardinality(String), -- 'account', 'customer', 'cluster', 'index'
node_name String,
parent_id String DEFAULT '', -- '' = root
hierarchy LowCardinality(String), -- 'aws_account', 'marqo_logical'
metadata Map(String, String), -- 'role' → 'customer', 'tier' → 'enterprise', etc.
created_at DateTime64(3),
updated_at DateTime64(3),
_version UInt64
)
ENGINE = ReplacingMergeTree(_version)
ORDER BY (hierarchy, node_id);

resource_ancestry — Closure table

Pre-computed transitive closure of the full hierarchy. For every resource, every ancestor at every depth. Enables "total cost of everything under node X" without recursion.

CREATE TABLE polo.resource_ancestry
(
resource_arn String, -- the leaf resource
ancestor_id String, -- node_id or resource ARN
ancestor_type LowCardinality(String), -- 'account', 'customer', 'cluster', 'index', 'ec2:instance', etc.
hierarchy LowCardinality(String), -- 'aws_physical', 'marqo_logical', 'aws_account'
depth UInt8, -- 1 = parent, 2 = grandparent, etc.
_version UInt64
)
ENGINE = ReplacingMergeTree(_version)
ORDER BY (ancestor_id, resource_arn);

Example rows for snapshot snap-ccc:

resource_arnancestor_idancestor_typehierarchydepth
arn:...:snap-cccarn:...:vol-bbbebs:volumeaws_physical1
arn:...:snap-cccarn:...:i-aaaec2:instanceaws_physical2
arn:...:snap-cccindex:acme-productsindexmarqo_logical3
arn:...:snap-ccccluster:acme-prod-1clustermarqo_logical4
arn:...:snap-ccccustomer:acmecustomermarqo_logical5
arn:...:snap-cccaccount:111111111111accountaws_account1

Size estimate: ~5K resources × ~6 ancestors each = ~30K rows. At tens of thousands of resources, O(100K) rows — trivial for ClickHouse. Rebuilt in < 1 second every 15 minutes.

hierarchy_dict — In-memory metadata lookups

CREATE DICTIONARY polo.hierarchy_dict
(
node_id String, node_type String, node_name String,
parent_id String, hierarchy String, metadata Map(String, String)
)
PRIMARY KEY node_id
SOURCE(CLICKHOUSE(TABLE 'hierarchy_nodes' DB 'polo'))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(MIN 60 MAX 300);

Usage: dictGet('polo.hierarchy_dict', 'metadata', 'account:111')['role'] — near-zero cost per lookup.

cost_rollup_daily — Pre-aggregated hierarchical costs

Daily cost at every hierarchy level (logical AND physical). Powers treemap and sunburst views.

CREATE TABLE polo.cost_rollup_daily
(
day Date,
node_id String, -- node_id or resource ARN
node_type LowCardinality(String),
node_name String,
hierarchy LowCardinality(String),
parent_id String,
cost_usd Float64,
resource_count UInt32,
cost_by_type Map(String, Float64) -- 'ec2:instance' → 142.50, etc.
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (hierarchy, node_type, node_id, day);

Tag Inheritance Through Physical Parents

Snapshots and other child resources typically have no Marqo tags. The HierarchyResolver walks up the physical parent chain (snapshot → volume → instance) until it finds an ancestor with marqo:index tags, then uses those for the logical hierarchy. This means an untagged snapshot still rolls up correctly.

MARQO_TAG_MAP = {
'marqo_customer': ['marqo:customer', 'Customer', 'customer'],
'marqo_cluster': ['marqo:cluster', 'marqo:cluster-id', 'Cluster'],
'marqo_index': ['marqo:index', 'marqo:index-name', 'Index'],
'marqo_env': ['marqo:environment', 'Environment', 'env'],
'marqo_purpose': ['marqo:purpose', 'Purpose'],
}

Resource Relationship Types

The relationships collector discovers edges between resources from AWS APIs. Each edge has a type and a is_cost_parent flag that controls whether the edge participates in cost rollup.

Cost-parent edges (roll up costs)

RelationshipSource → TargetAWS APIPurpose
attached_toEBS volume → EC2 instancedescribe_volumesAttachments[].InstanceIdVolume cost attributed to instance
snapshot_ofEBS snapshot → EBS volumedescribe_snapshotsVolumeIdSnapshot cost attributed to volume
associated_withEIP → EC2 instancedescribe_addressesInstanceIdEIP cost attributed to instance
eni_ofENI → EC2 instancedescribe_network_interfacesAttachment.InstanceIdENI cost attributed to instance
backed_byLoad balancer → EC2 instancedescribe_target_groups + describe_target_health → target instancesLB cost attributed to target instances

Topology edges (no cost rollup)

RelationshipSource → TargetAWS APIPurpose
in_subnetEC2 instance → Subnetdescribe_instancesSubnetIdNetwork placement
in_subnetENI → Subnetdescribe_network_interfacesSubnetIdNetwork placement
in_subnetNAT gateway → Subnetdescribe_nat_gatewaysSubnetIdNetwork placement
in_vpcSubnet → VPCdescribe_subnetsVpcIdNetwork containment

Topology edges enable "show me all resources in VPC X" queries without polluting the cost-optimised closure table. They're queried directly from resource_relationships, not through resource_ancestry.

Relationship Between Tables

resource_relationships (direct edges, typed)

│ hierarchy_builder reads is_cost_parent=1 edges

resource_ancestry (transitive closure, all 3 hierarchies)

│ cost queries JOIN here for rollups

cost_rollup_daily (pre-aggregated, for dashboards)

resource_relationships stores the raw graph. resource_ancestry is the query-optimised closure table derived from it. The hierarchy_builder collector rebuilds resource_ancestry every 15 minutes.

Marqo Logical Hierarchy — Data Sources

The hierarchy_admin collector populates the marqo_logical hierarchy from multiple DynamoDB tables and resource tags. This section documents the data sources, join logic, and heuristics.

Data Sources

SourceTableAccountWhat it provides
UsersAccountsTableDynamoDB468036072962 (Staging)Customer accounts: visible_account_id (UUID), system_account_id (short ID), organization, name
CustomerIndexConfigTableDynamoDB468036072962 (Staging)Indexes: index_name (SK), system_account_id (PK), index_status, inference_type, storage_class, marqo_version, marqo_endpoint
Resource tagsClickHouse resource_snapshotsAll collected accountsClusters and indexes via marqo:cluster, marqo:index AWS tags (fallback source)

Join Logic

UsersAccountsTable (ACCOUNT entities)

│ system_account_id = system_account_id (PK)

CustomerIndexConfigTable (index records)

│ visible_account_id from UsersAccountsTable

hierarchy_nodes: customer:{visible_account_id} → index:{system_account_id}:{index_name}

The system_account_id is an opaque short identifier (e.g., nh236hm2) that serves as:

  • The partition key in CustomerIndexConfigTable
  • The join key between customers and their indexes
  • Part of the Marqo endpoint subdomain: {index-name}-{hash}-{system_account_id}.marqo-staging.com

It is stored in hierarchy_nodes customer metadata so that subsequent runs can read the mapping from ClickHouse via _load_sys_to_visible() without re-scanning DynamoDB.

Index Node Identity

Index node IDs use the (system_account_id, index_name) tuple: index:{system_account_id}:{index_name}. This is necessary because different customers can have indexes with the same name (e.g., kogan-prod-20251122 exists under two different accounts).

Index Status Filtering

Only indexes with active statuses are collected: CREATING, DELETING, MODIFYING, READY. The DELETED status (which accounts for the vast majority of records — ~13,800 of ~13,850) is excluded.

Index Metadata

Each index node stores metadata from CustomerIndexConfigTable:

FieldExamplePurpose
index_statusREADYCurrent lifecycle state
inference_typeCPU, CPU.SMALL, GPU, GPU.LARGECompute tier
storage_classBASIC, BALANCED_STORAGE, PERFORMANCE, BALANCED_THROUGHPUT_PLUSStorage tier
marqo_version2.25.0-cloudMarqo engine version
marqo_endpointmy-index-abc123.marqo-staging.comCustomer-facing endpoint
index_namespacedeadbeef1234...Internal namespace hash
dev_cellmehul or emptyDev cell identifier (see below)

Dev Cells

In staging, dev cells are isolated Marqo Cloud instances used by developers for testing. They're identified by naming patterns in resource names and index names:

PatternExampleDev cell ID
Known developer prefixmehul-dev-MultitenantEKSCluster-*mehul
Known developer prefixkeshavjois-test-indexkeshavjois
Generic {name}-dev- prefixcell1-dev-MultitenantEKSCluster-*cell1-dev
Branch-based deploymentdev-controller-branchname-*branchname

Known developer prefixes (maintained in hierarchy_admin/handler.py): mehul, mehulporuwal, keshav, keshavjois, vfacabado.

Regular staging, preprod, and prod resources have an empty dev_cell value. The dev_cell metadata field enables filtering the hierarchy to show only production-grade resources or only a specific developer's environment.

Hierarchy Shape

The Marqo logical hierarchy is currently two levels (customer → index), not three:

Customer "Acme Corp" (organization, name from UsersAccountsTable)
├── Index "prod-search" (READY, GPU, PERFORMANCE)
├── Index "staging-search" (READY, CPU.SMALL, BASIC)
└── Index "dev-experiment" (CREATING, CPU, BASIC, dev_cell: mehul)

There is no explicit "cluster" concept in the CustomerIndexConfigTable. Clusters only appear when AWS resources carry marqo:cluster tags (currently none do in staging). If Marqo introduces a cluster concept in the control plane, a third hierarchy level can be added.

Three Query Paths

PathWhen to useJoins required
Fast pathGROUP BY a denormalised column (account_role, marqo_customer)None
Full rollupCost of a node including all physical childrenJOIN resource_ancestry
MetadataResolve account role, customer tier for displaydictGet() — near free