Data Quality Validation

Why This Exists

Polo's analytical value — cost rollups, hierarchy queries, anomaly detection, budget rules — depends entirely on the quality of the data flowing through it. If an EC2 instance is missing its marqo:customer tag, its cost disappears from the customer's rollup. If a collector stops running, resources go stale and cost analysis drifts from reality. If a relationship edge is missing, a volume's cost won't roll up to its parent instance.

These problems are silent. Nothing breaks visibly. Costs just quietly stop adding up correctly, and by the time someone notices, the gap may span weeks. Backfilling is possible but expensive, and some data (CloudTrail events, point-in-time metrics) can't be recovered at all.

The data quality system exists to make these problems loud and fast. It continuously measures the completeness and consistency of Polo's data, tracks those measurements over time, and alerts when quality degrades.

What We Measure

Metadata Completeness

Every resource type has a set of metadata fields that should be filled for Polo's queries to work correctly. An EC2 instance without marqo_customer is invisible in customer cost views. An EBS volume without account_role won't appear in account-level filtering.

The expected fields vary by resource type because different resources serve different analytical purposes:

Resource Type	Expected Fields	Why
`ec2:instance`	marqo_customer, marqo_cluster, marqo_index, role, system, audience, account_role	EC2 instances are the primary cost drivers. They must be fully attributed to enable customer billing, cluster-level analysis, and role-based cost breakdowns.
`ebs:volume`	account_role, system	Volumes inherit business context (customer, cluster, index) from their parent instance via the physical hierarchy. They only need their own account-level metadata.
`ebs:snapshot`	account_role	Snapshots inherit everything from volume → instance. Minimal own metadata needed.
`elbv2:load_balancer`	role, system, account_role	Load balancers need functional classification for infrastructure analysis.
`s3:bucket`, `ec2:eip`, `vpc:nat_gateway`	account_role	Infrastructure resources need account context for cost partitioning.

This table is defined in code at components/collectors/data_quality/expectations.py and is designed to evolve. When we add new resource types or new analytical dimensions, the expectations grow with them.

Orphan Resources

A resource is an "orphan" if it has no rows in resource_ancestry — it exists in resource_snapshots but isn't connected to any hierarchy. Orphan resources:

Don't appear in customer/cluster/index cost views
Don't roll up to account-level totals
Are invisible to budget rules scoped to hierarchy nodes

Some orphans are expected (newly created resources before the hierarchy_builder runs, resources in testing accounts). A sudden increase in orphan count signals a problem.

Missing Cost Data

Resources in resource_snapshots with cost_daily_usd = 0 either aren't generating costs (expected for some resource types) or have a broken cost attribution pipeline (a problem). Tracking this count helps catch Cost Explorer or CUR failures early.

Collector Freshness

Each collector has an expected run interval. If config_ec2 hasn't produced events in 2 hours, its data is stale and downstream tables (snapshots, relationships, hierarchy) are working with outdated state. The data quality system tracks the last event time per collector.

Tag Coverage

The fraction of resources with at least one marqo_* field filled. This is a broader, less precise signal than per-field completeness — useful for spotting systemic tagging failures (e.g., a new account's resources are all untagged).

How It Works

The `data_quality` Collector

The data quality collector is a derived collector — it doesn't call AWS APIs. It reads from ClickHouse tables that other collectors populate, computes quality metrics, and writes them to data_quality_daily.

Data sources:

resource_snapshots FINAL — current state of every resource
resource_ancestry FINAL — which resources have hierarchy connections
resource_events — last event time per collector

Computation:

For each resource, look up expected fields from expectations.py
Count how many expected fields are non-empty → per-resource fill_rate
Aggregate by 4 dimensions: resource_type, account, region, overall
For each dimension: average fill rate, complete count, orphan count, per-field rates

Schedule: Runs daily at 07:00 UTC, after cost_explorer (06:00) so cost data is fresh.

No account exclusions: All accounts — production, staging, testing, development — are scored equally. Testing and development accounts arguably matter more because we have full control over their tagging and lifecycle. Untagged test resources represent waste we can act on immediately. If a testing account has low completeness, that's a signal to either tag those resources or shut them down.

Storage: `data_quality_daily`

CREATE TABLE polo.data_quality_daily (
    day                  Date,
    dimension            LowCardinality(String),   -- 'resource_type', 'account', 'region', 'overall'
    dimension_value      String,
    total_resources      UInt32,
    complete_resources   UInt32,
    orphan_resources     UInt32,
    missing_cost         UInt32,
    completeness_score   Float64,                  -- 0.0 to 1.0
    tag_coverage         Float64,
    collector_last_seen  Map(String, DateTime64(3)),
    field_fill_rates     Map(String, Float64),
    _version             UInt64
) ENGINE = ReplacingMergeTree(_version)
  ORDER BY (day, dimension, dimension_value);

Uses ReplacingMergeTree because the collector re-computes the same day's row each run. The _version column ensures later runs replace earlier ones.

The dimension + dimension_value pattern keeps the table generic. Adding a new dimension (e.g., marqo_env) is a code change in the collector, not a schema migration.

Alerting

Quality degradation alerts flow through the existing rule_evaluator system. A data_quality rule type queries data_quality_daily and creates violations when:

completeness_score drops below a threshold (e.g., 80% for EC2 instances)
orphan_resources exceeds a count (e.g., 50)
Any collector hasn't been seen in N hours

Violations feed into the standard notification pipeline (Slack, etc.) with 24-hour cooldown.

Aspirations

Where We Want To Be

Target state: Every production resource in Polo has 100% metadata completeness. Every customer-facing EC2 instance is tagged with marqo:customer, marqo:cluster, marqo:index, and classified with role, system, audience. The overall completeness score is ≥ 95% and trending upward.

Why 100% matters: Polo's cost analysis is only as trustworthy as its metadata. If 10% of EC2 instances are missing marqo:customer, then 10% of compute costs are invisible in customer views. That's not a 10% error — it's an unknown error, because the missing costs could be disproportionately expensive. Partial coverage is worse than no coverage because it creates false confidence.

What "Good" Looks Like

Metric	Target	Acceptable	Investigate
EC2 completeness_score	≥ 0.95	≥ 0.80	< 0.80
Overall tag_coverage	≥ 0.90	≥ 0.75	< 0.75
Orphan resources	< 20	< 50	≥ 50
Collector freshness	All < 30 min	All < 2 hours	Any > 4 hours
Missing cost (non-free)	0	< 10	≥ 10

These thresholds are starting points. As we improve tagging discipline and collector reliability, we should ratchet them tighter. The data_quality_daily trend data will show us when we're ready.

What We're Building Toward

Automated tagging remediation: When the quality system detects an untagged EC2 instance in a production account, it should be able to infer the correct tags (from the instance's subnet, security group, or naming pattern) and either apply them automatically or create an action suggestion.
Backfill triggers: When quality drops because a collector failed, the system should detect the gap and trigger a targeted backfill for the affected time range and account/region.
Schema evolution tracking: When we add new metadata fields (like role, system, audience in migration 018), the quality system should automatically start tracking fill rates for those fields and alert on low coverage — even before expectations.py is updated.
Cross-resource consistency: Beyond per-resource completeness, verify that relationships are consistent. If instance i-123 has marqo:customer=acme but its attached volume has marqo:customer=beta, that's a data inconsistency worth flagging.

How to Extend

Adding a New Resource Type

Add the resource type to EXPECTED_FIELDS in expectations.py with the fields that should be filled
The collector will automatically start tracking it on the next run
If the new resource type needs special handling (e.g., it's always expected to be untagged), add it to the appropriate exclusion set

Adding a New Metadata Field

When a new field is added to ResourceEvent / resource_snapshots, decide whether it should be tracked
Add it to the relevant entries in EXPECTED_FIELDS
The field_fill_rates map in data_quality_daily will automatically include the new field

Adding a New Dimension

Add the dimension to the collector's aggregation loop (e.g., dimension='marqo_env')
The table schema doesn't need to change — dimension is a generic string
Add API queries and UI views for the new dimension

Adjusting Thresholds

Quality rule thresholds are stored in budget_rules (the existing rule infrastructure). Adjust them via:

UPDATE polo.budget_rules
SET condition = '{"metric": "completeness_score", "dimension": "resource_type",
                  "dimension_value": "ec2:instance", "threshold": "0.90", "direction": "below"}'
WHERE rule_name = 'EC2 metadata completeness';

Adding New Quality Metrics

The data_quality_daily table uses Map(String, Float64) for field_fill_rates, which is extensible without schema changes. For fundamentally new metric types:

Add a new column to data_quality_daily (requires a migration)
Compute it in the collector
Add an alert rule if appropriate

What Can Change

Things That Are Stable

The principle: Measure quality, track trends, alert on degradation. This won't change.
The storage pattern: Daily aggregated metrics in ClickHouse with ReplacingMergeTree. This fits the data lifecycle well.
The expectations model: Per-resource-type expected fields. The specific fields will evolve, but the pattern works.

Things That Might Change

Scoring algorithm: The current fill-rate approach (filled_fields / expected_fields) is simple and interpretable. We might move to weighted scoring if some fields matter more than others (e.g., marqo_customer is more important than marqo_purpose).
Real-time vs daily: The collector runs daily. If we need faster feedback (e.g., detecting a tagging regression within minutes of deployment), we could add a lightweight ClickHouse materialized view that computes a live completeness score from resource_events, separate from the daily deep analysis.
Grafana integration: ClickHouse handles the storage and querying well for now. If we deploy Grafana, the data_quality_daily table is a natural datasource for dashboards. No architectural change needed — just point Grafana at the table.
Per-event validation: Currently quality is measured post-hoc from resource_snapshots. A future evolution could validate each ResourceEvent at ingest time and reject or flag events with critical fields missing. This would prevent bad data from entering the system rather than detecting it after the fact.

Relationship to Other Systems

Hierarchy Builder

The hierarchy builder creates resource_ancestry, which the data quality system reads to detect orphans. If the hierarchy builder fails, orphan count spikes — but that's the hierarchy builder's problem. The quality system reports the symptom; the monitoring system (collector_runs, when implemented) reports the root cause.

Anomaly Detector

The anomaly detector watches cost trends. The data quality system watches metadata trends. They complement each other: a cost anomaly might be caused by a quality issue (e.g., costs shifting between customers because of a tagging change), and the quality trend data helps diagnose it.

Rule Evaluator

The rule evaluator is the alerting mechanism for quality metrics, just as it is for budget and compliance rules. Quality rules are stored in budget_rules alongside budget rules — they share the same evaluation, violation, and notification infrastructure.

Snapshot Builder

The snapshot builder materializes resource_snapshots from resource_events. The data quality system reads resource_snapshots — so snapshot builder failures directly affect quality measurement. Stale snapshots will show outdated metadata and undercount new resources.

Why This Exists​

What We Measure​

Metadata Completeness​

Orphan Resources​

Missing Cost Data​

Collector Freshness​

Tag Coverage​

How It Works​

The data_quality Collector​

Storage: data_quality_daily​

Alerting​

Aspirations​

Where We Want To Be​

What "Good" Looks Like​

What We're Building Toward​

How to Extend​

Adding a New Resource Type​

Adding a New Metadata Field​

Adding a New Dimension​

Adjusting Thresholds​

Adding New Quality Metrics​

What Can Change​

Things That Are Stable​

Things That Might Change​

Relationship to Other Systems​

Hierarchy Builder​

Anomaly Detector​

Rule Evaluator​

Snapshot Builder​