Data Quality Runbook

Procedures for monitoring, investigating, and resolving data quality issues in Polo.

Daily Health Check

Run this query to get the current quality snapshot:

SELECT
    dimension_value AS what,
    total_resources,
    complete_resources,
    round(completeness_score, 3) AS score,
    round(tag_coverage, 3) AS tags,
    orphan_resources AS orphans,
    missing_cost
FROM polo.data_quality_daily FINAL
WHERE day = today() AND dimension = 'resource_type'
ORDER BY total_resources DESC;

What to look for:

score < 0.80 for any resource type → investigate missing metadata (see "Low Completeness Score" below)
orphans > 0 for ec2:instance → hierarchy builder may be failing or relationships not being discovered
missing_cost > 0 for ec2:instance or ebs:volume → cost pipeline may be broken

Investigating Low Completeness Score

Step 1: Identify which fields are missing

SELECT
    field_fill_rates
FROM polo.data_quality_daily FINAL
WHERE day = today()
  AND dimension = 'resource_type'
  AND dimension_value = 'ec2:instance';  -- or whatever resource type

This returns a map like {'marqo_customer': 0.85, 'role': 0.92, 'system': 0.95, ...}. The field with the lowest rate is the culprit.

Step 2: Find the specific resources

SELECT
    resource_arn, resource_name, aws_account_id, aws_region,
    marqo_customer, marqo_cluster, marqo_index, role, system, account_role
FROM polo.resource_snapshots FINAL
WHERE resource_type = 'ec2:instance'
  AND marqo_customer = ''  -- replace with the missing field
ORDER BY aws_account_id, resource_name;

Step 3: Determine root cause

Common causes of missing metadata:

Missing Field	Likely Cause	Resolution
`marqo_customer`	Resource not tagged with `marqo:customer` in AWS	Tag the resource in AWS. Check if it's a Marqo Cloud resource or internal infrastructure.
`marqo_cluster`	Resource not tagged with `marqo:cluster`	Tag the resource. If it's a standalone instance (not in a cluster), this may be expected — consider adjusting expectations.
`role`	Instance name doesn't match any pattern in `enrichment.py`, and no `polo.role` tag	Add a `polo.role` tag to the instance, or add a name pattern to `_instance_role()` in `enrichment.py`.
`system`	Account not in `_ACCOUNT_SYSTEM` mapping, and role not in `_ROLE_TO_SYSTEM`	Add the account to `_ACCOUNT_SYSTEM` in `enrichment.py`, or ensure the instance has a role that maps to a system.
`account_role`	Account missing from `hierarchy_nodes`	Run the `hierarchy_admin` collector to discover the account, or add it manually.

Step 4: Verify the fix

After tagging resources or updating enrichment logic:

Wait for the next collector run (~15 min for config collectors)
Wait for snapshot_builder to update snapshots (~15 min after config)
Check resource_snapshots to verify the field is now populated
The next data_quality run (daily at 07:00 UTC) will reflect the improvement

For immediate verification without waiting for the daily run:

SELECT
    resource_type,
    count() AS total,
    countIf(marqo_customer != '') AS has_customer,
    countIf(role != '') AS has_role,
    countIf(account_role != '') AS has_account_role
FROM polo.resource_snapshots FINAL
GROUP BY resource_type
ORDER BY total DESC;

Investigating Orphan Resources

Step 1: List orphans

SELECT rs.resource_arn, rs.resource_type, rs.resource_name, rs.aws_account_id
FROM polo.resource_snapshots rs FINAL
LEFT JOIN (
    SELECT DISTINCT resource_arn FROM polo.resource_ancestry FINAL
) ra ON rs.resource_arn = ra.resource_arn
WHERE ra.resource_arn IS NULL
ORDER BY rs.resource_type, rs.aws_account_id;

Step 2: Diagnose why they're orphaned

Common causes:

New resource: Created after the last hierarchy_builder run. Wait 15 min.
Missing relationship edge: The resource has no cost-parent edge in resource_relationships. Check if the relationships collector is discovering it.
Missing tags: The resource has no marqo:* tags AND no physical parent, so the hierarchy_builder can't place it in the logical hierarchy.
Hierarchy_builder failure: Check if the collector ran successfully. Look at Lambda logs or query resource_events for recent hierarchy_builder events.

Step 3: Check relationship edges

-- Does this resource have any relationships?
SELECT * FROM polo.resource_relationships FINAL
WHERE source_arn = 'arn:aws:ec2:...'  -- the orphan's ARN
   OR target_arn = 'arn:aws:ec2:...';

If no rows, the relationships collector isn't finding edges for this resource. Check:

Is it a resource type that the relationships collector covers? (EC2, EBS, EIP, ENI, ELB)
Is it in a region/account that the collector is scanning?

Investigating Collector Freshness

Step 1: Check last seen times

SELECT
    collector_last_seen
FROM polo.data_quality_daily FINAL
WHERE day = today() AND dimension = 'overall';

Or query directly from events:

SELECT
    collector,
    max(event_time) AS last_event,
    dateDiff('minute', max(event_time), now()) AS minutes_ago
FROM polo.resource_events
WHERE event_time > now() - INTERVAL 2 DAY
GROUP BY collector
ORDER BY minutes_ago DESC;

Step 2: Investigate stale collectors

Collector Stale	What It Means	Impact
`config_ec2`	EC2 snapshots are outdated	New/terminated instances not reflected, stale metadata
`config_ebs`	Volume state outdated	Detached volumes still show as attached
`relationships`	Physical hierarchy edges outdated	New attachments not tracked, cost rollup incomplete
`hierarchy_builder`	Ancestry closure table outdated	Orphan count inflated, cost rollups stale
`cost_explorer`	Cost data not flowing	Missing cost > 0 for resources that should have costs
`tags`	Tag changes not reflected	Metadata completeness may be understated

Step 3: Check Lambda execution

Look at CloudWatch Logs for the collector's Lambda function. Common failures:

AssumeRole failure: The cross-account role may have expired or been modified
ClickHouse connection timeout: ClickHouse instance may be overloaded or unreachable
API throttling: AWS API rate limits hit during collection

Investigating Missing Cost Data

Step 1: Which resources are missing costs?

SELECT
    resource_arn, resource_type, resource_name,
    aws_account_id, first_seen, cost_daily_usd
FROM polo.resource_snapshots FINAL
WHERE cost_daily_usd = 0
  AND resource_type IN ('ec2:instance', 'ebs:volume', 'vpc:nat_gateway', 'elbv2:load_balancer')
  AND first_seen < today() - INTERVAL 2 DAY  -- exclude brand new resources
ORDER BY resource_type, aws_account_id;

Step 2: Check if cost data exists in events

SELECT count(), sum(value), max(event_time)
FROM polo.resource_events
WHERE resource_arn = 'arn:aws:...'
  AND event_type = 'cost';

If no cost events exist, the cost_explorer or cost_cur collector may not be covering this account or resource.

Step 3: Check Cost Explorer coverage

SELECT DISTINCT aws_account_id
FROM polo.resource_events
WHERE collector = 'cost_explorer'
  AND event_time > today() - INTERVAL 1 DAY;

Compare with the list of active accounts. Any account not in the results needs investigation.

Tracking Quality Trends

Week-over-week trend

SELECT
    day,
    round(completeness_score, 3) AS score,
    total_resources,
    complete_resources,
    orphan_resources
FROM polo.data_quality_daily FINAL
WHERE dimension = 'overall' AND dimension_value = '*'
ORDER BY day DESC
LIMIT 30;

Look for:

Gradual decline: Suggests new resources being created without proper tagging. Check recent instance launches.
Sudden drop: Suggests a collector failure, a bulk resource creation without tags, or a code change that broke enrichment.
Steady improvement: The system is working. Ratchet alert thresholds tighter.

Per-account trend

SELECT
    day,
    dimension_value AS account_id,
    round(completeness_score, 3) AS score,
    total_resources
FROM polo.data_quality_daily FINAL
WHERE dimension = 'account'
ORDER BY day DESC, score ASC
LIMIT 50;

Accounts with consistently low scores need a tagging audit. Accounts with declining scores may have new untagged workloads.

Backfilling After Quality Issues

When to backfill

Backfill is needed when data quality issues caused incorrect data to be stored and the incorrect data is still being used for analysis. Common scenarios:

Collector was down for N hours: Config snapshots are stale. Cost events may have gaps.
Enrichment bug: Resources were tagged incorrectly for a period.
Missing relationship edges: Cost rollups were incomplete.

How to backfill

Config data (snapshots, relationships): Re-run the affected collectors. They perform full state snapshots, so the next successful run will correct the data. No historical backfill needed — the snapshot is always "now."

Cost data: If cost_explorer missed a day, re-run it with a custom date range (if the collector supports it). Cost Explorer API data is available for up to 12 months retroactively. CUR data in S3 is immutable and can be reprocessed.

Hierarchy/ancestry: Re-run hierarchy_builder. It performs a full rebuild from current state. No historical dimension — it's always "now."

Quality metrics: Re-run the data_quality collector. It re-computes the current day's metrics. Historical quality scores from past days cannot be retroactively corrected (and shouldn't be — they accurately reflect what the system saw at that time).

Preventive Measures

Tagging Discipline

The most impactful thing you can do for data quality is ensure resources are tagged at creation time:

CloudFormation / CDK templates should include marqo:customer, marqo:cluster, marqo:index tags on all taggable resources
Tag policies in AWS Organizations can enforce required tags
Polo tag validation rules (via rule_evaluator) can flag untagged resources immediately

Collector Monitoring

Set up CloudWatch alarms on Lambda function errors and duration. A collector that takes 10x longer than usual may be encountering API throttling or processing unexpectedly large datasets.

Regular Quality Reviews

Weekly review of the quality trend dashboard. Aim for monotonically improving scores. When scores plateau, investigate whether the remaining gaps are:

Fixable (untagged resources that should be tagged)
Expected (infrastructure resources that genuinely don't need business metadata)
Expectations too strict (fields in expectations.py that aren't actually needed)

Adjust expectations to match reality, not the other way around — but only after confirming the gap is genuinely expected, not just unfixed.

Daily Health Check​

Investigating Low Completeness Score​

Step 1: Identify which fields are missing​

Step 2: Find the specific resources​

Step 3: Determine root cause​

Step 4: Verify the fix​

Investigating Orphan Resources​

Step 1: List orphans​

Step 2: Diagnose why they're orphaned​

Step 3: Check relationship edges​

Investigating Collector Freshness​

Step 1: Check last seen times​

Step 2: Investigate stale collectors​

Step 3: Check Lambda execution​

Investigating Missing Cost Data​

Step 1: Which resources are missing costs?​

Step 2: Check if cost data exists in events​

Step 3: Check Cost Explorer coverage​

Tracking Quality Trends​

Week-over-week trend​

Per-account trend​

Backfilling After Quality Issues​

When to backfill​

How to backfill​

Preventive Measures​

Tagging Discipline​

Collector Monitoring​

Regular Quality Reviews​