Data Quality Runbook
Procedures for monitoring, investigating, and resolving data quality issues in Polo.
Daily Health Check
Run this query to get the current quality snapshot:
SELECT
dimension_value AS what,
total_resources,
complete_resources,
round(completeness_score, 3) AS score,
round(tag_coverage, 3) AS tags,
orphan_resources AS orphans,
missing_cost
FROM polo.data_quality_daily FINAL
WHERE day = today() AND dimension = 'resource_type'
ORDER BY total_resources DESC;
What to look for:
- score < 0.80 for any resource type → investigate missing metadata (see "Low Completeness Score" below)
- orphans > 0 for ec2:instance → hierarchy builder may be failing or relationships not being discovered
- missing_cost > 0 for ec2:instance or ebs:volume → cost pipeline may be broken
Investigating Low Completeness Score
Step 1: Identify which fields are missing
SELECT
field_fill_rates
FROM polo.data_quality_daily FINAL
WHERE day = today()
AND dimension = 'resource_type'
AND dimension_value = 'ec2:instance'; -- or whatever resource type
This returns a map like {'marqo_customer': 0.85, 'role': 0.92, 'system': 0.95, ...}. The field with the lowest rate is the culprit.
Step 2: Find the specific resources
SELECT
resource_arn, resource_name, aws_account_id, aws_region,
marqo_customer, marqo_cluster, marqo_index, role, system, account_role
FROM polo.resource_snapshots FINAL
WHERE resource_type = 'ec2:instance'
AND marqo_customer = '' -- replace with the missing field
ORDER BY aws_account_id, resource_name;
Step 3: Determine root cause
Common causes of missing metadata:
| Missing Field | Likely Cause | Resolution |
|---|---|---|
marqo_customer | Resource not tagged with marqo:customer in AWS | Tag the resource in AWS. Check if it's a Marqo Cloud resource or internal infrastructure. |
marqo_cluster | Resource not tagged with marqo:cluster | Tag the resource. If it's a standalone instance (not in a cluster), this may be expected — consider adjusting expectations. |
role | Instance name doesn't match any pattern in enrichment.py, and no polo.role tag | Add a polo.role tag to the instance, or add a name pattern to _instance_role() in enrichment.py. |
system | Account not in _ACCOUNT_SYSTEM mapping, and role not in _ROLE_TO_SYSTEM | Add the account to _ACCOUNT_SYSTEM in enrichment.py, or ensure the instance has a role that maps to a system. |
account_role | Account missing from hierarchy_nodes | Run the hierarchy_admin collector to discover the account, or add it manually. |
Step 4: Verify the fix
After tagging resources or updating enrichment logic:
- Wait for the next collector run (~15 min for config collectors)
- Wait for snapshot_builder to update snapshots (~15 min after config)
- Check
resource_snapshotsto verify the field is now populated - The next data_quality run (daily at 07:00 UTC) will reflect the improvement
For immediate verification without waiting for the daily run:
SELECT
resource_type,
count() AS total,
countIf(marqo_customer != '') AS has_customer,
countIf(role != '') AS has_role,
countIf(account_role != '') AS has_account_role
FROM polo.resource_snapshots FINAL
GROUP BY resource_type
ORDER BY total DESC;
Investigating Orphan Resources
Step 1: List orphans
SELECT rs.resource_arn, rs.resource_type, rs.resource_name, rs.aws_account_id
FROM polo.resource_snapshots rs FINAL
LEFT JOIN (
SELECT DISTINCT resource_arn FROM polo.resource_ancestry FINAL
) ra ON rs.resource_arn = ra.resource_arn
WHERE ra.resource_arn IS NULL
ORDER BY rs.resource_type, rs.aws_account_id;
Step 2: Diagnose why they're orphaned
Common causes:
- New resource: Created after the last hierarchy_builder run. Wait 15 min.
- Missing relationship edge: The resource has no cost-parent edge in
resource_relationships. Check if the relationships collector is discovering it. - Missing tags: The resource has no
marqo:*tags AND no physical parent, so the hierarchy_builder can't place it in the logical hierarchy. - Hierarchy_builder failure: Check if the collector ran successfully. Look at Lambda logs or query
resource_eventsfor recenthierarchy_builderevents.
Step 3: Check relationship edges
-- Does this resource have any relationships?
SELECT * FROM polo.resource_relationships FINAL
WHERE source_arn = 'arn:aws:ec2:...' -- the orphan's ARN
OR target_arn = 'arn:aws:ec2:...';
If no rows, the relationships collector isn't finding edges for this resource. Check:
- Is it a resource type that the relationships collector covers? (EC2, EBS, EIP, ENI, ELB)
- Is it in a region/account that the collector is scanning?
Investigating Collector Freshness
Step 1: Check last seen times
SELECT
collector_last_seen
FROM polo.data_quality_daily FINAL
WHERE day = today() AND dimension = 'overall';
Or query directly from events:
SELECT
collector,
max(event_time) AS last_event,
dateDiff('minute', max(event_time), now()) AS minutes_ago
FROM polo.resource_events
WHERE event_time > now() - INTERVAL 2 DAY
GROUP BY collector
ORDER BY minutes_ago DESC;
Step 2: Investigate stale collectors
| Collector Stale | What It Means | Impact |
|---|---|---|
config_ec2 | EC2 snapshots are outdated | New/terminated instances not reflected, stale metadata |
config_ebs | Volume state outdated | Detached volumes still show as attached |
relationships | Physical hierarchy edges outdated | New attachments not tracked, cost rollup incomplete |
hierarchy_builder | Ancestry closure table outdated | Orphan count inflated, cost rollups stale |
cost_explorer | Cost data not flowing | Missing cost > 0 for resources that should have costs |
tags | Tag changes not reflected | Metadata completeness may be understated |
Step 3: Check Lambda execution
Look at CloudWatch Logs for the collector's Lambda function. Common failures:
- AssumeRole failure: The cross-account role may have expired or been modified
- ClickHouse connection timeout: ClickHouse instance may be overloaded or unreachable
- API throttling: AWS API rate limits hit during collection
Investigating Missing Cost Data
Step 1: Which resources are missing costs?
SELECT
resource_arn, resource_type, resource_name,
aws_account_id, first_seen, cost_daily_usd
FROM polo.resource_snapshots FINAL
WHERE cost_daily_usd = 0
AND resource_type IN ('ec2:instance', 'ebs:volume', 'vpc:nat_gateway', 'elbv2:load_balancer')
AND first_seen < today() - INTERVAL 2 DAY -- exclude brand new resources
ORDER BY resource_type, aws_account_id;
Step 2: Check if cost data exists in events
SELECT count(), sum(value), max(event_time)
FROM polo.resource_events
WHERE resource_arn = 'arn:aws:...'
AND event_type = 'cost';
If no cost events exist, the cost_explorer or cost_cur collector may not be covering this account or resource.
Step 3: Check Cost Explorer coverage
SELECT DISTINCT aws_account_id
FROM polo.resource_events
WHERE collector = 'cost_explorer'
AND event_time > today() - INTERVAL 1 DAY;
Compare with the list of active accounts. Any account not in the results needs investigation.
Tracking Quality Trends
Week-over-week trend
SELECT
day,
round(completeness_score, 3) AS score,
total_resources,
complete_resources,
orphan_resources
FROM polo.data_quality_daily FINAL
WHERE dimension = 'overall' AND dimension_value = '*'
ORDER BY day DESC
LIMIT 30;
Look for:
- Gradual decline: Suggests new resources being created without proper tagging. Check recent instance launches.
- Sudden drop: Suggests a collector failure, a bulk resource creation without tags, or a code change that broke enrichment.
- Steady improvement: The system is working. Ratchet alert thresholds tighter.
Per-account trend
SELECT
day,
dimension_value AS account_id,
round(completeness_score, 3) AS score,
total_resources
FROM polo.data_quality_daily FINAL
WHERE dimension = 'account'
ORDER BY day DESC, score ASC
LIMIT 50;
Accounts with consistently low scores need a tagging audit. Accounts with declining scores may have new untagged workloads.
Backfilling After Quality Issues
When to backfill
Backfill is needed when data quality issues caused incorrect data to be stored and the incorrect data is still being used for analysis. Common scenarios:
- Collector was down for N hours: Config snapshots are stale. Cost events may have gaps.
- Enrichment bug: Resources were tagged incorrectly for a period.
- Missing relationship edges: Cost rollups were incomplete.
How to backfill
Config data (snapshots, relationships): Re-run the affected collectors. They perform full state snapshots, so the next successful run will correct the data. No historical backfill needed — the snapshot is always "now."
Cost data: If cost_explorer missed a day, re-run it with a custom date range (if the collector supports it). Cost Explorer API data is available for up to 12 months retroactively. CUR data in S3 is immutable and can be reprocessed.
Hierarchy/ancestry: Re-run hierarchy_builder. It performs a full rebuild from current state. No historical dimension — it's always "now."
Quality metrics: Re-run the data_quality collector. It re-computes the current day's metrics. Historical quality scores from past days cannot be retroactively corrected (and shouldn't be — they accurately reflect what the system saw at that time).
Preventive Measures
Tagging Discipline
The most impactful thing you can do for data quality is ensure resources are tagged at creation time:
- CloudFormation / CDK templates should include
marqo:customer,marqo:cluster,marqo:indextags on all taggable resources - Tag policies in AWS Organizations can enforce required tags
- Polo tag validation rules (via rule_evaluator) can flag untagged resources immediately
Collector Monitoring
Set up CloudWatch alarms on Lambda function errors and duration. A collector that takes 10x longer than usual may be encountering API throttling or processing unexpectedly large datasets.
Regular Quality Reviews
Weekly review of the quality trend dashboard. Aim for monotonically improving scores. When scores plateau, investigate whether the remaining gaps are:
- Fixable (untagged resources that should be tagged)
- Expected (infrastructure resources that genuinely don't need business metadata)
- Expectations too strict (fields in
expectations.pythat aren't actually needed)
Adjust expectations to match reality, not the other way around — but only after confirming the gap is genuinely expected, not just unfixed.