Skip to main content

Data Quality Runbook

Procedures for monitoring, investigating, and resolving data quality issues in Polo.

Daily Health Check

Run this query to get the current quality snapshot:

SELECT
dimension_value AS what,
total_resources,
complete_resources,
round(completeness_score, 3) AS score,
round(tag_coverage, 3) AS tags,
orphan_resources AS orphans,
missing_cost
FROM polo.data_quality_daily FINAL
WHERE day = today() AND dimension = 'resource_type'
ORDER BY total_resources DESC;

What to look for:

  • score < 0.80 for any resource type → investigate missing metadata (see "Low Completeness Score" below)
  • orphans > 0 for ec2:instance → hierarchy builder may be failing or relationships not being discovered
  • missing_cost > 0 for ec2:instance or ebs:volume → cost pipeline may be broken

Investigating Low Completeness Score

Step 1: Identify which fields are missing

SELECT
field_fill_rates
FROM polo.data_quality_daily FINAL
WHERE day = today()
AND dimension = 'resource_type'
AND dimension_value = 'ec2:instance'; -- or whatever resource type

This returns a map like {'marqo_customer': 0.85, 'role': 0.92, 'system': 0.95, ...}. The field with the lowest rate is the culprit.

Step 2: Find the specific resources

SELECT
resource_arn, resource_name, aws_account_id, aws_region,
marqo_customer, marqo_cluster, marqo_index, role, system, account_role
FROM polo.resource_snapshots FINAL
WHERE resource_type = 'ec2:instance'
AND marqo_customer = '' -- replace with the missing field
ORDER BY aws_account_id, resource_name;

Step 3: Determine root cause

Common causes of missing metadata:

Missing FieldLikely CauseResolution
marqo_customerResource not tagged with marqo:customer in AWSTag the resource in AWS. Check if it's a Marqo Cloud resource or internal infrastructure.
marqo_clusterResource not tagged with marqo:clusterTag the resource. If it's a standalone instance (not in a cluster), this may be expected — consider adjusting expectations.
roleInstance name doesn't match any pattern in enrichment.py, and no polo.role tagAdd a polo.role tag to the instance, or add a name pattern to _instance_role() in enrichment.py.
systemAccount not in _ACCOUNT_SYSTEM mapping, and role not in _ROLE_TO_SYSTEMAdd the account to _ACCOUNT_SYSTEM in enrichment.py, or ensure the instance has a role that maps to a system.
account_roleAccount missing from hierarchy_nodesRun the hierarchy_admin collector to discover the account, or add it manually.

Step 4: Verify the fix

After tagging resources or updating enrichment logic:

  1. Wait for the next collector run (~15 min for config collectors)
  2. Wait for snapshot_builder to update snapshots (~15 min after config)
  3. Check resource_snapshots to verify the field is now populated
  4. The next data_quality run (daily at 07:00 UTC) will reflect the improvement

For immediate verification without waiting for the daily run:

SELECT
resource_type,
count() AS total,
countIf(marqo_customer != '') AS has_customer,
countIf(role != '') AS has_role,
countIf(account_role != '') AS has_account_role
FROM polo.resource_snapshots FINAL
GROUP BY resource_type
ORDER BY total DESC;

Investigating Orphan Resources

Step 1: List orphans

SELECT rs.resource_arn, rs.resource_type, rs.resource_name, rs.aws_account_id
FROM polo.resource_snapshots rs FINAL
LEFT JOIN (
SELECT DISTINCT resource_arn FROM polo.resource_ancestry FINAL
) ra ON rs.resource_arn = ra.resource_arn
WHERE ra.resource_arn IS NULL
ORDER BY rs.resource_type, rs.aws_account_id;

Step 2: Diagnose why they're orphaned

Common causes:

  • New resource: Created after the last hierarchy_builder run. Wait 15 min.
  • Missing relationship edge: The resource has no cost-parent edge in resource_relationships. Check if the relationships collector is discovering it.
  • Missing tags: The resource has no marqo:* tags AND no physical parent, so the hierarchy_builder can't place it in the logical hierarchy.
  • Hierarchy_builder failure: Check if the collector ran successfully. Look at Lambda logs or query resource_events for recent hierarchy_builder events.

Step 3: Check relationship edges

-- Does this resource have any relationships?
SELECT * FROM polo.resource_relationships FINAL
WHERE source_arn = 'arn:aws:ec2:...' -- the orphan's ARN
OR target_arn = 'arn:aws:ec2:...';

If no rows, the relationships collector isn't finding edges for this resource. Check:

  • Is it a resource type that the relationships collector covers? (EC2, EBS, EIP, ENI, ELB)
  • Is it in a region/account that the collector is scanning?

Investigating Collector Freshness

Step 1: Check last seen times

SELECT
collector_last_seen
FROM polo.data_quality_daily FINAL
WHERE day = today() AND dimension = 'overall';

Or query directly from events:

SELECT
collector,
max(event_time) AS last_event,
dateDiff('minute', max(event_time), now()) AS minutes_ago
FROM polo.resource_events
WHERE event_time > now() - INTERVAL 2 DAY
GROUP BY collector
ORDER BY minutes_ago DESC;

Step 2: Investigate stale collectors

Collector StaleWhat It MeansImpact
config_ec2EC2 snapshots are outdatedNew/terminated instances not reflected, stale metadata
config_ebsVolume state outdatedDetached volumes still show as attached
relationshipsPhysical hierarchy edges outdatedNew attachments not tracked, cost rollup incomplete
hierarchy_builderAncestry closure table outdatedOrphan count inflated, cost rollups stale
cost_explorerCost data not flowingMissing cost > 0 for resources that should have costs
tagsTag changes not reflectedMetadata completeness may be understated

Step 3: Check Lambda execution

Look at CloudWatch Logs for the collector's Lambda function. Common failures:

  • AssumeRole failure: The cross-account role may have expired or been modified
  • ClickHouse connection timeout: ClickHouse instance may be overloaded or unreachable
  • API throttling: AWS API rate limits hit during collection

Investigating Missing Cost Data

Step 1: Which resources are missing costs?

SELECT
resource_arn, resource_type, resource_name,
aws_account_id, first_seen, cost_daily_usd
FROM polo.resource_snapshots FINAL
WHERE cost_daily_usd = 0
AND resource_type IN ('ec2:instance', 'ebs:volume', 'vpc:nat_gateway', 'elbv2:load_balancer')
AND first_seen < today() - INTERVAL 2 DAY -- exclude brand new resources
ORDER BY resource_type, aws_account_id;

Step 2: Check if cost data exists in events

SELECT count(), sum(value), max(event_time)
FROM polo.resource_events
WHERE resource_arn = 'arn:aws:...'
AND event_type = 'cost';

If no cost events exist, the cost_explorer or cost_cur collector may not be covering this account or resource.

Step 3: Check Cost Explorer coverage

SELECT DISTINCT aws_account_id
FROM polo.resource_events
WHERE collector = 'cost_explorer'
AND event_time > today() - INTERVAL 1 DAY;

Compare with the list of active accounts. Any account not in the results needs investigation.

Week-over-week trend

SELECT
day,
round(completeness_score, 3) AS score,
total_resources,
complete_resources,
orphan_resources
FROM polo.data_quality_daily FINAL
WHERE dimension = 'overall' AND dimension_value = '*'
ORDER BY day DESC
LIMIT 30;

Look for:

  • Gradual decline: Suggests new resources being created without proper tagging. Check recent instance launches.
  • Sudden drop: Suggests a collector failure, a bulk resource creation without tags, or a code change that broke enrichment.
  • Steady improvement: The system is working. Ratchet alert thresholds tighter.

Per-account trend

SELECT
day,
dimension_value AS account_id,
round(completeness_score, 3) AS score,
total_resources
FROM polo.data_quality_daily FINAL
WHERE dimension = 'account'
ORDER BY day DESC, score ASC
LIMIT 50;

Accounts with consistently low scores need a tagging audit. Accounts with declining scores may have new untagged workloads.

Backfilling After Quality Issues

When to backfill

Backfill is needed when data quality issues caused incorrect data to be stored and the incorrect data is still being used for analysis. Common scenarios:

  1. Collector was down for N hours: Config snapshots are stale. Cost events may have gaps.
  2. Enrichment bug: Resources were tagged incorrectly for a period.
  3. Missing relationship edges: Cost rollups were incomplete.

How to backfill

Config data (snapshots, relationships): Re-run the affected collectors. They perform full state snapshots, so the next successful run will correct the data. No historical backfill needed — the snapshot is always "now."

Cost data: If cost_explorer missed a day, re-run it with a custom date range (if the collector supports it). Cost Explorer API data is available for up to 12 months retroactively. CUR data in S3 is immutable and can be reprocessed.

Hierarchy/ancestry: Re-run hierarchy_builder. It performs a full rebuild from current state. No historical dimension — it's always "now."

Quality metrics: Re-run the data_quality collector. It re-computes the current day's metrics. Historical quality scores from past days cannot be retroactively corrected (and shouldn't be — they accurately reflect what the system saw at that time).

Preventive Measures

Tagging Discipline

The most impactful thing you can do for data quality is ensure resources are tagged at creation time:

  1. CloudFormation / CDK templates should include marqo:customer, marqo:cluster, marqo:index tags on all taggable resources
  2. Tag policies in AWS Organizations can enforce required tags
  3. Polo tag validation rules (via rule_evaluator) can flag untagged resources immediately

Collector Monitoring

Set up CloudWatch alarms on Lambda function errors and duration. A collector that takes 10x longer than usual may be encountering API throttling or processing unexpectedly large datasets.

Regular Quality Reviews

Weekly review of the quality trend dashboard. Aim for monotonically improving scores. When scores plateau, investigate whether the remaining gaps are:

  • Fixable (untagged resources that should be tagged)
  • Expected (infrastructure resources that genuinely don't need business metadata)
  • Expectations too strict (fields in expectations.py that aren't actually needed)

Adjust expectations to match reality, not the other way around — but only after confirming the gap is genuinely expected, not just unfixed.