Skip to main content

Operational Monitoring

Status: PLANNED — The collector_runs table, /api/v1/system/status endpoint, and System Status UI page are not yet implemented. This document is the design specification for Polo's self-monitoring.

Polo monitors itself. Its own infrastructure (Lambdas, ClickHouse) should be tagged with polo:purpose=polo-infrastructure and appear in the hierarchy. The System Status page will show operational health at a glance.

Collector Health: polo.collector_runs

Every collector writes a row at the end of each run:

CREATE TABLE polo.collector_runs
(
run_id UUID DEFAULT generateUUIDv4(),
collector LowCardinality(String),
aws_account_id LowCardinality(String),
aws_region LowCardinality(String),
started_at DateTime64(3),
completed_at DateTime64(3),
status LowCardinality(String), -- 'succeeded', 'failed', 'partial'
events_produced UInt32,
rows_inserted UInt32,
error String DEFAULT '',
duration_ms UInt32
)
ENGINE = MergeTree()
ORDER BY (collector, started_at)
TTL started_at + INTERVAL 90 DAY;

Freshness monitoring

Each collector has an expected interval. If it hasn't run successfully in 2× that interval, it's stale:

SELECT
collector,
max(completed_at) AS last_run,
dateDiff('minute', max(completed_at), now()) AS minutes_ago,
argMax(status, completed_at) AS last_status,
argMax(error, completed_at) AS last_error
FROM polo.collector_runs
GROUP BY collector
HAVING minutes_ago > 60 -- adjust per collector's expected interval
ORDER BY minutes_ago DESC;

Dependency sequencing

The hierarchy_builder checks that config collectors completed recently before running:

SELECT max(completed_at)
FROM polo.collector_runs
WHERE collector IN ('config_ec2', 'config_ebs', 'config_network')
AND status = 'succeeded'
AND completed_at > now() - INTERVAL 20 MINUTE;

Account Coverage

From aws_accounts:

SELECT
polo_status,
count() AS account_count,
groupArray(account_name) AS accounts
FROM polo.aws_accounts FINAL
GROUP BY polo_status;

Accounts in pending_setup or no_access appear as warnings. Accounts not scanned in 24h generate alerts.

System Status UI Page

/system shows:

  • Collector grid: One card per collector showing last run time, status (green/amber/red), duration, event count, and last error
  • Account coverage: Table of accounts by status, with red highlight for no_access and pending_setup
  • ClickHouse health: Disk usage, query latency (p50/p95/p99), active queries count
  • SQS health: Queue depth, dead-letter queue depth (should be 0)
  • Collector error rate: Sparkline of failures over last 7 days per collector

API Endpoint

GET /api/system/status

Returns:

{
"collectors": [
{
"name": "config_ec2",
"last_run": "2025-03-28T14:00:12Z",
"status": "succeeded",
"minutes_ago": 15,
"health": "green"
},
...
],
"accounts": {
"active": 18,
"pending_setup": 2,
"no_access": 0
},
"clickhouse": {
"disk_used_gb": 12.4,
"query_p95_ms": 180
}
}

Alerting

Operational alerts funnel through the same notification system as anomalies and rule violations:

AlertTriggerSeverity
Collector failure3 consecutive failureswarning
Collector staleNo success in 2× expected intervalcritical
Account no accessAssumeRole failswarning
DLQ non-emptyDead-letter queue has messagescritical
ClickHouse disk > 80%Disk usage thresholdwarning

Polo's Own Cost

Tag Polo's own resources with polo:purpose=polo-infrastructure:

  • Lambda functions
  • SQS queues
  • ClickHouse instance (or Cloud subscription)
  • CUR S3 bucket
  • Secrets Manager secrets

These show up in the hierarchy under a dedicated node, giving a first-class "how much does Polo cost to run?" metric.