Operational Monitoring
Status: PLANNED — The
collector_runstable,/api/v1/system/statusendpoint, and System Status UI page are not yet implemented. This document is the design specification for Polo's self-monitoring.
Polo monitors itself. Its own infrastructure (Lambdas, ClickHouse) should be tagged with polo:purpose=polo-infrastructure and appear in the hierarchy. The System Status page will show operational health at a glance.
Collector Health: polo.collector_runs
Every collector writes a row at the end of each run:
CREATE TABLE polo.collector_runs
(
run_id UUID DEFAULT generateUUIDv4(),
collector LowCardinality(String),
aws_account_id LowCardinality(String),
aws_region LowCardinality(String),
started_at DateTime64(3),
completed_at DateTime64(3),
status LowCardinality(String), -- 'succeeded', 'failed', 'partial'
events_produced UInt32,
rows_inserted UInt32,
error String DEFAULT '',
duration_ms UInt32
)
ENGINE = MergeTree()
ORDER BY (collector, started_at)
TTL started_at + INTERVAL 90 DAY;
Freshness monitoring
Each collector has an expected interval. If it hasn't run successfully in 2× that interval, it's stale:
SELECT
collector,
max(completed_at) AS last_run,
dateDiff('minute', max(completed_at), now()) AS minutes_ago,
argMax(status, completed_at) AS last_status,
argMax(error, completed_at) AS last_error
FROM polo.collector_runs
GROUP BY collector
HAVING minutes_ago > 60 -- adjust per collector's expected interval
ORDER BY minutes_ago DESC;
Dependency sequencing
The hierarchy_builder checks that config collectors completed recently before running:
SELECT max(completed_at)
FROM polo.collector_runs
WHERE collector IN ('config_ec2', 'config_ebs', 'config_network')
AND status = 'succeeded'
AND completed_at > now() - INTERVAL 20 MINUTE;
Account Coverage
From aws_accounts:
SELECT
polo_status,
count() AS account_count,
groupArray(account_name) AS accounts
FROM polo.aws_accounts FINAL
GROUP BY polo_status;
Accounts in pending_setup or no_access appear as warnings. Accounts not scanned in 24h generate alerts.
System Status UI Page
/system shows:
- Collector grid: One card per collector showing last run time, status (green/amber/red), duration, event count, and last error
- Account coverage: Table of accounts by status, with red highlight for
no_accessandpending_setup - ClickHouse health: Disk usage, query latency (p50/p95/p99), active queries count
- SQS health: Queue depth, dead-letter queue depth (should be 0)
- Collector error rate: Sparkline of failures over last 7 days per collector
API Endpoint
GET /api/system/status
Returns:
{
"collectors": [
{
"name": "config_ec2",
"last_run": "2025-03-28T14:00:12Z",
"status": "succeeded",
"minutes_ago": 15,
"health": "green"
},
...
],
"accounts": {
"active": 18,
"pending_setup": 2,
"no_access": 0
},
"clickhouse": {
"disk_used_gb": 12.4,
"query_p95_ms": 180
}
}
Alerting
Operational alerts funnel through the same notification system as anomalies and rule violations:
| Alert | Trigger | Severity |
|---|---|---|
| Collector failure | 3 consecutive failures | warning |
| Collector stale | No success in 2× expected interval | critical |
| Account no access | AssumeRole fails | warning |
| DLQ non-empty | Dead-letter queue has messages | critical |
| ClickHouse disk > 80% | Disk usage threshold | warning |
Polo's Own Cost
Tag Polo's own resources with polo:purpose=polo-infrastructure:
- Lambda functions
- SQS queues
- ClickHouse instance (or Cloud subscription)
- CUR S3 bucket
- Secrets Manager secrets
These show up in the hierarchy under a dedicated node, giving a first-class "how much does Polo cost to run?" metric.