Operational Monitoring

Status: PLANNED — The collector_runs table, /api/v1/system/status endpoint, and System Status UI page are not yet implemented. This document is the design specification for Polo's self-monitoring.

Polo monitors itself. Its own infrastructure (Lambdas, ClickHouse) should be tagged with polo:purpose=polo-infrastructure and appear in the hierarchy. The System Status page will show operational health at a glance.

Collector Health: `polo.collector_runs`

Every collector writes a row at the end of each run:

CREATE TABLE polo.collector_runs
(
    run_id           UUID DEFAULT generateUUIDv4(),
    collector        LowCardinality(String),
    aws_account_id   LowCardinality(String),
    aws_region       LowCardinality(String),
    started_at       DateTime64(3),
    completed_at     DateTime64(3),
    status           LowCardinality(String),       -- 'succeeded', 'failed', 'partial'
    events_produced  UInt32,
    rows_inserted    UInt32,
    error            String DEFAULT '',
    duration_ms      UInt32
)
ENGINE = MergeTree()
ORDER BY (collector, started_at)
TTL started_at + INTERVAL 90 DAY;

Freshness monitoring

Each collector has an expected interval. If it hasn't run successfully in 2× that interval, it's stale:

SELECT
    collector,
    max(completed_at) AS last_run,
    dateDiff('minute', max(completed_at), now()) AS minutes_ago,
    argMax(status, completed_at) AS last_status,
    argMax(error, completed_at) AS last_error
FROM polo.collector_runs
GROUP BY collector
HAVING minutes_ago > 60  -- adjust per collector's expected interval
ORDER BY minutes_ago DESC;

Dependency sequencing

The hierarchy_builder checks that config collectors completed recently before running:

SELECT max(completed_at)
FROM polo.collector_runs
WHERE collector IN ('config_ec2', 'config_ebs', 'config_network')
  AND status = 'succeeded'
  AND completed_at > now() - INTERVAL 20 MINUTE;

Account Coverage

From aws_accounts:

SELECT
    polo_status,
    count() AS account_count,
    groupArray(account_name) AS accounts
FROM polo.aws_accounts FINAL
GROUP BY polo_status;

Accounts in pending_setup or no_access appear as warnings. Accounts not scanned in 24h generate alerts.

System Status UI Page

/system shows:

Collector grid: One card per collector showing last run time, status (green/amber/red), duration, event count, and last error
Account coverage: Table of accounts by status, with red highlight for no_access and pending_setup
ClickHouse health: Disk usage, query latency (p50/p95/p99), active queries count
SQS health: Queue depth, dead-letter queue depth (should be 0)
Collector error rate: Sparkline of failures over last 7 days per collector

API Endpoint

GET /api/system/status

Returns:

{
  "collectors": [
    {
      "name": "config_ec2",
      "last_run": "2025-03-28T14:00:12Z",
      "status": "succeeded",
      "minutes_ago": 15,
      "health": "green"
    },
    ...
  ],
  "accounts": {
    "active": 18,
    "pending_setup": 2,
    "no_access": 0
  },
  "clickhouse": {
    "disk_used_gb": 12.4,
    "query_p95_ms": 180
  }
}

Alerting

Operational alerts funnel through the same notification system as anomalies and rule violations:

Alert	Trigger	Severity
Collector failure	3 consecutive failures	warning
Collector stale	No success in 2× expected interval	critical
Account no access	AssumeRole fails	warning
DLQ non-empty	Dead-letter queue has messages	critical
ClickHouse disk > 80%	Disk usage threshold	warning

Polo's Own Cost

Tag Polo's own resources with polo:purpose=polo-infrastructure:

Lambda functions
SQS queues
ClickHouse instance (or Cloud subscription)
CUR S3 bucket
Secrets Manager secrets

These show up in the hierarchy under a dedicated node, giving a first-class "how much does Polo cost to run?" metric.

Collector Health: polo.collector_runs​

Freshness monitoring​

Dependency sequencing​

Account Coverage​

System Status UI Page​

API Endpoint​

Alerting​

Polo's Own Cost​