Runbook: grpc_metrics_gateway_alerts
Remediation for the gRPC metrics gateway Grafana alerts
(infra/multitenant_eks_cluster/grpc_metrics_gateway_alerts.tf). The gateway is
a Go service that ingests Prometheus samples over gRPC/gRPC-Web, aggregates them
in memory, and re-exposes them on /metrics for a dedicated 5-minute scrape.
Cloudflare Workers reach it over a private Cloudflare Tunnel (cloudflared).
| Field | Details |
|---|---|
| Owner | Dataplane |
| Required Access | kubectl for the cell, AWS (AMP/Grafana), Cloudflare dashboard |
| Component | components/grpc_metrics_gateway, components/metrics_gateway_worker |
| Dashboard | Grafana → "gRPC Metrics Gateway" (grpc_metrics_gateway_dashboard) |
| Namespace | metrics-gateway (var grpc_metrics_gateway_namespace) |
Quick context for every alert below: the gateway pods are
deployment/grpc-metrics-gateway and the tunnel pods are
deployment/cloudflared-grpc-metrics-gateway, both in the metrics-gateway
namespace. Self-metrics (grpc_metrics_gateway_*) come from the 5-minute
grpc-metrics-gateway scrape job; cloudflared_* from the 60-second
cloudflared-metrics-gateway job.
Requests failing
Alert: reject ratio samples_rejected / (accepted + rejected) > 10% for 15m.
A high share of pushed samples is being rejected. The gateway rejects a sample
when it is malformed (invalid metric/label name, reserved __ label, non-finite
value), a negative counter increment, a gauge (unsupported), a type/label
mismatch against an already-registered series, or — at the extreme — when the
in-memory series cap is hit (that also raises the Ingestion saturation
alert and grpc_metrics_gateway_series_cap_rejected_total).
1. Decide bad-input vs capacity
Compare the cap-rejection rate against the total reject rate:
sum(rate(grpc_metrics_gateway_series_cap_rejected_total[15m]))
sum(rate(grpc_metrics_gateway_samples_rejected_total[15m]))
If cap rejections dominate, treat this as Ingestion saturation (below). Otherwise it is a misbehaving client sending invalid samples.
2. Identify the misbehaving client
The gateway logs each rejection at debug with the reason. Tail a pod:
kubectl -n metrics-gateway logs deploy/grpc-metrics-gateway --tail=200 | grep -i "rejected metric"
The per-sample error strings are also returned to the caller in the PushResponse
(errors, capped at 100/response) — check the calling Worker/service logs.
3. Remediate
Fix the client to send valid counters/histograms (no gauges, finite values, stable label sets per metric name). A bad client cannot poison other series — it only inflates the reject counter — so this is a warning, not a page.
Ingestion saturation
Alert: per-pod weighted series cap utilisation
grpc_metrics_gateway_series_cost / grpc_metrics_gateway_series_max > 0.9 for 15m.
There is no request queue (ingest is synchronous), so the truest "backlog
building up" signal is the in-memory series store filling toward MAX_SERIES
(var.grpc_metrics_gateway_max_series). A counter costs 1; a histogram costs
#buckets + 3. At 1.0 the gateway rejects new series (re-pushes of existing
series still succeed).
1. Confirm and find the source
grpc_metrics_gateway_series_cost / grpc_metrics_gateway_series_max
topk(20, count by (__name__) ({__name__=~"mgw_.*"}))
High cardinality usually means a client put an unbounded value (request id, raw URL, timestamp) into a label.
2. Remediate
- Client side (preferred): drop the high-cardinality label; bound label values to a small set.
- Capacity: raise
grpc_metrics_gateway_max_seriesin the env tfvars and apply (this is a per-pod weighted cost; account for histogram weight and pod memorygrpc_metrics_gateway_memory_limit). - Restarting a pod clears its in-memory series but they rebuild on the next pushes, so it is only a temporary reset.
Scrape failing
Alert: max(up{job="grpc-metrics-gateway"}) < 1 for 15m (no_data ⇒ Alerting).
Prometheus cannot scrape any gateway pod's /metrics, so ingest visibility is
lost. If the pods themselves are down, pushed metrics are failing too.
1. Check pods and targets
kubectl -n metrics-gateway get pods -l app=grpc-metrics-gateway -o wide
kubectl -n metrics-gateway describe deploy/grpc-metrics-gateway
In Grafana Explore: up{job="grpc-metrics-gateway"}. If there are no series,
the Service annotations or scrape job changed; if series are 0, the pods are
unreachable on the metrics port.
2. Common causes
- Pods crash-looping / OOMKilled → check
kubectl -n metrics-gateway logsand raisegrpc_metrics_gateway_memory_limitif OOM. - The metrics Service lost the
prometheus.io/grpc-gateway-scrape=trueannotation or themetricsport (grpc_metrics_gateway.tf). - HPA scaled to zero / scheduling failure (no nodes for
workload=general-workload-v2).
3. Remediate
Restore the pods (kubectl -n metrics-gateway rollout restart deploy/grpc-metrics-gateway)
or fix scheduling/resources, then confirm up == 1.
Tunnel down
Alert: max(cloudflared_tunnel_ha_connections{job="cloudflared-metrics-gateway"}) < 1
for 5m (no_data ⇒ Alerting).
No cloudflared connector for the metrics-gateway tunnel has an edge connection, so Workers cannot reach the gateway over the tunnel. (In-cluster pushes are unaffected — this is the external Worker path.)
1. Check the connector pods
kubectl -n metrics-gateway get pods -l app=cloudflared-grpc-metrics-gateway
kubectl -n metrics-gateway logs deploy/cloudflared-grpc-metrics-gateway --tail=200
In Grafana Explore: cloudflared_tunnel_ha_connections{job="cloudflared-metrics-gateway"}
(healthy is ~4 per connector).
2. Common causes
- Invalid/rotated tunnel token → the
grpc-metrics-gateway-tunnelSecret is out of sync with the Cloudflare tunnel. Re-apply the tunnel Terraform to regenerate both. - Egress blocked → cloudflared dials out over IPv6 via the egress-only IGW; check node networking / NetworkPolicy.
- cloudflared image too old → must be ≥ 2025.7.0 for Workers VPC
(
var.cloudflared_image). - Tunnel deleted/disabled in Cloudflare, or the connector can't resolve the gateway Service.
3. Remediate
Restart the connector (kubectl -n metrics-gateway rollout restart deploy/cloudflared-grpc-metrics-gateway).
If the token is stale, re-apply grpc_metrics_gateway_tunnel.tf. Confirm
cloudflared_tunnel_ha_connections >= 1 and that a Worker can reach the gateway
(components/metrics_gateway_worker/perf/run_perf.sh --verify-gateway).