Runbook: grpc_metrics_gateway_alerts

Remediation for the gRPC metrics gateway Grafana alerts (infra/multitenant_eks_cluster/grpc_metrics_gateway_alerts.tf). The gateway is a Go service that ingests Prometheus samples over gRPC/gRPC-Web, aggregates them in memory, and re-exposes them on /metrics for a dedicated 5-minute scrape. Cloudflare Workers reach it over a private Cloudflare Tunnel (cloudflared).

Field	Details
Owner	Dataplane
Required Access	`kubectl` for the cell, AWS (AMP/Grafana), Cloudflare dashboard
Component	`components/grpc_metrics_gateway`, `components/metrics_gateway_worker`
Dashboard	Grafana → "gRPC Metrics Gateway" (`grpc_metrics_gateway_dashboard`)
Namespace	`metrics-gateway` (var `grpc_metrics_gateway_namespace`)

Quick context for every alert below: the gateway pods are deployment/grpc-metrics-gateway and the tunnel pods are deployment/cloudflared-grpc-metrics-gateway, both in the metrics-gateway namespace. Self-metrics (grpc_metrics_gateway_*) come from the 5-minute grpc-metrics-gateway scrape job; cloudflared_* from the 60-second cloudflared-metrics-gateway job.

Requests failing

Alert: reject ratio samples_rejected / (accepted + rejected) > 10% for 15m.

A high share of pushed samples is being rejected. The gateway rejects a sample when it is malformed (invalid metric/label name, reserved __ label, non-finite value), a negative counter increment, a gauge (unsupported), a type/label mismatch against an already-registered series, or — at the extreme — when the in-memory series cap is hit (that also raises the Ingestion saturation alert and grpc_metrics_gateway_series_cap_rejected_total).

1. Decide bad-input vs capacity

Compare the cap-rejection rate against the total reject rate:

sum(rate(grpc_metrics_gateway_series_cap_rejected_total[15m]))
sum(rate(grpc_metrics_gateway_samples_rejected_total[15m]))

If cap rejections dominate, treat this as Ingestion saturation (below). Otherwise it is a misbehaving client sending invalid samples.

2. Identify the misbehaving client

The gateway logs each rejection at debug with the reason. Tail a pod:

kubectl -n metrics-gateway logs deploy/grpc-metrics-gateway --tail=200 | grep -i "rejected metric"

The per-sample error strings are also returned to the caller in the PushResponse (errors, capped at 100/response) — check the calling Worker/service logs.

3. Remediate

Fix the client to send valid counters/histograms (no gauges, finite values, stable label sets per metric name). A bad client cannot poison other series — it only inflates the reject counter — so this is a warning, not a page.

Ingestion saturation

Alert: per-pod weighted series cap utilisation grpc_metrics_gateway_series_cost / grpc_metrics_gateway_series_max > 0.9 for 15m.

There is no request queue (ingest is synchronous), so the truest "backlog building up" signal is the in-memory series store filling toward MAX_SERIES (var.grpc_metrics_gateway_max_series). A counter costs 1; a histogram costs #buckets + 3. At 1.0 the gateway rejects new series (re-pushes of existing series still succeed).

1. Confirm and find the source

grpc_metrics_gateway_series_cost / grpc_metrics_gateway_series_max
topk(20, count by (__name__) ({__name__=~"mgw_.*"}))

High cardinality usually means a client put an unbounded value (request id, raw URL, timestamp) into a label.

2. Remediate

Client side (preferred): drop the high-cardinality label; bound label values to a small set.
Capacity: raise grpc_metrics_gateway_max_series in the env tfvars and apply (this is a per-pod weighted cost; account for histogram weight and pod memory grpc_metrics_gateway_memory_limit).
Restarting a pod clears its in-memory series but they rebuild on the next pushes, so it is only a temporary reset.

Scrape failing

Alert: max(up{job="grpc-metrics-gateway"}) < 1 for 15m (no_data ⇒ Alerting).

Prometheus cannot scrape any gateway pod's /metrics, so ingest visibility is lost. If the pods themselves are down, pushed metrics are failing too.

1. Check pods and targets

kubectl -n metrics-gateway get pods -l app=grpc-metrics-gateway -o wide
kubectl -n metrics-gateway describe deploy/grpc-metrics-gateway

In Grafana Explore: up{job="grpc-metrics-gateway"}. If there are no series, the Service annotations or scrape job changed; if series are 0, the pods are unreachable on the metrics port.

2. Common causes

Pods crash-looping / OOMKilled → check kubectl -n metrics-gateway logs and raise grpc_metrics_gateway_memory_limit if OOM.
The metrics Service lost the prometheus.io/grpc-gateway-scrape=true annotation or the metrics port (grpc_metrics_gateway.tf).
HPA scaled to zero / scheduling failure (no nodes for workload=general-workload-v2).

3. Remediate

Restore the pods (kubectl -n metrics-gateway rollout restart deploy/grpc-metrics-gateway) or fix scheduling/resources, then confirm up == 1.

Tunnel down

Alert: max(cloudflared_tunnel_ha_connections{job="cloudflared-metrics-gateway"}) < 1 for 5m (no_data ⇒ Alerting).

No cloudflared connector for the metrics-gateway tunnel has an edge connection, so Workers cannot reach the gateway over the tunnel. (In-cluster pushes are unaffected — this is the external Worker path.)

1. Check the connector pods

kubectl -n metrics-gateway get pods -l app=cloudflared-grpc-metrics-gateway
kubectl -n metrics-gateway logs deploy/cloudflared-grpc-metrics-gateway --tail=200

In Grafana Explore: cloudflared_tunnel_ha_connections{job="cloudflared-metrics-gateway"} (healthy is ~4 per connector).

2. Common causes

Invalid/rotated tunnel token → the grpc-metrics-gateway-tunnel Secret is out of sync with the Cloudflare tunnel. Re-apply the tunnel Terraform to regenerate both.
Egress blocked → cloudflared dials out over IPv6 via the egress-only IGW; check node networking / NetworkPolicy.
cloudflared image too old → must be ≥ 2025.7.0 for Workers VPC (var.cloudflared_image).
Tunnel deleted/disabled in Cloudflare, or the connector can't resolve the gateway Service.

3. Remediate

Restart the connector (kubectl -n metrics-gateway rollout restart deploy/cloudflared-grpc-metrics-gateway). If the token is stale, re-apply grpc_metrics_gateway_tunnel.tf. Confirm cloudflared_tunnel_ha_connections >= 1 and that a Worker can reach the gateway (components/metrics_gateway_worker/perf/run_perf.sh --verify-gateway).

Requests failing​

1. Decide bad-input vs capacity​

2. Identify the misbehaving client​

3. Remediate​

Ingestion saturation​

1. Confirm and find the source​

2. Remediate​

Scrape failing​

1. Check pods and targets​

2. Common causes​

3. Remediate​

Tunnel down​

1. Check the connector pods​

2. Common causes​

3. Remediate​

Requests failing

1. Decide bad-input vs capacity

2. Identify the misbehaving client

3. Remediate

Ingestion saturation

1. Confirm and find the source

2. Remediate

Scrape failing

1. Check pods and targets

2. Common causes

3. Remediate

Tunnel down

1. Check the connector pods

2. Common causes

3. Remediate