Skip to main content

Runbook: grpc_metrics_gateway_alerts

Remediation for the gRPC metrics gateway Grafana alerts (infra/multitenant_eks_cluster/grpc_metrics_gateway_alerts.tf). The gateway is a Go service that ingests Prometheus samples over gRPC/gRPC-Web, aggregates them in memory, and re-exposes them on /metrics for a dedicated 5-minute scrape. Cloudflare Workers reach it over a private Cloudflare Tunnel (cloudflared).

FieldDetails
OwnerDataplane
Required Accesskubectl for the cell, AWS (AMP/Grafana), Cloudflare dashboard
Componentcomponents/grpc_metrics_gateway, components/metrics_gateway_worker
DashboardGrafana → "gRPC Metrics Gateway" (grpc_metrics_gateway_dashboard)
Namespacemetrics-gateway (var grpc_metrics_gateway_namespace)

Quick context for every alert below: the gateway pods are deployment/grpc-metrics-gateway and the tunnel pods are deployment/cloudflared-grpc-metrics-gateway, both in the metrics-gateway namespace. Self-metrics (grpc_metrics_gateway_*) come from the 5-minute grpc-metrics-gateway scrape job; cloudflared_* from the 60-second cloudflared-metrics-gateway job.

Requests failing

Alert: reject ratio samples_rejected / (accepted + rejected) > 10% for 15m.

A high share of pushed samples is being rejected. The gateway rejects a sample when it is malformed (invalid metric/label name, reserved __ label, non-finite value), a negative counter increment, a gauge (unsupported), a type/label mismatch against an already-registered series, or — at the extreme — when the in-memory series cap is hit (that also raises the Ingestion saturation alert and grpc_metrics_gateway_series_cap_rejected_total).

1. Decide bad-input vs capacity

Compare the cap-rejection rate against the total reject rate:

sum(rate(grpc_metrics_gateway_series_cap_rejected_total[15m]))
sum(rate(grpc_metrics_gateway_samples_rejected_total[15m]))

If cap rejections dominate, treat this as Ingestion saturation (below). Otherwise it is a misbehaving client sending invalid samples.

2. Identify the misbehaving client

The gateway logs each rejection at debug with the reason. Tail a pod:

kubectl -n metrics-gateway logs deploy/grpc-metrics-gateway --tail=200 | grep -i "rejected metric"

The per-sample error strings are also returned to the caller in the PushResponse (errors, capped at 100/response) — check the calling Worker/service logs.

3. Remediate

Fix the client to send valid counters/histograms (no gauges, finite values, stable label sets per metric name). A bad client cannot poison other series — it only inflates the reject counter — so this is a warning, not a page.

Ingestion saturation

Alert: per-pod weighted series cap utilisation grpc_metrics_gateway_series_cost / grpc_metrics_gateway_series_max > 0.9 for 15m.

There is no request queue (ingest is synchronous), so the truest "backlog building up" signal is the in-memory series store filling toward MAX_SERIES (var.grpc_metrics_gateway_max_series). A counter costs 1; a histogram costs #buckets + 3. At 1.0 the gateway rejects new series (re-pushes of existing series still succeed).

1. Confirm and find the source

grpc_metrics_gateway_series_cost / grpc_metrics_gateway_series_max
topk(20, count by (__name__) ({__name__=~"mgw_.*"}))

High cardinality usually means a client put an unbounded value (request id, raw URL, timestamp) into a label.

2. Remediate

  • Client side (preferred): drop the high-cardinality label; bound label values to a small set.
  • Capacity: raise grpc_metrics_gateway_max_series in the env tfvars and apply (this is a per-pod weighted cost; account for histogram weight and pod memory grpc_metrics_gateway_memory_limit).
  • Restarting a pod clears its in-memory series but they rebuild on the next pushes, so it is only a temporary reset.

Scrape failing

Alert: max(up{job="grpc-metrics-gateway"}) < 1 for 15m (no_data ⇒ Alerting).

Prometheus cannot scrape any gateway pod's /metrics, so ingest visibility is lost. If the pods themselves are down, pushed metrics are failing too.

1. Check pods and targets

kubectl -n metrics-gateway get pods -l app=grpc-metrics-gateway -o wide
kubectl -n metrics-gateway describe deploy/grpc-metrics-gateway

In Grafana Explore: up{job="grpc-metrics-gateway"}. If there are no series, the Service annotations or scrape job changed; if series are 0, the pods are unreachable on the metrics port.

2. Common causes

  • Pods crash-looping / OOMKilled → check kubectl -n metrics-gateway logs and raise grpc_metrics_gateway_memory_limit if OOM.
  • The metrics Service lost the prometheus.io/grpc-gateway-scrape=true annotation or the metrics port (grpc_metrics_gateway.tf).
  • HPA scaled to zero / scheduling failure (no nodes for workload=general-workload-v2).

3. Remediate

Restore the pods (kubectl -n metrics-gateway rollout restart deploy/grpc-metrics-gateway) or fix scheduling/resources, then confirm up == 1.

Tunnel down

Alert: max(cloudflared_tunnel_ha_connections{job="cloudflared-metrics-gateway"}) < 1 for 5m (no_data ⇒ Alerting).

No cloudflared connector for the metrics-gateway tunnel has an edge connection, so Workers cannot reach the gateway over the tunnel. (In-cluster pushes are unaffected — this is the external Worker path.)

1. Check the connector pods

kubectl -n metrics-gateway get pods -l app=cloudflared-grpc-metrics-gateway
kubectl -n metrics-gateway logs deploy/cloudflared-grpc-metrics-gateway --tail=200

In Grafana Explore: cloudflared_tunnel_ha_connections{job="cloudflared-metrics-gateway"} (healthy is ~4 per connector).

2. Common causes

  • Invalid/rotated tunnel token → the grpc-metrics-gateway-tunnel Secret is out of sync with the Cloudflare tunnel. Re-apply the tunnel Terraform to regenerate both.
  • Egress blocked → cloudflared dials out over IPv6 via the egress-only IGW; check node networking / NetworkPolicy.
  • cloudflared image too old → must be ≥ 2025.7.0 for Workers VPC (var.cloudflared_image).
  • Tunnel deleted/disabled in Cloudflare, or the connector can't resolve the gateway Service.

3. Remediate

Restart the connector (kubectl -n metrics-gateway rollout restart deploy/cloudflared-grpc-metrics-gateway). If the token is stale, re-apply grpc_metrics_gateway_tunnel.tf. Confirm cloudflared_tunnel_ha_connections >= 1 and that a Worker can reach the gateway (components/metrics_gateway_worker/perf/run_perf.sh --verify-gateway).