Skip to main content

2026-06-19-envato-audio-502-blips

Sanity-checked the prod_000_audio 502 blip — it's not a real outage and not the VRAM incident.

I pulled our own HAProxy access logs (100% sample) and ALB CloudWatch metrics for the audio target group across 04:00–08:00 AEST 06-19 (18:00–22:00 UTC 06-18), which brackets the ~5:41 AM blip.

Our origin returned zero 502s — zero 5xx of any kind. Regex-independent check over 1,413,464 HAProxy log lines:

" 502 ": 0 " 500 ": 0 " 503 ": 0 " 504 ": 0 " 429 ": 43,930

ALB agrees independently (audio target group, hourly):

HTTPCode_Target_5XX_Count = 0
TargetConnectionErrorCount = 0 (ALB always reached a pod)
HTTPCode_Target_4XX_Count = 43,943 ≈ the 43,930 HAProxy 429s
RequestCount ≈ 1.38M, ~96.8% 2xx

The only "errors" are 429 rate-limits, shed at the proxy before they ever hit a marqo pod: marqo_frontend be_search/ 0/-1/-1/-1/0 429 ... PR-- ... "POST /indexes/prod_000_audio/search" + PR-- = HAProxy throttling on RPS bursts. By design, benign. ~3.2% of audio searches over the window, concentrated in bursts.

Smoking gun: our biggest 429 burst was 19:29–19:40 UTC = 05:29–05:40 AEST (peak 4,593/min) — landing exactly on Envato's "5:41 AM" spike. Their dashboard "502" bar is our 429 burst seen through their instrumentation. (Their absolute "502 = 537k" can't be HTTP responses anyway — at their own ≤3.3 RPS the window holds only a few thousand requests.)

  • Mehul/Oliver were right: all 429s, no 5xx since the patch. Zero 500s here → unrelated to the VRAM/KServe incident (that was HTTP-500 inference OOM; PR #1201 is holding).
  • Owen's "few 502s in Cloudflare": LB-wide ELB_5XX ≈ 350/hr flat across the entire shared ingress (all tenants) — audio's connection-error count is 0, so none of it is audio. Flat background, not a spike.
  • For Envato: their 502s don't originate from Marqo (our origin = 0 over the period); they map 1:1 (timing + volume) to our 429 throttling. Worth asking how their elements_backend dashboard maps upstream 429s/timeouts. If they want fewer 429s: client-side backoff on bursts, or a rate-limit/plan conversation — not a bug.

Re: the 502s on the Cloudflare dash — those are gateway-edge noise, not Marqo. I checked the exact point you're hovering (08:22 AEST = 22:22 UTC), and the whole 502 cluster around it:

@08:22 AEST (22:22 UTC) across 07:00-10:00 AEST
Marqo origin 5xx (prod_000_audio) 0 0 every minute
audio ALB target 5xx 0 0
audio ALB target connection errors 0 0
origin search latency <1.4s, 0 reqs >2s no timeouts
shared-LB ALB ELB_502 (ALL tenants) 6/min flat ~5-9/min
CF edge "Search code:502" ~13 ~0.05% of traffic

So:

  • Marqo isn't returning these. Our origin (HAProxy) returned zero 5xx for prod_000_audio, and the audio ALB target group shows zero target-5xx and zero connection errors. The ALB always reached a healthy pod.
  • It's not a slow-origin timeout either — search latency maxed at ~1.4s with zero requests over 2s, so the CF Worker isn't giving up on a slow Marqo.
  • The only thing emitting 502 is the shared ingress ALB itself (HTTPCode_ELB_502_Count), a flat ~5–9/min across every tenant on that LB — not audio-specific, not a spike (same level before/during/after). Cloudflare passes those through, which is why they show up as CF-edge Search 502. The "peak 13" is just jitter in that flat background, a touch higher because search RPS bursts were higher in that window.
  • Mechanism: textbook ALB↔target / CF↔ALB HTTP keep-alive idle-timeout connection race — the LB reuses a connection the peer just closed and returns 502. Benign at this volume (~0.01–0.05%), happens continuously on busy shared LBs.

Bottom line: different beast from the customer's reported "502 spike" (that was our 429 throttling seen through their instrumentation). These CF-edge 502s are a tiny, steady gateway artifact — nothing to fix, and unrelated to the VRAM incident.


  1. Do these 502s show up on every customer's CF widget?

Effectively yes — it's a platform-wide shared-LB artifact, not Envato/audio-specific. That ingress LB (k8s-marqoingressgroup) fronts 96 target groups — 96 indexes/tenants share it (Envato's other index is in the same list). The 502s are emitted by the shared ALB itself, so the same class of edge-502 can surface on any tenant's Cloudflare widget. It scales with traffic: a busy index like audio (~200–400 rps) shows a few/min; a quiet index shows ~0 (fewer connections → fewer keep-alive races). So same background everywhere, just different heights. (I'm inferring the per-customer CF view from the shared mechanism + LB-level attribution — I didn't open all 96 CF dashboards.)

  1. Customer impact?

Negligible. LB-wide over the 6h window (all 96 tenants):

requests 33,954,910 ELB 5xx 2,195 (100% are 502; 503=0, 504=0) => platform-wide 502 rate = 0.0065% (~1 in 15,000)

  • Each 502 is one search attempt that errored at the edge — not at Marqo (origin served 0 5xx and was <1.4s, so results were available; the connection just got reset on reuse).
  • For audio specifically the worst minute was ~13 vs ~12–24k req/min (~0.05%), ~0 most minutes.
  • The all-502 / zero-503 / zero-504 signature = connection-reuse resets, not capacity (503) or slow-origin timeouts (504). These retry-succeed immediately, and CF / most clients / Envato's backend retry idempotent searches on 502 — so end-user-visible impact ≈ nil.
  • It's flat background — not new, not escalating, not tied to the VRAM incident.

If anyone ever wants to drive it toward zero, the lever is aligning the target's keep-alive timeout to exceed the ALB idle timeout (classic ALB 502 fix). Low priority at 0.0065%.


No — there wasn't a Cloudflare incident that explains these, and the access logs are why I can say that with confidence.

The 502s are minted by our own ALB, not Cloudflare. Every one of the 28 rows I pulled shows our load balancer returning the 502 itself because the backend pod closed the pooled connection (elb_status_code=502, target_status_code="-"). Cloudflare sits upstream of the ALB — it just forwards/reports whatever the origin returns. A Cloudflare incident can't manufacture an ALB↔pod keep-alive race; that handshake is entirely inside AWS. If CF itself were the problem, you'd see CF-generated 52x codes (520–527) or CF 502s with no matching ALB 502 — the opposite of what we have.

The shape also doesn't fit an incident: it's flat, steady (~6/min), chronic, and spread across 9+ tenants — that's background connection churn, not a discrete time-bounded outage window.

For completeness I did check Cloudflare's history for 2026-06-18, and there's nothing relevant:

  • "Degraded Domain Registration" — resolved 18:05 UTC (a registrar feature, not the edge proxy path).
  • "Snippets requests / Worker threw exception" — resolved 00:46 UTC (hours before our window, and not the search request path).

Neither touches the search-proxy data path, and neither lines up with our pattern. (One third-party aggregator vaguely flags ~19:05 UTC, but that's unverified and, again, can't explain ALB-generated 502s.)

Bottom line: the cause is an internal ALB↔pod keep-alive idle-connection race — verified in our own access logs — not Cloudflare and not the VRAM incident.

Sources: Cloudflare Status (https://www.cloudflarestatus.com/), Cloudflare Outage History (controld.com) (https://controld.com/blog/biggest-cloudflare-outages/), StatusGator – Cloudflare (https://statusgator.com/services/cloudflare)