2026-06-19-kogan-latency-blips

Slack reply (paste-ready)

▎ Dug into this — good news: nothing actually broke last night. Kogan's Marqo backend behaved normally in both windows. I pulled search latency across all 14 indexes (account 23pqyh37, cell2) and compared the exact windows to the previous two nights:
▎
▎ | window                  | Jun 16      | Jun 17      | Jun 18 (last night) |
▎ |-------------------------|-------------|-------------|---------------------|
▎ | 7:20–8:20pm p95 / p99   | 153 / 351ms | 143 / 281ms | 151 / 362ms         |
▎ | 10:40–11:40pm p95 / p99 | 174 / 457ms | 134 / 277ms | 170 / 300ms         |
▎
▎ Last night is statistically the same as the prior nights. The mild evening rise + occasional 1–2.5s p99 blips happen every night in these windows (the 16th hit 2.1s, the 17th 1.6s). The one thing that stands out — a p99=2.5s blip at exactly 7:22pm — was a single minute, ~75 of 7,500 requests. Normal tail noise.
▎
▎ Ruled out: ❌ request surge (volume was declining, not spiking — matches what you saw) · ❌ the shared GPU/e5 encoder (steady 14ms, 0 failures, 7.4GB VRAM free — yesterday's OOM did not relapse) · ❌ Vespa (CPU ~20%, flat) · ❌ marqo API tier (CPU <33%).
▎
▎ @Mehul re Pixel updates — checked: indexing/write load was normal last night, not a burst (docs declining with traffic, delete-batch flat ~880/min). So Pixel didn't cause these windows. (Side note: the write path is chronically slow, p95 ~1–2.3s — worth a separate look, but it's not new and isn't bleeding into search.)
▎
▎ So whatever the dashboard flagged, the backend was healthy. Two things would help pin it down: (1) which metric/percentile the alert used, and does it include CloudFlare/edge time? (2) if the spike looked much bigger than ~185ms p95 on your side, it's at the edge/CDN — which is the one layer our in-cluster metrics don't see. Happy to scan the CloudFlare logs for those windows if you think it was customer-facing.

What I checked & why I'm confident

Method: all from AMP (cell2 workspace), $0 Athena. Aggregated across all 14 Kogan indexes via account_id="23pqyh37", decomposed search latency into embedding-step vs Vespa-step, swept a full 42h to locate peaks TZ-agnostically, then compared the reported windows across 3 nights.

The decisive findings:

3-night comparison — Jun 18 ≈ Jun 16 ≈ Jun 17 at the same clock times. No anomaly.
42h sweep — peak search p95 anywhere was only ~221ms; there's no hidden spike at any timezone.
Backend decomposition — Vespa flat (~70ms, CPU ~20%), shared e5 encoder flat (14ms, 0 fails, GPU healthy), marqo tier unsaturated. The only elevated thing is the embedding round-trip tail (p99 ~0.9–2.4s) — and it's higher at baseline than in the windows, i.e. chronic, not a Jun-18 event.
Volume declining through both windows — confirms Owen's "requests didn't spike"; latency rose as traffic fell.

Full writeup with all tables and queries is in 2026-06-19-kogan-latency-spike/results.md.

One genuine standing issue surfaced that's worth a separate ticket (not last night's spike): Kogan's write path is chronically slow (/documents p95 ~1–2.3s at 1,300–3,300/min) — likely the same EBS-throughput/feed-pressure theme from the 2026-05-20 investigation. Want me to profile that, or scan the CloudFlare edge logs for last night's windows if Owen confirms the spike was customer-facing?