← Back to feed
Research

DeepMind's TurboQuant: 6.2× KV-cache compression, no perplexity loss

At ICLR 2026, DeepMind's Yury Makarychev presented TurboQuant — PolarQuant (a randomized rotation making weight distributions near-Gaussian) composed with a Quantized Johnson–Lindenstrauss projection. Together they compress the KV cache 6.2× at identical perplexity. On a Gemini 3.1 Ultra 2M-token workload, GPU memory dropped from 380GB to 62GB per request. Google says it ships in Gemini's April 18 update. On-device long-context inference suddenly looks tractable; data-center inference costs fall sharply.

DeepMindICLR 2026QuantizationKV CacheInference

Why it matters

6.2× KV-cache compression at no perplexity loss is the kind of infrastructure win that quietly reshapes inference economics. On-device 2M-context suddenly becomes plausible on a laptop GPU; data-center inference costs drop fast. If TurboQuant generalizes beyond Gemini, it's a step toward making long-context inference a commodity rather than a premium tier.

Impact scorecard

7.8/10
Stakes
8.0
Novelty
8.0
Authority
9.0
Coverage
5.5
Concreteness
9.0
Social
6.0
FUD risk
2.0
Coverage13 outlets · 2 tier-1
Google Research blog, ICLR 2026 proceedings, The Gradient, HPCwire, The Register, Semianalysis
X / Twitter4,200 mentions
@GoogleResearch · 6,700 likes
@dylan522p · 5,100 likes
Reddit2,800 upvotes
r/MachineLearning
r/MachineLearning, r/LocalLLaMA

Trust check

high

Google Research paper + ICLR peer review + code released. Self-reported speedup numbers, but the mechanism (rotation + JL projection) is theoretically sound and already has third-party reimplementations.

Primary source ↗