TurboQuant: Google's New Compression Tricks That Actually Work

Google Research just dropped three new compression algorithms at ICLR and AISTATS 2026, and I’ve been digging into the papers. The main one everyone’s talking about is TurboQuant, but there’s also Quantized Johnson-Lindenstrauss (QJL) and PolarQuant. Let me cut through the academic jargon and tell you what actually matters.

The problem nobody talks about

Vectors are how AI models think. Small vectors describe simple stuff like a point on a graph, high-dimensional ones capture complex things like image features or word meanings. The problem? These high-dimensional vectors eat memory for breakfast. The key-value cache – that high-speed cheat sheet models use to avoid expensive lookups – gets clogged fast.

Vector quantization is the classic fix: compress those vectors down. But traditional methods have this annoying habit of adding their own memory overhead. You compress the data, but then you need to store quantization constants in full precision, adding 1-2 extra bits per number. It’s like patching a leaky pipe and creating a new one.

TurboQuant skips the overhead

TurboQuant does two things in sequence. First, it randomly rotates the data vectors – sounds weird, but it simplifies the geometry so a standard quantizer can handle each part individually. This first stage does the heavy lifting, using most of the bits to capture the vector’s essence.

Then comes the clever part: it takes the leftover error from that first pass and applies QJL with just 1 bit. QJL acts as a mathematical error-checker that eliminates bias in the attention score. The result is massive compression with zero accuracy loss.

QJL: the 1-bit trick that works

QJL is deceptively simple. It uses the Johnson-Lindenstrauss Transform to shrink high-dimensional data while preserving the distances between points. Then it reduces each number to a single sign bit – just +1 or -1. Zero memory overhead. To maintain accuracy, it uses a special estimator that balances a high-precision query against the simplified data. It’s the kind of hack that makes you wonder why nobody did it sooner.

PolarQuant changes the angle

PolarQuant takes a completely different approach. Instead of representing vectors in standard X,Y,Z coordinates, it converts them into polar coordinates – angle and magnitude. This eliminates the memory overhead problem because you’re working with a different representation entirely. TurboQuant uses PolarQuant as its first-stage compressor.

What this means in practice

I’ve been testing these on some internal benchmarks, and the results are solid. The key-value cache bottlenecks that plague inference at scale? Significantly reduced. Vector search engines? Faster similarity lookups without the memory tax.

Google tested these across multiple model sizes and tasks. The accuracy holds up even at extreme compression ratios. I’m particularly interested in how this plays out for on-device AI where memory is tight. If you can run a decent model with a fraction of the cache memory, that opens up a lot of possibilities.

The papers are worth reading if you’re into the math, but the takeaway is straightforward: we now have compression methods that don’t suck. No accuracy loss, no hidden overhead, just efficient representation.

TurboQuant: Google’s New Compression Tricks That Actually Work

The problem nobody talks about

TurboQuant skips the overhead

QJL: the 1-bit trick that works

PolarQuant changes the angle

What this means in practice

Comments (0)