Google Research has introduced TurboQuant, a compression method for large language models and vector search systems. In tests, it sharply reduced memory use while preserving model accuracy.
The team also presented two related algorithms, Quantized Johnson-Lindenstrauss (QJL) and PolarQuant. Together, the three methods aim to cut the memory needed to store and process high-dimensional vectors, which are widely used in AI tasks such as language modelling, search, and similarity matching.
High-dimensional vectors are central to many modern AI workloads, but they can be memory-intensive. A key pressure point is the key-value cache, which stores information a model can reuse during inference, especially in long-context tasks that involve large amounts of text.
Traditional vector quantisation has long been used to compress this data. But existing approaches often need extra storage for quantisation constants, creating overhead that limits the gains from compression.
Compression approach
TurboQuant is designed to address that problem in two steps. First, it applies PolarQuant, which rotates vectors and compresses parts of them individually. It then applies QJL to the residual error from that first stage using a single extra bit.
According to the researchers, QJL uses the Johnson-Lindenstrauss Transform to reduce dimensionality while preserving relationships between data points. It then stores each resulting value as a sign bit, aiming to avoid the memory overhead of full-precision constants.
PolarQuant takes a different approach by expressing vectors in polar rather than standard coordinates. This avoids a costly normalisation step and removes part of the overhead found in standard methods.
The work targets two commercially important areas: the key-value cache in large language models, where lower memory use can reduce bottlenecks, and vector search, which underpins semantic retrieval systems that match content by meaning rather than exact keywords.
Test results
The team evaluated the methods on long-context benchmarks including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. Tests were carried out on open-source models including Gemma and Mistral.
Google Research said TurboQuant achieved what it described as optimal scoring performance on measures including dot-product distortion and recall, while also reducing key-value memory use. In needle-in-a-haystack tests, the method preserved downstream results across all benchmarks while shrinking key-value memory size by at least six times.
The researchers also said TurboQuant could quantise the key-value cache to 3 bits without training or fine-tuning and without reducing model accuracy in the trials. They added that 4-bit TurboQuant delivered up to an eightfold performance increase over 32-bit unquantised keys on H100 graphics processors when computing attention logits.
In vector search tests, TurboQuant was compared with methods including PQ and RabbiQ. According to the team, it delivered stronger 1@k recall ratios than those baselines, even when the alternatives used larger codebooks and dataset-specific tuning.
Broader significance
The work reflects a wider push across the AI industry to lower the cost of inference and retrieval as models grow larger and handle longer contexts. Compression methods that reduce memory demands without retraining are especially notable because they can, in principle, be applied to deployed systems with less disruption than model redesigns.
Google Research linked the work not only to language models but also to semantic search, where large databases of vectors must be indexed and queried efficiently. That has become increasingly important as search systems shift towards intent-based retrieval and recommendation systems rely more heavily on vector similarity.
"We introduce a set of advanced theoretically grounded quantization algorithms that enable massive compression for large language models and vector search engines," said Amir Zandieh, Research Scientist, and Vahab Mirrokni, VP and Google Fellow, Google Research.
The researchers also described the methods as fundamental algorithmic work rather than only engineering changes. They said the approach is supported by theoretical proofs and operates near theoretical lower bounds, making it suitable for large-scale systems where efficiency and predictability matter.
The project involved collaboration with Praneeth Kacham at Google, Majid Hadian, Principal Engineer at Google DeepMind, Insu Han, Assistant Professor at KAIST, Majid Daliri at NYU, Lars Gottesbüren at Google, and Rajesh Jayaram at Google.