Jin Daily AI Trivia – 1M Context Window? NVIDIA Turns KV Cache into JPEG

NVIDIA Research just dropped KVTC (KV Cache Transform Coding) — a technique that treats the KV cache in LLMs like an image compression problem. It applies a linear transform (PCA-style) to the features, quantizes the coefficients, and then runs DEFLATE on GPU to pack everything tightly.

The result? Up to 20x smaller KV memory (and up to 40x in some cases), with under 1% accuracy drop on long-context and reasoning benchmarks. It also delivers up to 8x faster time-to-first-token by reducing memory thrashing.

Until now, most KV “compression” has been pretty basic — lower precision, token dropping, or paging tricks. Basically turning a 24-bit BMP into a crusty 8-bit GIF.

KVTC is the first clean attempt to apply the full JPEG/MP3 mindset to KV: decorrelate, quantize, entropy encode — then plug it into existing systems via a KV block manager. No model retraining needed, since the KV cache is restored to its original precision before compute.

So why is this only happening now?

Previously, most LLMs had relatively “small” context windows — 128k or 256k. But with the rise of agentic AI, 1M-token contexts are becoming the norm. KV cache memory has suddenly become a real bottleneck — especially with current memory cost pressures.

PS: Google recently published a similar idea called TurboQuant — focused on algorithmic KV quantization with near-zero loss. KVTC, on the other hand, leans toward near-lossless compression while preserving model performance.

Trivia Image 1