KV Cache Compression Benchmark

Benchmarked 7 KV cache compression methods i.e., ExpectedAttention, SnapKV, AdaKV, and others on Needle-in-a-Haystack retrieval at 4K–8K context lengths using Llama 3.1 8B.

5 of 7 methods hold perfect retrieval accuracy at 70% token eviction. Despite this, post-prefill compression delivers less than 1 GB peak memory reduction in practice as the KV cache only dominates memory after prefill, at which point the prefill activations (the actual memory spike) are already gone.

The real fix is chunked prefill with interleaved eviction: evict tokens during prefill so they never accumulate. This approach is largely absent from the KV cache compression literature, which optimises for decode-time efficiency rather than peak memory.