KV Cache Compression Benchmark
Benchmarked 7 KV cache compression methods — ExpectedAttention, SnapKV, AdaKV, and others — on Needle-in-a-Haystack retrieval at 4K–8K context lengths using Llama 3.1 8B.
Finding: 5 of 7 methods hold perfect retrieval accuracy at 70% token eviction. Despite this, post-prefill compression delivers less than 1 GB peak memory reduction in practice — because the KV cache only dominates memory after prefill, at which point the prefill activations (the actual memory spike) are already gone.
The real fix is chunked prefill with interleaved eviction: evict tokens during prefill so they never accumulate. This approach is largely absent from the KV cache compression literature, which optimises for decode-time efficiency rather than peak memory.
Full write-up at the demo link.