The KV cache in LLMs is a performance bottleneck that most people try to optimize, but the real win is realizing it’s not just about size, it’s about access patterns.
Let’s see it in action. Imagine a simple LLM processing a prompt. When it generates the first token, it calculates attention scores for all previous tokens. For the second token, it does the same, but crucially, it recalculates attention for the first token. This is inefficient. The KV cache stores the key and value vectors for each token generated so far, so the LLM doesn’t have to recompute them.
Here’s a simplified look at what’s happening under the hood. For each token t in the input sequence, the LLM computes a key vector K_t and a value vector V_t. These are then used in the attention mechanism:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
Where Q is the query vector for the current token. Without the KV cache, for each new token t+1, the LLM would recompute K_0, V_0, K_1, V_1, ..., K_t, V_t and then compute Q_{t+1} against all of them. With the KV cache, K_0...K_t and V_0...V_t are stored. When Q_{t+1} arrives, it’s only compared against the cached keys, and the corresponding cached values are used.
This drastically reduces computation, especially for long sequences. The memory footprint of the KV cache grows linearly with the sequence length and batch size. For a model with n_layers layers, n_heads attention heads, d_head head dimension, and a sequence length seq_len, the KV cache size is approximately:
2 * n_layers * batch_size * seq_len * d_head * sizeof(float)
The 2 is for both Key and Value. If n_layers=32, batch_size=4, seq_len=2048, d_head=64, and using float16 (2 bytes), this is roughly:
2 * 32 * 4 * 2048 * 64 * 2 bytes = 3.4 GB
This is where TensorRT comes in. TensorRT can optimize the KV cache by:
- Quantization: Reducing the precision of the KV cache from
float16toint8. This can halve the memory footprint with minimal accuracy loss. - Kernel Fusion: Combining multiple operations (like matrix multiplications and activations) into a single GPU kernel. This reduces kernel launch overhead and improves memory bandwidth utilization for KV cache reads/writes.
- Optimized Data Layout: Using formats like
[batch_size, seq_len, num_heads, head_dim]or[batch_size, num_heads, seq_len, head_dim]strategically based on access patterns.
A key lever you control is the max_batch_size and max_seq_len parameters when building your TensorRT engine. These define the maximum dimensions the engine will support. The KV cache is allocated based on these maximums. If you consistently use smaller batch sizes or sequence lengths than the maximums, you’re wasting significant memory.
Consider the builder.max_batch_size and builder.max_workspace_size in the TensorRT Python API. max_batch_size directly impacts KV cache allocation. max_workspace_size is for intermediate tensors, but efficient KV cache management often frees up workspace.
The most surprising thing about KV cache optimization is that simply increasing the max_seq_len in your TensorRT engine can sometimes decrease overall memory usage per token for very long sequences due to better kernel efficiency and reduced fragmentation, even though the theoretical peak KV cache size increases. This is because certain kernel implementations and memory allocators perform better when dealing with larger, contiguous blocks of memory, and the overhead per token becomes amortized over a longer sequence.
When you’re using TensorRT, you’ll often find yourself tweaking the opt_batch_size and opt_max_seq_len during engine building to find the sweet spot between latency, throughput, and memory usage for your specific deployment scenario.
The next hurdle you’ll likely encounter is managing the KV cache across multiple GPUs for distributed inference.