The most surprising thing about vLLM’s prefix caching is that it doesn’t just save computation, it fundamentally changes the cost structure of serving language models, making shared prefixes a first-class citizen rather than an emergent optimization.

Let’s see this in action. Imagine we have two prompts:

Prompt A: "The capital of France is" Prompt B: "The capital of Spain is"

If we process these sequentially with a standard vLLM setup, each token generation incurs the full attention computation.

Now, let’s enable prefix caching. When Prompt A is processed, vLLM generates the KV cache for "The", "The capital", "The capital of", and "The capital of France". When Prompt B arrives, vLLM recognizes that "The capital of" is a common prefix. Instead of recomputing the KV cache for these tokens, it reuses the existing KV cache from Prompt A. Only the computation for "Spain" and subsequent tokens is performed from scratch.

This mechanism is called "Prefix Caching" in vLLM, and it’s powered by a data structure called the "PagedAttention" kernel. PagedAttention manages the KV cache memory by dividing it into fixed-size blocks, similar to how operating systems manage physical memory. When a new sequence is generated, its KV cache is assigned new blocks. However, if a subsequent sequence shares a prefix, its KV cache can point to the same memory blocks that were used by the earlier sequence. This is where the "reuse" happens.

The internal workings are quite elegant. PagedAttention maintains a mapping from a logical block index (representing a sequence of tokens) to a physical memory block. When a new request comes in, vLLM checks if its initial sequence of tokens matches any existing prefixes for which KV cache blocks are already allocated. If a match is found, vLLM simply creates a new "logical block table" for the new request that points to the existing physical blocks for the shared prefix. Only the unique suffix of the new request requires new KV cache blocks and computation.

The primary lever you control for prefix caching is the gpu_memory_utilization parameter when initializing the LLM object. This parameter dictates how much GPU memory vLLM can use for KV cache storage and other internal structures. A higher utilization generally allows for more concurrent requests and longer KV caches, thus increasing the likelihood of prefix matches. You don’t explicitly "enable" prefix caching; it’s an intrinsic part of PagedAttention, which is vLLM’s default attention mechanism. The effectiveness is directly proportional to the number of concurrent requests that share common prefixes.

The more sequences that share a prefix, the more dramatic the savings. This is why prefix caching is particularly effective for tasks like:

  • Chatbots: Many user queries start with similar greetings or instructions.
  • Summarization: Different documents might begin with similar introductory phrases.
  • Instruction Following: A set of instructions might share a common preamble.

The actual memory savings come from avoiding redundant writes to GPU memory. Instead of allocating and filling new KV cache blocks for each token of a shared prefix across multiple requests, vLLM simply creates new entries in its internal block tables that point to the already populated blocks. This is a significant win for memory bandwidth and latency, beyond just the compute savings from not re-running the attention calculations.

The next logical step after understanding how vLLM efficiently manages KV cache for shared prefixes is to explore how to strategically influence which prefixes are likely to be shared, perhaps through prompt engineering or batching strategies that group similar requests.

Want structured learning?

Take the full Vllm course →