PagedAttention is a memory management system for large language models that achieves near-optimal memory utilization by treating GPU memory like virtual memory in an operating system.
Let’s see it in action. Imagine we have a vLLM server processing requests. Each request generates a sequence of tokens. These tokens are stored in GPU memory.
# Example of a vLLM server running with PagedAttention
python -m vllm.entrypoints.api_server --model lmsys/vicuna-7b-v1.5 --port 8000
When a client sends a prompt like "Tell me a story about a dragon," vLLM starts generating tokens. The first few tokens might be "Once upon a time, in a land of fire and…" Each of these tokens, along with their associated attention key-value (KV) states, needs to be stored in GPU memory.
The core innovation of PagedAttention is how it manages this memory. Instead of allocating a contiguous block of memory for each sequence, it breaks memory down into fixed-size "pages." Each page can hold a certain number of KV cache entries for tokens.
Here’s a simplified view of what happens internally:
- KV Cache: Large language models use a KV cache to store intermediate attention calculations. This cache is essential for efficient generation, as it avoids recomputing attention for previous tokens. However, it can consume a massive amount of GPU memory.
- PagedAttention’s Approach: PagedAttention divides the KV cache into blocks. These blocks are the "pages." Each block has a fixed size, say 16 tokens worth of KV states.
- Logical vs. Physical: A sequence’s KV cache is represented logically as a list of block indices. Physically, these blocks can be scattered across GPU memory. This is analogous to how an operating system maps virtual memory pages to physical RAM frames.
- Dynamic Allocation: When a sequence needs more memory (i.e., it generates more tokens), PagedAttention simply allocates a new block from the available pool of memory pages and links it into the sequence’s logical chain. If a sequence is shorter than expected, its unused allocated pages can be returned to the pool.
- Sharing: A critical benefit is efficient sharing. If multiple requests share a common prefix (e.g., system prompts, or identical initial user prompts), their initial KV cache blocks can be physically shared. PagedAttention tracks reference counts for each block. When a block is shared, its reference count increases. When a sequence ends or a block is no longer needed by any active sequence, its reference count decreases, and the block is freed if the count reaches zero.
Consider two requests: Request A: "System: You are a helpful assistant. User: What is the capital of France?" Request B: "System: You are a helpful assistant. User: What is the weather like today?"
Both requests share the "System: You are a helpful assistant. User: " prefix. PagedAttention can allocate a set of blocks for this shared prefix once and then have both Request A and Request B logically point to these same physical blocks. When Request A adds tokens for "What is the capital of France?", it gets new blocks. When Request B adds tokens for "What is the weather like today?", it gets its own new blocks. This drastically reduces memory duplication compared to traditional methods.
The memory utilization is near-optimal because PagedAttention avoids internal fragmentation (wasted space within an allocated chunk) and external fragmentation (gaps between allocated chunks). Memory is allocated and freed in fixed-size chunks (pages), and these chunks can be anywhere in physical memory. The system only needs to keep track of which pages belong to which sequence and how many sequences are referencing each page.
The core mechanism that enables this sharing and dynamic allocation is a data structure called the "page table." For each sequence, there’s a logical page table mapping logical page indices to physical block indices. A global "block table" then maps physical block indices to their actual memory locations and reference counts. When a new token is generated, PagedAttention looks up the last page for the sequence, appends the new KV state to it. If the page is full, it allocates a new page from the free list, updates the sequence’s page table, and increments the reference count for the new page.
The real magic happens when sequences diverge or end. If a sequence finishes, all its associated blocks have their reference counts decremented. If a block’s reference count drops to zero, it’s returned to the free pool. This granular control over memory blocks, combined with efficient sharing of common prefixes, allows PagedAttention to serve significantly more concurrent requests with the same amount of GPU memory compared to previous methods.
The key insight is that the KV cache doesn’t need to be contiguous. By managing memory in fixed-size pages and using indirection (page tables), PagedAttention achieves the flexibility of dynamic allocation with the efficiency of contiguous memory for individual blocks.
This system’s ability to manage memory efficiently directly translates to higher throughput for LLM inference.
The next challenge you’ll likely encounter is understanding how PagedAttention handles different sequence lengths and batching strategies to maximize throughput.