The TensorRT Paged KV Cache is a memory management technique that dramatically improves LLM serving performance by treating KV cache like a garbage-collected heap.

Let’s see it in action. Imagine serving a popular LLM with a batch of incoming requests. Without Paged KV Cache, each token generation would allocate a fixed, potentially large, chunk of memory for the KV cache, even if the sequence is short. This leads to memory fragmentation and underutilization. Paged KV Cache, however, breaks down this cache into smaller, fixed-size "blocks." These blocks are dynamically allocated and deallocated as needed, similar to how an operating system manages virtual memory.

Here’s a simplified view of the process:

  1. Token Generation: When a new token is generated, the LLM needs to update its KV cache.
  2. Block Allocation: Paged KV Cache requests a block of memory from its internal pool.
  3. KV Data Storage: The KV cache data for the new token is written into this allocated block.
  4. Block Management: As sequences grow and shrink, blocks are reused or released back to the pool. This avoids allocating memory for the maximum possible sequence length upfront and reduces fragmentation.

The core problem this solves is the massive, often wasteful, memory footprint of KV caches in LLM inference. For a model with a large context window and many parallel requests, the KV cache can easily become the dominant memory consumer, leading to costly hardware and limited throughput. Paged KV Cache optimizes this by:

  • Granular Allocation: Instead of allocating memory for an entire sequence at once, it allocates memory in small, fixed-size blocks (e.g., 64 or 128 tokens).
  • Dynamic Management: A sophisticated allocator tracks free and used blocks. When a sequence needs more KV cache, it requests new blocks. When a sequence is finished, its blocks are returned to the free pool.
  • Reduced Fragmentation: By managing memory in these smaller units, it prevents large contiguous chunks from being wasted due to short sequences, significantly improving memory utilization.

To configure Paged KV Cache in TensorRT-LLM, you primarily interact with the trt.TLLMConfig object during the engine building phase. The key parameters are:

  • max_batch_size: The maximum number of sequences that can be processed concurrently.
  • max_input_len: The maximum length of input sequences.
  • max_output_len: The maximum length of generated sequences.
  • max_num_tokens (crucial for Paged KV Cache): This is the total number of KV cache tokens across all sequences that the engine can hold in memory at any given time. This is the primary knob for controlling the Paged KV Cache’s capacity.

Here’s a snippet of how you might set this up in Python:

from tensorrt_llm.builder import Builder
from tensorrt_llm.config import TensorRTLLMConfig
from tensorrt_llm.models import PretrainedConfig

# ... (model loading and other configurations) ...

config = PretrainedConfig(
    model_name="gpt2",
    # ... other model specific configs ...
)

# Configure for Paged KV Cache
# Example: Allow for a total of 1 million KV cache tokens across all concurrent requests.
# If max_batch_size is 32 and max_output_len is 1024, this means
# sequences can be shorter on average, or some sequences can be longer
# up to the limit of 1,000,000 tokens total.
builder_config = TensorRTLLMConfig(
    max_batch_size=32,
    max_input_len=512,
    max_output_len=1024,
    max_num_tokens=1000000,  # Total KV cache tokens allowed
    # Other configs like TP size, PP size, etc.
)

# Build the engine
builder = Builder()
engine = builder.build(..., config=builder_config)

The max_num_tokens parameter is the direct control for the Paged KV Cache. It dictates the total "capacity" of your KV cache memory pool. If you set max_num_tokens too low, you’ll encounter OutOfMemoryError or performance degradation as the allocator struggles to find blocks. If you set it too high, you might over-allocate memory unnecessarily.

The actual size of each KV cache block is determined internally by TensorRT-LLM based on the model’s architecture (number of layers, hidden size, number of attention heads) and the max_batch_size, max_input_len, and max_output_len. You don’t directly set the block size, but max_num_tokens governs how many of these blocks can be active.

The most counterintuitive aspect of Paged KV Cache is that it allows for longer effective sequence lengths than max_output_len might suggest at first glance, as long as the total number of KV tokens (max_num_tokens) is not exceeded. The max_output_len primarily influences the maximum number of blocks that can be associated with a single sequence during its lifetime, and thus the maximum potential size of a sequence’s KV cache. However, the actual memory consumed is governed by max_num_tokens. If you have many short sequences, you can collectively accommodate more tokens than if you have a few very long sequences, even if all sequences respect their individual max_output_len.

The next hurdle you’ll face is managing the KV cache’s impact on GPU memory bandwidth during high-throughput scenarios.

Want structured learning?

Take the full Tensorrt course →