vLLM’s CPU offloading lets you run models that are too big for your GPU by cleverly moving parts of the model’s weights to CPU RAM.
Here’s a quick demo. Imagine we have a quantized Llama 2 7B model that just fits into 16GB of VRAM. We’ll try to load it into a system with only 10GB of VRAM.
from vllm import LLM, SamplingParams
# This model is ~7GB in float16, but quantized to ~4GB
model_name = "hf-internal-testing/llama-2-7b-chat-hf"
# Set a small context length to make it easier to run
max_model_len = 512
# Try to load it without offloading (this will likely fail)
try:
llm_no_offload = LLM(model=model_name, max_model_len=max_model_len, gpu_memory_utilization=0.9)
print("Loaded without offloading (should not happen)")
except Exception as e:
print(f"Failed to load without offloading: {e}")
# Now, load with CPU offloading enabled
# We'll reserve 90% of GPU memory, but allow offloading
llm_offload = LLM(
model=model_name,
max_model_len=max_model_len,
gpu_memory_utilization=0.9,
swap_space=10, # Allocate 10 GB of swap space (CPU RAM)
disable_sliding_window=True # Important for offloading
)
print("Successfully loaded with CPU offloading!")
# Generate some text
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
outputs = llm_offload.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
The core idea is that vLLM’s PagedAttention mechanism, designed for efficient GPU memory management, can be extended to use CPU RAM as an overflow. When the GPU runs out of space for a given sequence’s KV cache or model weights, vLLM can swap these blocks to a designated CPU memory region. This allows you to fit much larger models or handle longer contexts than your GPU VRAM would normally permit.
The key parameters are swap_space and disable_sliding_window. swap_space tells vLLM how much CPU RAM (in GB) to reserve for offloaded blocks. disable_sliding_window=True is crucial because the sliding window attention mechanism, while great for long contexts on GPU, doesn’t play well with offloading. When offloading is enabled, vLLM uses a simpler, non-sliding window attention.
Internally, vLLM maintains a unified memory pool that spans both GPU and CPU. When a request comes in, vLLM attempts to allocate the necessary memory blocks on the GPU. If it can’t, it looks to the swap_space. The system dynamically moves data between GPU and CPU as needed during inference. This involves custom CUDA kernels that can read from and write to both memory spaces. The PagedAttention logic is adapted to manage these "virtual" memory pages, which can reside on either the GPU or the CPU.
A common misconception is that offloading is only for model weights. While weights can be offloaded, the real memory hog, especially with long contexts, is the KV cache. CPU offloading is particularly effective at managing large KV caches that would otherwise quickly exhaust GPU VRAM. This is why even if your model weights just fit, enabling offloading can still allow for much longer sequences.
When you first set up CPU offloading, ensure you have enough physical RAM on your CPU. vLLM will try to allocate a contiguous chunk for its swap space. If your system is already heavily utilized, you might encounter allocation errors even if swap_space is set to a value less than your total RAM.
The next hurdle you’ll likely face is performance. CPU offloading introduces latency because moving data between CPU and GPU is orders of magnitude slower than staying within the GPU. You’ll see a significant drop in tokens per second, and the system might even become unresponsive if the offloading is too aggressive for the CPU’s bandwidth.