The vLLM OutOfMemoryError during KV cache allocation means the GPU ran out of VRAM to store the attention key-value states for the sequences being processed.
Here’s a breakdown of why this happens and how to fix it, from most to least common:
1. Batch Size Too High
- Diagnosis: Check your current batch size. If you’re running inference on multiple requests simultaneously, this is the most likely culprit.
# No specific command, but observe your inference script or API call # Example: python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-num-seqs 100 --gpu-memory-utilization 0.9 # The --max-num-seqs here is effectively controlling how many sequences can be *active* concurrently. - Fix: Reduce the
max_num_seqs(or equivalent batch size parameter in your vLLM setup). For example, if you were using 100, try 64.# Example CLI change python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-num-seqs 64 --gpu-memory-utilization 0.9 - Why it works: The KV cache size is directly proportional to the batch size and sequence length. Reducing the number of concurrent sequences drastically cuts down KV cache memory requirements.
2. Sequence Lengths Too Long
- Diagnosis: Observe the typical and maximum sequence lengths of the prompts and generated outputs. Longer sequences mean larger KV cache entries.
# Again, no direct command, but infer from your input data or API requests. # If using the API server, check the 'request_output_len' and 'prompt_len' from incoming requests. - Fix: Implement a maximum sequence length limit for both input prompts and generated outputs. For instance, cap generated sequences at 1024 tokens and reject prompts longer than 2048 tokens.
# Example within a custom inference script using vLLM's sampling methods from vllm import SamplingParams sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1024) # Limit output length # When processing prompts, you might add a check: if len(prompt_tokens) > 2048: raise ValueError("Prompt too long") - Why it works: Each token in a sequence requires storage in the KV cache. Limiting sequence length directly reduces the memory footprint per sequence.
3. Model Size and Architecture
- Diagnosis: The model’s parameters (weights) themselves consume VRAM, but more importantly for this error, the architecture dictates the KV cache dimensions. Larger models (e.g., Llama-2 70B vs. 7B) have larger hidden dimensions, leading to larger KV cache entries per token.
# Check model size by looking at the Hugging Face model card or repository. # Example: You're using "meta-llama/Llama-2-70b-hf" - Fix: If possible, switch to a smaller model or a model with a more efficient architecture for your use case.
# Example CLI change python -m vllm.entrypoints.api_server --model "meta-llama/Llama-2-7b-hf" --max-num-seqs 100 --gpu-memory-utilization 0.9 - Why it works: Smaller models have smaller hidden states, which are multiplied by the number of attention heads and layers to determine the size of the K and V vectors stored in the cache.
4. Quantization (or Lack Thereof)
- Diagnosis: Verify if you are using a quantized version of the model. Unquantized FP16 or BF16 models require more VRAM for weights and also for the KV cache (which stores FP16/BF16 values).
# Check your model loading command. If you don't specify quantization, it's likely FP16/BF16. # Example: Loading 'mistralai/Mistral-7B-v0.1' without specific quantization flags. - Fix: Use a quantized version of the model (e.g., AWQ, GPTQ, or bitsandbytes NF4). vLLM supports various quantization formats.
# Example CLI change for AWQ python -m vllm.entrypoints.api_server --model "TheBloke/Mistral-7B-Instruct-v0.2-AWQ" --max-num-seqs 100 --gpu-memory-utilization 0.9 - Why it works: Quantized models store weights and KV cache elements in lower precision (e.g., 4-bit), significantly reducing memory usage for both the model parameters and the KV cache.
5. gpu_memory_utilization Too High
- Diagnosis: The
gpu_memory_utilizationparameter tells vLLM what fraction of the GPU’s total VRAM it is allowed to use. If this is set too high, vLLM might attempt to allocate more memory than is available for the KV cache, even if other components (like model weights) could fit.# Observe the --gpu-memory-utilization flag in your vLLM launch command. # Example: --gpu-memory-utilization 0.95 - Fix: Lower the
gpu_memory_utilizationvalue. For example, change it from 0.95 to 0.85.# Example CLI change python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-num-seqs 100 --gpu-memory-utilization 0.85 - Why it works: This parameter acts as an upper bound. By reducing it, you give the system more breathing room, preventing aggressive allocation requests that might trigger the OOM error, especially when the KV cache needs to grow dynamically.
6. Overlapping Requests or Context Fragmentation
- Diagnosis: If you have a very dynamic workload where requests frequently start, end, and new ones begin, vLLM’s memory allocator might struggle to reclaim and reuse memory efficiently, leading to fragmentation. This is harder to diagnose directly without deep profiling.
- Fix: Restarting the vLLM server periodically can help defragment memory. For a more robust solution, consider implementing a custom eviction strategy or reducing the
max_num_seqseven further to minimize the churn. - Why it works: Restarting forces a complete re-initialization of the KV cache storage, eliminating fragmentation. A lower
max_num_seqsreduces the number of active allocations and deallocations, making fragmentation less likely.
7. CUDA Toolkit/Driver Version Mismatch or Bugs
- Diagnosis: While less common, an outdated or incompatible CUDA toolkit or GPU driver can sometimes lead to unexpected memory management issues.
- Fix: Ensure your CUDA toolkit and NVIDIA driver versions are compatible with the vLLM version you are using. Consult vLLM’s documentation for recommended versions and update your drivers and toolkit accordingly.
- Why it works: Newer drivers and CUDA toolkits often include performance improvements and bug fixes related to memory allocation and management.
After addressing these, the next error you might encounter is a RuntimeError: CUDA error: out of memory during model loading if the model weights themselves are too large for your GPU, or a performance degradation due to the reduced batch size or sequence length.