vLLM is designed for high-throughput, low-latency LLM inference, but achieving optimal performance requires understanding and tuning your specific setup.
Here’s how we’ll benchmark and optimize:
First, let’s get some sample data and a model. We’ll use the HuggingFaceLLM from vllm.model and a small, fast model like facebook/opt-125m for demonstration.
from vllm import LLM, SamplingParams
from vllm.model import HuggingFaceLLM
# Load a small model for quick demonstration
model_id = "facebook/opt-125m"
llm = LLM(model=HuggingFaceLLM(model_id))
# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
# Define prompts
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
"What is the meaning of life?",
]
Now, let’s run an initial inference to establish a baseline. This warm-up run also helps the system allocate necessary resources.
# Initial inference to warm up the model
outputs = llm.generate(prompts, sampling_params)
print("Warm-up complete.")
To measure throughput, we need to send a significant number of requests and count how many tokens are generated per second. We’ll simulate this by generating many outputs for a single prompt, or by having a large list of unique prompts. For practical benchmarking, you’d use a much larger dataset.
Let’s simulate a high load with a single, longer prompt repeated many times.
import time
# Larger batch size for throughput testing
batch_size = 128
long_prompt = "Explain the concept of quantum entanglement in simple terms."
prompts_for_throughput = [long_prompt] * batch_size
start_time = time.time()
outputs_throughput = llm.generate(prompts_for_throughput, sampling_params)
end_time = time.time()
total_tokens_generated = sum(len(output.outputs[0].token_ids) for output in outputs_throughput)
total_time = end_time - start_time
throughput_tokens_per_sec = total_tokens_generated / total_time
print(f"Batch Size: {batch_size}")
print(f"Total Tokens Generated: {total_tokens_generated}")
print(f"Total Time: {total_time:.2f} seconds")
print(f"Throughput: {throughput_tokens_per_sec:.2f} tokens/sec")
Latency is typically measured as the time from when a request is sent to when the first token is received (time to first token, TTFT) or when the entire response is received (end-to-end latency). For interactive applications, TTFT is often more critical.
Let’s measure the average time to first token and end-to-end latency for a set of requests.
# Measure latency
num_latency_requests = 100
latency_prompts = ["What is your favorite color?"] * num_latency_requests
latencies_ttft = []
latencies_e2e = []
for prompt in latency_prompts:
start_req_time = time.time()
outputs_latency = llm.generate([prompt], sampling_params)
end_req_time = time.time()
# Assuming the first output is the one we care about
first_token_time = outputs_latency[0].outputs[0].finish_time - start_req_time
end_to_end_time = end_req_time - start_req_time
latencies_ttft.append(first_token_time)
latencies_e2e.append(end_to_end_time)
avg_ttft = sum(latencies_ttft) / len(latencies_ttft)
avg_e2e = sum(latencies_e2e) / len(latencies_e2e)
print(f"\nAverage Time to First Token (TTFT): {avg_ttft:.4f} seconds")
print(f"Average End-to-End Latency: {avg_e2e:.4f} seconds")
Key levers for optimization include:
gpu_memory_utilization: This parameter controls how much GPU memory vLLM is allowed to use. Higher values can lead to larger PagedAttention KV caches, enabling longer sequences and larger batch sizes, thus improving throughput. A common starting point is0.9(90%).llm_high_mem = LLM(model=HuggingFaceLLM(model_id), gpu_memory_utilization=0.95)max_num_seqs: The maximum number of sequences (requests) that can be processed concurrently. Increasing this can improve throughput if you have many concurrent users, but it also consumes more memory. The default is often dynamically determined but can be set manually.llm_max_seqs = LLM(model=HuggingFaceLLM(model_id), max_num_seqs=1024)max_model_len: The maximum sequence length the model can handle. This affects the size of the KV cache. If you expect very long prompts or generations, you’ll need to increase this.llm_long_seq = LLM(model=HuggingFaceLLM(model_id), max_model_len=2048)- Batch Size: While vLLM handles dynamic batching, the
batch_sizeparameter inllm.generatecan be used to control the static batch size if needed, though generally, letting vLLM manage it is best. The example above demonstrates a static batch for throughput measurement. - Model Choice: Larger, more complex models inherently have higher latency and lower throughput on the same hardware. Benchmarking is crucial to understand this trade-off.
- Quantization: Using quantized models (e.g., AWQ, GPTQ) can significantly reduce memory usage and increase inference speed, often with minimal impact on accuracy.
The most surprising thing about vLLM’s performance is how much impact PagedAttention has on memory efficiency, allowing for much larger effective batch sizes and longer sequences than traditional methods without explicit memory fragmentation issues. This means that even with many concurrent requests, vLLM can often keep more of the KV cache in GPU memory, leading to fewer re-computations and higher throughput.
Consider the enable_prefix_caching option. When enabled, vLLM can reuse computations for common prefixes across multiple requests. This is particularly effective when you have many users sending similar initial prompts, such as in a chatbot scenario where greetings or common questions are frequent. This can lead to a significant reduction in redundant computations, boosting throughput for specific workloads.
To truly understand your setup, you need to benchmark with your target model, your expected prompt/generation lengths, and your anticipated concurrency.
The next step in optimizing is exploring different quantization methods and their impact on your specific hardware.