A100s and H100s are both NVIDIA GPUs, but the H100 is a generational leap in performance, especially for AI workloads.

Let’s see vLLM, a popular LLM inference engine, handle requests on these different GPUs.

from vllm import LLM, SamplingParams
import time

# Load models (using a small model for quicker demonstration)
# For real benchmarks, use larger models like "meta-llama/Llama-2-7b-chat-hf"
model_name = "gpt2" # Replace with your desired model

# Initialize LLM on available GPU
# vLLM automatically detects and uses the best available GPU
llm = LLM(model=model_name)

# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

# Define prompts
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
    "What is the meaning of life?",
    "Tell me a joke",
]

# --- Latency Test ---
print("--- Latency Test ---")
start_time = time.time()
for prompt in prompts:
    outputs = llm.generate(prompt, sampling_params)
    # For demonstration, we'll just print the first output
    # print(f"Prompt: {prompt}\nOutput: {outputs[0].outputs[0].text}\n")
end_time = time.time()
latency = (end_time - start_time) / len(prompts)
print(f"Average latency per prompt: {latency:.4f} seconds")

# --- Throughput Test ---
print("\n--- Throughput Test ---")
num_requests = 100
start_time = time.time()
for _ in range(num_requests):
    # In a real scenario, you'd likely have more diverse prompts and lengths
    outputs = llm.generate(prompts[0], sampling_params)
end_time = time.time()
total_time = end_time - start_time
throughput = num_requests / total_time
print(f"Processed {num_requests} requests in {total_time:.2f} seconds.")
print(f"Throughput: {throughput:.2f} requests per second")

print("\n--- Benchmarking Notes ---")
print("1. Replace 'gpt2' with larger, more representative LLM models for meaningful benchmarks.")
print("2. Vary prompt lengths and `max_tokens` to simulate real-world request diversity.")
print("3. Ensure you have sufficient VRAM. H100s have significantly more VRAM than A100s.")
print("4. This example uses a single GPU. Multi-GPU setups will have different scaling characteristics.")

The core problem vLLM solves is efficient LLM inference. Traditional inference methods, especially for large models, suffer from high memory usage and slow processing due to how they handle the KV cache (Key-Value cache) and token generation. vLLM’s innovations, particularly PagedAttention, allow it to manage this KV cache much more efficiently, leading to higher throughput and lower latency.

The H100, based on NVIDIA’s Hopper architecture, brings substantial improvements over the A100 (Ampere architecture). Key differences relevant to vLLM performance include:

  • More Compute Power: H100 has more SMs (Streaming Multiprocessors) and higher clock speeds, directly translating to faster matrix multiplications, which are the backbone of transformer models.
  • Transformer Engine: H100 features a specialized Transformer Engine that dynamically uses FP8 precision. This can significantly speed up computations and reduce memory bandwidth requirements compared to A100’s FP16/BF16, without a substantial loss in accuracy for most LLM tasks.
  • Higher Memory Bandwidth: H100 boasts much higher HBM3 memory bandwidth (up to 3.35 TB/s) compared to A100’s HBM2e (up to 2 TB/s). This is crucial for LLMs where memory access is often the bottleneck.
  • NVLink Improvements: Newer generations of NVLink on the H100 offer higher bandwidth for multi-GPU communication, beneficial for distributed inference.

When vLLM runs on an H100, it can leverage these architectural advantages. PagedAttention, vLLM’s core innovation, is designed to be memory-efficient. However, the raw speed of the H100 means that even with efficient memory management, the computational steps themselves are performed much faster. The FP8 support in the Transformer Engine is particularly impactful, allowing for faster calculations and reduced memory traffic for models that can utilize it. This combination of efficient memory handling (PagedAttention) and raw computational speed (Hopper architecture, Transformer Engine) is why H100s consistently outperform A100s in LLM inference benchmarks.

The most surprising thing about vLLM’s performance on H100s versus A100s isn’t just raw speed, but how PagedAttention interacts with the hardware. While PagedAttention is designed to minimize memory fragmentation and waste, the H100’s higher memory bandwidth and the Transformer Engine’s FP8 capabilities mean that the rate at which PagedAttention can fetch and process KV cache blocks is dramatically higher. This isn’t just about having more memory or faster compute; it’s about how the memory-access patterns optimized by PagedAttention are executed with unprecedented speed on the Hopper architecture. The H100 can effectively "page" through the KV cache and perform computations on those pages far faster than the A100, making the entire inference process significantly more fluid and responsive, especially under heavy load.

The next step after understanding hardware differences is exploring how vLLM’s continuous_batching and openai-compatible server features can further optimize throughput and latency by handling dynamic request arrivals.

Want structured learning?

Take the full Vllm course →