vLLM can dynamically swap multiple LoRA adapters on a single model instance, letting you serve many fine-tuned variations without spinning up new GPUs.
Let’s see this in action. Imagine we have a base model, say meta-llama/Llama-2-7b-hf, and two LoRA adapters: adapter_A and adapter_B. We can load the base model once and then tell vLLM to use adapter_A for one request and adapter_B for another, all from the same running process.
from vllm import LLM, SamplingParams
import os
# Ensure you have vLLM installed: pip install vllm
# And transformers: pip install transformers
# Set environment variable for Hugging Face token if needed for gated models
# os.environ["HUGGING_FACE_HUB_TOKEN"] = "YOUR_HF_TOKEN"
# Load the base model
# Using a smaller model for demonstration, replace with your actual base model
llm = LLM(model="meta-llama/Llama-2-7b-hf", trust_remote_code=True)
# Define LoRA adapters. These would typically be paths to your saved LoRA weights.
# For this example, we'll assume you have them saved locally.
# If they are on Hugging Face Hub, use their names directly.
lora_adapters = {
"adapter_A": "path/to/your/adapter_A", # Replace with actual path or HF repo ID
"adapter_B": "path/to/your/adapter_B" # Replace with actual path or HF repo ID
}
# You would typically train and save your LoRA adapters like this (example using PEFT):
# from peft import LoraConfig, get_peft_model
# from transformers import AutoModelForCausalLM
#
# base_model_name = "meta-llama/Llama-2-7b-hf"
# model = AutoModelForCausalLM.from_pretrained(base_model_name)
#
# config = LoraConfig(
# r=16,
# lora_alpha=32,
# target_modules=["q_proj", "v_proj"], # Example target modules
# lora_dropout=0.05,
# bias="none",
# task_type="CAUSAL_LM"
# )
#
# peft_model = get_peft_model(model, config)
# peft_model.save_pretrained("path/to/your/adapter_A")
# peft_model.save_pretrained("path/to/your/adapter_B")
# --- Serving requests with dynamic adapter swapping ---
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
# Request using adapter_A
prompt_A = "Write a short story about a brave knight."
print(f"Requesting with adapter_A for prompt: '{prompt_A}'")
# The `lora_request` parameter is the key here.
# It's a dictionary mapping adapter names to their paths/IDs.
# vLLM will automatically load and apply the specified adapter.
results_A = llm.generate(prompt_A, sampling_params, lora_request=lora_adapters["adapter_A"])
print(f"Response (adapter_A): {results_A[0].outputs[0].text}\n")
# Request using adapter_B
prompt_B = "Explain the concept of quantum entanglement in simple terms."
print(f"Requesting with adapter_B for prompt: '{prompt_B}'")
results_B = llm.generate(prompt_B, sampling_params, lora_request=lora_adapters["adapter_B"])
print(f"Response (adapter_B): {results_B[0].outputs[0].text}\n")
# You can even mix requests without explicitly defining adapters again if they are cached
# If adapter_A was recently used, vLLM might still have it loaded.
# Let's assume adapter_A is still readily available in memory.
print(f"Re-requesting with adapter_A for prompt: '{prompt_A}'")
results_A_again = llm.generate(prompt_A, sampling_params, lora_request=lora_adapters["adapter_A"])
print(f"Response (adapter_A again): {results_A_again[0].outputs[0].text}\n")
# To demonstrate swapping back and forth:
print(f"Swapping back to adapter_B for prompt: '{prompt_B}'")
results_B_again = llm.generate(prompt_B, sampling_params, lora_request=lora_adapters["adapter_B"])
print(f"Response (adapter_B again): {results_B_again[0].outputs[0].text}\n")
The core mechanism is the lora_request parameter in the llm.generate method. When you pass a string (the path or Hugging Face repository ID of a LoRA adapter) to lora_request, vLLM intercepts the request. It checks if this adapter is already loaded in its internal cache. If not, it loads the LoRA weights, merges them (conceptually, not a permanent merge) with the base model’s weights for the duration of that request, and then executes the generation. Subsequent requests for the same adapter will hit the cache, making the swap nearly instantaneous.
This dynamic swapping is enabled by vLLM’s efficient memory management and its ability to load and unload LoRA adapters without requiring a full model reload. The base model’s weights remain constant in GPU memory, while only the LoRA adapter’s weights (or a combination thereof) are applied on the fly. This is crucial for scenarios where you have many fine-tuned versions of a single base model, each catering to a different task, domain, or user preference. Instead of maintaining separate GPU instances for each fine-tuned model, you can consolidate them onto a single vLLM instance, dramatically reducing infrastructure costs and simplifying deployment. The system handles the complexity of managing which adapter is active for which incoming request, making it seamless for the end-user.
A common point of confusion is how vLLM handles the "merging." It doesn’t perform a permanent disk-based merge for each request. Instead, it dynamically applies the LoRA delta weights to the base model’s weights during the forward pass. This means the base model weights on GPU are never altered permanently. When a new adapter is requested, the system efficiently swaps out the active delta weights and applies the new ones. This dynamic application is what makes the swapping so fast and memory-efficient.
The next hurdle is understanding how to manage adapter loading and unloading when dealing with a very large number of adapters, especially if memory becomes a constraint and vLLM’s internal cache needs to be managed more explicitly.