vLLM with Ray Serve: Production Deployment Pattern

The most surprising thing about deploying large language models (LLMs) in production is how much of the performance bottleneck isn’t the model inference itself, but the surrounding infrastructure and its ability to feed data to the GPU efficiently.

Let’s watch vLLM and Ray Serve handle a real-time request. Imagine a user types "Tell me a story about a dragon" into a web application. This request hits a Ray Serve deployment. Ray Serve, acting as a smart router and orchestrator, forwards this request to one or more vLLM inference servers.

// Example Ray Serve Deployment Configuration
import ray
from ray import serve
from ray.serve.config import AutoscalingConfig

@serve.deployment(
    name="LLM_API",
    num_replicas=2,
    autoscaling_config=AutoscalingConfig(
        min_replicas=1,
        max_replicas=5,
        target_num_ongoing_requests=10,
    )
)
class LLMDeployment:
    def __init__(self, model_id: str = "meta-llama/Llama-2-7b-chat-hf"):
        from vllm import LLM, SamplingParams
        self.llm = LLM(model=model_id)
        self.sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)

    async def __call__(self, request: ray.serve.request.Request):
        prompt = await request.json()
        texts = [prompt["text"]]
        # vLLM's generate method handles batching and efficient inference
        outputs = self.llm.generate(texts, self.sampling_params)
        return {"generated_text": outputs[0].outputs[0].text}

# Deploy the model
ray.init(address="auto") # Connect to an existing Ray cluster
serve.start(detached=True)
LLMDeployment.deploy()

# Simulate a request
import requests
response = requests.post("http://localhost:8000/LLM_API", json={"text": "Tell me a story about a dragon"})
print(response.json())

In this setup, Ray Serve manages the scaling of the LLM_API deployment. If the number of incoming requests (target_num_ongoing_requests) exceeds what the current replicas can handle, Ray Serve automatically spins up more instances of the LLMDeployment. Each instance runs a vLLM LLM object. When self.llm.generate() is called, vLLM takes over. It doesn’t just pass the prompt to the model; it intelligently batches incoming requests, manages the GPU memory using techniques like PagedAttention, and schedules model execution to maximize throughput. The SamplingParams control how the model generates text, influencing creativity and coherence.

The core problem this pattern solves is making LLMs practical for high-volume, low-latency applications. Naively running an LLM on a single GPU, or even with basic web frameworks, quickly becomes a bottleneck. LLMs are computationally intensive, but they also require significant GPU memory. vLLM’s PagedAttention is crucial here; it treats GPU memory like virtual memory, allowing for flexible allocation and deallocation of KV cache blocks. This means multiple requests can share the same KV cache memory efficiently, dramatically increasing the number of requests that can be processed concurrently on a single GPU. Ray Serve adds the crucial orchestration layer, handling request routing, load balancing, and autoscaling across potentially many such vLLM-powered inference servers, turning a single model into a robust, scalable service.

What most people don’t realize is how much vLLM’s continuous_batching feature overlaps with Ray Serve’s target_num_ongoing_requests. vLLM batches requests within a single GPU to maximize its utilization, dynamically adjusting batch sizes based on incoming requests and available memory. Ray Serve, on the other hand, batches requests across multiple replicas (and thus potentially multiple GPUs) by deciding how many replicas should be active. Both are forms of batching, but at different layers of the system, and they work in concert: vLLM keeps each GPU fed and busy, while Ray Serve ensures you have enough GPUs (and vLLM instances) to meet overall demand.

The next challenge is managing model versions and A/B testing different LLM configurations within this scalable infrastructure.

Want structured learning?

Take the full Vllm course →