You can serve and query embedding models with vLLM, and the most surprising thing is how little you need to change from serving a text-generation model to do it.

Let’s see vLLM serve an embedding model. We’ll use sentence-transformers/all-MiniLM-L6-v2 for this example because it’s small and fast.

python -m vllm.entrypoints.openai.api_server \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --served-model-name embedding-model \
    --dtype auto \
    --port 8000

This command starts an OpenAI-compatible API server. The --model flag points to the Hugging Face model ID. --served-model-name is just a label for this model within the API. --dtype auto lets vLLM pick the best precision for your hardware. --port 8000 is where the server will listen.

Now, let’s query it. We’ll send a POST request to the /v1/embeddings endpoint.

import requests
import json

url = "http://localhost:8000/v1/embeddings"
headers = {
    "Content-Type": "application/json"
}
data = {
    "input": "Hello, world!",
    "model": "embedding-model" # This must match --served-model-name
}

response = requests.post(url, headers=headers, data=json.dumps(data))
result = response.json()

print(json.dumps(result, indent=2))

The output will look something like this:

{
  "object": "embedding",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        -0.006972252,
        0.03953067,
        -0.02428517,
        // ... many more numbers
      ],
      "index": 0
    }
  ],
  "model": "embedding-model",
  "usage": {
    "prompt_tokens": 2,
    "total_tokens": 2
  }
}

The embedding field contains the vector representation of your input text. The index corresponds to the input item’s position (if you send multiple inputs).

The Mental Model: From Text Generation to Embeddings

vLLM’s core strength is its PagedAttention mechanism, which efficiently manages KV caches for transformer models. When serving an embedding model, vLLM essentially treats the embedding generation as a single-token generation task. The model’s final hidden state, typically used as the input for a language modeling head, is instead directly outputted as the embedding vector.

Here’s a breakdown of how it works internally:

  1. Request Ingestion: The API server receives a request to /v1/embeddings. It parses the input text and the model name.
  2. Tokenization: The input text is tokenized using the model’s tokenizer.
  3. Forward Pass: The token IDs are fed into the transformer model. vLLM’s PagedAttention handles the KV cache efficiently, even though for embeddings, the sequence length is usually short and there’s no subsequent generation.
  4. Output Extraction: Instead of passing the last layer’s hidden states through a language modeling head to predict the next token, vLLM extracts the hidden state corresponding to the last token of the input sequence. This hidden state is the embedding vector.
  5. Response Formatting: The extracted vector is formatted into the OpenAI embedding object structure and returned.

The key is that the underlying transformer architecture (the encoder part, in essence) is the same. vLLM’s inference engine is agnostic to whether the final output is used for next-token prediction or directly as an embedding.

What You Control

When serving embedding models with vLLM, your primary levers are:

  • Model Choice: You can serve any Hugging Face model that produces a sequence of hidden states, including specialized embedding models (like sentence-transformers/*) or the encoder parts of larger LLMs.
  • Batching: vLLM automatically handles dynamic batching, grouping multiple incoming requests to maximize GPU utilization. This is crucial for throughput when serving many users.
  • Quantization/Dtype: Using --dtype auto or specifying --dtype float16 or --dtype bfloat16 can significantly reduce memory usage and speed up inference, with minimal impact on embedding quality for most models.
  • API Server Configuration: Parameters like --port, --host, --tensor-parallel-size, and --gpu-memory-utilization allow you to tailor the deployment to your infrastructure and load.

One aspect that often surprises users is that you don’t need to specify a separate "embedding head" or modify the model architecture. vLLM’s vllm.model.llm.LLM class, which powers the API server, is designed to be flexible. When it encounters an embedding model (identified by its configuration or by the fact that it’s a known embedding model type), it knows to extract the last hidden state instead of performing a full generative forward pass. This is achieved by a subtle change in how the forward method’s output is processed within vLLM’s internal ModelRunner. The decode_layer_output logic, which normally handles logits for generation, is bypassed or re-purposed to grab the desired hidden states directly from the transformer’s output.

The next step is often to integrate these embeddings into a vector database for similarity search.

Want structured learning?

Take the full Vllm course →