Guided decoding in vLLM lets you force an LLM to produce output that strictly conforms to a predefined structure, like JSON.

This isn’t just about suggesting a format; it’s about enforcing it. Think of it like a highly disciplined stenographer for your LLM, ensuring every utterance fits the exact template you’ve given it. This capability is crucial for integrating LLMs into automated workflows where predictable, structured data is non-negotiable.

Let’s see it in action. Imagine we want to extract structured information from a user’s request about booking a flight. We’ll use vLLM with its guided decoding feature to ensure the output is valid JSON.

First, we need a model capable of following instructions, like Mistral-7B-Instruct-v0.2 or a similar fine-tuned model. The core of guided decoding is the guided_decoding parameter in vLLM’s SamplingParams. We’ll define a JSON schema that dictates the expected structure.

from vllm import LLM, SamplingParams
import json

# Define the JSON schema
json_schema = {
    "type": "object",
    "properties": {
        "departure_city": {"type": "string"},
        "arrival_city": {"type": "string"},
        "departure_date": {"type": "string", "format": "date"},
        "return_date": {"type": "string", "format": "date"},
        "passengers": {"type": "integer", "minimum": 1}
    },
    "required": ["departure_city", "arrival_city", "departure_date", "passengers"]
}

# Load the LLM
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2")

# Define the prompt
prompt = "Book a flight for 2 people from London to New York, departing on 2024-12-15 and returning on 2024-12-22."

# Set sampling parameters with guided decoding
sampling_params = SamplingParams(
    temperature=0.0,  # Set temperature to 0 for deterministic output
    max_tokens=150,
    guided_decoding={"type": "json", "schema": json_schema}
)

# Generate the output
outputs = llm.generate(prompt, sampling_params)

# Print the generated output
generated_json_string = outputs[0].outputs[0].text
print(generated_json_string)

# Verify the output is valid JSON
try:
    parsed_json = json.loads(generated_json_string)
    print("\nJSON is valid!")
    print(parsed_json)
except json.JSONDecodeError as e:
    print(f"\nJSON is invalid: {e}")

When you run this, vLLM will not just try to produce JSON; it will actively steer the generation process token by token to ensure the output adheres to the json_schema. The temperature=0.0 is important here as it makes the generation deterministic, and guided decoding can then more precisely control the token selection to match the schema.

The magic happens within vLLM’s attention mechanism. When guided decoding is enabled, the logits for each token are modified before sampling. The system looks at the current state of generation and the defined schema. If the next token would violate the schema (e.g., trying to output a string when an integer is expected, or failing to close a bracket), the logits for those invalid tokens are heavily penalized, effectively making them impossible to select. This ensures that the model always stays on a path that leads to valid JSON. The guided_decoding parameter takes a dictionary where "type": "json" specifies the mode, and "schema" contains your JSON schema object.

The problem this solves is the inherent non-determinism and unstructured nature of raw LLM output. Without guided decoding, you’d get freeform text that might resemble JSON, requiring brittle post-processing and error handling. With it, the LLM acts as a robust data extraction engine. You define the contract (the schema), and vLLM guarantees the output will meet it. This is critical for chaining LLMs, feeding data into databases, or controlling other software components.

A subtle but powerful aspect is how vLLM handles required fields and format specifiers. For example, specifying "format": "date" in the schema doesn’t just mean "expect a string"; vLLM actively tries to guide the model towards generating strings that look like dates (e.g., YYYY-MM-DD). Similarly, the required fields ensure that the model doesn’t omit essential pieces of information. If the model gets close to finishing without including a required field, the guided decoding will heavily favor tokens that complete that field.

The next step in controlling LLM output is exploring custom logit processors or more complex schema validation scenarios, which often involve defining specific token constraints beyond standard JSON types.

Want structured learning?

Take the full Vllm course →