Speculative decoding lets a small, fast model generate tokens and then have a larger, more accurate model verify them, dramatically speeding up inference.

Let’s see it in action. Imagine we’re generating text with vLLM. We’ll use a small "draft" model (like a distilled version of our main model) and a larger "target" model.

from vllm import LLM, SamplingParams

# Load your draft and target models
# For demonstration, assume they are named 'draft-model' and 'target-model'
# In a real scenario, these would be model paths or Hugging Face identifiers.
llm_draft = LLM(model="draft-model")
llm_target = LLM(model="target-model")

# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

# Input prompt
prompt = "The quick brown fox jumps over the lazy"

# Generate draft tokens
draft_outputs = llm_draft.generate(prompt, sampling_params)
draft_tokens = draft_outputs[0].outputs[0].token_ids

# Now, use the target model to verify and extend
# vLLM's speculative decoding is integrated, but conceptually this is what happens:
# The target model looks at the draft tokens and decides which ones to accept.
# It then generates new tokens based on the accepted sequence.

# For this example, we'll simulate the process by feeding the draft tokens
# and letting the target model continue. In reality, vLLM handles this
# by passing the draft sequence to the target model for verification.

# Simulate speculative decoding:
# The target model would internally process the prompt + draft_tokens.
# It would accept a prefix of the draft tokens and then generate the rest.
# Let's assume for illustration, the target model accepts the first 5 draft tokens
# and then generates 5 more.

# This is a simplified representation of the internal vLLM process.
# The actual implementation involves batched KV cache management and efficient
# token verification.

# A more accurate conceptual flow using vLLM's speculative decoding feature
# would look like this:
# llm_speculative = LLM(model="target-model", speculative_model="draft-model")
# outputs = llm_speculative.generate(prompt, sampling_params)
# print(outputs[0].outputs[0].text)

# Let's manually construct a hypothetical output for clarity on the *concept*:
accepted_draft_prefix_length = 5 # Hypothetical number of tokens accepted
generated_after_verification = 5 # Hypothetical number of new tokens generated

# The final output would be prompt + accepted_draft_prefix + new_tokens
# For simplicity, we'll just show the full generation from the target model
# as if it processed the prompt and accepted a good chunk of the draft.

# Actual vLLM speculative decoding call:
llm_speculative = LLM(model="target-model", speculative_model="draft-model")
outputs = llm_speculative.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

Speculative decoding addresses the bottleneck of autoregressive generation, where each token must be predicted sequentially by the large, powerful model. This sequential nature means that even with massive parallelization, the generation speed is limited by the model’s latency per token. The core idea is to use a smaller, faster "draft" model to propose a sequence of tokens, and then use the larger "target" model to verify these proposed tokens in parallel. If the draft model is good enough, it can generate multiple tokens that the target model accepts with high probability. The target model only needs to perform the full forward pass for verification and potentially generate a few more tokens if the draft sequence is rejected or incomplete. This dramatically reduces the number of expensive forward passes required for the large model, leading to significant speedups.

The system works by orchestrating two models: a small, quick draft model and a large, accurate target model. During generation, the draft model predicts a sequence of k tokens. These k tokens are then fed to the target model. The target model computes the probability of each of these k tokens occurring given the preceding tokens. It then uses these probabilities to decide how many of the draft tokens are "accepted." Typically, this acceptance is probabilistic, meaning tokens with higher probabilities assigned by the target model are more likely to be kept. Once the accepted sequence is determined (which could be anywhere from 0 to k tokens), the target model generates the next token, and the process repeats. The key is that the target model’s verification step can be done much more efficiently than generating each token individually. vLLM’s implementation optimizes this by managing the KV cache for both models and efficiently batching the verification requests.

The primary levers you control are the choice of the draft model and its size relative to the target model. A smaller draft model is faster but might have a lower acceptance rate, meaning the target model has to do more work. A larger draft model might have a higher acceptance rate but will be slower itself. The ideal balance depends on the specific models and hardware. You also influence performance through standard sampling parameters like temperature, top_p, and max_tokens, which affect the output quality and the length of the sequences being drafted and verified. The num_speculative_tokens parameter in vLLM (though not explicitly shown in the basic example) directly controls how many tokens the draft model tries to predict in each step.

The most surprising part of speculative decoding is how it can achieve near-target-model quality at significantly higher speeds, even when the draft model is substantially smaller. This isn’t magic; it’s a consequence of the fact that the target model doesn’t need to re-generate the accepted tokens from scratch. Instead, it uses its understanding of the context to validate the draft tokens. If the draft model proposes a token that the target model considers highly probable, the target model can quickly confirm it by looking up its probability in its own forward pass. This validation is much cheaper than a full generation of that token. The efficiency comes from amortizing the cost of the target model’s forward pass over multiple draft tokens.

The next challenge you’ll likely encounter is fine-tuning the draft model to maximize its acceptance rate without sacrificing too much of its speed advantage, or exploring alternative speculative decoding strategies like using multiple draft models of increasing size.

Want structured learning?

Take the full Vllm course →