vLLM’s Best-of-N sampling and beam search aren’t just about picking the "best" next token; they’re a sophisticated dance of exploration and exploitation that can drastically alter output quality and efficiency.

Let’s see it in action. Imagine we’re generating text from a simple prompt: "The quick brown fox jumps over the…"

Here’s a basic vLLM generation without any special sampling:

from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-125m") # Using a small model for quick demo

# Default sampling (greedy)
sampling_params_greedy = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=20)
outputs_greedy = llm.generate("The quick brown fox jumps over the", sampling_params_greedy)
print("Greedy Output:", outputs_greedy[0].outputs[0].text)

# Best-of-N sampling
sampling_params_bon = SamplingParams(best_of=5, temperature=0.7, top_p=0.95, max_tokens=20)
outputs_bon = llm.generate("The quick brown fox jumps over the", sampling_params_bon)
print("Best-of-N Output:", outputs_bon[0].outputs[0].text)

# Beam Search
sampling_params_beam = SamplingParams(use_beam_search=True, beam_width=5, temperature=0.7, top_p=0.95, max_tokens=20)
outputs_beam = llm.generate("The quick brown fox jumps over the", sampling_params_beam)
print("Beam Search Output:", outputs_beam[0].outputs[0].text)

Running this might produce something like:

Greedy Output: lazy dog.
Best-of-N Output: lazy dog, and the lazy dog ran away.
Beam Search Output: lazy dog, and the lazy dog ran away.

Notice how the Best-of-N and Beam Search outputs are more elaborate and potentially more coherent than the greedy one. This hints at their underlying mechanisms.

The core problem these techniques solve is the limitation of greedy decoding. Greedy decoding always picks the single token with the highest probability at each step. This is fast but often leads to repetitive, bland, or nonsensical text because it never explores alternative paths that might be slightly less probable initially but lead to a better overall sequence.

Best-of-N Sampling addresses this by generating N independent sequences using standard sampling (controlled by temperature, top_p, etc.) and then selecting the one with the highest probability according to the model. This means the system tries N different ways to continue the text and then picks the one that looks "best" after it’s generated, based on the model’s own likelihood assessment of the complete sequence.

Beam Search is a bit different. Instead of generating N full sequences and picking the best, it maintains k (the beam_width) "beams" or partial sequences at each decoding step. At each step, it considers all possible next tokens for each of the k current beams, calculates the probability of the resulting k * vocab_size new sequences, and then keeps only the top k most probable sequences to carry forward to the next step. This is an exploration strategy that actively prunes less promising paths early.

Here’s how you control them in vLLM:

  • best_of: An integer. If set to N > 1, vLLM will generate N candidate sequences using the specified sampling parameters (temperature, top_p, etc.) and then return the single sequence that has the highest overall probability. This is evaluated after all N sequences are generated.
  • use_beam_search: A boolean. If True, vLLM will use beam search instead of standard sampling.
  • beam_width: An integer. When use_beam_search is True, this parameter specifies how many beams (candidate sequences) to maintain at each step. A beam_width of 1 is equivalent to greedy decoding.

The interaction between temperature, top_p, and best_of/beam_width is crucial. When best_of is used, the sampling parameters (temperature, top_p) define how each of the N candidates is generated. A higher temperature or top_p will lead to more diverse candidate sequences, increasing the chance that one of them might be remarkably good. When use_beam_search is True, the temperature and top_p parameters influence the probability distribution from which the next tokens are chosen for each beam at each step.

A common misconception is that best_of and beam_width are interchangeable. They are not. best_of generates N independent samples and then selects one. beam_width actively prunes search paths during generation. best_of can be computationally more expensive for large N because it generates N full sequences, whereas beam search prunes early. However, best_of can sometimes find better quality results because it doesn’t discard potentially good but initially low-probability paths as aggressively as beam search might.

One subtle but important aspect of best_of is that it uses the model’s internal log probabilities to score the entire generated sequence. This means if you have best_of=5, vLLM generates 5 complete sequences, and then for each sequence, it calculates sum(log_prob(token_i | context_i)) for all tokens i in that sequence. The sequence with the highest sum of log probabilities is returned. This is different from just picking the sequence that looks best to a human, or the one that had the highest probability at the final token.

If you’re seeing repetitive or generic outputs even with these techniques, the next thing you’ll likely need to tune is the repetition_penalty parameter.

Want structured learning?

Take the full Vllm course →