The most surprising thing about vLLM’s custom sampling parameters is that top_p and top_k aren’t just random filters, but rather a pair of mechanisms that together sculpt the probability distribution of the next token in a way that can dramatically alter the output’s coherence and creativity.
Let’s see this in action. Imagine we’re prompting a model to continue a story:
from vllm import LLM, SamplingParams
prompt = "The old clock tower stood silent, its hands frozen at midnight. Suddenly, a faint chime echoed through the deserted square. A lone figure emerged from the shadows, their cloak billowing like a storm cloud. They approached the tower, a single, ancient key clutched in their gloved hand. As they reached the heavy oak door, a gust of wind, unnaturally cold, swept through the square, carrying with it a whisper that seemed to coil around the figure's very soul. The whisper spoke of..."
Now, let’s try a few sampling strategies.
Default (effectively temperature=0.7, top_p=1.0, top_k=-1):
sampling_params_default = SamplingParams(temperature=0.7, top_p=1.0, top_k=-1)
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
outputs_default = llm.generate(prompt, sampling_params_default, max_tokens=100)
print("--- Default ---")
for output in outputs_default:
print(output.outputs[0].text)
This will likely give a coherent, somewhat predictable continuation.
Higher Temperature (more randomness):
sampling_params_temp = SamplingParams(temperature=1.5, top_p=1.0, top_k=-1)
outputs_temp = llm.generate(prompt, sampling_params_temp, max_tokens=100)
print("\n--- High Temperature (1.5) ---")
for output in outputs_temp:
print(output.outputs[0].text)
Here, we’d expect more surprising word choices, potentially leading to creative but possibly nonsensical output.
Top-P (Nucleus Sampling):
sampling_params_top_p = SamplingParams(temperature=0.7, top_p=0.9, top_k=-1)
outputs_top_p = llm.generate(prompt, sampling_params_top_p, max_tokens=100)
print("\n--- Top-P (0.9) ---")
for output in outputs_top_p:
print(output.outputs[0].text)
This will restrict sampling to the smallest set of tokens whose cumulative probability exceeds 0.9. It’s a way to dynamically adjust the number of considered tokens based on their likelihood.
Top-K:
sampling_params_top_k = SamplingParams(temperature=0.7, top_p=1.0, top_k=5)
outputs_top_k = llm.generate(prompt, sampling_params_top_k, max_tokens=100)
print("\n--- Top-K (5) ---")
for output in outputs_top_k:
print(output.outputs[0].text)
This limits the selection to the top 5 most probable tokens.
Combined Top-P and Top-K:
sampling_params_combined = SamplingParams(temperature=0.7, top_p=0.9, top_k=5)
outputs_combined = llm.generate(prompt, sampling_params_combined, max_tokens=100)
print("\n--- Combined Top-P (0.9) and Top-K (5) ---")
for output in outputs_combined:
print(output.outputs[0].text)
This is where things get interesting. The model first considers the top_k tokens, and then applies top_p to that reduced set. If top_k is small, it might filter out tokens that top_p would have kept. Conversely, if top_p is very restrictive, it might eliminate tokens that top_k would have included.
The core problem these parameters solve is controlling the trade-off between coherence/predictability and creativity/novelty in text generation. Without any sampling parameters (or with temperature=0, top_p=1, top_k=-1), the model would deterministically pick the single most probable token at each step, leading to repetitive and often nonsensical output, especially for longer generations.
Internally, at each step of generation, the LLM outputs a probability distribution over its entire vocabulary for the next token.
- Temperature (
temperature): This parameter scales the logits (raw scores) before they are converted into probabilities via the softmax function. A higher temperature (e.g.,1.5) flattens the distribution, making less likely tokens more probable and increasing randomness. A lower temperature (e.g.,0.2) sharpens the distribution, favoring high-probability tokens and making the output more deterministic.temperature=1.0uses the original probabilities. - Top-K (
top_k): This acts as a hard cutoff. The model considers only thektokens with the highest probabilities. All other tokens are assigned a probability of zero. Ifk=5, only the 5 most likely next words are even candidates. - Top-P (
top_p): This is also known as nucleus sampling. Instead of a fixed number of tokens, it selects the smallest set of tokens whose cumulative probability exceeds a thresholdp. The probabilities of tokens outside this "nucleus" are redistributed among the tokens within it. For example, iftop_p=0.9and the top few tokens have probabilities0.4,0.3,0.2,0.15, then the first three tokens (summing to0.9) would form the nucleus. The fourth token (0.15) would be excluded.
When top_k and top_p are both used, the process is sequential:
First, top_k is applied. The model considers only the k most probable tokens.
Then, top_p is applied to this reduced set of k tokens. The nucleus is formed from these k tokens, and any tokens whose cumulative probability (within this set) exceeds p are kept.
The most counterintuitive aspect is how top_k and top_p interact when both are set. Many assume they are independent filters. However, top_p operates on the output of top_k. If top_k is set to a small value, say 5, and top_p is set to 0.95, the model will first find the 5 most likely tokens. Then, it will calculate the cumulative probability of only those 5 tokens and select the smallest subset of those 5 that sums to 0.95. This means top_p can end up selecting fewer than k tokens, and top_k can sometimes be the dominant constraint if the top k tokens already sum to a probability less than p.
The next logical step is to explore how to fine-tune these parameters for specific downstream tasks, like summarization versus creative writing.