LLM serving is often a game of throughput, and TensorRT-LLM’s inflight batching is a surprisingly effective way to squeeze more requests through your GPU.
Imagine you’ve got a bunch of users asking your LLM questions simultaneously. Without inflight batching, each user’s request gets processed one by one, or maybe in small, fixed batches. This means your powerful GPU sits idle a lot of the time, waiting for the current request to finish before it can even start on the next one. Inflight batching changes this by dynamically grouping incoming requests together as they arrive.
Let’s see it in action. We’ll use a simple Python script to simulate incoming requests and then show how TensorRT-LLM’s InflightBatchingManager can group them.
import tensorrt_llm
from tensorrt_llm.runtime import InflightBatchingManager, SessionConfig
from tensorrt_llm.bindings import GenerationConfig
import torch
import time
# Assume you have a TensorRT-LLM engine built for your model
# For demonstration, we'll use dummy parameters
model_path = "path/to/your/tensorrt_llm_engine"
max_batch_size = 128
max_input_len = 1024
max_output_len = 1024
max_num_tokens = max_input_len + max_output_len
# Load the engine and create a session config
# Replace with your actual engine path and model configuration
engine_buffer = open(model_path, "rb").read()
engine_bytes = bytearray(engine_buffer)
session_config = SessionConfig(
max_batch_size=max_batch_size,
max_input_len=max_input_len,
max_output_len=max_output_len,
max_num_tokens=max_num_tokens,
dtype=tensorrt_llm.str_dtype_to_dtype("float16"), # Example dtype
use_cuda_graph=True, # Often beneficial for inflight batching
enable_chunk_support=False, # Example setting
)
# Initialize the inflight batching manager
manager = InflightBatchingManager(
session_config,
engine_bytes,
cuda_graph_batch_size=4, # Number of CUDA graphs to use for batching
)
# Simulate incoming requests
num_requests = 10
request_id = 0
requests_to_add = []
print("Simulating incoming requests...")
for i in range(num_requests):
prompt = f"User {i}: What is the capital of France?"
input_token_ids = tensorrt_llm.tools.lora_utils.string_to_token_ids(prompt) # Placeholder
input_lengths = [len(input_token_ids)]
generation_config = GenerationConfig(
max_new_tokens=20,
# Other generation parameters...
)
requests_to_add.append(
(request_id, input_token_ids, input_lengths, generation_config)
)
request_id += 1
time.sleep(0.05) # Simulate requests arriving over time
print(f"Adding {len(requests_to_add)} requests to the manager...")
for req_id, input_ids, input_lens, gen_config in requests_to_add:
manager.enqueue_request(req_id, input_ids, input_lens, gen_config)
# Now, the manager would continuously poll for completed requests and add new ones.
# For this demo, we'll just show how many are currently in flight.
print(f"Currently in-flight requests: {manager.num_in_flight_batches()}")
# In a real scenario, you'd have a loop here to:
# 1. Poll for completed requests using manager.fetch_completed_requests()
# 2. Process the output
# 3. Add new requests as they come in
# 4. Re-enqueue the manager.run_next_batch() to process the next set of requests
This code sets up the InflightBatchingManager and then simulates a stream of requests. The manager’s job is to intelligently group these requests into batches that can be processed by the TensorRT-LLM engine. It does this by looking at requests that have arrived but haven’t started processing yet, and filling up a batch up to the max_batch_size or until a certain time limit is reached, or a minimum number of requests are accumulated.
The core problem this solves is GPU underutilization when dealing with variable request arrival times and lengths. Traditional batching requires all requests in a batch to have the same length or be padded to the maximum length, which is inefficient for LLMs where prompts and desired outputs vary wildly. Inflight batching, also known as dynamic batching or continuous batching, addresses this by:
- Request Queuing: Incoming requests are placed into a queue.
- Dynamic Batch Formation: The
InflightBatchingManagermonitors the queue and forms batches on the fly. It tries to fill a batch as much as possible, consideringmax_batch_sizeand potentially other constraints like a timeout or a minimum number of requests to avoid starting a batch with very few items. - Continuous Processing: As soon as a batch is formed, it’s sent to the TensorRT-LLM engine for inference. Crucially, the manager can start forming the next batch while the current one is still being processed on the GPU. This is where the "inflight" aspect comes in – batches are processed concurrently.
- Token-Level Scheduling: For generative tasks, once a batch starts processing, the manager doesn’t wait for the entire batch to finish generating all tokens. It can manage requests at a token level, yielding control back to the scheduler to form new batches as soon as any request within the current batch completes its current token generation step. This is often achieved by using CUDA graphs for efficient kernel reuse.
The key levers you control are primarily within the SessionConfig and how you manage the InflightBatchingManager:
max_batch_size: The absolute upper limit on how many requests can be in a single batch. A higher value can increase throughput but also increases GPU memory usage and latency for individual requests if the batch is consistently full.max_input_len,max_output_len,max_num_tokens: These define the maximum sequence lengths your engine can handle. The inflight batcher must respect these limits when forming batches.cuda_graph_batch_size: This setting in theInflightBatchingManagerdetermines how many different batch sizes can have their CUDA graphs pre-recorded. A larger number can improve performance by reducing overhead for more common batch sizes but increases VRAM usage.- Request Arrival Logic: How you enqueue requests and how frequently you poll for completions significantly impacts how well the batcher can utilize the GPU. Too few requests arriving, or polling too infrequently, will lead to underutilization.
The magic of inflight batching lies in its ability to keep the GPU busy by always having a batch ready or being formed. When a request finishes generating a token, the manager can immediately process the next token for that request, and potentially start a new batch with other waiting requests. This is a far cry from static batching where a batch might sit idle for many token generation steps if one request in the batch is much shorter than others.
What most people don’t realize is how granular the scheduling can get. For generative models, the InflightBatchingManager doesn’t just batch entire requests for an entire generation sequence. It can manage the process at a token-generation step level. When a batch is running, and one request in that batch generates its next token, the manager can immediately pick up the next token for that request, while simultaneously scheduling the next batch of requests that have been waiting. This allows for a near-continuous flow of computation on the GPU, minimizing idle time between token generations for any request.
The next hurdle you’ll likely face is managing the latency-throughput trade-off, especially with very large batch sizes or complex models.