Skip to content
ADHDecode
  1. Home
  2. Articles
  3. vLLM

vLLM Articles

50 articles

vLLM with Ray Serve: Production Deployment Pattern

vLLM with Ray Serve: Production Deployment Pattern The most surprising thing about deploying large language models LLMs in production is how much of the.

2 min read

vLLM Request Priority Scheduling: Preempt Low-Priority Jobs

vLLM is doing something pretty wild with your LLM requests, and it's not just a simple queue. It's actively kicking lower-priority requests out of the w.

3 min read

vLLM Speculative Decoding: Draft Model for Faster Generation

Speculative decoding lets a small, fast model generate tokens and then have a larger, more accurate model verify them, dramatically speeding up inferenc.

4 min read

vLLM Tensor Parallelism: Multi-GPU Setup

Tensor parallelism splits a single large model layer across multiple GPUs, allowing you to run models that wouldn't fit on a single GPU or to speed up i.

3 min read

vLLM Throughput and Latency Benchmarking: Measure Your Setup

vLLM is designed for high-throughput, low-latency LLM inference, but achieving optimal performance requires understanding and tuning your specific setup.

3 min read

vLLM Tool Calling: Function Calling API Support

The most surprising thing about vLLM's tool calling support is that it's not a separate feature you enable, but rather a core capability that emerges fr.

3 min read

vLLM Vision Language Models: Serve LLaVA and Qwen-VL

vLLM Vision Language Models: Serve LLaVA and Qwen-VL — practical guide covering vllm setup, configuration, and troubleshooting with real-world examples.

4 min read

vLLM vs TensorRT-LLM: Compare Inference Frameworks

vLLM and TensorRT-LLM are both high-performance inference frameworks for large language models LLMs, but they approach optimization with different philo.

3 min read

vLLM vs Triton: Choose the Right Serving Framework

vLLM and Triton are both powerful tools for serving large language models LLMs, but they target different needs and excel in different areas.

3 min read

Fix vLLM CUDA Out of Memory During KV Cache Allocation

The vLLM OutOfMemoryError during KV cache allocation means the GPU ran out of VRAM to store the attention key-value states for the sequences being proce.

4 min read

vLLM A100 vs H100 Performance: Throughput and Latency

vLLM A100 vs H100 Performance: Throughput and Latency — practical guide covering vllm setup, configuration, and troubleshooting with real-world examples.

3 min read

vLLM API Server Authentication and Security

The vLLM API server, by default, doesn't enforce any authentication, meaning anyone who can reach its network endpoint can send requests and potentially.

3 min read

vLLM Async Engine: Production Server Setup

The vLLM async engine lets you serve large language models with impressive throughput, but getting it into production feels like navigating a minefield .

5 min read

vLLM Auto-Scaling on Kubernetes: HPA Configuration

The Horizontal Pod Autoscaler HPA in Kubernetes, when configured for vLLM, doesn't actually measure vLLM's inference throughput directly; it relies on u.

2 min read

vLLM Batch Inference: Offline Processing at Scale

vLLM can process requests much faster than traditional methods because it runs inference in batches, which is essentially a way of grouping multiple req.

3 min read

vLLM Best-of-N Sampling and Beam Search Config

vLLM's Best-of-N sampling and beam search aren't just about picking the "best" next token; they're a sophisticated dance of exploration and exploitation.

3 min read

vLLM Chat Completions API: OpenAI-Compatible Integration

The most surprising thing about vLLM's OpenAI-compatible API is that it's not just compatible, it often outperforms OpenAI's own models in terms of raw .

3 min read

vLLM Chunked Prefill: Tune for Throughput and TTFT

vLLM Chunked Prefill: Tune for Throughput and TTFT — practical guide covering vllm setup, configuration, and troubleshooting with real-world examples.

3 min read

vLLM Continuous Batching: Maximize GPU Throughput

Continuous batching allows vLLM to process requests much faster by not waiting for all requests in a batch to finish before starting new ones.

3 min read

vLLM Cost Per Token Optimization: Reduce Inference Cost

vLLM's memory management is the hidden driver of its cost-efficiency, specifically its PagedAttention mechanism, which treats GPU memory like virtual me.

3 min read

vLLM CPU Offloading: Run Large Models with Less GPU RAM

vLLM's CPU offloading lets you run models that are too big for your GPU by cleverly moving parts of the model's weights to CPU RAM.

2 min read

vLLM CUDA Graph Optimization: Reduce Kernel Launch Overhead

vLLM's CUDA graph optimization is a technique to significantly reduce the overhead associated with launching kernels on the GPU, particularly for repeti.

2 min read

vLLM Custom Model Integration: Add Your Own Architecture

vLLM Custom Model Integration: Add Your Own Architecture — practical guide covering vllm setup, configuration, and troubleshooting with real-world examp...

4 min read

vLLM Custom Sampling Parameters: Temperature, Top-P, Top-K

The most surprising thing about vLLM's custom sampling parameters is that topp and topk aren't just random filters, but rather a pair of mechanisms that.

3 min read

vLLM Disaggregated Prefill and Decode: Separate Phases

The most surprising thing about vLLM's disaggregated prefill and decode is that they are fundamentally different computational problems, and treating th.

3 min read

vLLM Docker and Kubernetes Production Deployment

vLLM, the lightning-fast inference engine, can be a bit prickly when you try to wrangle it into production with Docker and Kubernetes.

4 min read

vLLM Embedding Models: Serve and Query Embeddings

You can serve and query embedding models with vLLM, and the most surprising thing is how little you need to change from serving a text-generation model .

3 min read

vLLM Fine-Tuned Adapter Serving: LoRA and Full Finetunes

The most surprising thing about serving fine-tuned LLMs with vLLM is that sometimes, the smaller adapter weights like LoRA can actually be slower to loa.

4 min read

vLLM FP8 Quantization: H100 Native FP8 Inference

The H100's FP8 tensor cores unlock a new level of inference throughput, but getting them to hum involves understanding a few key constraints.

3 min read

vLLM GGUF Loading: Serve Quantized GGUF Models

Quantized GGUF models can actually load and run faster than their unquantized counterparts, even though they use less memory.

2 min read

vLLM GPU Memory Utilization: Configure KV Cache Size

vLLM GPU Memory Utilization: Configure KV Cache Size — practical guide covering vllm setup, configuration, and troubleshooting with real-world examples.

3 min read

vLLM gRPC Server: Low-Latency Inference Endpoint

vLLM's gRPC server doesn't just serve model outputs; it's a highly optimized pipeline designed to minimize the wall-clock time between a request arrivin.

3 min read

vLLM Guided Decoding: Constrain Output to JSON Schema

Guided decoding in vLLM lets you force an LLM to produce output that strictly conforms to a predefined structure, like JSON.

2 min read

vLLM Health Checks and Prometheus Metrics

vLLM is a surprisingly efficient inference engine, but its internal workings can feel like a black box, especially when it comes to monitoring.

3 min read

vLLM Installation: GPU Quickstart Guide

vLLM can actually run faster than Hugging Face Transformers on a single GPU, even for tasks you might think would be CPU-bound.

3 min read

vLLM Llama, Mistral, Mixtral: Model Serving Guide

Serving large language models efficiently is a surprisingly complex dance between hardware, software, and the model's own architecture.

3 min read

vLLM Load Balancing: Route Across Multiple Replicas

The most surprising thing about vLLM load balancing is that it's not just about spreading requests evenly; it's about predicting which replica can finis.

3 min read

vLLM LoRA Serving: Dynamically Swap Multiple Adapters

vLLM can dynamically swap multiple LoRA adapters on a single model instance, letting you serve many fine-tuned variations without spinning up new GPUs.

3 min read

vLLM Marlin and AWQ: Fastest INT4 Quantization

Marlin and AWQ are two cutting-edge techniques that let you run large language models LLMs on less hardware by quantizing their weights to INT4, but the.

4 min read

vLLM max_model_len: Configure Context Length

vLLM's maxmodellen isn't just about how much text a model can process, it's about how much context it can hold across multiple turns.

3 min read

vLLM Mixtral MoE: Expert Parallelism Setup

Expert parallelism in vLLM for Mixtral MoE models isn't just about distributing experts; it's about orchestrating a symphony of specialized computation .

3 min read

vLLM Model Warm-Up: Eliminate Cold Start Latency

The primary reason vLLM feels sluggish on its first request is that it's not just loading weights; it's meticulously preparing its internal memory struc.

3 min read

vLLM Monitoring: P99 Latency Dashboards

vLLM's P99 latency dashboards are often misunderstood because they don't just measure request processing time; they also include the time spent waiting .

2 min read

vLLM Multi-Node Distributed Serving: Scale Across Hosts

vLLM's multi-node serving is how you stop thinking about fitting your massive LLM into a single GPU's memory and start thinking about fitting it across .

2 min read

vLLM OpenAI-Compatible Server: Drop-In API Replacement

The vLLM OpenAI-compatible server is so good at being a drop-in replacement for OpenAI's API that you can often forget it's not actually OpenAI.

3 min read

vLLM PagedAttention: Memory Management Explained

PagedAttention is a memory management system for large language models that achieves near-optimal memory utilization by treating GPU memory like virtual.

3 min read

vLLM Pipeline Parallelism: Serve Models Too Big for One GPU

Pipeline parallelism in vLLM is a technique that lets you serve large language models that wouldn't fit into a single GPU's memory by splitting the mode.

2 min read

vLLM Prefix Caching: Reuse KV Cache for Common Prefixes

The most surprising thing about vLLM's prefix caching is that it doesn't just save computation, it fundamentally changes the cost structure of serving l.

2 min read

vLLM Production Checklist: Config, Security, Monitoring

A vLLM deployment isn't just about loading a model; it's a distributed system where the inference server, the model weights, and the client requests are.

4 min read

vLLM Quantization: AWQ, GPTQ, and INT4 Inference

Quantization isn't just about making models smaller; it's a sophisticated form of lossy compression that fundamentally alters the model's weights to all.

3 min read
ADHDecode

Complex topics, finally made simple

Courses

  • Networking
  • Databases
  • Linux
  • Distributed Systems
  • Containers & Kubernetes
  • System Design
  • All Courses →

Resources

  • Cheatsheets
  • Debugging
  • Articles
  • About
  • Privacy
  • Sitemap

Connect

  • Twitter (opens in new tab)
  • GitHub (opens in new tab)

Built for curious minds. Free forever.

© 2026 ADHDecode. All content is free.

  • Home
  • Learn
  • Courses
Esc
Start typing to search all courses...
See all results →
↑↓ navigate Enter open Esc close