Expert parallelism in vLLM for Mixtral MoE models isn’t just about distributing experts; it’s about orchestrating a symphony of specialized computation where the network chooses which instruments play, and the system ensures they can all play their part simultaneously without stepping on each other’s toes.

Let’s see this in action. Imagine a Mixtral 8x7B model running on two A100 GPUs. Each of the 8 experts is a full feed-forward network, and the router decides which 2 experts to activate for each token. With expert parallelism, we can split these experts across GPUs.

Here’s a simplified representation of a request being processed.

Request: {"prompt": "Explain expert parallelism in Mixtral.", "max_tokens": 50}

Scenario: Mixtral 8x7B, 2 GPUs (GPU0, GPU1).

vLLM Configuration (Conceptual):

{
  "model": "mistralai/Mixtral-8x7B-v0.1",
  "tensor_parallel_size": 1, // We're focusing on expert parallelism here
  "pipeline_parallel_size": 1,
  "worker_use_ray": true, // Assuming Ray for distributed workers
  "expert_parallel_size": 2 // This is the key!
}

In this setup, vLLM will attempt to distribute the 8 experts across the 2 "expert parallel" workers. Roughly, 4 experts will reside on the worker running on GPU0, and the other 4 experts on the worker running on GPU1.

Internal Flow (Simplified):

  1. Tokenization & Input Processing: The prompt is tokenized.
  2. Router Network: The initial layers of the model (shared across experts) process the tokens. The router then decides which 2 experts to send each token to.
  3. Expert Dispatch:
    • If a token is routed to Expert 1 (on GPU0) and Expert 3 (on GPU1):
      • The token’s computation for Expert 1 is sent to the worker on GPU0.
      • The token’s computation for Expert 3 is sent to the worker on GPU1.
  4. Expert Computation:
    • GPU0 Worker: Computes Expert 1’s output for the relevant tokens.
    • GPU1 Worker: Computes Expert 3’s output for the relevant tokens.
  5. Gather & Combine: The outputs from the activated experts (Expert 1 and Expert 3 in this example) are gathered back. The router’s weights are applied, and the results are combined.
  6. Subsequent Layers: The combined output proceeds through the rest of the model layers, potentially involving more expert dispatches if there are subsequent MoE layers.

This allows for a single Mixtral model to utilize more experts than can fit on a single GPU, effectively scaling the model’s capacity by distributing its specialized components.

The core problem expert parallelism solves is memory. A full Mixtral 8x7B model has 8 independent feed-forward networks, each the size of a standard 7B model’s FFN. If you have only 2 GPUs, you can’t fit all 8 experts (plus shared weights) onto one. Expert parallelism splits these experts across GPUs, so each GPU only needs to hold a subset of the experts. The system then intelligently routes tokens to the correct GPU for the activated experts.

Here’s how vLLM orchestrates this, focusing on the expert_parallel_size parameter. When expert_parallel_size > 1, vLLM spins up multiple "worker" processes, often mapped to different GPUs. The total number of workers is tensor_parallel_size * pipeline_parallel_size * expert_parallel_size. If tensor_parallel_size and pipeline_parallel_size are 1, then expert_parallel_size directly dictates the number of workers, and thus the number of GPUs involved in expert computation.

Each worker will be responsible for a slice of the model’s experts. For Mixtral 8x7B and expert_parallel_size=2, you’d have two workers, each holding 4 experts. When a token’s router directs it to, say, Expert 0 and Expert 5, the computation for Expert 0 would happen on Worker 0 (on GPU0) and Expert 5 on Worker 1 (on GPU1). vLLM’s communication layer ensures these results are synchronized and combined correctly before proceeding to the next layer.

The critical aspect is that the router still makes its decision based on the full model’s logic, but the execution of the chosen experts is distributed. This means you can run models with more experts than would fit on a single device, provided you have enough devices to distribute them.

To verify your setup, you can use nvidia-smi to see GPU utilization. If you have expert_parallel_size=2 and are processing tokens, you should see roughly balanced GPU utilization across the two GPUs involved in expert computation, with each GPU primarily handling its assigned subset of experts.

The "experts" themselves are the Feed-Forward Networks (FFNs) within each Transformer block. In Mixtral, these FFNs are not monolithic; they are a collection of distinct FFNs, and a "router" network selects which ones to use for a given token. Expert parallelism takes this architectural feature and maps it onto multiple devices. Instead of one GPU holding all 8 FFNs, GPU0 might hold FFNs 0-3, and GPU1 holds FFNs 4-7. When a token needs Experts 1 and 6, GPU0 computes Expert 1 and GPU1 computes Expert 6, and their results are then combined.

The most surprising thing is that the router’s logic doesn’t change; it still thinks it’s selecting from 8 experts, but the underlying system intercepts those selections and sends the computation to the correct GPU that holds the physical FFN for that expert.

The next step in understanding distributed inference for large models is often exploring how tensor parallelism and pipeline parallelism can be combined with expert parallelism to maximize throughput and minimize latency on even larger clusters.

Want structured learning?

Take the full Vllm course →