The most surprising thing about W&B Launch is that it’s not just a fancy scheduler; it’s a distributed system that fundamentally changes how you think about your training infrastructure.

Imagine you’ve got a killer new model and you want to run it on a fleet of machines, maybe even across different cloud providers. You’re tired of manually SSHing into servers, copying code, setting up environments, and launching wandb agent processes. W&B Launch aims to automate all of that.

Let’s see it in action.

First, you define your training job. This isn’t just your Python script; it’s a full recipe.

# my_training_job.yaml
job_type: train-model
# This is the Docker image that will run your code.
# W&B will pull this image onto the worker nodes.
image: wandb/deeplearning:latest
# This is the command that will be executed inside the container.
# It's your standard training script, but with W&B integration.
command:
  - python
  - train.py
  - --learning-rate
  - 0.001
  - --epochs
  - 100
# Define any environment variables needed by your training script.
env:
  MY_API_KEY: "your_secret_key"
# Specify resource requirements for the job.
resources:
  cpu: 2
  gpu: 1
  memory: "8Gi"

Then, you tell W&B Launch where to run these jobs. This is your "queue" configuration.

# my_queue.yaml
queue_name: gpu-training-queue
# This tells W&B where to provision compute.
# Here, we're using AWS EC2. You could also use Kubernetes, GCP, etc.
provider:
  type: aws
  region: us-east-1
  instance_type: g4dn.xlarge # A GPU instance type
  # You can specify a custom AMI if needed.
  # ami: ami-0abcdef1234567890
  # Number of machines to spin up for this queue.
  # W&B will manage scaling this up and down based on job demand.
  min_instances: 0
  max_instances: 5
# This is the Docker image W&B Launch uses to run the agent itself.
# It's a lightweight image that connects to the W&B API and pulls jobs.
agent_image: wandb/agent:latest

You then start your queue manager:

wandb launch --queue my_queue.yaml

This command spins up a small number of worker nodes (or connects to existing ones if you’ve configured it that way) and starts wandb agent processes on them. These agents constantly poll the W&B API for jobs associated with gpu-training-queue.

When you submit a job:

wandb launch --job-type train-model --queue gpu-training-queue --config my_training_job.yaml

W&B Launch finds an available agent on the gpu-training-queue. The agent then:

  1. Pulls the specified Docker image (wandb/deeplearning:latest).
  2. Runs the command (python train.py ...) inside a container on the worker node.
  3. All your W&B logs, metrics, and artifacts are automatically streamed back to your W&B project.

The magic is in the orchestration. W&B Launch handles the provisioning of compute resources (spinning up EC2 instances in our AWS example), containerization, and job dispatching. You just define what you want to run and where, and W&B makes it happen.

The job_type field in your job configuration is a crucial piece of the puzzle. It acts as a tag that the wandb agent processes, running on your worker nodes, look for. When you launch a job with wandb launch --job-type train-model --queue gpu-training-queue, W&B Launch ensures that only agents listening for train-model jobs on the gpu-training-queue will pick it up. This allows you to have different queues for different types of workloads (e.g., a gpu-training-queue and a cpu-inference-queue) and different job types that target specific queues.

Under the hood, W&B Launch leverages cloud provider APIs (like EC2, GCP Compute Engine, or Kubernetes) to dynamically scale your compute fleet. When jobs are submitted, it requests new instances; when they finish, it can scale down. This means you only pay for the compute you’re actively using for your training runs. You define the min_instances and max_instances in your queue configuration, and W&B Launch manages the autoscaling policies to stay within those bounds.

A common point of confusion is how the agent on the worker node knows which job to run. The wandb agent process, started by W&B Launch on each provisioned instance, is configured to poll a specific queue (--queue <queue_name>). When you submit a job using wandb launch --queue <queue_name> --job-type <job_type>, W&B Launch places that job definition into a queue. The agent on a worker node associated with that queue sees the new job, checks if its job_type matches (or if it’s configured to accept any job_type), and if so, pulls the necessary Docker image and executes the command.

If you’re running into issues where your jobs aren’t starting, double-check that the agent_image in your queue configuration is valid and accessible by your worker nodes. A common mistake is using a private Docker image without properly configuring the worker nodes to authenticate with the registry.

Want structured learning?

Take the full Wandb course →