The most surprising thing about W&B Launch is that it’s not just a fancy scheduler; it’s a distributed system that fundamentally changes how you think about your training infrastructure.
Imagine you’ve got a killer new model and you want to run it on a fleet of machines, maybe even across different cloud providers. You’re tired of manually SSHing into servers, copying code, setting up environments, and launching wandb agent processes. W&B Launch aims to automate all of that.
Let’s see it in action.
First, you define your training job. This isn’t just your Python script; it’s a full recipe.
# my_training_job.yaml
job_type: train-model
# This is the Docker image that will run your code.
# W&B will pull this image onto the worker nodes.
image: wandb/deeplearning:latest
# This is the command that will be executed inside the container.
# It's your standard training script, but with W&B integration.
command:
- python
- train.py
- --learning-rate
- 0.001
- --epochs
- 100
# Define any environment variables needed by your training script.
env:
MY_API_KEY: "your_secret_key"
# Specify resource requirements for the job.
resources:
cpu: 2
gpu: 1
memory: "8Gi"
Then, you tell W&B Launch where to run these jobs. This is your "queue" configuration.
# my_queue.yaml
queue_name: gpu-training-queue
# This tells W&B where to provision compute.
# Here, we're using AWS EC2. You could also use Kubernetes, GCP, etc.
provider:
type: aws
region: us-east-1
instance_type: g4dn.xlarge # A GPU instance type
# You can specify a custom AMI if needed.
# ami: ami-0abcdef1234567890
# Number of machines to spin up for this queue.
# W&B will manage scaling this up and down based on job demand.
min_instances: 0
max_instances: 5
# This is the Docker image W&B Launch uses to run the agent itself.
# It's a lightweight image that connects to the W&B API and pulls jobs.
agent_image: wandb/agent:latest
You then start your queue manager:
wandb launch --queue my_queue.yaml
This command spins up a small number of worker nodes (or connects to existing ones if you’ve configured it that way) and starts wandb agent processes on them. These agents constantly poll the W&B API for jobs associated with gpu-training-queue.
When you submit a job:
wandb launch --job-type train-model --queue gpu-training-queue --config my_training_job.yaml
W&B Launch finds an available agent on the gpu-training-queue. The agent then:
- Pulls the specified Docker image (
wandb/deeplearning:latest). - Runs the command (
python train.py ...) inside a container on the worker node. - All your W&B logs, metrics, and artifacts are automatically streamed back to your W&B project.
The magic is in the orchestration. W&B Launch handles the provisioning of compute resources (spinning up EC2 instances in our AWS example), containerization, and job dispatching. You just define what you want to run and where, and W&B makes it happen.
The job_type field in your job configuration is a crucial piece of the puzzle. It acts as a tag that the wandb agent processes, running on your worker nodes, look for. When you launch a job with wandb launch --job-type train-model --queue gpu-training-queue, W&B Launch ensures that only agents listening for train-model jobs on the gpu-training-queue will pick it up. This allows you to have different queues for different types of workloads (e.g., a gpu-training-queue and a cpu-inference-queue) and different job types that target specific queues.
Under the hood, W&B Launch leverages cloud provider APIs (like EC2, GCP Compute Engine, or Kubernetes) to dynamically scale your compute fleet. When jobs are submitted, it requests new instances; when they finish, it can scale down. This means you only pay for the compute you’re actively using for your training runs. You define the min_instances and max_instances in your queue configuration, and W&B Launch manages the autoscaling policies to stay within those bounds.
A common point of confusion is how the agent on the worker node knows which job to run. The wandb agent process, started by W&B Launch on each provisioned instance, is configured to poll a specific queue (--queue <queue_name>). When you submit a job using wandb launch --queue <queue_name> --job-type <job_type>, W&B Launch places that job definition into a queue. The agent on a worker node associated with that queue sees the new job, checks if its job_type matches (or if it’s configured to accept any job_type), and if so, pulls the necessary Docker image and executes the command.
If you’re running into issues where your jobs aren’t starting, double-check that the agent_image in your queue configuration is valid and accessible by your worker nodes. A common mistake is using a private Docker image without properly configuring the worker nodes to authenticate with the registry.