The W&B Kubernetes Agent lets you run your machine learning training jobs as pods on your Kubernetes cluster, managed by Weights & Biases.

Let’s see it in action. Imagine you have a Python script train.py that logs metrics to W&B:

import wandb
import time
import random

# Log in to W&B (replace with your API key or use environment variables)
# wandb.login(key="YOUR_API_KEY")

# Start a W&B run
run = wandb.init(project="k8s-training-demo", job_type="training")

print("W&B run started:", run.url)

# Simulate training
for i in range(100):
    loss = 1.0 / (i + 1) + random.random() * 0.1
    accuracy = 1.0 - loss / 2.0 + random.random() * 0.05
    wandb.log({"loss": loss, "accuracy": accuracy, "step": i})
    time.sleep(0.5)

print("Training finished.")
run.finish()

To run this on Kubernetes using the W&B agent, you’ll first need to install the agent. This typically involves applying a manifest file that deploys the agent as a Kubernetes Deployment. The agent watches for specific W&B commands and launches your training jobs.

Here’s a simplified example of how you might trigger a training job from your local machine using kubectl and the W&B agent. First, ensure your kubectl context is set to your desired Kubernetes cluster.

You’ll define a Kubernetes Job resource that tells the agent what to do. This Job will specify your Docker image, the command to run, and any environment variables needed, like your W&B API key.

apiVersion: batch/v1
kind: Job
metadata:
  name: my-tensorflow-training
spec:
  template:
    spec:
      containers:
      - name: training-container
        image: your-dockerhub-username/your-ml-image:latest # Replace with your ML image
        command: ["python", "/app/train.py"] # Command to run your training script
        env:
        - name: WANDB_API_KEY
          valueFrom:
            secretKeyRef:
              name: wandb-secret # A Kubernetes secret containing your W&B API key
              key: api-key
        # Mount persistent volumes if your training needs to read data or save checkpoints
        # volumeMounts:
        # - name: data-volume
        #   mountPath: /data
      restartPolicy: Never # Important for Jobs; ensures it doesn't restart indefinitely on failure
      # volumes:
      # - name: data-volume
      #   persistentVolumeClaim:
      #     claimName: my-data-pvc
  backoffLimit: 4 # Number of retries before the Job is marked as failed

Before applying this, you need to create the Kubernetes secret wandb-secret with your W&B API key:

kubectl create secret generic wandb-secret --from-literal=api-key='YOUR_ACTUAL_WANDB_API_KEY'

Now, you can create the Job resource:

kubectl apply -f job.yaml

The W&B Kubernetes Agent, running in your cluster, will detect this new Job resource. It understands that this Job is intended for W&B managed execution. The agent then orchestrates the creation of the actual Kubernetes Pod that will run your train.py script within the specified Docker image. This pod will have access to the WANDB_API_KEY from the secret, allowing it to authenticate with W&B.

As your train.py script executes, it will log metrics, and because the pod is running within the W&B agent’s managed environment, these logs and metrics will be seamlessly streamed to your specified W&B project. You can then monitor the training progress directly in your W&B dashboard.

The core idea is that the agent acts as a bridge. You define your training as a standard Kubernetes Job (or Pod or CronJob), and the agent intercepts these definitions, understands they are for W&B, and ensures they are executed correctly within the Kubernetes environment while hooking into W&B’s logging and orchestration capabilities. This allows you to leverage Kubernetes for scaling, resource management, and fault tolerance of your ML workloads, all while keeping your experiment tracking integrated with Weights & Biases.

The agent’s power comes from its ability to interpret specific annotations or resource types that signal W&B management. For instance, certain annotations on a Pod or Job might instruct the agent to pull a specific W&B run configuration or to link the pod to an existing W&B run. This declarative approach means you define what you want to run, and the agent figures out how to run it on Kubernetes, including setting up the necessary environment variables and potentially downloading code directly from W&B.

One of the most powerful, yet often overlooked, aspects of the W&B Kubernetes Agent is its ability to manage code synchronization. If you don’t bake your code into a Docker image, you can configure the agent to automatically git clone your repository at a specific commit or branch before starting your training script. This is achieved by adding specific environment variables to your Job definition that the agent recognizes, like WANDB_CODE_PATH and WANDB_GIT_BRANCH. The agent then handles the cloning process within the training pod, ensuring your script runs with the exact code version you intended, without requiring you to build and push new Docker images for every code change.

Once your jobs are running smoothly, you’ll start thinking about how to manage multiple experiments efficiently.

Want structured learning?

Take the full Wandb course →