Weights & Biases (W&B) logging for SageMaker training jobs is designed to provide seamless experiment tracking and visualization directly from your SageMaker environment, offering insights into model performance, hyperparameters, and resource utilization.
Here’s a quick demo of W&B logging in action with a simple SageMaker training script.
First, let’s set up a basic Python script (train.py) that uses W&B.
import argparse
import wandb
import time
import random
def train(epochs, learning_rate):
wandb.init(project="sagemaker-demo", config={
"learning_rate": learning_rate,
"epochs": epochs
})
for epoch in range(epochs):
loss = 1.0 / (epoch + 1) + random.random() * 0.1
accuracy = 0.5 + epoch / epochs + random.random() * 0.05
wandb.log({"epoch": epoch, "loss": loss, "accuracy": accuracy})
time.sleep(0.5) # Simulate training step
wandb.log({"final_accuracy": accuracy, "final_loss": loss})
wandb.finish()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=10)
parser.add_argument("--learning_rate", type=float, default=0.01)
args = parser.parse_args()
train(args.epochs, args.learning_rate)
To integrate this with SageMaker, you’ll need to:
- Install W&B: Ensure the
wandblibrary is available in your SageMaker environment. You can do this by including it in yourrequirements.txtfile. - Configure W&B API Key: Set your W&B API key as an environment variable. In SageMaker, this is typically done using SageMaker’s environment variable configuration for training jobs.
- Launch a SageMaker Training Job: Use the SageMaker Python SDK to define and launch your training job.
Here’s how you’d launch this from a SageMaker notebook or script:
import sagemaker
from sagemaker.tensorflow import TensorFlow # Or PyTorch, MXNet, etc.
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
# Define training script and entry point
entry_point_script = 'train.py'
source_dir = '.' # Assuming train.py is in the current directory
# Define hyperparameters
hyperparameters = {
"epochs": 20,
"learning_rate": 0.005
}
# Define W&B environment variables
# You can get your API key from https://wandb.ai/settings
wandb_api_key = "YOUR_WANDB_API_KEY" # **Replace with your actual API key**
environment_variables = {
"WANDB_API_KEY": wandb_api_key,
"WANDB_PROJECT": "sagemaker-demo", # Optional: overrides project in script
"WANDB_ENTITY": "your-wandb-entity" # Optional: replace with your W&B entity/username
}
# Define dependencies (e.g., requirements.txt)
# Ensure wandb is listed in requirements.txt
# Example requirements.txt:
# wandb
# tensorflow # or torch, etc.
# Configure the estimator
# Using TensorFlow estimator as an example, adjust for your framework
estimator = TensorFlow(
entry_point=entry_point_script,
source_dir=source_dir,
role=role,
instance_count=1,
instance_type='ml.m5.large', # Choose an appropriate instance type
framework_version='2.11', # Specify your framework version
py_version='py39', # Specify your Python version
hyperparameters=hyperparameters,
environment=environment_variables,
# If your requirements.txt is not in source_dir, specify it:
# dependencies=['path/to/requirements.txt']
)
# Launch the training job
estimator.fit({'training': 's3://your-bucket/your-data/'}) # Add S3 data path if needed
Once the training job starts, W&B will automatically begin logging metrics and system information. You can view these in your W&B project dashboard at https://wandb.ai/your-wandb-entity/sagemaker-demo.
The most surprising thing about this integration is how little code modification is actually required in your training script. By simply initializing wandb.init() and calling wandb.log(), the SageMaker environment, configured with the WANDB_API_KEY, takes care of the rest. This includes automatically capturing system metrics like CPU utilization, GPU utilization, memory usage, and network traffic without any explicit instrumentation in your code.
Internally, the SageMaker training container is set up to detect the WANDB_API_KEY environment variable. When W&B is imported and initialized, the W&B SDK, in conjunction with SageMaker’s execution environment, automatically hooks into system monitoring agents. These agents collect performance data, which is then packaged and sent to the W&B servers alongside your custom training metrics. The WANDB_PROJECT and WANDB_ENTITY variables help direct this data to the correct project and user account on W&B.
The key levers you control are the hyperparameters passed to your script (which are logged by W&B), the W&B project/entity configuration via environment variables, and the standard SageMaker estimator configurations (instance type, count, framework version). The automatic system metric collection is a powerful, "set it and forget it" feature.
One aspect that often goes unnoticed is how W&B automatically associates training jobs with specific SageMaker job names and ARNs. This linkage is established through environment variables injected by SageMaker into the training container, such as SAGEMAKER_JOB_NAME and SAGEMAKER_TRAINING_JOB_ARN. W&B’s integration reads these variables and includes them as metadata for the run, making it easy to trace W&B runs back to specific SageMaker jobs in the AWS console.
The next step after getting basic logging working is to explore W&B’s model artifact management and hyperparameter optimization capabilities within SageMaker.