W&B LLM Fine-Tuning Tracking: Full Guide (2026)

You can track your fine-tuning runs with W&B, and it’s surprisingly powerful for understanding how your model learns.

Here’s a typical fine-tuning setup:

import wandb
from transformers import Trainer, TrainingArguments

# Initialize W&B
wandb.init(project="llm-finetuning-example", job_type="full-finetune")

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=50,
    evaluation_strategy="steps",
    eval_steps=50,
    report_to="wandb", # This is key!
)

# Load your dataset and model (example: using Hugging Face)
# dataset = load_dataset(...)
# model = AutoModelForCausalLM.from_pretrained(...)
# tokenizer = AutoTokenizer.from_pretrained(...)

# Create a Trainer instance
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=dataset["train"],
#     eval_dataset=dataset["validation"],
#     tokenizer=tokenizer,
#     # compute_metrics=compute_metrics_function,
# )

# Start training
# trainer.train()

# Finish the W&B run
# wandb.finish()

When report_to="wandb" is set in TrainingArguments, the Trainer automatically logs metrics like loss, learning rate, and evaluation scores to your W&B project. You’ll see these populating in real-time in your W&B dashboard.

This setup is designed to solve the problem of observability during fine-tuning. Without it, you’re essentially flying blind, hoping the hyperparameters you chose are effective. W&B gives you a window into the training process, allowing you to:

Monitor Loss: See if your training loss is decreasing and if your validation loss is also decreasing (or starting to increase, indicating overfitting).
Track Learning Rate: Observe how the learning rate changes over time, especially if you’re using a scheduler.
Evaluate Performance: See how your model performs on a held-out validation set at regular intervals, using custom metrics you define.
Compare Runs: Easily compare different fine-tuning experiments side-by-side, changing only one hyperparameter at a time (e.g., learning rate, batch size, number of epochs) to see its impact.
Reproduce Results: W&B logs all configurations, code versions, and hyperparameters, making your experiments reproducible.

The core W&B integration happens through the wandb.init() call and the report_to="wandb" argument in TrainingArguments. For Hugging Face Trainer and Accelerate, this is often all you need. For custom training loops, you’d manually log metrics using wandb.log({"metric_name": metric_value, "step": current_step}).

Beyond the automatic logging, you can log custom artifacts:

# After training, log the fine-tuned model as an artifact
model_artifact = wandb.Artifact(
    name="fine-tuned-model",
    type="model",
    description="My fine-tuned LLM"
)
# Assuming 'model' is your trained model object and you save it locally
model.save_pretrained("./my_finetuned_model")
model_artifact.add_dir("./my_finetuned_model")
wandb.log_artifact(model_artifact)

# Log evaluation results
results = trainer.evaluate()
wandb.log(results)

This allows you to version your models and datasets directly within W&B, creating a clear lineage from data to trained model.

A subtle but powerful aspect of W&B logging during fine-tuning is the automatic capture of gradient norms and parameter updates if you enable specific settings. This requires a bit more manual intervention or specific library integrations, but it provides deep insights into model stability and convergence. For instance, you can log:

if wandb.run.config.get("log_gradients", False): # Example config flag
    for name, param in model.named_parameters():
        if param.grad is not None:
            wandb.log({f"gradients/{name}": wandb.Histogram(param.grad.cpu().numpy())})

This allows you to visualize the distribution of gradients for each layer. If you see exploding gradients (very large values) or vanishing gradients (values close to zero) across many layers, it’s a strong indicator of training instability or that your learning rate is too high or too low. This level of detail is crucial for diagnosing convergence issues that simple loss curves might not reveal.

The next step after basic fine-tuning tracking is to set up hyperparameter sweeps to systematically explore the hyperparameter space.

More Deep Dives in Wandb