Logging custom scalar metrics in Weights & Biases (W&B) lets you track exactly what matters for your specific machine learning project beyond the standard loss and accuracy.

Let’s see it in action. Imagine you’re training a recommendation system and want to track the "diversity" of your recommendations, a metric not built into W&B.

import wandb
import random

# Initialize a W&B run
wandb.init(project="custom-metrics-demo")

# Simulate training loop
for epoch in range(10):
    # Simulate calculating some custom metrics
    recommendation_diversity = random.random() * 10
    user_engagement_score = random.random() * 5

    # Log the custom metrics
    wandb.log({
        "epoch": epoch,
        "recommendation_diversity": recommendation_diversity,
        "user_engagement_score": user_engagement_score
    })
    print(f"Epoch {epoch}: Diversity={recommendation_diversity:.2f}, Engagement={user_engagement_score:.2f}")

# Finish the run
wandb.finish()

When you run this code, W&B will create a new run in your project. In the W&B UI, you’ll see a dashboard with charts for recommendation_diversity and user_engagement_score plotting their values across epochs. You can add these custom metrics to any existing W&B dashboard or create a new one dedicated to them.

The core problem W&B custom metrics solve is bridging the gap between your model’s performance and the specific business or research goals you have. Standard metrics like loss are great for optimization, but they might not tell you if your model is actually useful. For instance, a low loss doesn’t guarantee diverse recommendations if your model always suggests the same popular items. Custom metrics allow you to quantify these nuanced aspects.

Internally, W&B treats every logged key-value pair as a data point associated with a specific step (usually the training step or epoch). When you call wandb.log({"my_metric": value}), W&B stores value and associates it with the current step. These logged values are then aggregated and made available for visualization through W&B’s plotting tools. You can log any serializable Python object, but scalars (integers, floats) are the most common and directly translate to line charts.

The primary levers you control are the metric names and the values you log. You can log as many distinct scalar metrics as you need, each with its own unique name. The frequency of logging also matters; logging every step provides high granularity, while logging per epoch gives a broader overview. You can even log metrics calculated on validation sets using wandb.log({"val_my_metric": value}, sync_dist=True) within distributed training setups to ensure correct aggregation across processes.

A common pattern is to log metrics derived from specific parts of your data or model. For example, if you’re working with imbalanced datasets, you might log precision and recall for each class separately, or log the F1-score for the minority class. This gives you granular insight into how your model is performing on critical subsets of your data, which can be easily missed by looking at overall accuracy alone.

The next concept you’ll likely explore is logging more complex data types like histograms, images, or even audio, each offering unique ways to inspect your model’s behavior.

Want structured learning?

Take the full Wandb course →