W&B Evaluation Metrics: Custom Scoring Functions (2026)

The most surprising thing about W&B evaluation metrics is that they aren’t just for displaying results; they’re a powerful, programmable way to guide your training process itself.

Let’s see this in action. Imagine you’re training a model to detect rare anomalies in manufacturing. Your standard accuracy metric might be high, but it’s useless if you miss the few critical anomalies. You need a metric that heavily penalizes false negatives.

Here’s how you’d define a custom scoring function in W&B. This function will calculate a weighted F1-score, giving higher importance to the "anomaly" class.

import wandb
from sklearn.metrics import f1_score
import numpy as np

# Assume you have true labels (y_true) and predicted labels (y_pred)
# For demonstration, let's create some dummy data
y_true = np.array([0, 1, 0, 0, 1, 0, 0, 0, 1, 0]) # 0: normal, 1: anomaly
y_pred = np.array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

# Define your custom scoring function
def weighted_f1_anomaly(y_true, y_pred):
    # Assign higher weight to the anomaly class (class 1)
    # This is a simple example; weights can be more sophisticated
    class_weights = {0: 1, 1: 5}
    
    # Calculate F1 score for each class
    f1_per_class = f1_score(y_true, y_pred, average=None)
    
    # Calculate weighted F1 score
    weighted_f1 = sum(f1_per_class[i] * class_weights.get(i, 1) for i in range(len(f1_per_class))) / sum(class_weights.values())
    
    return weighted_f1

# Log the metric
wandb.init(project="custom-metrics-demo")
wandb.log({"weighted_anomaly_f1": weighted_f1_anomaly(y_true, y_pred)})
wandb.finish()

When you run this, W&B captures the weighted_anomaly_f1 value. But the real power comes when you integrate this into your training loop. You can use this metric directly in your wandb.log calls within the training loop, and W&B can even track it as a "goal" metric for hyperparameter sweeps.

The mental model here is that W&B evaluation metrics are not a passive reporting tool. They are an active component of your ML workflow. You define what "good" looks like, not just in terms of loss, but in terms of impactful business outcomes.

Internally, W&B treats these custom metrics like any other logged metric. They are associated with a specific run, appear in the UI, and can be used for filtering, comparison, and optimization. The wandb.log function is the bridge; it takes your Python function’s output and attaches it to the current W&B run.

The levers you control are the Python code of your scoring function and how you pass its output to wandb.log. You can use any Python library (NumPy, SciPy, scikit-learn, etc.) to compute arbitrary metrics. You can create metrics that consider class imbalance, specific error types, or even business-specific KPIs.

The key to using custom metrics effectively is understanding that they can be calculated on any set of predictions and ground truth. This means you can log them not just on your validation set, but also on specific subsets of your data that are particularly important (e.g., a subset representing high-value customers or critical failure modes). You can also use them to compare different model architectures or pre-processing steps side-by-side within the same experiment.

Most people think of W&B metrics as just plotting final results. The nuance is that you can also use them to signal to W&B’s hyperparameter optimization (HPO) system which runs are better. By setting your custom metric as the primary_metric in a sweep configuration, you tell W&B to actively search for configurations that maximize or minimize your specific definition of success, rather than just a generic loss.

Once you’ve mastered custom scoring functions, the next step is to explore how these metrics can be used to trigger automated actions, like early stopping based on a custom metric plateauing or even triggering model retraining when a specific performance threshold is breached.

More Deep Dives in Wandb