W&B Model Performance Monitoring in Production (2026)

Weights & Biases (W&B) doesn’t just track your training runs; it can keep an eye on your production models too, flagging when they start to drift or perform poorly.

Let’s watch a model’s predictions change over time. Imagine we have a model predicting housing prices. We’ve trained it and deployed it. Now, we want to see how its predictions hold up against actual sales prices as new data comes in.

Here’s a simple Python script using wandb.log to send production predictions and actuals. We’ll simulate receiving new data over a few "days."

import wandb
import random
import time

# Initialize W&B run (this would typically be a separate process or service)
run = wandb.init(project="production-monitoring-demo", job_type="production-inference")

print("Logging production inference data...")

for day in range(3):
    print(f"--- Day {day+1} ---")
    for i in range(10): # Simulate 10 inferences per day
        # Simulate a new data point
        sq_ft = random.randint(800, 3500)
        num_bedrooms = random.randint(1, 5)
        
        # Simulate model prediction (this would be your actual model inference)
        # Let's make the model slightly over-optimistic on average
        predicted_price = (sq_ft * 150 + num_bedrooms * 10000) * random.uniform(0.9, 1.1)

        # Simulate the actual sale price (with some noise and potential drift)
        actual_price = (sq_ft * 160 + num_bedrooms * 12000) * random.uniform(0.95, 1.05)
        
        # Log the prediction and the actual value to W&B
        wandb.log({
            "inference_id": f"infer_{day}_{i}",
            "features/sq_ft": sq_ft,
            "features/num_bedrooms": num_bedrooms,
            "prediction/price": predicted_price,
            "actual/price": actual_price,
            "ground_truth_latency_ms": random.randint(10, 100) # Example metric
        })
        time.sleep(0.1) # Simulate inference time

    # Simulate a day passing before new data arrives
    time.sleep(1)

print("Finished logging production data. Check your W&B project dashboard.")
run.finish()

This script initializes a W&B run specifically for production inference. Inside a loop, it simulates receiving new data points, generates a prediction using a placeholder model logic, and then logs both the prediction and the eventual actual value (the ground truth) to W&B. Crucially, it logs inference_id, features, prediction, and the actual/price. This allows W&B to join predictions with their corresponding outcomes later.

The core problem W&B production monitoring solves is detecting model degradation in live systems. Models trained on historical data can become stale as the real-world data distribution shifts. This is often called "data drift" or "concept drift." Without active monitoring, you might not realize your model’s performance is silently decaying until it causes significant issues. W&B provides tools to visualize and quantify this drift by comparing production inference data against a baseline or against itself over time.

Internally, W&B stores these logged predictions and actuals as Tables. When you log a dictionary with multiple keys, W&B often creates a Table object. For production monitoring, you’re logging rows of data where each row contains features, the model’s prediction for those features, and eventually, the true outcome. You can then query and analyze these tables directly within the W&B UI or programmatically. The system allows you to define "ground truth" for specific inference IDs later, effectively backfilling the actual/price for past predictions.

The key levers you control are:

What you log: You must log enough information to identify individual inferences (inference_id), the input features, the model’s prediction, and importantly, the ground truth when it becomes available.
Logging frequency: How often do you send new inference data? Real-time logging provides immediate insights, while batch logging might be more efficient.
Baseline definition: You can log a "golden" dataset (e.g., validation set from training) as a baseline. W&B then compares production data against this baseline to detect distribution shifts.
Metric definition: You define the performance metrics (e.g., MAE, RMSE, accuracy, F1-score) you want to track. W&B can compute these on demand using the logged predictions and ground truth.

One of the most powerful, yet often overlooked, aspects of W&B’s production monitoring is its ability to correlate performance metrics with specific feature values or data segments. Instead of just seeing that overall accuracy dropped, you can drill down. For instance, you might discover that your model’s performance plummets only for houses with more than 4 bedrooms in a specific zip code, or when the predicted price is above $1 million. This granular insight is critical for debugging and targeted retraining. You achieve this by logging rich feature data and then using W&B’s filtering and segmentation capabilities on the logged Tables to analyze performance slices.

The next step after setting up basic performance monitoring is to configure automated alerts for significant performance drops or data drift.

More Deep Dives in Wandb