W&B RLHF Training Logging: Track Reward Model Runs (2026)

The most surprising thing about logging reward model runs in W&B is that you’re not just logging a single number; you’re logging a complex decision-making process that influences the entire downstream fine-tuning.

Imagine you’re training a chatbot to be helpful. You feed it a bunch of prompts, and for each prompt, you have a human judge rate the chatbot’s responses. The reward model (RM) learns to predict these human ratings. When you log this RM run in W&B, you’re not just seeing if the RM’s predictions get closer to the human ratings; you’re seeing how well the RM will eventually guide your chatbot to produce better responses.

Here’s a simulated W&B log for a reward model training run. We’ll use a simplified scenario where we’re training an RM to predict human preferences between two chatbot responses (response_A and response_B) to a given prompt.

import wandb
import random
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Initialize W&B
wandb.init(project="rlhf-reward-model-logging", job_type="reward-model-training")

# --- Simulate Reward Model Training ---

# Load a pre-trained model for sequence classification
# In a real scenario, this would be a model specifically fine-tuned for RM tasks
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1) # Output a single score

# Simulate training data: (prompt, response_a, response_b, preferred_response_label)
# label 1 means response_a is preferred, label 0 means response_b is preferred
training_data = [
    ("What is the capital of France?", "Paris is the capital.", "The capital is London.", 1),
    ("Tell me a joke.", "Why did the scarecrow win an award? Because he was outstanding in his field!", "Jokes are hard. I don't know.", 1),
    ("Explain quantum physics simply.", "Quantum physics is complex. It deals with the smallest particles.", "It's about tiny things that do weird stuff.", 0),
    ("How to bake a cake?", "Preheat oven to 350F. Mix ingredients.", "Just put flour and eggs in a pan.", 1),
    ("Write a poem about the sea.", "The ocean blue, a vast expanse,\nWaves crash and roar, a rhythmic dance.", "Sea is wet. It has water.", 1),
]

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
criterion = torch.nn.BCEWithLogitsLoss() # For binary classification (preferred vs. not preferred)

# Simulate training loop
num_epochs = 3
for epoch in range(num_epochs):
    total_loss = 0
    correct_predictions = 0
    total_samples = 0

    for prompt, response_a, response_b, preferred_label in training_data:
        # In a real RM, you'd likely concatenate prompt + response
        # For simplicity here, we'll just score each response independently
        # and then compare scores. A more realistic RM would take (prompt, response)
        # and output a single score.

        # Simulate scoring response_a
        inputs_a = tokenizer(f"{prompt} {response_a}", return_tensors="pt", truncation=True, padding=True)
        outputs_a = model(**inputs_a)
        score_a = outputs_a.logits.squeeze()

        # Simulate scoring response_b
        inputs_b = tokenizer(f"{prompt} {response_b}", return_tensors="pt", truncation=True, padding=True)
        outputs_b = model(**inputs_b)
        score_b = outputs_b.logits.squeeze()

        # Determine which response is predicted as preferred based on scores
        predicted_preferred_label = 1 if score_a > score_b else 0

        # Calculate loss: we want the score of the preferred response to be higher
        # This is a pairwise loss. If label is 1 (response_a preferred), we want score_a > score_b.
        # If label is 0 (response_b preferred), we want score_b > score_a.
        # A common formulation is to use a margin loss, but BCEWithLogitsLoss works too
        # if we transform the labels to represent the *difference* in scores we want.
        # For simplicity, let's simulate a loss based on the *difference* between scores.
        # If response_a is preferred (label 1), we want score_a - score_b > 0.
        # If response_b is preferred (label 0), we want score_b - score_a > 0 (or score_a - score_b < 0).
        # Let's use a loss that penalizes if the score difference is in the wrong direction.
        if preferred_label == 1: # response_a is preferred
            loss = torch.relu(0.5 - (score_a - score_b)) # Penalize if score_a - score_b is too small
        else: # response_b is preferred
            loss = torch.relu(0.5 - (score_b - score_a)) # Penalize if score_b - score_a is too small

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        # Track accuracy for logging
        if predicted_preferred_label == preferred_label:
            correct_predictions += 1
        total_samples += 1

    avg_loss = total_loss / len(training_data)
    accuracy = correct_predictions / total_samples

    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}")

    # Log metrics to W&B
    wandb.log({
        "epoch": epoch,
        "train_loss": avg_loss,
        "train_accuracy": accuracy,
        "learning_rate": optimizer.param_groups[0]['lr']
    })

    # --- Simulate logging specific RM run details ---
    # This is crucial for understanding *why* the RM is behaving a certain way.
    # You might log:
    # 1. A sample of predictions vs. ground truth
    # 2. The distribution of predicted scores for preferred vs. non-preferred responses
    # 3. Specific examples of prompts where the RM gets it wrong

    # Log a sample of predictions
    sample_predictions = []
    for i, (prompt, response_a, response_b, preferred_label) in enumerate(training_data[:5]): # Log first 5
        inputs_a = tokenizer(f"{prompt} {response_a}", return_tensors="pt", truncation=True, padding=True)
        score_a = model(**inputs_a).logits.squeeze().item()
        inputs_b = tokenizer(f"{prompt} {response_b}", return_tensors="pt", truncation=True, padding=True)
        score_b = model(**inputs_b).logits.squeeze().item()

        predicted_preferred_label = 1 if score_a > score_b else 0
        sample_predictions.append({
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "ground_truth_preferred": "A" if preferred_label == 1 else "B",
            "predicted_preferred": "A" if predicted_preferred_label == 1 else "B",
            "score_a": score_a,
            "score_b": score_b,
            "correct": predicted_preferred_label == preferred_label
        })
    wandb.log({"sample_predictions": wandb.Table(dataframe=pd.DataFrame(sample_predictions))})

    # Log score distributions (simplified - in reality, you'd run on a held-out set)
    all_scores_a = []
    all_scores_b = []
    for prompt, response_a, response_b, preferred_label in training_data:
        inputs_a = tokenizer(f"{prompt} {response_a}", return_tensors="pt", truncation=True, padding=True)
        score_a = model(**inputs_a).logits.squeeze().item()
        inputs_b = tokenizer(f"{prompt} {response_b}", return_tensors="pt", truncation=True, padding=True)
        score_b = model(**inputs_b).logits.squeeze().item()
        all_scores_a.append(score_a)
        all_scores_b.append(score_b)

    # Plotting score distributions requires matplotlib and wandb.Image
    # For simplicity, let's just log the mean scores
    wandb.log({
        "mean_score_response_a": sum(all_scores_a) / len(all_scores_a),
        "mean_score_response_b": sum(all_scores_b) / len(all_scores_b),
    })

wandb.finish()

What’s happening here is that the model is learning to assign a higher score to the response that a human (or our simulated preference label) deemed better. The criterion (BCEWithLogitsLoss, adapted for pairwise comparison) is the engine of this learning, and the optimizer adjusts the model’s weights.

Crucially, in W&B, you’re not just looking at train_loss and train_accuracy. You’re examining sample_predictions to see where the model is making mistakes. Are there certain prompts it struggles with? Does it consistently prefer shorter answers, or longer ones? The mean_score_response_a and mean_score_response_b (which would ideally be calculated on a validation set) give you a sense of the overall score distribution the RM is learning to produce.

The real power comes when you analyze these W&B logs during training. You might see the accuracy plateau. By looking at the sample_predictions table, you might notice the RM is confused by nuanced comparisons or sarcastic responses. This insight tells you that your training data might need more examples of these tricky cases, or that your RM architecture needs to be more sophisticated.

The mental model you build is one of a sensitive judge. The RM’s job is to perfectly mimic human judgment, and its scores are its "opinions." When you log, you’re essentially asking the RM to show you its homework: "Show me the responses you ranked, how you ranked them, and why you think that." This allows you to debug not just the training process, but the very reasoning the RM is developing.

The levers you control are primarily:

Data Quality & Diversity: The RM is only as good as the preference data it’s trained on. W&B helps you see if the RM is learning biases from this data.
Model Architecture: Different RM architectures (e.g., using different backbone models, or specific heads for pairwise comparison) will learn different decision boundaries.
Hyperparameters: Learning rate, batch size, and the specific loss function (e.g., margin loss vs. cross-entropy on score differences) directly impact how the RM learns to rank responses.

The one thing most people don’t know is that the magnitude of the scores produced by the reward model, not just their relative order, can be surprisingly important. While the RM is trained to predict which response is better, the absolute difference in scores can sometimes reveal how "confident" the RM is in its judgment. A large score difference might indicate strong confidence, while a small difference suggests the RM found both responses similarly good or bad. This confidence score can be leveraged in more advanced RLHF techniques, like using it to filter out low-confidence preferences or to adjust the exploration/exploitation balance during the policy training phase.

After you’ve successfully logged and iterated on your reward model training, the next major hurdle is effectively using that reward model to fine-tune your policy model (e.g., your chatbot).

More Deep Dives in Wandb