W&B Prompts: LLM Evaluation and Tracing (2026)

Prompts in Weights & Biases aren’t just about collecting text inputs; they’re a fundamental mechanism for understanding and debugging the decision-making process of Large Language Models.

Let’s see this in action. Imagine you’re fine-tuning a model to summarize news articles. You’d want to track not just the final summary, but also the specific instructions you gave the model, any intermediate thoughts it might have generated, and the final output.

import wandb

# Initialize a W&B run
run = wandb.init(project="llm-evaluation-demo")

# Define your prompt structure
prompt_template = """
Analyze the following news article and provide a concise summary.

Article:
{article_text}

Instructions:
- Focus on the main event and key actors.
- Keep the summary under 50 words.
- Avoid jargon.

Summary:
"""

# Your article text
article = """
The global semiconductor shortage, which began in late 2020, has continued to impact various industries, most notably the automotive and consumer electronics sectors. Supply chain disruptions, coupled with a surge in demand for digital devices during the pandemic, created a perfect storm. Major chip manufacturers are investing billions in expanding production capacity, but these efforts will take several years to fully materialize. Analysts predict that the scarcity could persist into 2023, affecting product availability and prices.
"""

# Construct the prompt with the article
full_prompt = prompt_template.format(article_text=article)

# Simulate LLM inference (replace with your actual model call)
# For demonstration, we'll just generate a placeholder response
# In a real scenario, this would be your model.generate(full_prompt) call
model_response = {
    "generated_text": "The global semiconductor shortage, starting late 2020, is impacting auto and electronics industries due to supply chain issues and pandemic-driven demand. Chip manufacturers are expanding production, but relief may not come until 2023, affecting product availability and prices."
}

# Log the prompt and the response using wandb.log
wandb.log({
    "prompt": wandb.Html(f"<pre>{full_prompt}</pre>"), # Log prompt as HTML for better display
    "response": wandb.Html(f"<pre>{model_response['generated_text']}</pre>")
})

# If your LLM provides intermediate steps or reasoning, you can log them too.
# For example, if your model has a "thought" process:
intermediate_thought = "Extracted key entities: semiconductor shortage, automotive, consumer electronics, chip manufacturers. Identified key impacts: supply chain, demand surge, production expansion, price/availability. Timeframe: late 2020, into 2023. Constraint: under 50 words. Drafted summary focusing on these points."

wandb.log({
    "intermediate_thought": wandb.Html(f"<pre>{intermediate_thought}</pre>")
})

# End the W&B run
run.finish()

This code snippet demonstrates how you can log the exact prompt sent to an LLM, along with its response and any intermediate reasoning. The wandb.Html wrapper ensures that your prompts and responses are rendered nicely within the W&B UI, preserving formatting and making them easy to read.

The core problem W&B Prompts solve is the "black box" nature of LLMs. When a model produces an undesirable output – perhaps an inaccurate summary, a nonsensical answer, or a biased statement – you need to understand why. Was it the input data? Was it the way the prompt was phrased? Was it a flaw in the model’s internal reasoning process? Prompts, when logged comprehensively, provide that crucial traceability.

Internally, W&B treats logged prompts and their associated outputs as distinct data points within a run. You can log various types of prompts, not just the final input. This includes:

System Prompts: The overarching instructions that set the LLM’s persona or task.
User Prompts: The specific query or data provided by the user.
Few-Shot Examples: Any examples included in the prompt to guide the model.
Intermediate Outputs: "Chain-of-thought" steps, scratchpad work, or any generated text before the final answer.
Final Response: The ultimate output from the LLM.

By logging these components separately, you build a detailed audit trail. You can then use W&B’s features to:

Compare Prompts: See how slight variations in prompt wording affect model performance across different runs or datasets.
Analyze Intermediate Steps: Identify where the model goes off track during its reasoning process.
Debug Specific Failures: Reconstruct the exact prompt and context that led to a particular bad output.
Evaluate Against Metrics: Correlate prompt characteristics with downstream evaluation metrics (e.g., ROUGE scores for summarization, accuracy for classification).

The real power comes when you integrate this with W&B’s evaluation tools. You can define custom metrics that analyze the logged prompts and responses. For instance, you might create a metric that checks if the summary length constraint was met, or if specific keywords from the article were included in the summary.

One aspect often overlooked is the ability to log prompts and responses not just as plain text, but as structured data or rich media. For example, if your LLM is processing code, you can log the code snippets with syntax highlighting. If it’s generating tables, you can log them as actual HTML tables. This makes the logged data far more interpretable and actionable than simple text dumps.

The next step after mastering prompt logging is to explore how to programmatically generate and test prompt variations to find optimal configurations.

More Deep Dives in Wandb