W&B Weave: LLM Observability and Evaluation Platform (2026)

Weave isn’t just another logging tool; it’s a system designed to give you deep visibility into how your LLM applications actually behave in production, and critically, how to systematically improve them.

Let’s see it in action. Imagine you’ve got a simple RAG (Retrieval Augmented Generation) pipeline. You query it, and it fetches documents, then synthesizes an answer.

Here’s a snippet of how you might instrument it with Weave:

import wandb
from datasets import load_dataset
from rag_pipeline import RAGPipeline # Your RAG implementation

# Initialize Weave
wandb.init(project="rag-observability-demo")

# Load your RAG pipeline
pipeline = RAGPipeline()

# Load a dataset for evaluation
dataset = load_dataset("squad", split="validation[:10]")

# Run the pipeline and log with Weave
for example in dataset:
    question = example["question"]
    with wandb.log_artifact(f"question: {question[:50]}...") as trace:
        # This is where the magic happens - Weave traces the execution
        answer = pipeline.run(question)

        # Log inputs and outputs, Weave captures intermediate steps too
        trace.log_inputs(question=question)
        trace.log_outputs(answer=answer)

        # You can also log intermediate results for deeper debugging
        retrieved_docs = pipeline.get_retrieved_docs() # Assuming this method exists
        trace.log_intermediate(retrieved_docs=retrieved_docs)

wandb.finish()

When this runs, Weave captures a detailed trace of each pipeline.run() call. It’s not just logging the final answer. It’s recording:

The exact question passed in.
The sequence of calls within pipeline.run().
The documents fetched by the retriever.
The prompts sent to the LLM.
The LLM’s raw response.
The final synthesized answer.

This gives you a visual, step-by-step replay of what happened for every single request. You can see the retriever’s output, the prompt construction, and the LLM’s response, all tied to the original input.

The core problem Weave solves is the "black box" nature of LLM applications. You deploy a model, it gives an answer, but why? Was the retrieved context poor? Was the prompt ambiguous? Did the LLM hallucinate based on the context? Weave provides the telemetry to answer these questions. It turns LLM inference into a traceable, debuggable process.

Internally, Weave uses a distributed tracing mechanism, similar to what you might see in microservice observability. When you wrap a function call (or a block of code using with wandb.log_artifact(...)) within a Weave trace, it starts a new span. Any nested calls within that span become child spans. Each span records its start time, end time, inputs, outputs, and any errors. This hierarchical structure allows you to reconstruct the entire execution flow.

The key levers you control are:

Instrumentation: Deciding what to trace. You can trace entire functions, specific code blocks, or even individual LLM calls. The wandb.log_artifact() context manager is your primary tool here.
Input/Output Logging: Explicitly logging trace.log_inputs() and trace.log_outputs() makes these values visible in the trace. This is crucial for understanding what went in and what came out at each stage.
Intermediate Logging: trace.log_intermediate() is where you capture the nitty-gritty details – retrieved documents, intermediate prompt states, embeddings, etc. This is invaluable for debugging complex pipelines.
Evaluation & Analysis: Weave integrates with W&B’s experiment tracking and evaluation tools. You can define metrics (e.g., ROUGE scores, custom relevance checks) and run them across your traces to identify regressions or compare model versions.

Most people understand Weave as a debugger. But its real power lies in its ability to turn LLM inference into a reproducible, auditable event. This means you can not only debug a single bad response but also build systems that guarantee certain qualities in responses. For instance, you can create a trace that asserts the retrieved documents are relevant to the question, or that the final answer is factually consistent with the provided context. If any of those assertions fail during inference, the trace is flagged. This shifts LLM development from reactive debugging to proactive quality assurance.

The next step is to explore how to define and run custom evaluation metrics directly on these captured traces.

More Deep Dives in Wandb