W&B Artifacts are not just fancy file storage; they are a system for tracking the lineage of everything that goes into making a machine learning model, from the raw data to the trained weights.
Let’s see how this plays out in a real workflow. Imagine we’re training a sentiment analysis model.
import wandb
# Start a new W&B run
run = wandb.init(project="sentiment-analysis", job_type="data-ingestion")
# Load our raw dataset
data = pandas.read_csv("raw_reviews.csv")
# Perform some preprocessing
processed_data = data.dropna().sample(frac=0.8) # Remove NaNs, take 80% sample
# Log the processed data as an artifact
# This creates a new version of the "processed-dataset" artifact
processed_data_artifact = wandb.Artifact(
name="processed-dataset",
type="dataset",
description="Cleaned and sampled dataset for sentiment analysis training"
)
processed_data_artifact.add(
wandb.Table(dataframe=processed_data),
name="processed_data.table.json" # W&B expects a specific extension for tables
)
run.log_artifact(processed_data_artifact)
# Finish the run
run.finish()
In this snippet, wandb.Artifact is the core. We’re creating an artifact named processed-dataset of type dataset. When run.log_artifact() is called, W&B takes a snapshot of the processed_data DataFrame, serializes it, and stores it. Crucially, it assigns a unique version (like processed-dataset:v0) to this specific state of the data. If we run this code again, we’ll get processed-dataset:v1, and so on.
Now, let’s say we use this artifact for training:
import wandb
import pandas
# Start a new W&B run for training
run = wandb.init(project="sentiment-analysis", job_type="training")
# Get the latest version of our processed dataset artifact
# By default, this pulls the latest available version
try:
dataset_artifact = run.use_artifact(
"processed-dataset:latest", # Or specify a version like "processed-dataset:v0"
type="dataset"
)
except wandb.errors.CommError as e:
print(f"Could not find artifact: {e}")
exit()
# Load the data from the artifact
dataset_table = dataset_artifact.get_table("processed_data.table.json")
processed_data = dataset_table.dataframe()
# (Model training code here using processed_data)
# ... Let's assume we get trained_model_weights
# Log the trained model as another artifact
model_artifact = wandb.Artifact(
name="sentiment-model",
type="model",
description="A sentiment analysis model trained on processed data"
)
# Save the weights to a temporary file before logging
# For real training, this would be your actual model checkpoint file
with open("model_weights.pth", "w") as f:
f.write("dummy_weights_content")
model_artifact.add_file("model_weights.pth")
# Log the model artifact, making sure to link it to the dataset artifact it used
# This establishes lineage!
model_artifact.add_reference(dataset_artifact) # Crucial for lineage
run.log_artifact(model_artifact)
run.finish()
Here, run.use_artifact("processed-dataset:latest", type="dataset") tells W&B to download and make available the artifact named processed-dataset. If we specified processed-dataset:v0, it would fetch that exact version. The add_reference(dataset_artifact) line is where the magic of lineage happens. It explicitly records that this sentiment-model artifact was created using that specific version of the processed-dataset artifact.
This creates a directed acyclic graph (DAG) of your ML experiments. You can see exactly which version of data produced which version of a model, and which hyperparameter settings were used to train it (if you log those too). This is invaluable for reproducibility, debugging, and auditing.
The surprising thing about W&B Artifacts is that they don’t just store files; they are a formal, versioned, and auditable record of your computational process. Each artifact version is a snapshot that can be perfectly recreated, not just the files themselves, but the state they represented at a specific point in time. This means you can point to a specific model version and know precisely which data and code produced it, even months later.
The add_reference method is more powerful than just a comment; it creates a formal dependency link. When you use_artifact for a model, you can also use_artifact for the data it referenced, and W&B will fetch the exact version of the data that was used for that model. This is how you achieve true reproducibility. If you have a model artifact, you can ask W&B to download the exact data artifact it was trained on and the exact code artifact that performed the training.
The next concept you’ll want to explore is how to manage and version your code using Artifacts, creating a complete reproducible pipeline from code to data to model.