The W&B Embedding Projector is a powerful tool for visualizing high-dimensional data, like the embeddings generated by machine learning models. It lets you see patterns and relationships in your data that would be impossible to discern otherwise.
Let’s look at it in action. Imagine you’ve trained a text embedding model, perhaps using something like Sentence-BERT, and you’ve logged the resulting embeddings to Weights & Biases. Here’s how you might set that up in your Python script:
import wandb
import torch
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Assume 'model' is your trained embedding model
# Assume 'texts' is a list of strings
# Generate embeddings
embeddings = []
with torch.no_grad():
for text in texts:
# This is a simplified example; actual embedding generation depends on your model
embedding = model.encode(text)
embeddings.append(embedding)
embeddings_tensor = torch.tensor(embeddings)
# Log embeddings to W&B
wandb.init(project="embedding-visualization-demo")
wandb.log({"my_embeddings": wandb.Object3D(embeddings_tensor.numpy())}) # For 3D visualization
wandb.log({"my_embeddings_pca": wandb.Table(data=PCA(n_components=2).fit_transform(embeddings_tensor.numpy()), columns=["PCA1", "PCA2"])}) # For 2D scatter plot
wandb.log({"my_embeddings_tsne": wandb.Table(data=TSNE(n_components=2, random_state=42).fit_transform(embeddings_tensor.numpy()), columns=["TSNE1", "TSNE2"])}) # For t-SNE scatter plot
wandb.log({"texts": texts}) # Log the original texts to associate with embeddings
wandb.finish()
In this example, we first generate embeddings for a list of texts. Then, we use wandb.log to send these embeddings to your W&B project. We’re logging them in a few different ways: as a wandb.Object3D for direct 3D projection (if your embeddings are 3D or you’re using dimensionality reduction to 3D), and as wandb.Table objects after applying PCA and t-SNE for 2D visualizations. It’s crucial to also log the original texts so you can see what each point in your embedding space actually represents.
Once this code runs, you’ll see a new panel in your W&B run. Clicking on the "Media" tab, you’ll find the "Embedding Projector" visualization. Here, you can interact with your data in a 3D scatter plot. You can rotate, zoom, and pan to explore the relationships between data points. The real power comes from coloring and filtering. If you’ve logged labels or other metadata associated with your embeddings, you can use them to color the points, revealing clusterings or anomalies. For instance, if you colored by sentiment (positive/negative) or by document topic, you’d expect to see points of the same color group together.
The Embedding Projector addresses the fundamental challenge of understanding high-dimensional data. Machine learning models often learn representations of data (like text, images, or user behavior) in spaces with hundreds or even thousands of dimensions. Humans can’t directly perceive patterns in such spaces. Techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are used to reduce these dimensions to 2 or 3, making them visualizable. PCA finds orthogonal axes that capture the most variance, while t-SNE focuses on preserving local neighborhoods, often revealing more nuanced clusters. The projector leverages these techniques, allowing you to explore these reduced-dimension representations interactively.
The levers you control are the dimensionality reduction methods and the metadata you log. By default, the projector might use PCA or t-SNE if you’ve logged data in a table format suitable for it. You can also explicitly tell W&B how to project your embeddings. For example, if you have a tensor of shape (N, D) where N is the number of samples and D is the dimensionality, W&B can often infer how to visualize it. If you’re logging multiple embedding sets, you can switch between them. Crucially, the quality of your visualization is directly tied to the quality of your embeddings and the metadata you use for coloring and filtering. Good embeddings will show meaningful clusters when visualized, and well-chosen metadata will highlight these structures.
A common pitfall is not logging enough context with your embeddings. Just logging the raw embedding vectors is like looking at a cloud of dots without any labels. To make sense of it, you need to associate each dot with its original data point. This means logging the corresponding text, image file path, or categorical label alongside the embedding vector. Without this, the visualization is just an abstract arrangement of points.
The next step after exploring your embeddings is often to investigate specific clusters or outliers identified in the projector.