Logging sensitive information like personally identifiable information (PII) to Weights & Biases (W&B) can expose it in your project’s UI, which is accessible to anyone with project permissions.

Let’s see W&B in action, but with a focus on what not to log. Imagine you’re training a model on user data, and you’ve got a dataset that looks something like this:

import pandas as pd

data = {
    'user_id': [101, 102, 103, 104, 105],
    'username': ['alice_w', 'bob_k', 'charlie_p', 'diana_m', 'ethan_s'],
    'email': ['alice.w@example.com', 'bob.k@example.com', 'charlie.p@example.com', 'diana.m@example.com', 'ethan.s@example.com'],
    'purchase_amount': [55.20, 120.00, 30.50, 75.00, 90.25]
}
df = pd.DataFrame(data)

If you were to log this df directly to W&B using wandb.log({"user_data": wandb.Table(dataframe=df)}), the user_id, username, and email columns would become visible in the W&B UI. Anyone with access to your W&B project could then see this PII.

The problem W&B privacy masking solves is precisely this: preventing sensitive data from appearing in the W&B UI while still allowing you to log other useful metadata or even anonymized versions of your data. It’s about maintaining data security without sacrificing the benefits of experiment tracking.

Here’s how you’d typically set up W&B logging for a model training run:

import wandb
import random

# Initialize a W&B run
run = wandb.init(project="my-sensitive-data-project", job_type="data-analysis")

# Simulate some data that might contain PII
sample_data = [
    {"user_id": 101, "username": "alice_w", "email": "alice.w@example.com", "value": random.random()},
    {"user_id": 102, "username": "bob_k", "email": "bob.k@example.com", "value": random.random()},
    {"user_id": 103, "username": "charlie_p", "email": "charlie.p@example.com", "value": random.random()},
]

# Log a table directly (DO NOT DO THIS WITH REAL PII)
# wandb.log({"raw_data_table": wandb.Table(data=sample_data)})

# --- The correct way: Mask PII ---

def mask_pii(data_row):
    masked_row = data_row.copy()
    if 'user_id' in masked_row:
        masked_row['user_id'] = f"user_{masked_row['user_id'] % 1000}" # Simple anonymization
    if 'username' in masked_row:
        masked_row['username'] = f"user_{hash(masked_row['username'])}"[0:8] # Hashing
    if 'email' in masked_row:
        masked_row['email'] = f"user_{hash(masked_row['email'])}@masked.com" # Masking email
    return masked_row

masked_data = [mask_pii(row) for row in sample_data]

# Log the masked table
run.log({"masked_data_table": wandb.Table(data=masked_data)})

# Log model metrics or other non-PII data
for i in range(10):
    run.log({"metric": random.random() * (i + 1)})

run.finish()

In this example, mask_pii is a function that takes a dictionary representing a data row and returns a new dictionary with PII fields replaced by anonymized or masked values. We then create a wandb.Table from this masked_data and log it. This way, the W&B UI will display user_id values like user_789, username values like user_a1b2c3d4, and masked emails, none of which can be traced back to the original individuals.

The core mechanism involves client-side data transformation before it ever reaches the W&B servers for logging. You have complete control over what gets masked and how. This could be as simple as replacing specific fields with placeholders, or as complex as applying cryptographic hashing or differential privacy techniques. The key is that the W&B SDK in your local environment handles the transformation.

The most surprising truth about W&B’s privacy features is that they don’t rely on any special server-side W&B configuration for PII masking. The responsibility and the tooling for anonymization are entirely within your code. W&B itself doesn’t "know" what PII is; it just logs whatever data you send it. Therefore, any privacy guarantees are achieved by pre-processing your data before it’s passed to wandb.log(). This gives you ultimate flexibility but also means you must be diligent about identifying and transforming sensitive fields yourself.

The next step is understanding how to integrate this masking logic into your data pipelines, especially when dealing with large datasets or streaming data.

Want structured learning?

Take the full Wandb course →