Artifacts in Weights & Biases are how you save and version your data, models, and other outputs. Uploading large artifacts can be slow and expensive, but you can significantly optimize both.

Here’s a demonstration of a large artifact upload using wandb sync and then a faster, more efficient method using wandb artifact put with parallelization.

First, let’s simulate a large artifact.

import wandb
import os
import shutil

# Create a dummy large artifact
artifact_name = "large-dataset-sim"
artifact_dir = "simulated_data"
os.makedirs(artifact_dir, exist_ok=True)

# Create a few large dummy files
file_size_gb = 2
num_files = 5
total_size_gb = file_size_gb * num_files

print(f"Creating {num_files} dummy files, each {file_size_gb}GB, for a total of {total_size_gb}GB...")

for i in range(num_files):
    file_path = os.path.join(artifact_dir, f"data_part_{i}.bin")
    with open(file_path, "wb") as f:
        f.seek((file_size_gb * 1024**3) - 1)
        f.write(b"\0")
    print(f"Created {file_path}")

print(f"Dummy artifact directory '{artifact_dir}' created.")

# --- Initializing W&B ---
# This assumes you are logged in: wandb login
run = wandb.init(project="artifact-optimization-demo", job_type="upload-test")

# --- Method 1: Using wandb sync (less efficient for large files) ---
# This method is generally for syncing local directories with remote W&B storage.
# For large, single artifacts, it might not be the most optimized.

print("\n--- Starting upload with wandb sync ---")
# In a real scenario, you'd run this from your terminal:
# wandb sync --clean --target-artifact large-dataset-sim:latest simulated_data
# For this script, we'll simulate the artifact creation and then the upload command.
# Note: wandb sync is more for general directory syncing. For artifacts, 'artifact put' is preferred.

# We'll create an artifact first, then try to 'sync' it, which isn't its primary use case
# but demonstrates a potential path users might take.
artifact_sync_candidate = wandb.Artifact(artifact_name, type="dataset", description="Simulated large dataset for sync test")
artifact_sync_candidate.add_dir(artifact_dir)
run.log_artifact(artifact_sync_candidate)
print(f"Logged artifact '{artifact_name}' for sync test (this logs it, sync command would upload changes if any).")
print("To truly test 'wandb sync' for large files, you'd typically have a directory and sync it.")
print("Let's assume the above logged artifact represents the data we want to sync.")
print("The 'wandb sync' command would then be run separately to upload/update.")
print("However, for direct artifact uploads, 'wandb artifact put' is more idiomatic and often faster.")


# --- Method 2: Using wandb artifact put (optimized for artifacts) ---
# This command is designed specifically for uploading artifacts and supports parallelization.

print("\n--- Starting upload with wandb artifact put ---")
# This is the command you'd run in your terminal:
# wandb artifact put --name large-dataset-optimized --type dataset --description "Simulated large dataset, optimized upload" simulated_data
# The 'wandb artifact put' command is generally more efficient for this task.

# To simulate this within the script, we'll use the programmatic API.
# The underlying implementation of artifact put uses parallel uploads.

artifact_optimized = wandb.Artifact("large-dataset-optimized", type="dataset", description="Simulated large dataset, optimized upload")
print(f"Adding directory '{artifact_dir}' to artifact '{artifact_optimized.name}'...")
artifact_optimized.add_dir(artifact_dir)

# When you log this artifact, W&B handles the optimized upload (including parallelization).
print(f"Logging artifact '{artifact_optimized.name}'...")
run.log_artifact(artifact_optimized)

print("\n--- Uploads complete ---")
run.finish()

# --- Cleanup ---
print(f"Cleaning up dummy artifact directory: {artifact_dir}")
shutil.rmtree(artifact_dir)

The surprising truth about Weights & Biases artifact uploads, especially for large files, is that the wandb artifact put command is not just a wrapper around simple file transfers; it leverages parallel, multi-part uploads managed by W&B’s backend. This means that even if you’re uploading a single large file or a directory, the upload process breaks it down into smaller chunks and sends them concurrently to W&B’s cloud storage, significantly reducing transfer time compared to sequential uploads.

Let’s break down how this works and how you can control it.

When you execute wandb artifact put --name my-artifact --type dataset my_data_directory/, W&B performs several steps:

  1. Hashing and Indexing: W&B first scans the provided directory (my_data_directory/). It calculates cryptographic hashes (like SHA-1) for each file and for the directory structure itself. This creates a unique fingerprint for your artifact. If you upload the same artifact again, W&B can detect that and avoid re-uploading unchanged files.
  2. Chunking: Large individual files are automatically broken down into smaller, manageable chunks by the W&B client.
  3. Parallel Upload: These chunks, along with individual smaller files, are uploaded to W&B’s cloud storage concurrently. The W&B client manages a pool of connections to maximize throughput.
  4. Metadata Storage: Once all chunks and files are uploaded, W&B stores the artifact’s metadata—the manifest (list of files, their hashes, and sizes), the directory structure, and any associated tags or descriptions—linking it all back to your W&B run.

The default behavior of wandb artifact put is to use an aggressive level of parallelization. You can influence this, though it’s rarely necessary to tune it down. The underlying mechanism uses a thread pool, and the number of concurrent uploads is dynamically managed. For most users, the default settings provide the best balance of speed and resource utilization on the client machine.

Consider a scenario where you have a 100GB model file.

# This command will automatically chunk the large file and upload parts in parallel.
wandb artifact put --name large-model --type model --description "My large PyTorch model" path/to/your/100GB_model.pth

The W&B client will break 100GB_model.pth into smaller pieces (e.g., 10MB or 64MB chunks, depending on internal heuristics and file size) and upload them simultaneously. This is why it’s often much faster than a simple gsutil cp or aws s3 cp if those tools aren’t configured for parallel multipart uploads of a single large object.

The primary levers you control are:

  • --name and --type: Essential for identifying and organizing your artifacts.
  • --description: A human-readable explanation.
  • --metadata: For attaching custom key-value pairs (e.g., {"framework": "pytorch", "version": "1.12"}).
  • --aliases: To create symbolic links to this artifact version (e.g., production, latest).
  • --force: To overwrite an existing artifact with the same name and version.

The system automatically handles retries for transient network errors during chunk uploads. The cost factor is primarily driven by the amount of data transferred and stored. By avoiding re-uploads of unchanged data (thanks to hashing) and by efficiently transferring data in parallel, wandb artifact put helps minimize both time and egress costs from your storage provider (if W&B is configured to use your own S3/GCS bucket).

What most people don’t realize is how W&B manages the state of large uploads. If an upload is interrupted, W&B doesn’t necessarily restart the entire transfer from scratch. It can resume uploads by identifying which chunks have already been successfully transferred and only uploading the missing ones. This resilience is built into the multipart upload protocol it uses, making it robust for large, potentially unstable network environments.

The next step in optimizing artifact workflows is understanding how to efficiently consume these large artifacts in downstream training runs, leveraging W&B’s caching and partial download capabilities.

Want structured learning?

Take the full Wandb course →