The most surprising thing about multi-label classification is that the "labels" aren’t mutually exclusive; they’re independent binary decisions.

Let’s see this in action with a simplified TensorFlow pipeline. Imagine we have images of animals, and each image can contain a dog, a cat, or both.

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Flatten, Conv2D, MaxPooling2D, Reshape
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import numpy as np

# --- Data Generation (Simulated) ---
def generate_data(num_samples=1000):
    images = np.random.rand(num_samples, 64, 64, 3).astype(np.float32)
    # Each column represents a label: [has_dog, has_cat]
    labels = np.random.randint(0, 2, size=(num_samples, 2)).astype(np.float32)
    return images, labels

images, labels = generate_data()

# --- Model Definition ---
input_img = Input(shape=(64, 64, 3))
x = Conv2D(32, (3, 3), activation='relu')(input_img)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(64, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
# Output layer: 2 neurons, one for each label.
# Sigmoid activation allows independent probabilities for each label.
output = Dense(2, activation='sigmoid')(x)

model = Model(inputs=input_img, outputs=output)

# --- Compilation ---
# Use binary_crossentropy for each output neuron, as each is a binary decision.
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy']) # Accuracy here is per-label accuracy

model.summary()

# --- Training ---
print("Training model...")
history = model.fit(images, labels, epochs=5, batch_size=32, validation_split=0.2)

# --- Prediction ---
print("\nMaking predictions on new data...")
sample_image = np.random.rand(1, 64, 64, 3).astype(np.float32)
predictions = model.predict(sample_image)

print(f"Raw predictions (probabilities): {predictions}")
# Interpret predictions: if probability > 0.5, the label is considered present.
predicted_labels = (predictions > 0.5).astype(int)
print(f"Interpreted labels: {predicted_labels}")

In this example, the Dense layer at the end has two output neurons, one for "dog" and one for "cat." Crucially, it uses activation='sigmoid'. This is the key difference from multi-class classification (where you’d use softmax and have mutually exclusive classes). Sigmoid squashes each output neuron’s value independently into the range (0, 1), representing the probability of that specific label being present, regardless of the others.

The loss='binary_crossentropy' is also vital. TensorFlow applies this loss function to each output neuron separately. It treats each neuron’s output as an independent binary classification problem. The model is trained to minimize the combined binary cross-entropy across all output neurons.

When you make predictions, model.predict() will output two probabilities. For instance, [[0.85, 0.15]] means there’s an 85% chance of a dog and a 15% chance of a cat. You then apply a threshold (commonly 0.5) to these probabilities to get your final multi-label prediction: [[1, 0]], indicating "dog present, cat absent."

The training accuracy reported by model.fit is an average of the per-label accuracies. For a more nuanced evaluation, you’d typically use metrics like F1-score, precision, and recall, calculated per label or averaged across labels, often with macro or micro averaging depending on your needs.

The most common pitfall is confusing multi-label with multi-class. If you use softmax on the output layer and categorical_crossentropy as the loss, your model will force predictions into mutually exclusive categories, which is incorrect for multi-label problems where an instance can belong to multiple classes simultaneously.

The next concept you’ll likely grapple with is handling imbalanced datasets in multi-label scenarios, where some labels might be far more frequent than others, requiring specialized weighting or sampling techniques during training.

Want structured learning?

Take the full Tensorflow course →