Regularization in Keras isn’t about preventing overfitting; it’s about actively encouraging the model to learn more robust, generalizable features by making it harder to memorize the training data.

Let’s see how this plays out with dropout and L1/L2 regularization. Imagine we have a simple Keras model for classifying images:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Assume X_train, y_train, X_test, y_test are already defined

model = keras.Sequential([
    keras.Input(shape=(28, 28, 1)),
    layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation="relu"),
    layers.Dense(10, activation="softmax"),
])

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# model.fit(X_train, y_train, epochs=10, validation_split=0.2)

This is a standard convolutional neural network. Without regularization, if our training set is small or has quirks, the model might learn to rely too heavily on specific pixels or patterns that are only present in the training data, leading to poor performance on unseen data (overfitting).

Dropout: Randomly Forgetting

Dropout is like making your team members randomly take a break during training. During each training step, a certain percentage of neurons (and their connections) in a layer are randomly "dropped out" or ignored. This forces the remaining neurons to learn more robust features because they can’t rely on any single neuron being present.

To add dropout, we insert layers.Dropout() layers:

model_dropout = keras.Sequential([
    keras.Input(shape=(28, 28, 1)),
    layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation="relu"),
    layers.Dropout(0.5),  # <--- Dropout layer added here
    layers.Dense(10, activation="softmax"),
])

model_dropout.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# model_dropout.fit(X_train, y_train, epochs=10, validation_split=0.2)

Here, layers.Dropout(0.5) means that 50% of the neurons in the preceding Dense layer will be randomly deactivated during each training batch. During inference (evaluation or prediction), no neurons are dropped out; instead, the activations of the remaining neurons are scaled down by the dropout rate to compensate. This prevents the model’s output magnitude from changing between training and inference.

L1/L2 Regularization: Penalizing Large Weights

L1 and L2 regularization work by adding a penalty term to the loss function based on the magnitude of the model’s weights. This penalty discourages the model from assigning excessively large weights to any single feature, pushing it towards a simpler solution.

  • L1 Regularization (Lasso): Adds a penalty proportional to the absolute value of the weights (sum(|w|) * lambda). This can lead to sparse weight matrices, effectively performing feature selection by driving some weights to exactly zero.
  • L2 Regularization (Ridge): Adds a penalty proportional to the square of the weights (sum(w^2) * lambda). This discourages large weights but doesn’t typically drive them to zero, resulting in smaller, more distributed weights.

You can apply L1 or L2 regularization to the kernel (weights) or the bias of a layer. This is done by passing kernel_regularizer or bias_regularizer arguments to the layer.

from tensorflow.keras import regularizers

model_l1_l2 = keras.Sequential([
    keras.Input(shape=(28, 28, 1)),
    layers.Conv2D(32, kernel_size=(3, 3), activation="relu",
                  kernel_regularizer=regularizers.l1(0.001)), # L1 on kernel
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Conv2D(64, kernel_size=(3, 3), activation="relu",
                  kernel_regularizer=regularizers.l2(0.001)), # L2 on kernel
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation="relu",
                 kernel_regularizer=regularizers.l1_l2(l1=0.001, l2=0.001)), # Both L1 and L2
    layers.Dense(10, activation="softmax"),
])

model_l1_l2.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# model_l1_l2.fit(X_train, y_train, epochs=10, validation_split=0.2)

In this example:

  • The first Conv2D layer has L1 regularization with a factor of 0.001 applied to its kernel weights.
  • The second Conv2D layer has L2 regularization with a factor of 0.001 applied to its kernel weights.
  • The Dense layer has both L1 (factor 0.001) and L2 (factor 0.001) regularization applied to its kernel weights.

The l1, l2, and l1_l2 functions in tf.keras.regularizers are factories that create the appropriate regularization objects. The values 0.001 are the regularization strengths; you’ll often need to tune these hyperparameters. A higher value means a stronger penalty.

The magic of L1/L2 regularization is that the penalty term is automatically added to the loss function during training. When the optimizer (like Adam) tries to minimize the total loss, it has to balance minimizing the primary loss (e.g., cross-entropy) with minimizing the regularization penalty, which effectively means keeping the weights small.

One subtle point often overlooked is how the regularization penalty is only added to the loss during training. When you evaluate the model or make predictions, the regularization terms are not computed or added to the loss. This is crucial because you want the model to perform its best with all its learned weights during inference, not have its performance artificially degraded by a penalty. The regularization’s effect is already baked into the learned weights themselves.

After implementing these, you’ll likely encounter issues with learning rates if you’re not careful with optimizers like Adam, which have adaptive learning rate mechanisms that can sometimes be thrown off by the changing loss landscape introduced by regularization.

Want structured learning?

Take the full Tensorflow course →