Transfer learning lets you leverage massive, pre-trained models for your own tasks, saving you immense amounts of time and data.

Let’s see it in action. Imagine you want to classify images of cats and dogs. Instead of training a convolutional neural network (CNN) from scratch, which would require millions of images and days of training, you can use a model already trained on ImageNet, a dataset with over 14 million labeled images.

Here’s a simplified Python example using TensorFlow and Keras:

import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

# Load a pre-trained model (e.g., VGG16) without the top classification layer
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the layers of the base model so they are not updated during training
for layer in base_model.layers:
    layer.trainable = False

# Add new classification layers on top of the base model
x = base_model.output
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x) # For binary classification (cat vs dog)

# Create the final model
model = Model(inputs=base_model.input, outputs=predictions)

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

# Now you would load your cat and dog images and train the model
# model.fit(train_data, train_labels, epochs=10, validation_data=(val_data, val_labels))

The core idea is to take a model that has already learned general features like edges, textures, and shapes from a massive dataset (like ImageNet) and adapt it to your specific problem. You achieve this by removing the original classification layer of the pre-trained model and adding your own, tailored to your number of classes. Then, you "freeze" the weights of the pre-trained layers, meaning they won’t be updated during training. This prevents the model from "forgetting" the valuable general features it has already learned.

The problem transfer learning solves is the data and computational cost of training deep neural networks from scratch. Training a state-of-the-art image classifier can require hundreds of thousands, if not millions, of labeled images and weeks of GPU time. Transfer learning allows you to achieve high accuracy with significantly less data and training time by standing on the shoulders of giants. The pre-trained model acts as a powerful feature extractor. Its early layers learn low-level features (edges, corners), and its later layers learn more complex patterns (textures, object parts). By freezing these layers, you’re essentially using the pre-trained model as a sophisticated feature engineering pipeline.

When you add your new layers, you’re training a small, task-specific classifier on top of these rich, pre-computed features. This is much more efficient than learning both feature extraction and classification from scratch. The include_top=False argument in Keras is crucial here; it tells the model to load the convolutional base without the fully connected classification layers that were specific to the original ImageNet task. The layer.trainable = False loop is the mechanism for freezing.

The learning rate for the Adam optimizer is set to 0.0001. This is a common practice when fine-tuning. A lower learning rate is used because you’re only training the newly added layers, and you don’t want to make drastic changes to the pre-trained weights, even if you were to unfreeze some layers later. The sigmoid activation in the final Dense layer is for binary classification. For multi-class classification, you’d use softmax and adjust the number of units to match your class count.

A common strategy after initially training the new layers is to "unfreeze" some of the later layers of the pre-trained model and continue training with an even smaller learning rate. This allows the model to subtly adapt the more complex learned features to your specific dataset, further improving performance. For instance, you might unfreeze the last few convolutional blocks of VGG16.

When you freeze the base model’s layers, you’re not just preventing weight updates; you’re also preventing the computation of gradients for those layers. This can significantly speed up the training process, especially for very deep base models, as fewer calculations are needed per batch.

The next concept you’ll explore is fine-tuning strategies, specifically deciding which layers to unfreeze and how to set the learning rate for optimal performance on your specific dataset.

Want structured learning?

Take the full Tensorflow course →