The most surprising thing about TensorFlow’s embedding layers is that they’re not just about looking up pre-computed vectors; they’re dynamically learned representations that capture complex relationships between discrete items.

Let’s see this in action. Imagine we’re building a movie recommendation system. We have a catalog of movies and a history of user ratings. Our goal is to predict how a user will rate a movie they haven’t seen.

First, we need to represent our movies and users as numerical IDs.

import tensorflow as tf
from tensorflow.keras import layers

# Example data (in reality, this would come from your dataset)
movie_titles = ["The Shawshank Redemption", "The Godfather", "The Dark Knight", "Pulp Fiction", "Schindler's List", "Fight Club", "Forrest Gump", "Inception", "The Matrix", "Goodfellas"]
user_ids = ["user_1", "user_2", "user_3", "user_4", "user_5"]

# Map titles and user IDs to integer indices
movie_to_index = {title: i for i, title in enumerate(movie_titles)}
index_to_movie = {i: title for title, i in movie_to_index.items()}
user_to_index = {uid: i for i, uid in enumerate(user_ids)}
index_to_user = {i: uid for uid, i in user_to_index.items()}

num_movies = len(movie_titles)
num_users = len(user_ids)
embedding_dim = 16 # The size of our learned vectors

Now, we define our model. An embedding layer takes integer indices and outputs dense vectors. We’ll have one for movies and one for users.

class RecommendationModel(tf.keras.Model):
    def __init__(self, num_users, num_movies, embedding_dim):
        super().__init__()
        self.user_embedding = layers.Embedding(
            input_dim=num_users,
            output_dim=embedding_dim,
            name="user_embedding"
        )
        self.movie_embedding = layers.Embedding(
            input_dim=num_movies,
            output_dim=embedding_dim,
            name="movie_embedding"
        )
        # A simple dot product to predict rating
        self.dot_product = layers.Dot(axes=1)

    def call(self, inputs):
        user_id_input, movie_id_input = inputs
        user_vec = self.user_embedding(user_id_input)
        movie_vec = self.movie_embedding(movie_id_input)
        return self.dot_product([user_vec, movie_vec])

model = RecommendationModel(num_users, num_movies, embedding_dim)

When we compile and train this model, the user_embedding and movie_embedding layers are initialized with random weights. During training, backpropagation adjusts these weights. The goal is to make the dot product of a user’s embedding vector and a movie’s embedding vector predict the actual rating. Over time, the embeddings learn to group similar items and users. Movies that are often liked by the same users will end up with similar embedding vectors. Similarly, users who like similar movies will have embedding vectors that are "close" to the embeddings of those movies.

Let’s prepare some dummy training data. We’ll create pairs of (user_id, movie_id) and a corresponding rating.

import numpy as np

# Dummy training data: (user_idx, movie_idx, rating)
# In a real scenario, you'd have many more ratings per user and movie.
training_data = [
    (user_to_index["user_1"], movie_to_index["The Shawshank Redemption"], 5),
    (user_to_index["user_1"], movie_to_index["The Godfather"], 4),
    (user_to_index["user_2"], movie_to_index["The Dark Knight"], 5),
    (user_to_index["user_2"], movie_to_index["Inception"], 4),
    (user_to_index["user_3"], movie_to_index["Pulp Fiction"], 5),
    (user_to_index["user_3"], movie_to_index["Fight Club"], 4),
    (user_to_index["user_4"], movie_to_index["The Matrix"], 5),
    (user_to_index["user_4"], movie_to_index["Goodfellas"], 3),
    (user_to_index["user_5"], movie_to_index["Forrest Gump"], 4),
    (user_to_index["user_5"], movie_to_index["Schindler's List"], 3),
    # Add some cross-interactions to help learning
    (user_to_index["user_1"], movie_to_index["The Dark Knight"], 3),
    (user_to_index["user_2"], movie_to_index["The Shawshank Redemption"], 4),
    (user_to_index["user_3"], movie_to_index["The Matrix"], 4),
    (user_to_index["user_4"], movie_to_index["Pulp Fiction"], 3),
    (user_to_index["user_5"], movie_to_index["Inception"], 3),
]

user_ids_train = np.array([x[0] for x in training_data])
movie_ids_train = np.array([x[1] for x in training_data])
ratings_train = np.array([x[2] for x in training_data])

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01), loss='mse')

# Train the model (a few epochs for demonstration)
print("Training the model...")
model.fit(
    [user_ids_train, movie_ids_train],
    ratings_train,
    epochs=50,
    verbose=0 # Set to 1 to see progress
)
print("Training complete.")

After training, we can inspect the learned embeddings. The get_weights() method will return the learned matrices.

# Get the learned embedding weights
user_embedding_weights = model.get_layer("user_embedding").get_weights()[0]
movie_embedding_weights = model.get_layer("movie_embedding").get_weights()[0]

print("\nLearned User Embeddings (first 5 users, first 4 dimensions):")
print(user_embedding_weights[:5, :4])

print("\nLearned Movie Embeddings (first 5 movies, first 4 dimensions):")
print(movie_embedding_weights[:5, :4])

The key insight here is that the embedding layer transforms sparse, high-dimensional categorical features (like a movie title or user ID) into dense, low-dimensional vectors. These vectors are not arbitrary; they are optimized through training to capture semantic relationships. For example, if "The Dark Knight" and "Inception" are frequently liked by the same users, their embedding vectors will likely become similar in the embedding_dim-dimensional space. This similarity allows the model to generalize: if a user likes "The Dark Knight," the model can infer they might also like "Inception" because their embeddings are close.

The "magic" of embedding layers lies in their ability to learn these latent features automatically. Instead of manually engineering features (e.g., "is this a superhero movie?", "is this directed by Nolan?"), the embedding layer discovers these underlying characteristics from the data itself. The dimensionality (embedding_dim) is a crucial hyperparameter, balancing the expressiveness of the learned representations against the risk of overfitting and computational cost.

When you have a very large number of unique items (millions of products, users, words), creating an embedding layer is the standard way to handle them. The alternative would be one-hot encoding, which results in astronomically large and sparse vectors that are computationally infeasible and don’t capture relationships. An embedding layer effectively learns a compressed, meaningful representation.

The dot product interaction in our simple model is just one way to combine user and item embeddings. More complex models might concatenate these embeddings and pass them through a series of dense layers, or use more sophisticated interaction functions, but the core idea of learning dense vector representations for discrete entities remains the same.

One subtle point about embedding layers is their initialization. While random initialization is common, for some tasks, pre-training embeddings (e.g., using word embeddings like Word2Vec for text, or using prior knowledge) can significantly speed up convergence and improve performance. However, for most recommendation tasks, training from scratch is sufficient and often preferred as it tailors the embeddings specifically to your dataset’s user-item interaction patterns.

Once your embeddings are learned, you can use them for various tasks beyond just prediction. You can find similar items by looking for movie embeddings that are close in the embedding space (e.g., using cosine similarity), or find users with similar tastes by comparing their user embeddings.

The next step in building a more robust recommendation system would be to explore more sophisticated model architectures that leverage these embeddings, such as combining them with other features (user demographics, item metadata) or using attention mechanisms to weigh the importance of different user interactions.

Want structured learning?

Take the full Tensorflow course →