Matryoshka Embeddings: Flexible Dimension Reduction (2026)

Matryoshka Embeddings let you trade off accuracy for retrieval speed by using a single embedding vector that can be truncated at different lengths.

Let’s see this in action. Imagine you have a dataset of product descriptions, and you want to find similar items.

Here’s a simplified Python example using a hypothetical MatryoshkaEmbeddingModel:

from typing import List

class MatryoshkaEmbeddingModel:
    def __init__(self, base_model_name: str = "example-model", max_dim: int = 768):
        self.max_dim = max_dim
        print(f"Initializing Matryoshka model based on {base_model_name} with max dimension {max_dim}")
        # In a real scenario, this would load a pre-trained model
        # and potentially a mechanism to extract embeddings at various dimensions.

    def encode(self, texts: List[str]) -> List[List[float]]:
        print(f"Encoding {len(texts)} texts...")
        # Simulate generating full-dimension embeddings
        full_embeddings = []
        for i, text in enumerate(texts):
            # Simulate a full embedding vector
            full_embedding = [(ord(c) + i + j) / 1000.0 for j, c in enumerate(text[:self.max_dim])]
            full_embeddings.append(full_embedding[:self.max_dim]) # Ensure it's not longer than max_dim
        return full_embeddings

    def get_truncated_embedding(self, full_embedding: List[float], dimension: int) -> List[float]:
        if dimension > len(full_embedding) or dimension <= 0:
            raise ValueError(f"Dimension {dimension} is out of bounds for embedding of length {len(full_embedding)}")
        return full_embedding[:dimension]

# --- Usage Example ---
model = MatryoshkaEmbeddingModel(max_dim=512)
texts = ["This is a red t-shirt.", "A comfortable cotton blend shirt.", "Blue jeans for casual wear."]

# Get full embeddings
full_embeddings = model.encode(texts)
print(f"Generated full embeddings of dimension: {len(full_embeddings[0])}")

# Truncate embeddings for different retrieval needs
embedding_dim_128 = [model.get_truncated_embedding(emb, 128) for emb in full_embeddings]
embedding_dim_32 = [model.get_truncated_embedding(emb, 32) for emb in full_embeddings]

print(f"Dimension 128 embedding length: {len(embedding_dim_128[0])}")
print(f"Dimension 32 embedding length: {len(embedding_dim_32[0])}")

# In a real system, you would store these truncated embeddings and use
# approximate nearest neighbor (ANN) search libraries (like Faiss, Annoy)
# which are significantly faster with smaller dimensions.

The core problem Matryoshka embeddings solve is the rigid trade-off between embedding dimensionality, retrieval accuracy, and search speed. Traditionally, if you wanted faster search, you’d need to use a smaller model or a dimensionality reduction technique like PCA. This often meant a significant drop in semantic understanding and, consequently, retrieval accuracy. Matryoshka embeddings allow you to achieve this flexibility from a single, high-dimensional embedding.

Internally, a Matryoshka embedding model is trained to ensure that the prefix of a higher-dimensional embedding retains as much semantic information as possible. The training process penalizes deviations in the initial dimensions of the embedding vector. This means that if you have a 768-dimensional embedding, the first 128 dimensions are highly representative of the original text’s meaning, and the first 32 are also quite representative, though less so. The model learns to compress information such that embedding[:d1] is a good approximation of embedding[:d2] when d1 < d2.

You control the trade-off by choosing the dimension parameter when you query for an embedding. For high-stakes, precise searches, you might use dimension=512. For quick, "good enough" searches where speed is paramount, you might use dimension=64 or even dimension=32. This is especially powerful in vector databases where indexing and searching are computationally intensive. A smaller dimension means a smaller index, less memory usage, and faster query times.

The surprising aspect is how much semantic information can be retained in the initial dimensions of a vector, even when the full vector is quite large. It’s not just about discarding less important dimensions; it’s about a specific training objective that prioritizes the initial dimensions’ representational power. This is achieved by adding a loss term during training that specifically penalizes divergence between the prefix embeddings and the full embedding, often using cosine similarity or L2 distance. The model is essentially forced to pack the most crucial semantic signals into the earliest components of the embedding vector.

The next logical step is to explore how these dynamically sized embeddings integrate with vector databases and the specific indexing strategies that maximize performance for Matryoshka retrieval.

More Deep Dives in Vector Databases