The magic of multi-vector document retrieval isn’t that it can find documents based on meaning, but that it can find specific sentences or paragraphs within those documents, even if the original query didn’t use those exact words.
Let’s see this in action. Imagine we have a document about "The History of the Internet."
# Document: The History of the Internet
## Chunk 1: Early Concepts
The seeds of the internet were sown in the early 1960s with concepts like packet switching, independently developed by Paul Baran and Donald Davies. This laid the groundwork for decentralized networks.
## Chunk 2: ARPANET
In 1969, the Advanced Research Projects Agency Network (ARPANET) was established by the U.S. Department of Defense. It connected four university computers and was the first operational packet-switching network.
## Chunk 3: TCP/IP and Expansion
The development of the Transmission Control Protocol/Internet Protocol (TCP/IP) in the 1970s by Vint Cerf and Bob Kahn provided a standardized way for different networks to communicate. This was a pivotal moment, enabling the internet's exponential growth.
## Chunk 4: The World Wide Web
Tim Berners-Lee invented the World Wide Web in 1989 while working at CERN. He developed HTML, HTTP, and the first web browser, making information easily accessible and navigable.
Now, we’ll generate embeddings for each of these chunks. Let’s say our embedding model produces these vectors (simplified for illustration):
- Chunk 1 Embed:
[0.8, 0.1, 0.2] - Chunk 2 Embed:
[0.7, 0.3, 0.1] - Chunk 3 Embed:
[0.2, 0.9, 0.4] - Chunk 4 Embed:
[0.1, 0.2, 0.9]
If a user queries "How did computers start talking to each other?", we generate an embedding for this query. Let’s say it’s [0.75, 0.2, 0.15].
When we compare this query embedding to our chunk embeddings using cosine similarity:
- Query vs. Chunk 1:
cosine_similarity([0.75, 0.2, 0.15], [0.8, 0.1, 0.2])-> High Similarity (around 0.98) - Query vs. Chunk 2:
cosine_similarity([0.75, 0.2, 0.15], [0.7, 0.3, 0.1])-> High Similarity (around 0.95) - Query vs. Chunk 3:
cosine_similarity([0.75, 0.2, 0.15], [0.2, 0.9, 0.4])-> Low Similarity (around 0.4) - Query vs. Chunk 4:
cosine_similarity([0.75, 0.2, 0.15], [0.1, 0.2, 0.9])-> Low Similarity (around 0.3)
The system would return Chunk 1 and Chunk 2, because their embeddings are most similar to the query embedding, effectively answering "how computers started talking to each other" by pointing to the early concepts and ARPANET.
The core problem this solves is information overload and the limitations of keyword search. Traditional search engines match words. If you search for "internet origins," you might miss documents that discuss "packet switching" or "ARPANET" without using the word "internet." Multi-vector retrieval, by embedding the meaning of text chunks, can bridge this semantic gap.
Internally, the process involves:
- Chunking: Breaking down large documents into smaller, semantically coherent units (e.g., paragraphs, sections, or even fixed-size blocks). The size of these chunks is a critical tuning parameter – too small and they lose context, too large and they become too general.
- Embedding: Using a pre-trained language model (like BERT, Sentence-BERT, or OpenAI’s Ada) to convert each text chunk into a high-dimensional numerical vector. This vector captures the semantic essence of the text.
- Indexing: Storing these embeddings in a specialized vector database (e.g., Pinecone, Weaviate, FAISS) that is optimized for fast similarity searches.
- Querying: When a user submits a query, it’s also embedded into a vector. The vector database then efficiently finds the chunk embeddings that are closest (most similar) to the query embedding, typically using algorithms like Approximate Nearest Neighbor (ANN).
- Retrieval: The system returns the original text of the top-k most similar chunks.
The exact levers you control are primarily in the chunking strategy and the choice of embedding model. For chunking, you might experiment with fixed sizes (e.g., 200 tokens), sentence splitting, or even recursive character splitting to find the optimal balance of granularity and context. The embedding model dictates the quality of the semantic representation; models trained on broader datasets or specifically for sentence similarity tasks will generally yield better results.
When you retrieve chunks based on embedding similarity, the system is effectively performing a nearest-neighbor search in a high-dimensional semantic space. The "distance" metric (most commonly cosine similarity) quantifies how semantically aligned the query is with each document chunk. A key insight is that even if two pieces of text use entirely different words, their embeddings can be very close if they convey a similar meaning. This allows for retrieval of conceptually related information that keyword-based systems would miss entirely.
The next challenge you’ll likely encounter is re-ranking the retrieved chunks. While the initial retrieval is fast and finds semantically relevant pieces, the top-k results might not always be in the most logical order for a human reader.