The most surprising thing about hybrid search is that it can often achieve lower recall than its individual components, despite combining their strengths.
Let’s see it in action. Imagine we have a dataset of product descriptions and we want to find products similar to "a comfortable blue running shoe for women."
First, let’s consider traditional keyword search (lexical). We’d look for documents containing "comfortable," "blue," "running," and "shoe."
// Example Query (Lexical)
"comfortable blue running shoe women"
This is fast and good for exact matches, but it struggles with synonyms or related concepts. "Athletic footwear" wouldn’t match "running shoe" directly.
Now, let’s look at vector search (semantic). We embed our query and documents into a multi-dimensional space where similarity in meaning translates to proximity.
# Example Embedding (Conceptual)
query_vector = model.encode("a comfortable blue running shoe for women")
# We'd then find vectors in our index closest to query_vector
This excels at understanding nuances like "athletic footwear" being similar to "running shoe," but it can miss specific keywords if the embedding doesn’t perfectly capture them. What if the embedding for "blue" is slightly off, or the model doesn’t strongly associate "comfortable" with the specific embedding of the product description?
Hybrid search aims to get the best of both worlds. It runs both a lexical and a semantic search, then combines the results. The key is how you combine them. A common approach is to use a weighted sum of the scores from each search type.
Let’s say our lexical search returns these top 3 results with scores:
- Product A: "Blue running shoes, very comfortable for women." (Score: 0.9)
- Product B: "Comfortable athletic shoes, blue, for women." (Score: 0.7)
- Product C: "Women’s running sneakers, comfortable fit." (Score: 0.6)
And our vector search returns these top 3 results with scores:
- Product D: "Lightweight athletic trainers for female runners." (Score: 0.95)
- Product A: "Blue running shoes, very comfortable for women." (Score: 0.92)
- Product E: "Comfortable walking shoes, blue." (Score: 0.88)
A simple, but often naive, way to combine them is to take the union of the top-k results and re-rank. A more sophisticated method uses a fusion algorithm, like Reciprocal Rank Fusion (RRF), which combines ranked lists without needing to normalize scores directly. RRF assigns a score to each document based on its rank in each list:
RRF_score(doc) = sum(1 / (k + rank(doc, list)))
where k is a constant (often 60) and rank(doc, list) is the document’s position in that specific ranked list.
Let’s apply a simplified RRF logic. Assume k=1.
- Product A: Rank 1 in Lexical, Rank 2 in Vector. RRF-like score: (1/2) + (1/3) = 0.83
- Product B: Rank 2 in Lexical, Not in Top-3 Vector. RRF-like score: (1/3) = 0.33
- Product C: Rank 3 in Lexical, Not in Top-3 Vector. RRF-like score: (1/4) = 0.25
- Product D: Not in Top-3 Lexical, Rank 1 in Vector. RRF-like score: (1/2) = 0.5
- Product E: Not in Top-3 Lexical, Rank 3 in Vector. RRF-like score: (1/4) = 0.25
With this simplified fusion, Product A still looks strong.
The problem with this simple additive or rank-based fusion is that the weights are often implicitly or explicitly tied to the original scoring mechanisms. If your lexical scorer is highly sensitive to common words and your vector scorer is highly sensitive to nuanced meaning, a simple average or rank fusion might amplify the weaknesses of one while diluting the strengths of the other. For instance, if the vector search missed the concept of "running shoe" entirely but got "comfortable" and "blue" right, and the lexical search found "running shoe" but missed "comfortable," a simple fusion might produce a nonsensical result. The "better recall" comes from finding documents that are either semantically close or lexically precise, and then intelligently merging those lists. The true power lies in tuning the fusion algorithm and the parameters for each individual search type.
The most common pitfall is treating the scores from lexical and vector search as directly comparable or assuming a simple linear combination is optimal. Without careful tuning of the fusion algorithm and the individual search parameters (like k for top-N results, or specific weighting in the fusion), you can end up with a hybrid system that performs worse than either pure lexical or pure vector search because the fusion process either over-emphasizes a weak signal or under-emphasizes a strong one.
The next step is understanding how to tune these fusion algorithms, like Reciprocal Rank Fusion, for your specific dataset and query types.