A vector database’s "accuracy" is less about hitting an exact match and more about surfacing the most relevant information, even if it’s not a perfect lexical overlap.

Let’s see what this looks like in practice. Imagine we have a simple RAG system where a user asks: "What are the symptoms of a common cold?" Our vector database is supposed to return documents about cold symptoms.

Here’s a simplified look at the data we might have indexed:

  • Document 1: "Common cold symptoms include a runny nose, sore throat, cough, and sneezing. These typically appear one to three days after exposure to a virus."
  • Document 2: "Influenza, or the flu, shares some symptoms with the common cold, such as fever and body aches. However, flu symptoms are generally more severe."
  • Document 3: "Allergies can manifest as sneezing and a runny nose, but usually don’t involve fever or body aches."
  • Document 4: "Children often experience different cold symptoms, like irritability and difficulty sleeping."

When the user asks "What are the symptoms of a common cold?", a perfect retrieval system would ideally boost Document 1 to the top. But what if our vector embeddings aren’t perfectly aligned?

Here’s how a typical retrieval might play out, based on semantic similarity scores (higher is better):

  1. Document 1: Score: 0.92 (Runny nose, sore throat, cough, sneezing)
  2. Document 4: Score: 0.78 (Children’s cold symptoms: irritability, sleep issues)
  3. Document 3: Score: 0.65 (Allergies: sneezing, runny nose)
  4. Document 2: Score: 0.55 (Flu: fever, body aches)

The vector database uses algorithms to represent these documents and the query as vectors in a high-dimensional space. The "similarity" is the cosine similarity (or another metric) between the query vector and the document vectors. Documents whose vectors are "closer" in this space to the query vector are considered more relevant.

The problem is, "accuracy" here isn’t a binary true/false. It’s a spectrum of relevance. We care about:

  • Recall: Did we get all the relevant documents? (e.g., if there was another document detailing cold symptoms, did we miss it?)
  • Precision: Of the documents we returned, how many were actually relevant? (e.g., if we returned a document about allergies with a high score, that’s a precision issue).
  • Ranking: Are the most relevant documents ranked highest?

To evaluate this, we don’t just look at the top result. We look at the entire ranked list.

Metrics to Measure Retrieval Performance

  1. Precision@K: This measures the proportion of relevant documents within the top K retrieved documents.

    • How to calculate: For our example, let’s say we consider Documents 1 and 4 as "relevant" to the query. If we set K=3, we retrieve Documents 1, 4, and 3.
      • Relevant documents in top 3: Document 1, Document 4.
      • Total documents in top 3: 3.
      • Precision@3 = 2 / 3 = 0.67.
    • Why it matters: It tells you how much "noise" you have in your top results, which is what the LLM will primarily see.
  2. Recall@K: This measures the proportion of all relevant documents that were found within the top K retrieved documents.

    • How to calculate: Again, assume Documents 1 and 4 are relevant. If we consider all documents in our dataset (say, 10 documents total), and we retrieve Documents 1, 4, and 3 (K=3):
      • Relevant documents found in top 3: Document 1, Document 4 (2 documents).
      • Total actual relevant documents in the dataset: 2 (Documents 1 and 4).
      • Recall@3 = 2 / 2 = 1.0.
    • Why it matters: It tells you if you’re missing important pieces of information.
  3. Mean Reciprocal Rank (MRR): This metric focuses on the rank of the first relevant document. It’s useful when you only care about getting one good answer quickly.

    • How to calculate: For our query, Document 1 is the first relevant document, and it’s at rank 1. The reciprocal rank is 1/1 = 1.
      • MRR = 1.0.
    • Why it matters: If your LLM prompt is very short and can only process a few documents, MRR is a good indicator of how quickly it will find a useful piece of context.
  4. Normalized Discounted Cumulative Gain (NDCG@K): This is a more sophisticated metric that accounts for both the relevance of documents and their position in the ranked list. It gives higher scores to highly relevant documents that appear earlier in the ranking.

    • How to calculate: This requires a "graded relevance" score for each document (e.g., 0=irrelevant, 1=somewhat relevant, 2=highly relevant). Let’s say Document 1 is highly relevant (2), Document 4 is relevant (1), and Document 3 is somewhat relevant (0.5).
      • Ideal DCG: Sum of (relevance / log2(rank+1)) for all relevant documents. If we had an infinite list of perfectly relevant documents: (2/log2(1+1)) + (1/log2(2+1)) + (0.5/log2(3+1)) + …
      • Actual DCG: For our top 3: (2/log2(1+1)) + (1/log2(2+1)) + (0.5/log2(3+1)) = (2/1) + (1/1.58) + (0.5/2) = 2 + 0.63 + 0.25 = 2.88.
      • NDCG@K: Actual DCG / Ideal DCG. You’d need to calculate the ideal DCG for the perfect ranking of your ground truth relevant documents.
    • Why it matters: It’s the most comprehensive, as it penalizes irrelevant items appearing high up and rewards highly relevant items appearing high up.

When evaluating your RAG system, you’ll typically run a set of queries against your vector database, compare the retrieved results against a "ground truth" of what should have been returned, and then calculate these metrics. Tools like LangChain’s RetrievalEvaluator or specialized libraries can help automate this.

The next hurdle is understanding how the chunking strategy of your documents dramatically impacts these retrieval metrics, even with the same vector embeddings.

Want structured learning?

Take the full Vector-databases course →