The most counterintuitive thing about vector distance metrics is that for many common use cases, you don’t want the "closest" vectors in the Euclidean sense.
Imagine you have a bunch of documents, each represented as a vector where each dimension corresponds to a word’s frequency. You want to find documents similar to a query document.
Here’s a simplified Python example using numpy to represent our vectors:
import numpy as np
# Document 1: "the quick brown fox"
doc1 = np.array([1, 1, 1, 1, 0, 0, 0, 0])
# Document 2: "the lazy dog"
doc2 = np.array([1, 0, 0, 0, 1, 1, 0, 0])
# Document 3: "the quick fox jumps"
doc3 = np.array([1, 1, 0, 1, 0, 0, 1, 0])
# Query: "quick brown dog"
query = np.array([0, 1, 1, 0, 1, 0, 0, 0])
# Let's see these in action
print("Doc 1:", doc1)
print("Doc 2:", doc2)
print("Doc 3:", doc3)
print("Query:", query)
Now, let’s calculate the distances between the query and each document using different metrics.
Euclidean Distance
This is the straight-line distance between two points in space. It’s what you’d typically think of as "distance."
from scipy.spatial.distance import euclidean
dist_euclidean_1 = euclidean(query, doc1)
dist_euclidean_2 = euclidean(query, doc2)
dist_euclidean_3 = euclidean(query, doc3)
print("\nEuclidean Distances:")
print("Query vs Doc 1:", dist_euclidean_1)
print("Query vs Doc 2:", dist_euclidean_2)
print("Query vs Doc 3:", dist_euclidean_3)
In this output, the smallest Euclidean distance indicates the "closest" document.
Dot Product
The dot product measures the degree to which two vectors point in the same direction, scaled by their magnitudes. A higher dot product means they are more aligned.
dot_product_1 = np.dot(query, doc1)
dot_product_2 = np.dot(query, doc2)
dot_product_3 = np.dot(query, doc3)
print("\nDot Products:")
print("Query vs Doc 1:", dot_product_1)
print("Query vs Doc 2:", dot_product_2)
print("Query vs Doc 3:", dot_product_3)
Here, the largest dot product indicates the most similar document. Notice how doc1 and doc3 have higher dot products with the query than doc2, even though doc2 might be "closer" in Euclidean terms to a shorter query.
Cosine Similarity
Cosine similarity is the cosine of the angle between two vectors. It’s essentially the dot product normalized by the magnitudes of the vectors. This makes it insensitive to the length of the vectors, focusing purely on orientation. A value of 1 means perfect alignment, 0 means orthogonal, and -1 means opposite directions.
from scipy.spatial.distance import cosine
# Note: scipy.spatial.distance.cosine returns 1 - cosine_similarity
# So, a smaller value from this function means higher similarity.
# For clarity, we'll calculate it directly.
def cosine_similarity(v1, v2):
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
cos_sim_1 = cosine_similarity(query, doc1)
cos_sim_2 = cosine_similarity(query, doc2)
cos_sim_3 = cosine_similarity(query, doc3)
print("\nCosine Similarities:")
print("Query vs Doc 1:", cos_sim_1)
print("Query vs Doc 2:", cos_sim_2)
print("Query vs Doc 3:", cos_sim_3)
Again, the largest cosine similarity indicates the most similar document. This is often the preferred metric for text similarity because it measures the proportion of words, not their absolute counts. A long document that happens to mention "quick" and "dog" many times might have a large Euclidean distance to a short query, but if the ratio of "quick" to "dog" is similar, cosine similarity will capture that.
The Mental Model: Why Cosine Wins for Text
When you represent text as vectors (like TF-IDF or word embeddings), the length of the vector often corresponds to the length of the document or the magnitude of the word counts. A very long document might have high values in many dimensions, leading to a large Euclidean distance even if its topic or word proportions are very similar to a shorter document.
Euclidean distance is sensitive to magnitude. If doc1 has 100 occurrences of "the" and your query has 1, Euclidean distance will penalize this difference heavily, even if "the" is a common stop word and shouldn’t contribute much to similarity.
Dot product is also sensitive to magnitude, but less so than Euclidean. It favors vectors that are not only aligned but also have large magnitudes. If a document is very long and mentions many words frequently, its dot product with a query can be high simply due to its length.
Cosine similarity, by normalizing for vector length, focuses solely on the angle between vectors. This means it captures the similarity in proportions of the dimensions. For text, this translates to capturing the similarity in the mix of words or topics, regardless of how many times those words appear. This is why it’s so prevalent in information retrieval and recommendation systems.
The one detail that trips people up with cosine similarity is its relationship with the dot product. When vectors are L2-normalized (their magnitude is 1), the dot product is the cosine similarity. Many vector databases or libraries will L2-normalize your vectors by default for efficiency when using cosine similarity, so you might see performance benefits if your data is already normalized.
The next common problem you’ll encounter is dealing with high-dimensional sparsity and the resulting "curse of dimensionality."