A vector database doesn’t store your data as text or numbers; it stores it as points in a high-dimensional space, where proximity implies semantic similarity.
Let’s say you have a collection of product descriptions. You want to find products similar to "a durable, waterproof backpack for hiking." First, you need to convert these descriptions into numerical vectors (embeddings). This is typically done using a pre-trained machine learning model.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
descriptions = [
"A rugged, water-resistant backpack designed for outdoor adventures.",
"Lightweight and breathable running shoes with excellent cushioning.",
"A stylish leather handbag perfect for everyday use.",
"This heavy-duty pack is built to withstand the elements on any trail.",
"Comfortable hiking boots with ankle support and waterproof lining."
]
embeddings = model.encode(descriptions)
print(embeddings.shape)
# Output: (5, 384) - 5 descriptions, each represented by a 384-dimensional vector
Now you have your embeddings. The next step is to store these vectors in a vector database and, crucially, index them. Without an index, searching for similar vectors would require comparing your query vector to every single vector in the database – a process that scales poorly as your dataset grows.
Vector databases use specialized indexing algorithms to speed up similarity searches. The most common type is Approximate Nearest Neighbor (ANN) search. Instead of guaranteeing the absolute closest vectors, ANN algorithms find vectors that are very likely to be among the closest, sacrificing perfect accuracy for massive performance gains.
Popular ANN index types include:
-
HNSW (Hierarchical Navigable Small Worlds): This is a graph-based approach. It builds a multi-layered graph where each layer is a graph of nodes (vectors). Searching starts at the top layer (coarser graph) and navigates down to finer layers, progressively getting closer to the target.
- Configuration Example (using
hnswlib):import hnswlib import numpy as np # Assume 'embeddings' is your numpy array of vectors # Assume 'dim' is the dimensionality of your vectors (e.g., 384) dim = embeddings.shape[1] num_elements = embeddings.shape[0] # Initialize HNSW index p = hnswlib.Index(space='l2', dim=dim) # 'l2' for Euclidean distance p.init_index(max_elements=num_elements, ef_construction=200, M=16) # ef_construction: build-time search depth, M: number of neighbors per node # Add vectors labels = np.arange(num_elements) p.add_items(embeddings, labels) # Set search-time parameter p.set_ef(50) # ef: search-time search depth. Higher means more accuracy, slower search. - Why it works: HNSW creates shortcuts in the graph, allowing the search algorithm to "jump" over large parts of the index. The hierarchical nature means it can quickly narrow down the search space.
- Configuration Example (using
-
IVF (Inverted File Index): This method partitions the vector space into a set of "centroids" (clusters). When you add a vector, it’s assigned to the nearest centroid. During a query, you first find the centroids closest to your query vector, and then only search within those specific clusters.
- Configuration Example (using
faiss):import faiss # Assume 'embeddings' is your numpy array of vectors # Assume 'dimension' is the dimensionality (e.g., 384) dimension = embeddings.shape[1] num_vectors = embeddings.shape[0] # Choose a quantizer (e.g., L2 distance) quantizer = faiss.IndexFlatL2(dimension) # Initialize IVF index # nlist: number of clusters/cells. More clusters means finer partitions. # m: number of sub-quantizers (for Product Quantization, not strictly required for basic IVF) nlist = 100 # Example: 100 clusters index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_L2) # Train the index (to find cluster centroids) index.train(embeddings) # Add vectors index.add(embeddings) # Set search parameter: nprobe (number of probes/cells to search) index.nprobe = 10 # Example: search 10 closest clusters - Why it works: IVF drastically reduces the number of vectors that need to be compared by pre-grouping similar vectors. Searching only within relevant groups is much faster than scanning the entire dataset.
- Configuration Example (using
-
ScaNN (Scalable Nearest Neighbors): Developed by Google, ScaNN uses anisotropic quantization to compress vectors and improve search accuracy and speed. It quantizes vectors in a way that accounts for the varying densities of data in different regions of the vector space.
Once indexed, querying is straightforward. You embed your query, and the database uses the index to efficiently find the k nearest neighbors.
# Assume 'p' is your HNSW index and 'model' is your SentenceTransformer
query_text = "a durable, waterproof backpack for hiking"
query_embedding = model.encode(query_text)
# Search for the 2 nearest neighbors
# k: number of neighbors to return
# The result is a tuple: (distances, labels)
distances, labels = p.knn_query(query_embedding, k=2)
print("Distances:", distances)
print("Labels:", labels)
# Example Output (values will vary):
# Distances: [[0.1234, 0.4567]]
# Labels: [[0, 3]]
# You would then use these labels to retrieve the original descriptions
# For example, if label 0 corresponds to the first description:
# print(descriptions[labels[0][0]])
The core trade-off in vector indexing is between accuracy, build time, and query speed. Parameters like ef_construction and ef (for HNSW) or nlist and nprobe (for IVF) directly influence this. Higher values generally lead to more accurate results but slower indexing and/or querying.
The next concept you’ll encounter is managing these indexes at scale, including strategies for real-time updates, sharding across multiple machines, and handling data deletion.