The surprising truth about vector databases is that they don’t actually store vectors; they store metadata and pointers to vectors, and their core job is to efficiently find the closest pointers based on vector similarity.
Let’s see this in action. Imagine we have a collection of documents about different dog breeds. We want to be able to ask questions like "What are some good guard dogs?" and get relevant dog breeds back, even if the exact phrase "guard dogs" isn’t in the document.
First, we need to embed our documents. This means converting the text into numerical representations (vectors) that capture their semantic meaning. Libraries like sentence-transformers do this.
from sentence_transformers import SentenceTransformer
from langchain_community.embeddings import SentenceTransformerEmbeddings
# Load a pre-trained model
model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)
# Example document
document = "The German Shepherd is a highly intelligent and versatile breed, often used as a police and military dog due to its loyalty and protective nature."
# Get the embedding
embedding = embedding_model.encode(document)
print(f"Embedding shape: {embedding.shape}")
print(f"First 5 dimensions: {embedding[:5]}")
This code will output something like:
Embedding shape: (384,)
First 5 dimensions: [-0.02345678 0.01234567 -0.00987654 0.04567890 -0.01112233]
Now, we need a vector database to store these embeddings and their associated document metadata. For this example, we’ll use ChromaDB, a popular in-memory or file-based vector store.
from langchain_chroma import Chroma
from langchain_core.documents import Document
# Create some sample documents
documents = [
Document(page_content="The German Shepherd is a highly intelligent and versatile breed, often used as a police and military dog due to its loyalty and protective nature.", metadata={"breed": "German Shepherd"}),
Document(page_content="The Golden Retriever is known for its friendly and gentle demeanor, making it an excellent family pet and a popular choice for assistance roles.", metadata={"breed": "Golden Retriever"}),
Document(page_content="The Rottweiler is a powerful and robust breed with a calm and confident disposition, often employed as a guard dog.", metadata={"breed": "Rottweiler"}),
Document(page_content="The Poodle is exceptionally smart and trainable, excelling in obedience and agility competitions.", metadata={"breed": "Poodle"}),
Document(page_content="The Doberman Pinscher is a sleek and energetic dog, recognized for its alertness and suitability as a guard dog.", metadata={"breed": "Doberman Pinscher"}),
]
# Initialize the embedding function for LangChain
langchain_embedding_function = SentenceTransformerEmbeddings(model_name=model_name)
# Create a Chroma vector store
vectorstore = Chroma.from_documents(
documents,
langchain_embedding_function,
persist_directory="./chroma_db" # Optional: to save the database
)
print("Vector store created and documents added.")
This sets up our vector store. Chroma.from_documents takes our Document objects and the langchain_embedding_function. For each document, it generates an embedding and stores it along with the document’s page_content and metadata in Chroma. The persist_directory argument tells Chroma to save the index and data to disk, so it’s not lost when the script ends.
Now, the core functionality: querying. We want to find documents semantically similar to a query.
# Define a query
query = "Which dogs are good for protecting a home?"
# Perform a similarity search
results = vectorstore.similarity_search(query, k=2) # k is the number of results to return
print(f"\nQuery: '{query}'")
print("\nSearch Results:")
for doc in results:
print(f"- Breed: {doc.metadata.get('breed')}, Content: {doc.page_content[:50]}...")
This query will likely return the German Shepherd and Rottweiler, even though the query doesn’t explicitly mention them. The similarity_search function takes our query, generates its embedding using the same langchain_embedding_function, and then queries the Chroma index to find the embeddings (and thus, the documents) that are closest in vector space to the query embedding. The k=2 parameter specifies that we want the top 2 most similar results.
The mental model here is that of a multi-dimensional space. Each document and query is a point in this space, represented by its vector. Documents with similar meanings are located near each other. The vector database is essentially a highly optimized index for finding nearest neighbors in this space. It doesn’t do the heavy lifting of calculating the similarity scores from scratch for every query; it uses sophisticated indexing structures (like HNSW or IVF) to quickly prune the search space and identify candidates.
The metadata fields are crucial. While the vector embedding captures the semantic meaning, the metadata provides structured, searchable information. This allows for hybrid search approaches where you can filter results based on metadata before or after the vector similarity search, or combine both. For example, you might search for dogs that are good guard dogs and have a metadata tag of "medium size".
What most people don’t realize is that the choice of embedding model is paramount and directly dictates the quality of your search results. A model trained on a broad corpus might do well generally, but a model fine-tuned on a specific domain (e.g., legal texts, medical abstracts) will yield significantly better semantic understanding and retrieval for queries within that domain. The vector database is just the efficient retrieval mechanism; the semantic intelligence comes from the embeddings.
The next step is often integrating this into a larger retrieval-augmented generation (RAG) pipeline, where the retrieved documents are fed to a Large Language Model (LLM) to generate a coherent answer.