Setting up Chroma locally is surprisingly easy, but the real trick is understanding how it manages its data persistence and retrieval, which often trips people up during development.
Let’s get Chroma running right now.
First, install the Python client:
pip install chromadb
Now, let’s spin up a basic Chroma client and add some data. This is where you’ll see the core interaction.
import chromadb
# This is the simplest way to start a client.
# It defaults to an in-memory database.
client = chromadb.Client()
# Create a collection. Think of this like a table in a relational database.
collection = client.create_collection("my_documents")
# Add some documents and their embeddings.
# In a real app, you'd generate embeddings using a model.
# For this example, we'll use dummy embeddings.
collection.add(
documents=["This is the first document.", "This is the second document."],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["doc1", "doc2"]
)
print("Documents added successfully.")
# Perform a query
results = collection.query(
query_texts=["This is a query document."],
n_results=1
)
print("Query results:", results)
If you run this, you’ll see the documents added and a query result. This is the in-memory mode. It’s fast for testing, but the data vanishes when the script ends.
For local development where you want to keep your data between runs, you need persistence. Chroma offers a simple way to do this by specifying a directory.
Modify the client initialization like this:
import chromadb
import os
# Define a directory for persistent storage.
# Ensure this directory exists or Chroma will create it.
persistent_path = "./chroma_data"
os.makedirs(persistent_path, exist_ok=True)
# Initialize the client with the persistent path.
client = chromadb.PersistentClient(path=persistent_path)
# You can then get or create collections as usual.
# If the collection exists in the persistent storage, it will be loaded.
try:
collection = client.get_collection("my_documents")
print("Collection 'my_documents' loaded from persistence.")
except:
collection = client.create_collection("my_documents")
print("Collection 'my_documents' created.")
# Add some initial data if it's a new collection
collection.add(
documents=["This is the first document.", "This is the second document."],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["doc1", "doc2"]
)
print("Initial data added.")
# Perform a query
results = collection.query(
query_texts=["What is the content?"],
n_results=1
)
print("Query results:", results)
Now, when you run this script, the data will be written to the ./chroma_data directory. If you stop the script and run it again, Chroma will load the existing data from that directory instead of starting fresh. You’ll see files like chroma.sqlite3 and various .parquet files appear in that directory.
The PersistentClient manages a SQLite database (chroma.sqlite3) that holds metadata, collection definitions, and pointers to the actual data stored in Parquet files. Chroma uses DuckDB under the hood for efficient querying of these Parquet files, which are stored in a structured way within the chroma_data directory. This architecture allows for fast data loading and querying, even with large datasets, because DuckDB can read Parquet files directly and efficiently.
What most developers miss is that Chroma doesn’t just store raw text. It stores your original documents, their associated metadata, and their vector embeddings. When you query, Chroma first uses the vector embeddings to find semantically similar items. Then, it retrieves the original documents and metadata for those top results. The ids you provide are crucial for uniquely identifying each piece of data and ensuring you can retrieve it later.
When you use client.get_or_create_collection("my_collection") with a PersistentClient, Chroma checks the specified path for a collection with that name. If it finds the necessary files (like the SQLite entry and associated data directories for that collection), it loads the existing collection. If not, it creates a new one and initializes its storage within the path. This is how your data survives restarts.
The real power comes from how Chroma handles large-scale vector similarity search efficiently. It doesn’t just brute-force compare your query vector against every single vector in your database. Instead, it uses indexing techniques (like HNSW, which you can configure) to quickly narrow down the search space. This means that even with millions of embeddings, your queries remain performant. The underlying storage format (Parquet) and the query engine (DuckDB) are optimized for this kind of analytical workload.
The next hurdle you’ll likely face is understanding how to configure and tune the vector index for optimal performance and recall.