A vector database doesn’t actually store vectors; it stores metadata that points to vectors, and those pointers are what get queried.
Let’s look at a real-world scenario. Imagine you’re building a recommendation engine for a large e-commerce site. You have millions of products, each represented by a vector embedding derived from its description, images, and user interaction data. You also have millions of users, each with their own vector embedding representing their preferences. You want to find products similar to what a user likes, or find users who like similar products.
Here’s how you might set up collections and namespaces in a vector database like Pinecone, which is a common choice for this kind of work.
First, the Collection. Think of a collection as a top-level container for a specific type of data. It’s where you define the fundamental characteristics of the vectors you’ll be storing.
{
"name": "product-embeddings",
"dimension": 1536,
"metric": "cosine",
"pods": {
"replicas": 1,
"shards": 2,
"environment": "us-west1-gcp",
"index_type": "regular"
}
}
In this example:
name: This is straightforward,product-embeddingstells us what this collection holds.dimension: This is crucial. It must match the dimensionality of your vector embeddings. If your embedding model outputs 1536-dimensional vectors (like many OpenAI models), you must set this to 1536. Mismatch here is a common source of errors.metric: This defines how similarity is calculated.cosineis popular for text and image embeddings because it measures the angle between vectors, indicating similarity in direction regardless of magnitude. Other options likedotproductoreuclideanmight be used for different types of embeddings.pods,replicas,shards,environment,index_type: These are operational settings related to performance, availability, and where your data lives.replicasensure high availability,shardshelp distribute the data and query load for scalability.
Now, Namespaces. Within a collection, namespaces provide a way to logically partition your data. This is incredibly useful for separating different types of data or different versions of your data without needing separate collections, which can be more costly and complex to manage.
Let’s say you want to store embeddings for both products and users within the same collection to enable cross-referencing, but you want to keep them logically distinct. Or perhaps you have different categories of products.
To add data to namespaces, you’d use an upsert operation:
index.upsert(
vectors=[
("product_id_123", [0.1, 0.2, ..., 0.9], {"category": "electronics"}),
("product_id_456", [0.3, 0.4, ..., 0.7], {"category": "apparel"}),
],
namespace="products"
)
index.upsert(
vectors=[
("user_id_abc", [0.5, 0.6, ..., 0.1], {"segment": "new_customer"}),
],
namespace="users"
)
Here, the namespace argument tells the database which logical partition to put the vectors into. products and users are distinct namespaces within the product-embeddings collection.
Why is this powerful?
- Organization: Keeps related data together but segmented. You can query only within the
productsnamespace to find similar products, or query across all namespaces if your use case demands it (though this is less common and can be slower). - Isolation: Queries can be scoped to a namespace. When you search for products similar to a user’s preference vector, you’d typically query the
productsnamespace. This avoids accidentally matching user embeddings against other user embeddings if they were in the same namespace. - Cost/Resource Management: In some systems, namespaces can influence resource allocation or access control, though their primary role is logical partitioning.
Consider a scenario where you’re updating product embeddings for a new version of your recommendation model. You could create a new namespace, say products_v2, upload the new embeddings there, test your recommendations against this new namespace, and then, once confident, switch your application to query products_v2 and potentially delete the old products namespace.
The most surprising thing about this setup is how much flexibility namespaces offer for managing data evolution and different data types without the overhead of creating and managing entirely separate collections for every minor variation. You can have dozens of namespaces within a single collection, each holding distinct subsets of your data, all sharing the same underlying index configuration and infrastructure. This makes it incredibly efficient for managing complex, multi-faceted data landscapes.
The next logical step after mastering schema and namespaces is understanding how to efficiently query these partitioned datasets, particularly when performing similarity searches across different namespaces or filtering results based on metadata.