Pinecone indexes are not just storage containers; they are active, queryable entities that continuously rebalance their data to maintain optimal query performance.
Let’s see what this looks like in practice. Imagine you’re building a recommendation engine for an e-commerce site. You’ve got millions of product embeddings, and you want to find similar products for a given item.
First, you need an index. This is where your vectors will live.
from pinecone import Pinecone
# Initialize Pinecone
pc = Pinecone(api_key="YOUR_API_KEY")
# Define index name
index_name = "product-recommender"
# Check if index exists
if index_name not in pc.list_indexes().names:
# Create a new index if it doesn't exist
pc.create_index(
name=index_name,
dimension=1536, # Example dimension for OpenAI embeddings
metric="cosine", # Cosine similarity is common for embeddings
spec=ServerlessSpec(cloud="aws", region="us-west-2")
)
print(f"Index '{index_name}' created.")
else:
print(f"Index '{index_name}' already exists.")
# Connect to the index
index = pc.Index(index_name)
print(index.describe_index_stats())
This code snippet initializes Pinecone, checks for an existing index named product-recommender, creates it if absent with a specified dimension and metric, and then connects to it. The describe_index_stats() call will show you the initial state – likely empty.
Now, you’ll be inserting data. This is where namespaces become crucial. Namespaces allow you to logically partition your data within a single index. Think of them as sub-databases or categories.
# Example data for two different product categories
products_electronics = [
{"id": "elec-001", "values": [0.1]*1536, "metadata": {"name": "Laptop X", "category": "electronics"}},
{"id": "elec-002", "values": [0.2]*1536, "metadata": {"name": "Smartphone Y", "category": "electronics"}}
]
products_apparel = [
{"id": "app-001", "values": [0.3]*1536, "metadata": {"name": "T-Shirt Z", "category": "apparel"}},
{"id": "app-002", "values": [0.4]*1536, "metadata": {"name": "Jeans W", "category": "apparel"}}
]
# Upsert data into different namespaces
index.upsert(vectors=products_electronics, namespace="electronics")
index.upsert(vectors=products_apparel, namespace="apparel")
print("Data upserted into namespaces.")
print(index.describe_index_stats())
Here, we’re upserting (inserting or updating) two sets of product vectors. Notice the namespace argument. We’re putting electronics into the "electronics" namespace and apparel into the "apparel" namespace. This keeps them separate for more targeted queries. The describe_index_stats() output will now show counts for each namespace.
When you query, you can also specify a namespace to narrow down your search.
# Example query vector for a laptop
query_vector = [0.15]*1536
# Query for similar electronics products
results_electronics = index.query(
vector=query_vector,
top_k=2,
namespace="electronics", # Only search within the 'electronics' namespace
include_metadata=True
)
print("\nSimilar electronics products:")
for match in results_electronics.matches:
print(f"ID: {match.id}, Score: {match.score}, Name: {match.metadata['name']}")
# Query for similar apparel products (will likely return unrelated items if no overlap)
results_apparel = index.query(
vector=query_vector,
top_k=2,
namespace="apparel", # Only search within the 'apparel' namespace
include_metadata=True
)
print("\nSimilar apparel products (from electronics query vector):")
for match in results_apparel.matches:
print(f"ID: {match.id}, Score: {match.score}, Name: {match.metadata['name']}")
In these queries, we’re asking for the top 2 most similar items. Crucially, by specifying namespace="electronics", we ensure the search only considers vectors within that namespace. If we omit the namespace parameter, Pinecone searches across all vectors in the index, regardless of namespace.
The power of namespaces is in managing complexity and improving query relevance. Without them, if you had a massive index with diverse data types (images, text, products), a query for a "red dress" might incorrectly return similar-looking car parts if their embeddings happened to align in vector space. Namespaces prevent this cross-contamination. You can also delete entire namespaces, which is much faster than deleting individual vectors across the whole index.
What most people don’t realize is that while namespaces provide logical separation, the underlying index is still a single, unified data structure. Pinecone’s internal sharding and rebalancing mechanisms operate across the entire index, even if you’re querying a specific namespace. This means that adding a massive amount of data to one namespace can still indirectly affect the performance of queries in other namespaces if it causes significant rebalancing overhead.
You can also have an "empty" default namespace if you don’t specify one during upsert. This can be convenient for simple use cases but quickly becomes unmanageable for complex applications.
The next step is to explore how to manage index configurations like replica counts and pod types for performance tuning.