pgvector is a PostgreSQL extension that lets you store and search high-dimensional vectors, which are the core of modern AI applications like recommendation engines and semantic search.
Here’s a basic users table with a profile_vector column that stores embeddings (vectors) for each user:
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name TEXT,
profile_vector vector(3) -- Storing 3-dimensional vectors
);
Now, let’s insert some user data with their corresponding vectors:
INSERT INTO users (name, profile_vector) VALUES
('Alice', '[1,2,3]'),
('Bob', '[4,5,6]'),
('Charlie', '[7,8,9]');
To find users whose profile vectors are most similar to a query vector, say [1.5, 2.5, 3.5], we use the <=> operator for cosine distance (which pgvector defaults to for <=> if no distance metric is specified). This operator calculates the distance between two vectors, and a smaller value means they are more similar.
SELECT id, name, profile_vector <=> '[1.5, 2.5, 3.5]' AS distance
FROM users
ORDER BY distance
LIMIT 5;
This query will return Alice, Bob, and Charlie, ordered by how close their profile_vector is to [1.5, 2.5, 3.5]. Alice will be first because her vector [1,2,3] is closest to the query.
The problem pgvector solves is the inefficiency of performing similarity searches on large datasets of vectors using traditional database methods. Imagine trying to calculate the distance between your query vector and every single vector in a table with millions of rows using standard SQL. It’s computationally prohibitive. pgvector introduces specialized index types, like ivfflat and hnsw, that drastically speed up these nearest neighbor searches. These indexes don’t guarantee finding the absolute nearest neighbor but provide a very high probability of finding it, making it suitable for most real-world applications where near-perfect recall is acceptable for a massive performance gain.
Internally, pgvector transforms your PostgreSQL table into a structure that can leverage these specialized indexes. When you create an index, pgvector analyzes the vectors and groups them based on certain criteria. For ivfflat, it partitions the vector space into clusters. For hnsw (Hierarchical Navigable Small World), it builds a graph where nodes are vectors and edges represent proximity. When you query, instead of scanning the entire dataset, the index directs the search to the most relevant partitions or graph neighborhoods, significantly reducing the number of distance calculations required.
The core idea is to trade a small amount of accuracy for a monumental leap in search speed. You control this trade-off through index parameters. For ivfflat, the lists parameter dictates how many partitions (lists) are created. More lists mean finer-grained partitions and potentially better accuracy but also more overhead. The probes parameter in a query controls how many of these lists are examined. Increasing probes improves accuracy but slows down the query. For hnsw, parameters like m (maximum number of neighbors for each node in the graph) and ef_construction (size of the dynamic list for traversing the graph during construction) influence the index’s build time, memory usage, and search performance/accuracy.
The most surprising true thing about pgvector is that it doesn’t just store vectors; it integrates them so deeply into PostgreSQL that you can combine vector search with all your existing relational data and SQL queries. You can filter by other table columns before or during the vector search, or use the results of a vector search to join with other tables. This eliminates the need to maintain separate vector databases and synchronize data, simplifying your architecture immensely.
Most people understand that indexes speed up searches by reducing the number of rows scanned. What’s less obvious is how the ivfflat index’s probes parameter works in conjunction with the lists parameter during a query. When you specify probes, pgvector doesn’t just look at probes random lists; it intelligently selects the probes lists that are most likely to contain the nearest neighbors based on a coarse scan of the entire list space. This means even with a small probes value, you’re often getting a very good approximation of the true nearest neighbors because the selection of lists is informed, not random.
The next concept you’ll likely explore is optimizing index performance for very large datasets and understanding the nuances of different distance metrics beyond cosine.