A vector database doesn’t store your data as rows and columns, but as points in a high-dimensional space.

Let’s say you have a collection of images, and you want to find all images that are visually similar to a given query image. You wouldn’t compare pixel by pixel. Instead, you’d convert each image into a numerical representation called a "vector" using a machine learning model (like a convolutional neural network). These vectors capture the semantic meaning of the images. Images with similar content will have vectors that are close to each other in this high-dimensional space. A vector database is optimized to store and query these vectors efficiently, finding the "nearest neighbors" to your query vector.

Here’s a Python client example using the pymilvus library to interact with a Milvus vector database. First, ensure you have Milvus installed and running, and then install the client:

pip install pymilvus

Now, let’s connect to the database and perform some operations.

from pymilvus import connections, CollectionSchema, FieldSchema, DataType, Collection

# 1. Connect to Milvus
# Assuming Milvus is running on localhost:19530
connections.connect("default", host="localhost", port="19530")
print("Connected to Milvus!")

# 2. Define the schema for your collection
# We'll store a unique ID, a vector, and some metadata
id_field = FieldSchema(
    name="id",
    dtype=DataType.INT64,
    is_primary=True,
    auto_id=True,
    description="Primary key for the entity"
)
vector_field = FieldSchema(
    name="embedding",
    dtype=DataType.FLOAT_VECTOR,
    dim=8,  # The dimension of your vectors
    description="Vector embedding of the entity"
)
text_field = FieldSchema(
    name="text_data",
    dtype=DataType.VARCHAR,
    max_length=256,
    description="Associated text data"
)

schema = CollectionSchema(
    fields=[id_field, vector_field, text_field],
    description="Collection for storing text embeddings"
)

# 3. Create a collection
collection_name = "my_text_embeddings"
if not Collection(collection_name).exists:
    collection = Collection(
        name=collection_name,
        schema=schema,
        using='default'
    )
    print(f"Collection '{collection_name}' created.")
else:
    collection = Collection(collection_name)
    print(f"Collection '{collection_name}' already exists.")

# 4. Prepare data to insert
# Each element is a dictionary representing an entity
# The 'embedding' key must match the name of your vector field
data_to_insert = [
    {"embedding": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8], "text_data": "first document"},
    {"embedding": [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2], "text_data": "second document"},
    {"embedding": [0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85], "text_data": "third document"},
]

# 5. Insert data into the collection
mr = collection.insert(data_to_insert)
print(f"Inserted {mr.insert_count} entities.")

# Milvus operations are often asynchronous. It's good practice to flush
# the data to make it searchable immediately.
collection.flush()
print("Collection flushed.")

# 6. Create an index for efficient searching
# You need an index on your vector field to perform similarity searches.
# Common index types include 'IVF_FLAT', 'HNSW', 'ANNOY'.
# 'metric_type' specifies how distance is calculated (e.g., 'L2' for Euclidean, 'IP' for Inner Product).
index_params = {
    "metric_type": "L2",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 1024} # Parameters specific to the index type
}
index_name = "embedding_index"
if not collection.has_index(index_name):
    collection.create_index(field_name="embedding", index_params=index_params, index_name=index_name)
    print(f"Index '{index_name}' created on 'embedding' field.")
else:
    print(f"Index '{index_name}' already exists.")

# 7. Load the collection into memory for searching
# Collections must be loaded before they can be searched.
collection.load()
print("Collection loaded into memory.")

# 8. Perform a similarity search
# Define your query vector. This would typically come from an embedding model.
query_vector = [[0.12, 0.22, 0.32, 0.42, 0.52, 0.62, 0.72, 0.82]] # Similar to the first and third documents

# Define search parameters
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10} # Number of clusters to search (for IVF_FLAT)
}

# Specify how many results you want (top_k)
top_k = 2

# Perform the search
results = collection.search(
    data=query_vector,
    anns_field="embedding",  # The vector field to search against
    param=search_params,
    limit=top_k,
    expr=None,  # Optional: filter expression, e.g., "text_data == 'some_value'"
    output_fields=["text_data"] # Fields to return along with the search results
)

# Print the search results
print("\nSearch Results:")
for hits in results:
    for hit in hits:
        print(f"  ID: {hit.id}, Distance: {hit.distance:.4f}, Text: {hit.entity.get('text_data')}")

# 9. Clean up (optional)
# collection.drop()
# print(f"Collection '{collection_name}' dropped.")
# connections.disconnect("default")
# print("Disconnected from Milvus.")

The core idea behind vector databases is to make similarity search, which is computationally expensive with traditional methods, fast and scalable. They achieve this by:

  1. Vector Embeddings: Converting your data (text, images, audio, etc.) into dense numerical vectors using machine learning models. These vectors capture the semantic meaning.
  2. High-Dimensional Indexing: Employing specialized data structures and algorithms (like Hierarchical Navigable Small Worlds (HNSW) or Inverted File Index (IVF)) to organize these high-dimensional vectors. These indexes allow for approximate nearest neighbor (ANN) searches, which are much faster than exact searches, with a very small trade-off in accuracy.
  3. Optimized Storage and Retrieval: Designed to handle massive amounts of vector data and retrieve similar vectors with low latency.

The pymilvus client abstracts away much of the complexity. You define your data schema, including the FLOAT_VECTOR type for your embeddings and their dimension (dim). Creating a Collection is like creating a table in a relational database. Inserting data involves providing a list of dictionaries where each dictionary represents an entity with its vector and any associated metadata.

Crucially, before you can search, you must create an index on your vector field. Without an index, searches would be a brute-force scan, defeating the purpose of a vector database. The index_params dictionary specifies the metric_type (how similarity is measured – L2 distance, Inner Product, etc.) and the index_type (the algorithm used for indexing). After creating the index, you need to load() the collection into memory. This makes it ready for querying.

The collection.search() method is where the magic happens. You provide your query_vector, specify the anns_field (the vector field to search against), param (search-specific parameters, often related to the index type, like nprobe for IVF), and limit (how many nearest neighbors you want). You can also use expr for filtering based on scalar fields (like text_data in our example) and output_fields to retrieve specific metadata along with the search results.

The most counterintuitive aspect of vector databases is that "similarity" is not an inherent property of the data itself, but rather a construct defined by the embedding model and the chosen distance metric. Two documents might be semantically similar to a human but have vectors that are far apart if the embedding model wasn’t trained to capture that specific nuance, or if you choose an inappropriate distance metric like Inner Product when L2 would be more suitable for normalized vectors.

The next step in mastering vector databases involves exploring different indexing strategies and understanding how their parameters (like nlist, ef, M for HNSW) directly impact search speed and accuracy, and how to tune them for your specific use case.

Want structured learning?

Take the full Vector-databases course →