The most surprising thing about vector database incremental updates is that "adding" and "replacing" often boil down to the same underlying operation: a delete followed by an insert.

Let’s see this in action. Imagine we have a simple in-memory vector store. We’ll use Python for illustration, but the principles apply to any vector database.

from collections import defaultdict

class SimpleVectorStore:
    def __init__(self):
        self._data = {} # Stores {vector_id: (vector, metadata)}
        self._index = defaultdict(set) # Stores {field: {vector_id}} for filtering

    def add_vector(self, vector_id, vector, metadata=None):
        if vector_id in self._data:
            print(f"Warning: Vector ID '{vector_id}' already exists. Overwriting.")
        self._data[vector_id] = (vector, metadata or {})
        if metadata:
            for key, value in metadata.items():
                self._index[key].add(vector_id)

    def get_vector(self, vector_id):
        return self._data.get(vector_id)

    def delete_vector(self, vector_id):
        if vector_id not in self._data:
            return False
        
        _, metadata = self._data.pop(vector_id)
        if metadata:
            for key, value in metadata.items():
                if vector_id in self._index[key]:
                    self._index[key].remove(vector_id)
                if not self._index[key]: # Clean up empty sets
                    del self._index[key]
        return True

    def update_vector(self, vector_id, new_vector, new_metadata=None):
        if vector_id not in self._data:
            print(f"Error: Vector ID '{vector_id}' not found for update.")
            return False
        
        # The core logic: delete old, add new
        self.delete_vector(vector_id)
        self.add_vector(vector_id, new_vector, new_metadata)
        return True

    def search(self, query_vector, k=5, filter_metadata=None):
        # Simplified search logic for demonstration
        if filter_metadata:
            # Find candidate IDs based on metadata filter
            candidate_ids = set(self._data.keys())
            for key, value in filter_metadata.items():
                if key in self._index:
                    candidate_ids.intersection_update(self._index[key])
                else:
                    return [] # No vectors match this filter

            # Filter out vectors that don't match ALL metadata conditions
            matching_ids = []
            for vid in candidate_ids:
                _, meta = self._data[vid]
                matches = True
                for key, value in filter_metadata.items():
                    if meta.get(key) != value:
                        matches = False
                        break
                if matches:
                    matching_ids.append(vid)
        else:
            matching_ids = list(self._data.keys())

        # Calculate distances and sort (placeholder for actual vector similarity)
        results = []
        for vid in matching_ids:
            vec, meta = self._data[vid]
            # In a real system, this would be a cosine similarity or L2 distance
            distance = sum((q - v)**2 for q, v in zip(query_vector, vec)) 
            results.append((vid, distance, meta))
        
        results.sort(key=lambda x: x[1])
        return results[:k]

# Example Usage
store = SimpleVectorStore()
store.add_vector("doc1", [0.1, 0.2, 0.3], {"genre": "fiction"})
store.add_vector("doc2", [0.4, 0.5, 0.6], {"genre": "non-fiction"})
store.add_vector("doc3", [0.15, 0.25, 0.35], {"genre": "fiction"})

print("Initial search for 'fiction':")
print(store.search([0.1, 0.2, 0.3], filter_metadata={"genre": "fiction"}))

print("\nUpdating doc1:")
store.update_vector("doc1", [0.9, 0.8, 0.7], {"genre": "science-fiction"})

print("Search for 'fiction' after update:")
print(store.search([0.1, 0.2, 0.3], filter_metadata={"genre": "fiction"})) # doc1 should be gone

print("Search for 'science-fiction' after update:")
print(store.search([0.9, 0.8, 0.7], filter_metadata={"genre": "science-fiction"})) # doc1 should appear

print("\nAttempting to update non-existent doc4:")
store.update_vector("doc4", [1.0, 1.0, 1.0])

The update_vector method in our SimpleVectorStore clearly shows the pattern: delete_vector followed by add_vector. This is fundamental because most vector databases, especially those built on approximate nearest neighbor (ANN) indexes like HNSW or IVF, manage their data in segments or shards. When you "replace" a vector, the old vector’s embedding might be deeply embedded within the index structure. Simply modifying it in place would corrupt the index. The safest and most common approach is to invalidate the old entry (delete) and then insert a completely new entry. The database then handles re-indexing this new entry correctly.

The "delete" operation in a vector database typically involves marking the vector as deleted or removing it from the active index structures. It might not be immediately physically removed from storage to avoid costly rewrites of large index segments. Instead, a garbage collection process or a background merge operation will eventually reclaim the space. The "add" operation, conversely, inserts the new vector and updates the necessary index structures to make it searchable. For filters, the metadata associated with the vector is also updated or replaced.

One of the most subtle aspects of this delete-then-insert model, particularly in distributed vector databases, is the potential for a brief window where both the old and new versions of a vector might momentarily coexist if the update operation is not atomic across all components. While the update_vector function might appear atomic from the client’s perspective, the underlying database might involve multiple steps: first marking the old vector for deletion, then adding the new one, and finally, asynchronously cleaning up the old data. This can lead to scenarios where a search might, in rare cases, return the old vector if it hits a replica or segment that hasn’t yet processed the deletion but has processed the insertion.

The next concept you’ll likely encounter is how vector databases handle bulk updates and deletions, and the trade-offs between immediate consistency and performance.

Want structured learning?

Take the full Vector-databases course →