The single most surprising thing about achieving millions of vector events per second isn’t about raw CPU power, but about how efficiently you can avoid doing work.

Let’s see it in action. Imagine we’re indexing a stream of vector embeddings from a real-time image recognition service. Each embedding is a 128-dimensional float vector. We want to ingest these at a blistering pace into a vector database for similarity search.

// Example incoming event (simplified)
{
  "id": "image_abc123",
  "embedding": [0.123, -0.456, ..., 0.789], // 128 floats
  "metadata": {"camera_id": "cam_05", "timestamp": 1678886400}
}

Our target is to process 2 million such events per second. This isn’t just about throwing more machines at it; it’s about orchestrating the data flow with surgical precision.

The core problem we’re solving is transforming a high-volume, high-velocity stream of incoming data into a format that can be efficiently queried for nearest neighbors in a multi-dimensional space. This involves several stages:

  1. Ingestion & Buffering: Receiving events from producers and holding them temporarily.
  2. Preprocessing/Enrichment: Adding or transforming data (e.g., hashing metadata, normalizing vectors).
  3. Indexing: Adding the vector and its associated ID/metadata to the vector database’s internal structure (e.g., HNSW, IVFPQ).
  4. Persistence: Ensuring data is durably stored.

Internally, a high-throughput vector system often looks like a pipeline. Producers send data to a message queue (like Kafka or Pulsar). Consumers (application servers or dedicated indexing services) read from the queue, perform transformations, and push data into the vector database. The vector database itself has internal indexing and persistence mechanisms.

The levers we control are primarily around:

  • Batching: How many events are processed and sent to the database at once.
  • Concurrency: How many parallel threads or processes are ingesting data.
  • Serialization/Deserialization: The efficiency of converting data between network and internal formats.
  • Vector Database Configuration: Parameters like index build strategy, memory usage, and disk I/O.
  • Network Throughput: The bandwidth between components.
  • CPU Utilization: For preprocessing and vector operations.

Consider this configuration snippet for a hypothetical vector database ingestion service. We’re not just setting a number; we’re tuning the behavior of the system.

# Ingestion Service Configuration
ingestion:
  kafka_consumer_group: "vector-indexer-group"
  num_workers: 128 # Number of parallel ingestion threads
  batch_size: 5000 # Events per batch sent to DB
  max_in_flight_batches: 100 # Number of batches the DB can be building simultaneously
  timeout_ms: 30000 # How long to wait for an operation
  vector_db:
    host: "vector-db-01.cluster.local"
    port: 19530
    index_name: "image_embeddings"
    # HNSW specific tuning for ingestion speed (often sacrifices search accuracy slightly)
    ef_construction: 400
    M: 64
    # PQ settings if applicable
    num_sub_vectors: 32
    num_codebooks: 4

The num_workers and batch_size are critical. If batch_size is too small, the overhead of sending individual batches dominates, and you can’t saturate the network or the database’s write capacity. If it’s too large, you increase latency and memory pressure. The max_in_flight_batches tells the database how much concurrent work it can accept before it starts rejecting new requests or slowing down its internal processes. Finding the sweet spot where producers are always feeding batches and the database is always processing them is key.

The most counterintuitive part of high-throughput vector ingestion is understanding that the bottleneck is rarely the mathematical complexity of the vector operations themselves, but the orchestration and serialization overhead. For instance, if your serialization format (like Protobuf or Thrift) isn’t optimized for your specific data structure and is frequently re-serializing small objects, you can burn an enormous amount of CPU just packaging and unpacking data. A highly tuned system might even use custom, binary formats that are less human-readable but vastly more efficient for machine-to-machine communication, avoiding intermediate object creation and string manipulation for every single event. The goal is to keep the vector data, in its rawest efficient form, moving from producer to the database’s internal index structures with minimal detours.

Once you’ve mastered ingestion throughput, the next challenge is optimizing for low-latency, high-QPS similarity search.

Want structured learning?

Take the full Vector course →