Spark Continuous Processing can achieve sub-millisecond latency, but it’s not the default and requires careful tuning.

Here’s how it works and what you need to know to get there.

Let’s see it in action with a simple streaming word count example.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger

val spark = SparkSession.builder()
  .appName("ContinuousWordCount")
  .master("local[*]") // For local testing, use your cluster manager in production
  .getOrCreate()

// For continuous processing, we need checkpointing enabled
spark.sparkContext.setCheckpointDir("/tmp/spark-checkpoints")

val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

import org.apache.spark.sql.functions._
val words = lines.select(explode(split($"value", " ")).as("word"))
val wordCounts = words.groupBy("word").count()

// This is the key for sub-millisecond latency: Trigger.Continuous
val query = wordCounts.writeStream
  .trigger(Trigger.Continuous(100)) // Trigger every 100 milliseconds
  .outputMode("update")
  .format("console") // Output to console for demonstration
  .start()

query.awaitTermination()

To run this, you’ll need to have spark-submit available and a way to send data to localhost:9999. You can use netcat for this:

echo "hello world hello spark" | nc -lk 9999

When you run the Spark application and then type into the netcat terminal, you’ll see word counts appearing in the Spark console output almost instantaneously, with microsecond-level delays between processing batches.

The core problem Spark’s continuous processing solves is the inherent batching nature of micro-batch streaming. Traditional Spark Structured Streaming operates on micro-batches, meaning it collects data for a short interval (e.g., 1 second) and then processes it as a batch. This introduces latency because data has to wait for the micro-batch to fill up before it can be processed. Continuous processing, on the other hand, aims to process data as it arrives, with minimal buffering.

Internally, continuous processing in Spark leverages a specialized execution engine. Instead of creating discrete micro-batches, it maintains an active, continuous stream of data flowing through the execution plan. When new data arrives, it’s immediately fed into the running operators. The Trigger.Continuous(interval) setting doesn’t mean it waits for interval milliseconds to start processing; it means it aims to complete a processing cycle within that interval. Spark achieves this by dedicating executor cores to continuously pull, process, and push data. The interval here is more of a target latency or a pacing mechanism rather than a batching window.

The key levers you control are primarily within the Trigger configuration. Trigger.Continuous(interval) is the most direct way to enable this mode. The interval parameter, specified in milliseconds, dictates the target latency for each processing cycle. Smaller values mean more aggressive processing, aiming for lower latency but potentially higher CPU utilization and more frequent, smaller output commits. Other configurations like outputMode (e.g., update, append, complete) and checkpointing are crucial for the reliability and correctness of continuous processing, just as they are for micro-batching. Checkpointing is mandatory because Spark needs to maintain state across these continuous, very short processing cycles.

The one thing most people don’t realize is how sensitive continuous processing is to the cost of your operations. If a single record, or a very small group of records, takes longer than your specified Trigger.Continuous interval to process through your entire pipeline, Spark will fall behind. It doesn’t magically speed up the individual operators; it just tries to run them more frequently. If your count() operation, for example, is being bottlenecked by the underlying data source’s ability to push data quickly enough, or if your aggregation logic is computationally expensive, you won’t achieve sub-millisecond latency, and your job might start to lag. The Trigger.Continuous interval is a goal, not a guarantee if your processing logic itself is too slow.

The next step you’ll likely encounter is managing state and ensuring exactly-once semantics in a truly continuous, low-latency environment, which often leads to exploring more advanced fault tolerance mechanisms.

Want structured learning?

Take the full Spark-streaming course →