The interval at which Spark Structured Streaming processes data isn’t just a knob to tweak for speed; it fundamentally dictates the trade-off between latency and throughput, and more importantly, the granularity of your data’s journey through the system.

Let’s see this in action. Imagine a simple streaming job that reads from Kafka, performs a count, and writes to a console.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger

val spark = SparkSession.builder()
  .appName("TriggerIntervalDemo")
  .master("local[*]") // For demonstration purposes
  .getOrCreate()

import spark.implicits._

// Assume you have a Kafka topic named 'input-topic'
// Replace with your Kafka bootstrap servers
val kafkaBrokers = "localhost:9092"

val streamingDF = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafkaBrokers)
  .option("subscribe", "input-topic")
  .load()

// Simple value extraction and casting to string for counting
val processedDF = streamingDF.selectExpr("CAST(value AS STRING)")
  .as[String]
  .groupBy($"value")
  .count()

// --- Trigger Interval Examples ---

// Example 1: Processing as soon as possible (default, effectively)
// val query1 = processedDF.writeStream
//   .outputMode("complete")
//   .format("console")
//   .trigger(Trigger.ProcessingTime("0 seconds")) // Or omit trigger for default
//   .start()
//   .awaitTermination()

// Example 2: Triggering every 10 seconds
// val query2 = processedDF.writeStream
//   .outputMode("complete")
//   .format("console")
//   .trigger(Trigger.ProcessingTime("10 seconds"))
//   .start()
//   .awaitTermination()

// Example 3: Triggering every 1 minute
// val query3 = processedDF.writeStream
//   .outputMode("complete")
//   .format("console")
//   .trigger(Trigger.ProcessingTime("1 minute"))
//   .start()
//   .awaitTermination()

// To run one of these, uncomment it and comment out the others.
// For this demo, let's assume we're running Example 2 with "10 seconds".
val query = processedDF.writeStream
  .outputMode("complete")
  .format("console")
  .trigger(Trigger.ProcessingTime("10 seconds"))
  .start()

// In a real application, you'd have your SparkSession running
// and potentially other logic. For this demo, we'll just let it run
// for a bit and then stop.
Thread.sleep(30000) // Let it run for 30 seconds
query.stop()
spark.stop()

When you run this with Trigger.ProcessingTime("10 seconds"), Spark won’t process data every 10 seconds exactly. Instead, it means that Spark will attempt to start a new micro-batch no sooner than 10 seconds after the previous micro-batch completed. If a micro-batch takes 5 seconds to run, the next one will start 5 seconds after that, not 10. But if a micro-batch takes 15 seconds to run, the next one will start immediately after it finishes. This is the core of "micro-batching."

The Trigger.ProcessingTime(interval) is your primary lever. Setting it to "0 seconds" (or omitting the trigger clause altogether, as "0 seconds" is the default) means Spark will try to process data as quickly as possible, starting a new micro-batch as soon as the previous one finishes. This minimizes latency but can lead to high CPU utilization and potentially shorter micro-batches, which might not be efficient if your processing logic is complex or if you have very small, frequent data bursts.

Increasing the interval, say to "1 minute", means Spark waits at least a minute between the completion of one micro-batch and the start of the next. This significantly increases latency (your data will be at least a minute old when it’s processed) but allows Spark to accumulate more data into a single micro-batch. This can be more efficient for your processing logic, as it amortizes the overhead of task scheduling and execution over a larger dataset. It also smooths out processing, preventing the system from being overwhelmed by a constant stream of very small updates.

There’s also Trigger.Once(), which is crucial for idempotent sinks or when you want to process all available data exactly once and then stop. This is often used in batch-like scenarios where you want to process a "snapshot" of streaming data.

Trigger.Continuous(interval) is a more advanced option that aims for lower latency by processing new events as soon as they arrive, but it has strict requirements: it only works with sources that can report event-time watermarks and sinks that can handle continuous processing. It’s less about fixed intervals and more about continuous flow, but it still operates on micro-batches internally, albeit much smaller ones.

The "micro-batch" itself is a logical batch of data that Spark processes. When you set a trigger interval, you’re essentially telling Spark how often to initiate the creation and processing of these micro-batches. The actual duration of a micro-batch depends on the amount of data available and the complexity of your transformations.

The Trigger.ProcessingTime(interval) configuration directly impacts the spark.sql.streaming.minBatchesToProcess configuration, which defaults to 1. This means Spark will try to process at least one micro-batch per trigger interval. If more data is available, Spark will continue processing additional micro-batches within that same trigger interval until it catches up to the current data or the interval expires. The trigger interval acts as a minimum time between the end of one micro-batch and the start of the next.

Understanding the interplay between your trigger interval, the time it takes to process a micro-batch, and the rate of incoming data is key to optimizing performance. A common pitfall is setting the trigger interval too low when processing is slow, leading to overlapping micro-batches and resource contention. Conversely, setting it too high can lead to unacceptable data staleness.

The spark.sql.streaming.idleTimeout configuration is also important here. If no new data arrives for a duration longer than the idleTimeout, Spark will stop the streaming query. This prevents resources from being held indefinitely when there’s no work to do. It defaults to "no timeout", meaning the query will run until explicitly stopped or an error occurs.

Want structured learning?

Take the full Spark-streaming course →