Spark Streaming’s output modes are a bit of a Rorschach test for how you think about state.

Let’s see what these modes actually do with some live data. Imagine we have a simple word count stream. Our input is a sequence of lines, and we want to count words.

import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.{SparkSession, DataFrame}

val spark = SparkSession.builder()
  .appName("StreamingOutputModes")
  .master("local[*]")
  .getOrCreate()

import spark.implicits._

val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

// --- Append Mode ---
println("--- Append Mode ---")
val appendQuery = wordCounts.writeStream
  .outputMode(OutputMode.Append())
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .start()

// In a real scenario, you'd send data to localhost:9999
// For demonstration, we'll simulate this later or assume it's running.
// appendQuery.awaitTermination() // Don't run this here, it blocks.

// --- Update Mode ---
println("--- Update Mode ---")
val updateQuery = wordCounts.writeStream
  .outputMode(OutputMode.Update())
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .start()

// --- Complete Mode ---
println("--- Complete Mode ---")
val completeQuery = wordCounts.writeStream
  .outputMode(OutputMode.Complete())
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .start()

// To see the output, you'd need to run these queries and then send data.
// For example, in another terminal:
// nc -lk 9999
// Then type: hello world hello spark
// You'd see different outputs in the console for each query.

When you run this, you’ll notice that each query attached to console will print results. The key difference is what gets printed.

Append Mode is the most restrictive and often the most performant. It only outputs rows that have been added to the result table since the last trigger. For our word count, this means only new word counts, or increments to existing counts, would appear. If a word count went from 3 to 2 (which can happen if you’re using --uncommitted-source-offset or similar advanced scenarios, or if your aggregation logic could decrease counts, though not typical for word count), that decrease wouldn’t be reflected in the output directly. It only shows the new information. This is great for sinks that can only append, like many message queues or databases where you’re inserting new records.

Complete Mode is the most straightforward conceptually. It outputs the entire aggregated result table every time the trigger fires. For our word count, every time Spark processes new data, it recomputes the counts for all words seen so far and prints the whole table. This is useful for sinks that can easily overwrite or replace their entire contents, like writing to a single file or a database table where you can TRUNCATE and INSERT. It guarantees that the output always reflects the current state of the aggregation.

Update Mode is a middle ground. It outputs only those rows in the result table that have been updated since the last trigger. For our word count, this means if a word count goes from 3 to 4, that row (e.g., (hello, 4)) will be output. If a word count goes from 4 to 5, that row ((hello, 5)) will be output. Importantly, rows that haven’t changed (e.g., (world, 1)) are not output. This is a more efficient way than Complete mode to update sinks that can handle row-level updates, like many relational databases. It’s a clever optimization because it avoids re-writing rows that haven’t changed.

The mental model to hold onto here is what Spark’s streaming query engine considers the "result table." In Complete mode, it’s the entire, up-to-the-minute, aggregated table. In Append mode, it’s effectively the delta of new rows added to that conceptual table. In Update mode, it’s the delta of rows that have changed (either new or modified) in that conceptual table. The choice of mode is tightly coupled with the capabilities of your downstream sink.

A common pitfall is trying to use Append mode with an aggregation that could decrease counts, or when your sink requires the absolute latest state for all entities. Append mode is brilliant for stateless operations or sinks that only care about new events, but for aggregations where you need the full, current picture, Complete or Update are necessary. You’ll often find yourself wanting Update mode for performance gains over Complete while still getting accurate, row-level state changes for your database.

The next thing you’ll run into is handling stateful operations across triggers, especially when dealing with watermarks.

Want structured learning?

Take the full Spark-streaming course →