Spark Streaming’s event time processing can feel like a magic trick, but the real magic is how it handles out-of-order events without you needing to write a single if statement for it.
Let’s watch it in action. Imagine a simple Spark Streaming job that counts word occurrences. Normally, Spark processes data in the order it arrives. But with event time, we tell Spark to care about the timestamp embedded within the data itself, not when Spark sees the data.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
val spark = SparkSession.builder.appName("EventTimeProcessing").getOrCreate()
import spark.implicits._
// Simulate incoming data with event timestamps
val rawData = spark.readStream
.format("rate") // A built-in source for generating data with timestamps
.option("rowsPerSecond", 1)
.load()
// Extract the timestamp from the generated data and create a "word" column
val processedData = rawData
.withColumn("event_timestamp", $"timestamp") // Use the source's timestamp as our event time
.withColumn("word", concat(lit("word_"), ($"value" % 3).cast("string"))) // Create some sample words
// Group by word and count, using event time for windowing
val wordCounts = processedData
.withWatermark("event_timestamp", "10 seconds") // Allow for 10 seconds of late data
.groupBy(
window($"event_timestamp", "5 seconds", "2 seconds"), // Window of 5 seconds, sliding every 2 seconds
$"word"
)
.count()
// Start the streaming query, writing to the console
val query = wordCounts.writeStream
.outputMode("update") // Output only rows that have changed
.format("console")
.trigger(Trigger.ProcessingTime("5 seconds")) // Process new data every 5 seconds
.start()
query.awaitTermination()
In this example, rawData is our simulated stream. We extract the timestamp field (which is Spark’s processing time for this source, but we’ll treat it as our event time) and create a word column. The crucial part is withWatermark("event_timestamp", "10 seconds"). This tells Spark: "If an event arrives more than 10 seconds after its event_timestamp, consider it late and potentially discard it." Then, we group by a tumbling window (window($"event_timestamp", "5 seconds")) that is 5 seconds long and slides every 2 seconds. This means that for every 2-second interval, Spark will emit counts for the 5-second windows that ended within that interval, considering events up to 10 seconds past their timestamp.
The mental model here is about watermarks and windows. Think of a watermark as a clock hand that moves forward based on the event time of the data arriving. When the watermark for a specific event time passes the end of a window, Spark knows that no more data for that window will arrive (within the allowed lateness), and it can safely emit the results for that window. The trigger defines how often Spark checks for new data and advances its internal state, but the event time processing itself is governed by the watermarks.
The "10 seconds" in withWatermark("event_timestamp", "10 seconds") is the critical lever. It defines the maximum lateness Spark will tolerate. If you set it too low, you might drop legitimate late data. If you set it too high, your state might grow excessively large, and results will be delayed. The window definition ("5 seconds", "2 seconds") dictates the granularity of your aggregations. A 5-second window means you’re interested in counts within any 5-second span of event time. A 2-second slide means you get these 5-second counts every 2 seconds, creating overlapping windows.
The counterintuitive part is how Spark manages the state for these windows. It doesn’t just keep a simple count for each window as it opens. Instead, it maintains an internal representation that can handle events arriving out of order. When an event arrives, Spark determines which window(s) it belongs to based on its event_timestamp. If the watermark hasn’t passed the end of that window yet, Spark updates the relevant aggregation (e.g., increments the count). If the watermark has passed, the event is considered late and is dropped, unless it’s within the watermark’s grace period. This state management is what allows Spark to produce accurate results even when the incoming data is jumbled.
The next concept you’ll grapple with is exactly-once semantics and how watermarking interacts with fault tolerance.