Spark Streaming Executor Memory Tuning: GC and Spill (2026)

Spark Streaming’s executor memory tuning, specifically around Garbage Collection (GC) and spilling, is often misunderstood because the system appears to just "slow down" or "drop data" without a clear indication of why.

The core issue is that Spark Streaming, in its continuous processing model, needs to manage a constantly growing amount of data in memory across its executors. When this memory pressure becomes too high, either the Java Virtual Machine (JVM) on an executor spends too much time doing garbage collection, or it has to offload data to disk (spill). Both scenarios cripple throughput and introduce latency.

Common Causes and Fixes

Insufficient Executor Memory: This is the most frequent culprit. Executors simply don’t have enough RAM to hold the data they’re processing and the intermediate states.
- Diagnosis: Monitor the spark.executor.memory setting in your Spark configuration. Observe the Storage Memory and Execution Memory usage in the Spark UI’s "Environment" tab. If Storage Memory is consistently high and nearing the spark.memory.storageFraction limit, or if execution memory is being pushed out, you’re likely memory-starved. Check executor logs for OutOfMemoryError or frequent, long GC pauses.
- Fix: Increase spark.executor.memory. For example, if you were using 4g, try 8g. You might also need to adjust spark.driver.memory if the driver is also experiencing pressure, though executor memory is usually the bottleneck for streaming.
- Why it works: More RAM means the JVM can hold more active data, reducing the need to constantly reclaim memory through GC. It also provides more buffer for execution tasks before they resort to spilling.
High spark.memory.fraction for Storage: If you’re caching a lot of RDDs/DataFrames (e.g., for lookups), and spark.memory.fraction is set too high, execution memory gets squeezed.
- Diagnosis: In the Spark UI’s "Environment" tab, look at the ratio of Storage Memory to Execution Memory. If Storage Memory is consistently occupying almost all of the spark.memory.storageFraction of JVM heap, and execution tasks are frequently being evicted or spilling, this is the problem.
- Fix: Decrease spark.memory.fraction. A common default is 0.6. Try setting it to 0.4 or 0.3. This reallocates more heap space to execution memory.
- Why it works: This setting dictates the maximum proportion of the heap that Spark’s unified memory manager can use for both execution and storage. Lowering it ensures that execution tasks have a larger guaranteed pool of memory available before Spark even considers using it for storage.
Inefficient Serialization (Kryo vs. Java): If your data objects are large and complex, the default Java serializer can be slow and generate large intermediate objects, exacerbating memory pressure.
- Diagnosis: Check your Spark configuration for spark.serializer. If it’s org.apache.spark.serializer.JavaSerializer, and your data contains custom objects or large collections, this is a prime suspect. Monitor GC times in the Spark UI’s "Executors" tab; long GC pauses often correlate with serialization overhead.
- Fix: Switch to Kryo serialization and register your custom classes. Set spark.serializer=org.apache.spark.serializer.KryoSerializer and spark.kryo.registrator=com.yourcompany.YourKryoRegistrator. In YourKryoRegistrator, implement com.esotericsoftware.kryo.Kryo.register(YourClass.class); for all custom classes.
- Why it works: Kryo is generally faster and produces smaller serialized output than Java serialization, reducing the amount of data that needs to be moved around and held in memory, thus easing GC pressure and reducing spill potential.
Too Many Partitions for Data Size: Having a very high number of small partitions means more task overhead, more objects to manage, and potentially more GC work per executor.
- Diagnosis: Examine the "Stages" tab in the Spark UI. Look for stages with a very large number of tasks (thousands) processing relatively small amounts of data per task. Also, check spark.sql.shuffle.partitions (for DataFrame operations) or spark.default.parallelism (for RDD operations).
- Fix: Increase the number of records per partition by increasing the number of executors or, more commonly, by repartitioning your data before it becomes a bottleneck. For DataFrame operations, set spark.sql.shuffle.partitions to a lower number, e.g., 200 if you previously had it at 2000. For RDDs, consider rdd.repartition(numPartitions) or rdd.coalesce(numPartitions).
- Why it works: Fewer, larger partitions reduce the overhead of task scheduling and management. It also means each task processes more data, leading to better locality and potentially more efficient memory usage per unit of data processed.
Excessive State in Streaming Aggregations: If your streaming job performs aggregations that maintain large amounts of state (e.g., groupByKey, reduceByKey over long windows with high cardinality keys), this state can grow too large for executor memory.
- Diagnosis: Monitor the "Streaming" tab in the Spark UI. Look at the "Stateful Operators" and their "State Size." If this size is growing uncontrollably or is very large, it’s a strong indicator of state-related memory issues. Executors might also be showing high memory usage and GC activity.
- Fix: Implement state management optimization. Use mapGroupsWithState or flatMapGroupsWithState with appropriate timeouts (.timeout(...)) to discard old state. For updateStateByKey (older API), ensure your state update logic includes pruning. If possible, reduce the window duration or key cardinality.
- Why it works: By actively discarding old or irrelevant state, you prevent the state store from consuming all available executor memory, directly reducing GC pressure and the likelihood of spilling the state itself to disk.
Large Shuffle Outputs / Wide Transformations: Operations like groupByKey, reduceByKey, join, and repartition trigger shuffles. If the data being shuffled is large, or if many tasks are writing to a single executor, it can overwhelm executor memory and disk.
- Diagnosis: The Spark UI’s "Stages" tab will show stages with a high number of shuffle read/write bytes. Executors might report disk spilling (Shuffle Spill (Disk)) in their metrics. Look for stages with a high number of tasks on a small number of executors.
- Fix: Optimize your transformations. If possible, perform aggregations before a shuffle (e.g., treeAggregate or a local reduceByKey followed by a global reduceByKey). Increase spark.sql.shuffle.partitions if the number of shuffle partitions is too low for the data volume, leading to large data chunks per partition. Ensure spark.shuffle.file.buffer is appropriately sized (e.g., 32k or 64k).
- Why it works: Reducing shuffle data volume means less data needs to be written to disk and read back. A larger number of shuffle partitions can distribute the shuffle load more evenly, preventing individual executors from becoming overwhelmed.

The next error you’ll likely encounter after fixing these is related to network throughput or, if you’re still pushing boundaries, data loss due to unrecoverable task failures.

Common Causes and Fixes

More Deep Dives in Spark Streaming