Spark Streaming’s receiver parallelism is the bottleneck you didn’t know you had, and increasing it doesn’t just boost throughput; it fundamentally changes how Spark handles incoming data.

Let’s see it in action. Imagine you’re ingesting from Kafka. Your Spark Streaming job is configured with spark.streaming.kafka.maxRatePerPartition.

val kafkaParams = Map[String, String](
  "bootstrap.servers" -> "kafka-broker-1:9092,kafka-broker-2:9092",
  "group.id" -> "my-spark-streaming-group",
  "auto.offset.reset" -> "earliest"
)

val topics = List("my-topic").toStream

val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  LocationStrategies.PreferConsistent,
  ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)

// This is the key setting for receiver parallelism
val streamWithMaxRate = stream.map(record => record.value()).transform(rdd => {
  // Your processing logic here
  rdd.foreachRDD(println)
  rdd
})

// Set the maximum rate of records per Kafka partition
// This indirectly controls receiver parallelism by limiting how much data
// each receiver task can pull per batch.
val sparkConf = new SparkConf().setAppName("KafkaReceiverParallelism").setMaster("local[*]")
sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "1000") // Records per partition per batch

val ssc = new StreamingContext(sparkConf, Seconds(5)) // Batch interval of 5 seconds

Here, spark.streaming.kafka.maxRatePerPartition is the critical knob. It doesn’t directly set the number of receiver threads, but it dictates how many records a single receiver task can pull from each Kafka partition within a single batch interval. If you have 10 Kafka partitions and set maxRatePerPartition to 1000, Spark will try to pull up to 10,000 records per batch for that topic.

The real magic happens when you understand how Spark maps these Kafka partitions to its own internal RDD partitions. When you use createDirectStream with Kafka, Spark doesn’t just spin up a fixed number of receiver threads. Instead, it dynamically creates receiver tasks. The number of these tasks is determined by how many Kafka partitions your topic has, and how you’ve configured Spark’s parallelism for the receiver stage.

The spark.streaming.kafka.maxRatePerPartition setting is your primary lever. It tells Spark, "For each Kafka partition I’m reading from, don’t fetch more than X records in a single batch." If your Kafka topic has N partitions, and you set maxRatePerPartition to M, Spark will attempt to pull up to N * M records per batch. This effectively scales the ingest parallelism by allowing Spark to process more data from more partitions concurrently.

Consider the scenario where you have a heavily loaded Kafka topic with many partitions. If your maxRatePerPartition is too low, each receiver task will be starved, even if your Spark executors have plenty of capacity. Conversely, if it’s too high, you might overwhelm your Kafka brokers or experience uneven processing if some partitions are significantly larger than others. The optimal value depends on your Kafka throughput, network bandwidth, and the processing power of your Spark executors.

The key insight is that Spark Streaming’s receiver parallelism isn’t a static configuration like spark.default.parallelism. It’s a dynamic interplay between the number of Kafka partitions and the maxRatePerPartition setting. Each Kafka partition is typically assigned to a dedicated receiver task within a batch. If you have 20 Kafka partitions for a topic, Spark will try to launch up to 20 receiver tasks for that topic, each respecting the maxRatePerPartition limit. If you increase maxRatePerPartition, each of these tasks can pull more data, thus increasing the overall ingest rate.

What most people miss is that increasing maxRatePerPartition doesn’t just make receivers pull faster; it directly influences the number of RDD partitions created for the initial data RDD within each batch. If you have 10 Kafka partitions and maxRatePerPartition is 1000, Spark will aim to create an RDD with up to 10 partitions for that batch. If you then run .repartition(50) on that RDD, you’re increasing the processing parallelism after the data has been ingested. The receiver parallelism is about how much data hits the Spark cluster initially.

The next concept you’ll wrestle with is how to ensure your downstream processing parallelism matches your ingest parallelism.

Want structured learning?

Take the full Spark-streaming course →