Running multiple Spark Streaming queries simultaneously isn’t just about launching more jobs; it’s about orchestrating resource sharing to avoid contention and ensure smooth operation.

Let’s see this in action. Imagine we have two distinct streaming applications, one processing clickstream data from Kafka and another aggregating sensor readings, also from Kafka.

// App 1: Clickstream Processing
val clickstreamKafkaParams = Map[String, String](
  "kafka.bootstrap.servers" -> "kafka-broker-1:9092,kafka-broker-2:9092",
  "subscribe" -> "clickstream_topic"
)
val clickstreamStream = spark.readStream
  .format("kafka")
  .options(clickstreamKafkaParams)
  .load()
  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as click_data")
  .writeStream
  .format("console")
  .option("truncate", "false")
  .outputMode("append")
  .queryName("clickstream_processor")
  .start()

// App 2: Sensor Data Aggregation
val sensorKafkaParams = Map[String, String](
  "kafka.bootstrap.servers" -> "kafka-broker-1:9092,kafka-broker-2:9092",
  "subscribe" -> "sensor_topic"
)
val sensorStream = spark.readStream
  .format("kafka")
  .options(sensorKafkaParams)
  .load()
  .selectExpr("CAST(key AS STRING)", "CAST(value AS DOUBLE) as sensor_reading")
  .groupBy($"key")
  .agg(count("*").as("reading_count"))
  .writeStream
  .format("console")
  .option("truncate", "false")
  .outputMode("update")
  .queryName("sensor_aggregator")
  .start()

// Keep the application running
spark.streams.awaitAnyTermination()

Here, spark.readStream from Kafka initiates two independent streaming queries. Each query defines its own Kafka connection parameters and subscribes to a different topic. The queryName is crucial for identification. The awaitAnyTermination() call keeps the Spark application alive, allowing both streams to process data concurrently.

The core problem multiple streaming queries solve is the need to handle distinct data sources or processing pipelines within a single Spark application without them interfering with each other. Spark Streaming’s micro-batching or continuous processing model can be applied to multiple independent data sources, allowing for parallel ingestion and processing. Each readStream call essentially sets up a new, independent data source reader and query plan.

Internally, Spark manages these queries within the same SparkSession. The SparkSession provides the execution environment, including the SparkContext for distributed computation. When you start multiple streams, Spark’s scheduler distributes the processing tasks for each micro-batch (or continuous chunk) across the available executors. The key is that each stream’s data source, transformations, and sink are defined independently, so Spark treats them as separate logical jobs within the same physical application.

The resource allocation is where things get interesting. By default, Spark will try to use its available CPU cores and memory to process all active streams. If you have a single Spark application with multiple streams, the executors will be shared. If one stream is particularly CPU-intensive or generates very large micro-batches, it can starve other streams of resources, leading to increased latency or even failures.

When you have multiple Kafka sources pointing to the same Kafka cluster and broker list, Spark’s Kafka consumer will manage offsets for each topic independently. This is handled by Spark’s internal state management for each stream. However, if multiple streams are writing to the same downstream sink (e.g., a single database table, a shared file directory), you must ensure that the sink can handle concurrent writes or implement a strategy to serialize access. For console output as shown above, this isn’t an issue because each stream writes to its own distinct console output.

The most common pitfall when running multiple streaming queries is forgetting to configure separate checkpointLocations for each query if they are performing stateful operations (like groupBy, agg, join with a streaming dataset, or mapGroupsWithState). If multiple stateful queries share the same checkpointLocation, Spark’s state management will become corrupted, leading to unexpected behavior or job failures. Each query needs its own dedicated checkpoint directory to maintain its independent processing state.

The next challenge you’ll likely encounter is managing the lifecycle and monitoring of these individual streaming queries within your application.

Want structured learning?

Take the full Spark-streaming course →