Spark Streaming’s metrics system is actually a sophisticated event bus that allows you to tap into the internal state of your streaming jobs.

Here’s what that looks like in practice. Imagine a Spark Streaming job processing an input stream and writing to an output sink.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger

val spark = SparkSession.builder().appName("StreamingMetricsDemo").getOrCreate()

val inputDF = spark.readStream
  .format("rate")
  .load()

val outputDF = inputDF.withColumn("processed_time", current_timestamp())

outputDF.writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .start()

spark.streams.awaitAnyTermination()

When this job runs, Spark’s internal metrics system is constantly emitting events about its progress: how many records are read, how long each batch takes to process, how many records are written, the latency between records arriving and being processed, and so on. These events are published to various "sinks" that can consume them. The Prometheus sink is one such consumer, designed to translate these events into metrics that Prometheus can scrape.

The core problem Spark Streaming solves is processing unbounded data streams with micro-batching. The metrics system is how you observe and understand the performance and health of this micro-batching process. It breaks down the streaming pipeline into discrete batches, and for each batch, it generates a wealth of data points. The key levers you control are the micro-batch interval (via Trigger.ProcessingTime), the number of records processed per batch (implicitly via input rate and Spark’s configuration), and the complexity of your transformations. The metrics system provides the visibility into how these choices impact throughput, latency, and resource utilization.

When you configure Spark to use the Prometheus metrics sink, it essentially registers a small HTTP server within your Spark application. This server exposes an endpoint (typically /metrics) that Prometheus can poll at regular intervals. Prometheus then scrapes these metrics, storing them in its time-series database. You can then query this data using PromQL and visualize it using tools like Grafana.

The most surprising true thing about Spark Streaming metrics is that many of the "latency" metrics you see are actually end-to-end latencies measured from when a record is generated by the source to when it’s processed by a Spark batch. This isn’t just about how long Spark takes to process a batch; it includes the time spent waiting for data to arrive from the source system.

Here’s a typical setup in spark-defaults.conf or via --conf arguments:

spark.metrics.conf=metrics.properties
spark.metrics.log.level=INFO

And in metrics.properties:

*.sink.prometheus.class=org.apache.spark.metrics.sink.PrometheusSink
*.sink.prometheus.port=9091
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource

This tells Spark to load the PrometheusSink and expose metrics on port 9091. Prometheus would then be configured to scrape http://<spark-driver-or-executor-ip>:9091/metrics.

When you look at a dashboard for a Spark Streaming job, you’ll typically see graphs for:

  • spark_streaming_job_num_records_received_total: The cumulative number of records received by the streaming job. This is a counter that only goes up.
  • spark_streaming_job_num_records_processed_total: The cumulative number of records processed across all batches.
  • spark_streaming_job_batch_processing_time_ms: The duration each batch took to process. This is a histogram, so you’ll see counts for different time intervals.
  • spark_streaming_job_scheduling_delay_ms: The delay between when a batch was ready to be scheduled and when it actually started processing. This is crucial for identifying bottlenecks in the Spark scheduler itself.
  • spark_streaming_job_total_batches_processed_total: A counter for the total number of batches completed.

The Prometheus sink transforms Spark’s internal event bus messages into Prometheus-compatible metrics. For example, an event like StreamingBatchProcessed(batchId=123, processingTime=500ms, numRecords=1000) gets translated into Prometheus metrics like spark_streaming_job_batch_processing_time_ms_bucket{le="500.0"} 1 and spark_streaming_job_num_records_processed_total 1000. The _bucket suffix indicates that batch_processing_time_ms is a histogram.

A common pitfall is expecting scheduling_delay_ms to always be zero. It will naturally be non-zero when your processing time per batch exceeds your trigger interval, or when the cluster is overloaded and cannot schedule the next batch immediately. If scheduling_delay_ms is consistently high, it signals that Spark is struggling to keep up with the incoming data rate or that there are insufficient resources. The batch_processing_time_ms histogram will show you if individual batches are taking too long due to complex transformations or data skew.

The single most overlooked aspect of Spark Streaming metrics is how they relate to the source and sink operations within a micro-batch. While metrics like batch_processing_time_ms capture the time Spark spends within its own execution engine for a batch, they don’t inherently include the time taken for the source to fetch data for that batch or the time taken for the output sink to commit the processed data. To get a true end-to-end latency, you often need to instrument your source and sink operations separately or rely on metrics that do explicitly measure this, like the spark_streaming_job_wall_clock_processing_time_ms which attempts to capture the total time from data arrival at the source to completion of processing for that batch, including any source-specific delays.

The next concept you’ll grapple with is how to correlate these Spark Streaming metrics with external metrics from your data sources (like Kafka consumer lag) and sinks to get a complete picture of your pipeline’s health.

Want structured learning?

Take the full Spark-streaming course →