Spark Streaming Thrift Server exposes Spark Streaming queries as SQL tables, allowing standard SQL clients to query live data streams.

Let’s see it in action. Imagine you have a Kafka topic named iot_data with JSON messages like this:

{
  "deviceId": "sensor-123",
  "timestamp": 1678886400000,
  "temperature": 25.5,
  "humidity": 60.2
}

You want to query this stream using SQL. First, you need to start the Spark Streaming Thrift Server. On your Spark cluster, navigate to $SPARK_HOME/sbin and run:

./start-thriftserver.sh --master yarn --deploy-mode client --conf spark.sql.streaming.streamingTableEnabled=true

This command starts the Thrift server, connects it to your YARN cluster in client mode, and crucially, enables the spark.sql.streaming.streamingTableEnabled configuration. This setting is what allows Spark SQL to recognize and treat streaming DataFrames as queryable tables.

Now, you can connect to this Thrift server using any SQL client that supports the Hive Thrift protocol (like Beeline, DBeaver, or even spark-sql itself). Let’s use spark-sql:

./bin/spark-sql --master yarn --deploy-mode client --hiveconf hive.server2.thrift.bind.host=localhost --hiveconf hive.server2.thrift.port=10000

Once connected, you can define a streaming query. This involves creating a DataFrame from your Kafka topic and then registering it as a temporary view or table.

-- Create a DataFrame from the Kafka topic
CREATE TEMPORARY VIEW iot_stream
USING org.apache.spark.sql.kafka
OPTIONS (
  kafka.bootstrap.servers = 'your_kafka_broker1:9092,your_kafka_broker2:9092',
  subscribe = 'iot_data',
  'startingOffsets' = 'earliest'
);

-- Define a streaming query on the view
CREATE TEMPORARY STREAMING VIEW processed_iot_data
AS SELECT
  CAST(value AS STRING) AS raw_data
FROM iot_stream;

-- Now you can query the streaming view like a regular table
SELECT * FROM processed_iot_data LIMIT 10;

When you run SELECT * FROM processed_iot_data LIMIT 10;, Spark SQL doesn’t just fetch 10 static rows. Instead, it starts a micro-batch and returns up to 10 new records that have arrived in the iot_data Kafka topic since the last query (or since the stream started if it’s the first query). The LIMIT clause here is a bit of a misnomer in the context of streaming – it effectively limits the number of new records returned in that specific micro-batch execution, not the total number of records ever seen.

The core problem Spark Streaming Thrift Server solves is bridging the gap between the real-time, event-driven world of streaming data and the declarative, batch-oriented world of SQL. Traditionally, accessing streaming data involved writing complex imperative code in Spark Streaming APIs (DStreams or Structured Streaming DataFrame/Dataset API). This made it difficult for analysts or BI tools that were accustomed to SQL to interact with live data. The Thrift server exposes these streaming computations as standard SQL tables, allowing for ad-hoc querying, integration with BI tools, and simplified data exploration of live data feeds.

Internally, when spark.sql.streaming.streamingTableEnabled is true, Spark SQL’s catalog is extended to understand streaming sources. When you issue a SELECT statement against a temporary streaming view, Spark doesn’t execute a static query plan. Instead, it creates and starts a continuous streaming query in the background. Each time the SQL client requests data (e.g., by re-executing the SELECT statement or through a mechanism that polls the Thrift server for updates), Spark triggers a new micro-batch for the underlying streaming query, processes the new data, and returns the results of that micro-batch. The LIMIT clause or other conditions are applied to the results of each micro-batch.

The exact levers you control are primarily in the CREATE TEMPORARY STREAMING VIEW statement. You define the schema, transformations, and windowing operations here, just as you would in a standard Spark Structured Streaming DataFrame API. The Thrift server then simply acts as the execution engine and access layer for these defined streaming computations. You can also control the batch interval and other streaming-specific configurations via Spark configuration properties passed to the start-thriftserver.sh script or within the SQL session itself.

What many people don’t realize is that the startingOffsets and endingOffsets options in the CREATE TEMPORARY VIEW for Kafka are crucial for controlling how the stream is read. If you omit startingOffsets, Spark defaults to latest, meaning you’ll only ever see data that arrives after the stream is started, potentially missing historical data if your goal is to process a backlog. Setting it to earliest ensures you process all available data from the beginning of the topic. These options are passed directly to the underlying Kafka source implementation.

The next concept you’ll likely encounter is managing stateful operations within your streaming SQL queries, such as aggregations over time windows, and how those states are persisted and accessed.

Want structured learning?

Take the full Spark-streaming course →