Kinesis Data Streams acts as a highly scalable, durable, and ordered streaming data service, while Firehose is a managed service for delivering real-time streaming data to destinations like S3, Redshift, and Splunk. When you use Kinesis as a source for Firehose, you’re essentially setting up a pipeline where data lands in Kinesis and is then efficiently delivered to your chosen destination by Firehose.
Let’s see Kinesis and Firehose in action, processing a simple log stream.
Imagine we have a web server generating access logs. Instead of writing these logs directly to files, which can be cumbersome to manage and analyze in real-time, we can send them to Kinesis Data Streams.
1. Kinesis Data Stream Setup:
First, we create a Kinesis Data Stream. We’ll need to decide on the number of shards, which determines the throughput. For this example, let’s start with 2 shards.
aws kinesis create-stream --stream-name my-web-logs-stream --shard-count 2
Now, our my-web-logs-stream is ready to receive data.
2. Simulating Log Ingestion into Kinesis:
We can simulate logs arriving by putting records into the stream. Each record is a piece of data, typically JSON or binary.
echo '{"timestamp": "2023-10-27T10:00:00Z", "ip": "192.168.1.10", "method": "GET", "path": "/index.html", "status": 200}' | \
aws kinesis put-record --stream-name my-web-logs-stream --partition-key 123 --data '{"timestamp": "2023-10-27T10:00:00Z", "ip": "192.168.1.10", "method": "GET", "path": "/index.html", "status": 200}'
The --partition-key is crucial for distributing data across shards. Records with the same partition key are guaranteed to be ordered within a shard.
3. Firehose Delivery Stream Setup:
Next, we create a Firehose delivery stream that will read from our Kinesis Data Stream and deliver to an S3 bucket.
aws firehose create-delivery-stream \
--delivery-stream-name my-web-logs-firehose \
--kinesis-source-configuration \
'{"KinesisStreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-web-logs-stream", "RoleARN": "arn:aws:iam::123456789012:role/FirehoseKinesisRole"}' \
--destination-configuration \
'{"S3DestinationConfiguration": {"BucketARN": "arn:aws:s3:::my-log-bucket-for-firehose/", "Prefix": "raw-logs/!", "ErrorOutputPrefix": "errors/!", "BufferingHints": {"SizeInMBs": 1, "IntervalInSeconds": 60}, "CompressionFormat": "GZIP", "EncryptionConfiguration": {"NoEncryptionConfig": {}}}}'
--kinesis-source-configuration: Links Firehose to our Kinesis stream.--destination-configuration: Specifies S3 as the destination.BucketARN: The S3 bucket where logs will be delivered.Prefix: A prefix for the S3 objects (e.g.,raw-logs/). The!is a special Firehose character that gets replaced with the current date and time, creating partitioned data.BufferingHints: Firehose buffers data before writing to S3. Here, it buffers up to 1MB or for 60 seconds, whichever comes first.CompressionFormat: We’re compressing logs with GZIP.
4. Data Flow:
Once configured, Firehose automatically starts consuming records from my-web-logs-stream. As logs (like the one we simulated) arrive in Kinesis, Firehose picks them up. It buffers them according to the BufferingHints, compresses them, and then writes them as objects to s3://my-log-bucket-for-firehose/raw-logs/YYYY/MM/DD/HH/firehose-stream-name-YYYY-MM-DD-HH-MM-SS-random-string.gz.
The Mental Model:
At its core, this pattern decouples data producers from data consumers. Your web servers (or any other application) simply push data to Kinesis, a highly available and durable buffer. They don’t need to know about S3, Redshift, or any other downstream system. Kinesis handles the immediate concerns of buffering and durability. Firehose then takes over the responsibility of reliably delivering that data to your chosen destinations, transforming it (e.g., compression, minor format changes) and handling batching and retries. This is particularly powerful for handling spiky traffic, as Kinesis can absorb large bursts of data that might overwhelm a direct connection to a destination.
The BufferingHints in Firehose are critical for balancing latency and cost. Smaller buffer sizes mean data arrives in your destination faster but results in more, smaller files in S3, which can increase S3 request costs and make downstream processing slightly less efficient. Larger buffer sizes reduce costs and S3 operations but increase latency.
The Prefix in the S3 destination configuration is not just for organization; it’s how Firehose partitions your data in S3. Using the date/time insertion into the prefix (!) is a common and effective way to partition data for efficient querying with services like Athena or Redshift Spectrum.
When Firehose writes data to S3, it uses a combination of the stream name, date/time, and a unique ID to create distinct object names. This ensures that even if multiple Firehose streams write to the same prefix, their objects won’t collide. The actual object name might look something like raw-logs/2023/10/27/10/my-web-logs-firehose-2023-10-27-10-05-30-abcdef123456.gz.
The primary purpose of using Kinesis Data Streams as a source for Firehose is to add an intermediate buffering and durability layer. This allows producers to send data at their own pace without worrying about the availability or throughput of the final destination. Firehose then acts as a reliable conveyor belt, ensuring data eventually reaches its destination even if there are temporary issues with the destination service.
If you need to process data before it’s delivered to S3 by Firehose (e.g., complex transformations, enriching data), you’d typically use a Lambda function as a transformation step within the Firehose configuration.