The core innovation of event streaming isn’t just moving data, it’s making data a persistent, replayable log of immutable events that anything can subscribe to.
Let’s look at a common pattern: Log Aggregation. Imagine you have dozens, hundreds, or even thousands of application servers, each spitting out logs. Instead of each server pushing logs to a central place (which is brittle and scales poorly), we can have them publish their logs as events to a Kafka topic.
Here’s a simplified producer code snippet in Java for Kafka:
Properties props = new Properties();
props.put("bootstrap.servers", "kafka-broker-1:9092,kafka-broker-2:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
String logMessage = "2023-10-27 10:00:01 INFO user_service: User 'alice' logged in successfully.";
String topic = "application-logs";
String key = "user_service_instance_1"; // Or a unique identifier for the log source
producer.send(new ProducerRecord<>(topic, key, logMessage));
producer.close();
On the other side, consumers can subscribe to this application-logs topic. A log processing service might consume these events, parse them, and store them in a database or search index like Elasticsearch.
Properties props = new Properties();
props.put("bootstrap.servers", "kafka-broker-1:9092,kafka-broker-2:9092");
props.put("group.id", "log-processor-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset", "earliest"); // Start reading from the beginning of the log
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("application-logs"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
System.out.printf("Received log: offset = %d, key = %s, value = %s%n",
record.offset(), record.key(), record.value());
// Process the log message here (e.g., parse, index)
}
}
This pattern decouples log producers from log consumers. If your log processing service goes down, the Kafka brokers simply buffer the incoming log events. When the service restarts, it picks up exactly where it left off (thanks to consumer group offsets). If you need to add a new system to analyze logs (e.g., a real-time anomaly detection system), it can simply subscribe to the existing application-logs topic without impacting any other part of the system.
Kinesis offers a similar capability with its Data Streams service. The producer would use the Kinesis Producer Library (KPL) or SDK to put records into a stream.
// Example using AWS SDK for Java v2
KinesisClient kinesisClient = KinesisClient.builder()
.region(Region.US_EAST_1)
.build();
PutRecordRequest putRecordRequest = PutRecordRequest.builder()
.streamName("application-logs-stream")
.partitionKey("user_service_instance_1") // Similar to Kafka's key
.data(SdkBytes.fromUtf8String("2023-10-27 10:00:01 INFO user_service: User 'alice' logged in successfully."))
.build();
kinesisClient.putRecord(putRecordRequest);
Consumers in Kinesis would typically use the Kinesis Client Library (KCL) to read records from shards within the stream. The KCL handles shard iteration, fault tolerance, and checkpointing (analogous to Kafka’s consumer offsets).
The fundamental mental model shift is from "pushing data to a destination" to "emitting events to a central, immutable log." This log becomes the source of truth, enabling multiple, independent consumers to react to changes in real-time or process historical data. It enables architectural patterns like Event Sourcing, CQRS, and stream processing without needing to build complex custom infrastructure for each.
What’s often misunderstood is how the partitionKey (in Kinesis) or key (in Kafka) directly influences throughput and ordering guarantees. For logs, distributing records across partitions based on the source server’s identifier ensures that all logs from a single server are processed in order by a single consumer instance within a consumer group. However, if you use a key that results in uneven distribution (e.g., a static key for all messages), you’ll create a bottleneck on a single partition and its associated consumer.
The next logical step in this event-driven journey is often implementing Command Query Responsibility Segregation (CQRS), where different event streams or topics might represent commands and their resulting state changes.