SQS and Kafka, despite both being message queues, solve fundamentally different problems, and choosing the wrong one can lead to significant architectural headaches.
Let’s see them in action. Imagine we have a web application that needs to process user sign-ups. When a new user registers, we want to send a welcome email, update a recommendation engine, and log the event.
SQS Scenario: Simple Fan-out with Decoupling
In an SQS setup, we’d likely have multiple SQS queues, one for each downstream service.
- SQS Queue for Emails:
aws sqs create-queue --queue-name welcome-email-queue - SQS Queue for Recommendations:
aws sqs create-queue --queue-name recommendation-update-queue - SQS Queue for Logging:
aws sqs create-queue --queue-name user-log-queue
When a user signs up, our application code would publish messages to each of these queues:
# Sending welcome email message
aws sqs send-message --queue-url $(aws sqs get-queue-url --queue-name welcome-email-queue --output text --query 'QueueUrl') --message-body '{"userId": "user123", "email": "user@example.com"}'
# Sending recommendation update message
aws sqs send-message --queue-url $(aws sqs get-queue-url --queue-name recommendation-update-queue --output text --query 'QueueUrl') --message-body '{"userId": "user123", "action": "signup"}'
# Sending logging message
aws sqs send-message --queue-url $(aws sqs get-queue-url --queue-name user-log-queue --output text --query 'QueueUrl') --message-body '{"userId": "user123", "event": "signup_completed"}'
Each downstream service (email sender, recommendation engine, logger) would have its own worker process polling its respective SQS queue:
# Email worker polling
aws sqs receive-message --queue-url $(aws sqs get-queue-url --queue-name welcome-email-queue --output text --query 'QueueUrl') --max-number-of-messages 10 --wait-time-seconds 20
# Recommendation worker polling
aws sqs receive-message --queue-url $(aws sqs get-queue-url --queue-name recommendation-update-queue --output text --query 'QueueUrl') --max-number-of-messages 10 --wait-time-seconds 20
SQS excels at decoupling producers from consumers. The producer doesn’t need to know how many consumers there are, or even if they’re running. It just drops a message, and SQS guarantees delivery to at least one consumer of that specific queue. If a consumer fails, the message remains in the queue (until its visibility timeout expires) to be picked up by another.
Kafka Scenario: High-Throughput, Ordered, Replayable Event Streams
Kafka is designed for streaming data, where order and replayability are paramount. Instead of individual queues, we have topics and partitions.
Let’s imagine a single Kafka topic for user events: user-events.
- Kafka Topic:
user-events(with, say, 4 partitions)
Our application publishes all user-related events to this single topic:
# Using kafka-console-producer (example)
kafka-console-producer --broker-list localhost:9092 --topic user-events
{"userId": "user123", "eventType": "signup", "timestamp": 1678886400}
{"userId": "user123", "eventType": "welcome_email_sent", "timestamp": 1678886405}
{"userId": "user123", "eventType": "recommendation_updated", "timestamp": 1678886410}
Here, the eventType and timestamp are crucial. Consumers read from partitions in the order messages are written to that partition.
- Email Service Consumer: Subscribes to
user-events, filters foreventType: "signup", and sends the email. - Recommendation Service Consumer: Subscribes to
user-events, filters foreventType: "signup", and updates recommendations. - Logging Service Consumer: Subscribes to
user-events, filters foreventType: "signup", and logs the event.
Crucially, multiple consumers can read from the same partition, but each message is only delivered to one consumer within a given consumer group. If we wanted to re-process all sign-ups from yesterday, we could simply reset the consumer group’s offset and re-read the topic from the beginning.
The core problem Kafka solves is managing a high-volume, ordered stream of events where multiple distinct consumers might need to react to the same event, or where historical data needs to be replayed. SQS, on the other hand, is about distributing tasks to independent workers where order within a single task queue isn’t critical, and each message is meant for one specific type of worker.
The key difference lies in the semantics of message delivery and consumption. SQS offers a "pull" model where consumers poll for messages, and it guarantees that a message is delivered to at least one consumer of that specific queue. Kafka offers a "push" (or rather, a segmented log) model where consumers actively track their position (offset) in a partitioned log, and messages are delivered in order within a partition to consumers within a consumer group. This allows for multiple, independent applications to consume the same stream of data, each tracking its own progress.
Kafka’s durability and replayability stem from its design as a distributed commit log. Messages are appended to immutable, ordered partitions, and retained for a configurable period. Consumers are responsible for committing their offset, indicating how far they’ve processed. This log-centric approach is fundamentally different from SQS’s queue-centric, "fire-and-forget-to-a-specific-destination" model.
Most people assume Kafka is just "SQS but faster and more scalable." This is a dangerous simplification. Kafka’s primary strength isn’t just raw throughput; it’s the ability to treat event streams as a durable, replayable source of truth. This enables use cases like stream processing (e.g., with Flink or Spark Streaming), real-time analytics, and event sourcing that are either impossible or extremely cumbersome with SQS. If your primary need is simple task decoupling where each task is distinct and doesn’t need to be replayed, SQS is often the simpler, more cost-effective choice. If you need a centralized, ordered, replayable stream of events that multiple applications can independently process, Kafka is the tool.
If you pick Kafka for simple task decoupling and don’t manage consumer group offsets correctly, you might find yourself accidentally processing the same message multiple times across different consumers in the same group.