SQS’s content-based deduplication uses a SHA-256 hash of the message body to identify duplicates, but it’s not as simple as just hashing the text.
Let’s see it in action. Imagine you have a queue my-dedup-queue configured with content-based deduplication enabled. You send two identical messages:
aws sqs send-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dedup-queue --message-body "{\"order_id\": \"12345\", \"item\": \"widget\"}"
aws sqs send-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dedup-queue --message-body "{\"order_id\": \"12345\", \"item\": \"widget\"}"
SQS will calculate a SHA-256 hash for "{\"order_id\": \"12345\", \"item\": \"widget\"}". If you try to receive messages immediately, you’ll likely only get one. If you send them with a slight delay, SQS is designed to detect that the second message is a duplicate based on this hash and discard it.
The core problem content-based deduplication solves is ensuring that an event, represented by a message, is processed exactly once, even if the producer accidentally sends it multiple times. This is crucial for idempotent consumers, where processing the same message twice has no adverse effects. Without it, you might double-charge a customer or process a refund twice.
Internally, when content-based deduplication is enabled on a standard SQS queue, SQS computes a SHA-256 hash of the message body. This hash is then used as the MessageDeduplicationId for that message. SQS maintains a deduplication window (defaulting to 5 minutes) for each message group. If a message with the same MessageDeduplicationId arrives within this window for the same message group (or if there’s no message group and it’s a standard queue), SQS discards the new message. This means that for a 5-minute window, any identical messages sent will result in only the first one being stored and delivered.
The primary lever you control is enabling the ContentBasedDeduplication flag when creating or updating the queue. You can do this via the AWS Management Console, the AWS CLI, or an SDK.
aws sqs create-queue --queue-url my-dedup-queue --attributes ContentBasedDeduplication=true
This flag tells SQS to automatically generate the MessageDeduplicationId from the message body. If you don’t enable this, you’d have to manually provide a MessageDeduplicationId for each message sent, which is often more complex.
However, the "content" SQS hashes is precisely the string you send as the message-body. This means that even tiny, seemingly insignificant differences in the message body will result in different hashes and thus, different MessageDeduplicationIds, bypassing deduplication. For example, trailing whitespace, the order of keys in a JSON object, or different formatting of numbers can all lead to unique hashes.
Consider this:
{"key1": "value1", "key2": "value2"}
vs.
{"key2": "value2", "key1": "value1"}
These are functionally identical JSON objects, but as strings, they are different. SQS will hash them independently. To ensure effective deduplication with JSON, you should always serialize your JSON objects with consistent key ordering before sending them to SQS. Libraries in most programming languages offer options for sorted serialization.
Another common pitfall is when the message body contains dynamic or timestamped information that should be unique but is being treated as a duplicate identifier. If your message body includes a timestamp like "timestamp": "2023-10-27T10:30:00Z", and you send two identical messages within the deduplication window, they will be treated as distinct because the timestamp is different, even if the core event is the same. You need to ensure that the part of the message body used for deduplication is stable for identical events.
The next concept you’ll likely grapple with is how to handle message ordering alongside deduplication, especially with FIFO queues.