SQS Dead Letter Queues (DLQs) are an essential mechanism for handling messages that fail to be processed successfully after a certain number of retries. Instead of just disappearing or being endlessly retried, these messages are automatically sent to a designated DLQ for later inspection and potential reprocessing. This prevents message loss and provides a crucial auditing and debugging capability.

Let’s see how this works with a simple example. Imagine an application that processes orders from an SQS queue. If an order message fails processing (e.g., due to a database error or invalid data), it will be retried. After a configured number of retries, if it still fails, it will be sent to a DLQ.

Here’s a typical setup:

Source Queue: my-order-processing-queue DLQ: my-order-processing-dlq

When a message is sent to my-order-processing-queue and fails processing repeatedly, it will eventually land in my-order-processing-dlq. You can then inspect the messages in the DLQ to understand why they failed.

How it Works Internally

SQS employs a simple yet effective mechanism for DLQs. You configure a redrive policy on your source queue. This policy specifies:

  1. maxReceiveCount: The number of times a message can be received (and therefore attempted to process) from the source queue before it’s considered "failed." Once this count is exceeded, SQS automatically moves the message.
  2. deadLetterTargetArn: The Amazon Resource Name (ARN) of the DLQ. This is where the failed messages will be sent.

When a message is delivered to the source queue, SQS increments a ApproximateReceiveCount attribute for that message. If a consumer successfully deletes the message, this count is reset. If the visibility timeout for the message expires (meaning the consumer didn’t delete it, implying a failure), the message becomes visible again, and its ApproximateReceiveCount is incremented. When this count hits maxReceiveCount, SQS moves the message to the DLQ.

Key Components and Configuration

  • Source Queue: The primary queue where messages are initially sent.
  • Dead Letter Queue (DLQ): A separate SQS queue configured to receive messages from the source queue.
  • Redrive Policy: A configuration setting on the source queue that defines the DLQ and the maxReceiveCount.

You can configure the redrive policy using the AWS Management Console, AWS CLI, or AWS SDKs.

Using AWS CLI to set a redrive policy:

First, create your DLQ if it doesn’t exist:

aws sqs create-queue --queue-name my-order-processing-dlq

Then, set the redrive policy on your source queue. Let’s assume your source queue is named my-order-processing-queue and its ARN is arn:aws:sqs:us-east-1:123456789012:my-order-processing-queue. The DLQ’s ARN would be arn:aws:sqs:us-east-1:123456789012:my-order-processing-dlq.

aws sqs set-queue-attributes \
    --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-order-processing-queue \
    --attributes '{"RedrivePolicy": "{\\"maxReceiveCount\\":\\"5\\",\\"deadLetterTargetArn\\":\\"arn:aws:sqs:us-east-1:123456789012:my-order-processing-dlq\\"}"}'

In this example, any message that is received 5 times from my-order-processing-queue without being successfully deleted will be moved to my-order-processing-dlq.

Why Use a DLQ?

  1. Prevent Message Loss: Messages aren’t lost if consumers fail. They are preserved for analysis.
  2. Debugging and Error Analysis: You can examine the messages in the DLQ to understand the root cause of processing failures. This might involve inspecting message content, checking application logs for corresponding errors, or identifying malformed messages.
  3. Auditing: DLQs provide a historical record of messages that couldn’t be processed, which can be important for compliance or operational auditing.
  4. Graceful Degradation: Allows your primary processing system to continue processing healthy messages while problematic ones are isolated.

What to do with Messages in the DLQ

Once messages land in the DLQ, you have several options:

  • Analyze and Fix: Inspect the messages to understand the failure. Once the underlying issue is resolved (e.g., a bug fix, data correction), you can potentially move the messages back to the source queue for reprocessing.
  • Manual Reprocessing: Develop a separate process or script to re-ingest messages from the DLQ into the source queue after fixing the issue.
  • Discard: If the messages are irrelevant or unrecoverable, you can simply delete them from the DLQ.

To move messages back (example using AWS CLI):

First, get the messages from the DLQ:

aws sqs receive-message \
    --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-order-processing-dlq \
    --max-number-of-messages 10 \
    --visibility-timeout 0

Then, send them back to the source queue:

aws sqs send-message \
    --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-order-processing-queue \
    --message-body "..." # The body of the message from the DLQ

Finally, delete the message from the DLQ:

aws sqs delete-message \
    --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-order-processing-dlq \
    --receipt-handle "..." # The receipt handle of the message from the DLQ

A common pattern is to set a retention period on the DLQ itself, so messages are automatically purged after a certain time if they are not manually handled.

The maxReceiveCount parameter is crucial; setting it too high might delay the detection of persistent issues, while setting it too low could lead to legitimate transient failures being prematurely sent to the DLQ. It’s a balance between responsiveness to errors and tolerance for temporary glitches.

The next step after implementing DLQs is often setting up monitoring and alerting on the DLQ itself, so you’re proactively notified when messages start accumulating.

Want structured learning?

Take the full Sqs course →