SQS DLQ redrive is the process of sending messages that failed processing and ended up in a Dead-Letter Queue (DLQ) back to their original source queue for reprocessing.
Here’s a breakdown of how it works and why it’s a powerful tool for handling message processing failures:
Imagine you have a system where messages are sent to an SQS queue for processing by a worker. Sometimes, these messages can fail to process correctly. This could be due to a bug in the worker, a temporary downstream service outage, or malformed data in the message itself. If these failures aren’t handled, the messages can be lost or perpetually fail, blocking the queue.
This is where Dead-Letter Queues (DLQs) come in. A DLQ is a secondary SQS queue configured to receive messages that have exceeded a specified number of delivery attempts (visibility timeout expirations) from their primary queue. It acts as a holding pen for problematic messages, preventing them from blocking the main queue and allowing for later inspection and remediation.
However, simply having messages in a DLQ isn’t a solution. You need a way to actually fix the underlying issue and get those messages processed. That’s the purpose of redrive. SQS provides a built-in "redrive" functionality that allows you to take messages from a DLQ and send them back to their original source queue.
Let’s see this in action.
First, you need a source queue and a DLQ configured.
Source Queue (my-source-queue):
- Visibility Timeout: 30 seconds
- Redrive Policy:
- Max Receive Count: 3
- Dead-Letter Queue ARN:
arn:aws:sqs:us-east-1:123456789012:my-dlq
Dead-Letter Queue (my-dlq):
- This is a standard SQS queue.
Now, let’s say a message fails processing three times in my-source-queue. After the third failed attempt, SQS automatically moves it to my-dlq.
You can view messages in your DLQ using the AWS Management Console or the AWS CLI.
AWS CLI Example to List Messages in DLQ:
aws sqs receive-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-dlq
Once you’ve identified the problematic messages in my-dlq and, crucially, fixed the underlying issue (e.g., deployed a bug fix to your worker, resolved a downstream dependency, or corrected malformed data), you can initiate a redrive.
Initiating a Redrive via AWS Management Console:
- Navigate to the SQS service in the AWS Console.
- Select your
my-dlq. - In the "Actions" dropdown menu, choose "Start DLQ redrive."
- Select the
my-source-queueas the destination. - Configure the redrive settings:
- Redrive Speed: Choose between "Standard" (best effort, messages are sent to the source queue as quickly as possible) or "Optimized" (Slower, but guarantees no message loss and respects source queue throughput limits). For most cases, "Standard" is sufficient.
- Maximum Receives for Redrive: This determines how many times a redriven message can be received by the source queue before it’s sent back to the DLQ again. It’s often set to a higher number than the original
Max Receive Countto give the fixed worker more attempts. A common setting is 10.
- Click "Start DLQ redrive."
Initiating a Redrive via AWS CLI:
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789012:my-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789012:my-source-queue \
--max-receive-count 10
When the redrive task starts, SQS will begin copying messages from my-dlq back to my-source-queue. The messages will appear in my-source-queue as if they were newly sent. Your worker will then pick them up and attempt to process them again.
The key to redrive is that it’s not just a simple copy. SQS handles the mechanics of reading from the DLQ, sending to the source queue, and managing the visibility timeouts. When a message is redriven, its ApproximateNumberOfMessagesVisible count in the source queue will increase, and its ApproximateNumberOfMessagesVisible count in the DLQ will decrease.
What most people don’t realize is that the MaxReceiveCount setting on the source queue’s redrive policy is what determines when a message is moved to the DLQ initially. The MaxReceiveCount you specify during the redrive operation is a new limit applied to the messages after they’ve been redriven. This new limit is often set higher than the original to allow for more retries after a fix has been implemented. If a redriven message still fails processing and hits this new MaxReceiveCount, it will be sent back to the DLQ.
Once the redrive is complete and all messages have been successfully processed from the source queue (they are no longer visible or have been deleted), the DLQ will be empty.
The next step after successfully redriving and processing messages is often to implement more robust error handling and monitoring to prevent future DLQ accumulation.