An SQS Dead Letter Queue (DLQ) alarm fires when messages end up in the DLQ, signaling that a consumer failed to process messages from the main queue. This isn’t just a notification; it’s a critical alert that a processing bottleneck or failure has occurred downstream, potentially leading to data loss or service degradation.

Here’s how to set it up and what it means:

The Problem: Unprocessed Messages

When a message is sent to an SQS queue, a consumer application is supposed to pick it up, process it, and then delete it. If the consumer fails to process the message (e.g., due to an error in the application logic, a downstream dependency failure, or the consumer crashing), the message will eventually reappear in the queue for another processing attempt. SQS has a maxReceiveCount parameter for this; once a message has been received maxReceiveCount times without being deleted, SQS moves it to a designated Dead Letter Queue (DLQ).

The DLQ is a separate SQS queue where these "failed" messages are sent. It’s a safety net, preventing infinite retry loops and data loss. However, a message in the DLQ means something is broken in your processing pipeline. The alarm is there to tell you immediately when this happens.

Setting Up the Alarm

You’ll need two SQS queues:

  1. Source Queue: The main queue where messages are sent.
  2. Dead Letter Queue (DLQ): The queue where failed messages are sent.

First, configure the source queue to send messages to the DLQ.

1. Configure Source Queue’s Redrive Policy:

In the AWS console, navigate to your source SQS queue.

  • Go to the "Dead-letter queue" tab.
  • Click "Edit".
  • Under "Dead-letter queue settings", select "Enable".
  • Choose "By message count" for the "Redrive policy".
  • Set "Maximum receives" to a value that makes sense for your application. A common value is 5 or 10. This means if a message is received 5 or 10 times without being deleted, it will be sent to the DLQ.
  • Under "Dead-letter queue", select your pre-created DLQ.
  • Click "Save".

2. Create the DLQ:

If you haven’t already, create a new SQS queue.

  • Give it a name, e.g., my-application-dlq.
  • Keep the default settings for now, unless you have specific requirements.
  • Make sure this queue is not configured to send messages to another DLQ (to avoid infinite loops).

3. Create a CloudWatch Alarm on the DLQ:

Now, create an alarm that triggers when messages appear in the DLQ.

  • Navigate to CloudWatch in the AWS console.
  • Go to "Alarms" and click "Create alarm".
  • Click "Select metric".
  • Find "SQS" under "AWS namespaces".
  • Select "Queue Metrics".
  • Find your DLQ by its name (e.g., my-application-dlq).
  • Choose the metric ApproximateNumberOfMessagesVisible. This metric counts the number of messages available for retrieval in the queue.
  • Click "Select metric".
  • Configure the alarm:
    • Statistic: Sum (we want to know if any messages arrive).
    • Period: 1 minute (or 5 minutes, depending on how quickly you need to be alerted).
    • Threshold type: Static.
    • Whenever ApproximateNumberOfMessagesVisible is: Greater than.
    • than: 0.
  • Click "Next".
  • Configure actions:
    • Notification: Choose an existing SNS topic or create a new one to send notifications (e.g., via email, Slack, PagerDuty).
    • Auto Scaling/EC2 Actions/EventBridge: Not typically needed for a DLQ alarm.
  • Click "Next".
  • Give your alarm a name, e.g., SQS-MyApplication-DLQ-High.
  • Add a description.
  • Click "Next".
  • Review and click "Create alarm".

What Happens When the Alarm Fires

When the ApproximateNumberOfMessagesVisible metric for your DLQ exceeds 0 for the specified period, the alarm will enter the ALARM state. This will trigger the configured SNS notification. You will then need to investigate why messages are failing to be processed by your consumers.

Common Causes and How to Fix Them

1. Consumer Application Crashes/Uncaught Exceptions:

  • Diagnosis: Review application logs for your consumer. Look for stack traces, error messages, or indications of unexpected shutdowns. If your consumers are running on EC2 or ECS, check their system logs.
  • Fix: Identify the root cause of the exceptions (e.g., bugs, configuration issues, external service failures) and deploy a corrected version of your consumer application.
  • Why it works: Fixing the bug or error in the consumer allows it to process messages successfully, preventing them from being sent to the DLQ.

2. Downstream Service Unavailability/Errors:

  • Diagnosis: If your consumer relies on other services (databases, APIs, other microservices), check the health and logs of those dependencies. Are they responding with errors (e.g., 5xx HTTP status codes, database connection errors)?
  • Fix: Address the issues with the downstream service. This might involve restarting it, scaling it up, fixing its own bugs, or implementing retry mechanisms within the consumer for transient downstream errors.
  • Why it works: Once the downstream dependency is healthy and responsive, the consumer can complete its work and delete the message.

3. Message Format Issues/Invalid Data:

  • Diagnosis: Inspect the messages in the DLQ. You can use the SQS console to "Poll for messages". Examine the message content to see if it’s malformed, missing expected fields, or contains data that violates your consumer’s assumptions.
  • Fix:
    • Option A (Fixing the Producer): If the producer is sending bad data, fix the producer application to send valid messages.
    • Option B (Handling Bad Data in Consumer): Modify the consumer to gracefully handle malformed messages. This might involve logging the bad message and continuing, or attempting to parse it with more flexible logic.
    • Option C (Manual Remediation): For a one-off issue, you might manually edit the message in the DLQ (if supported and appropriate) or re-process a batch of known good messages by moving them back to the source queue.
  • Why it works: Ensuring messages conform to the expected format allows the consumer to parse and process them without errors.

4. Insufficient Consumer Capacity (Overload):

  • Diagnosis: If your consumer is running but messages are still accumulating in the DLQ, your consumers might not be able to keep up with the message volume. Check the ApproximateAgeOfOldestMessage metric for your source queue. If it’s steadily increasing, your consumers are falling behind. Also, check CPU/memory utilization of your consumer instances/containers.
  • Fix: Increase the number of consumer instances or scale up their resources (CPU, memory). If using auto-scaling, adjust the scaling policies.
  • Why it works: More consumer instances or more powerful instances can process messages faster, reducing the backlog and preventing messages from exceeding maxReceiveCount.

5. Timeouts (Consumer Logic, Network, Dependencies):

  • Diagnosis: Look for logs indicating timeouts. This could be an application-level timeout waiting for a response from a dependency, a network timeout, or even a Lambda function timeout if your consumer is Lambda-based.
  • Fix:
    • Increase timeouts: If the processing genuinely takes longer than expected, increase the timeout values in your consumer logic or the timeout settings of the service invoking it (e.g., Lambda timeout).
    • Optimize processing: If the processing is taking too long, optimize the consumer’s logic, make downstream calls more efficient, or process messages in smaller batches.
  • Why it works: Allowing sufficient time for processing prevents premature termination of the consumer’s attempt to handle a message.

6. Incorrect maxReceiveCount Configuration:

  • Diagnosis: If messages are appearing in the DLQ very quickly, and your consumer appears to be working fine, your maxReceiveCount might be set too low. This is less common but possible.
  • Fix: Increase the maxReceiveCount on the source queue’s redrive policy. For example, change it from 3 to 10.
  • Why it works: A higher maxReceiveCount gives consumers more opportunities to successfully process a message before it’s considered "failed" and sent to the DLQ.

7. Consumer Deleting Messages Prematurely:

  • Diagnosis: This is a subtle one. If your consumer application has a bug where it deletes a message from the source queue before it has successfully completed all necessary processing (e.g., commits to a database, sends a notification), and then the application crashes, the message is lost forever. It won’t go to the DLQ. However, if the consumer attempts to delete but fails, or if the delete operation itself fails, the message might be returned to the queue and eventually end up in the DLQ after maxReceiveCount. Check your consumer’s delete logic carefully.
  • Fix: Ensure the DeleteMessage API call is only made after all critical processing steps for that message are confirmed as successful. Implement transactional logic if possible.
  • Why it works: Ensures that messages are only removed from the queue once their processing is definitively complete, preventing partial processing and subsequent loss or DLQing.

The Next Problem: Handling DLQ Messages

Once your alarm is resolved and you’ve fixed the underlying issue, you’ll likely face the problem of what to do with the messages already in your DLQ. You’ll need a strategy for replaying them or discarding them.

Want structured learning?

Take the full Sqs course →