You’ve hit a snag with SQS Dead-Letter Queues (DLQs) and need to configure retry limits. This isn’t just about what happens when messages fail; it’s about how SQS decides to move a message to the DLQ in the first place. The core issue is that SQS doesn’t have a universal "retry limit" setting for all messages. Instead, it relies on a combination of your source queue’s visibility timeout and the DLQ’s redrive policy to manage message lifecycle.

Here’s what’s actually breaking: Your source SQS queue is configured with a maxReceiveCount that’s too low, or your DLQ’s redrive policy isn’t set up to accommodate the intended retry behavior. When a message is received by a consumer, it becomes invisible for a period defined by the VisibilityTimeout. If the consumer fails to delete the message within this timeout (e.g., due to an error), SQS makes it visible again. This process counts towards maxReceiveCount. Once maxReceiveCount is reached, SQS moves the message to the DLQ.

Common Causes and Fixes

  1. Source Queue’s maxReceiveCount is Too Low:

    • Diagnosis: Check the maxReceiveCount attribute of your source SQS queue.
      aws sqs get-queue-attributes --queue-url YOUR_SOURCE_QUEUE_URL --attribute-names maxReceiveCount
      
      Look for a value that is likely too small for your application’s expected processing duration and potential transient failures.
    • Fix: Increase maxReceiveCount to a reasonable number. This is done when creating the queue or by updating its attributes. For example, to set it to 5:
      aws sqs set-queue-attributes --queue-url YOUR_SOURCE_QUEUE_URL --attributes '{ "maxReceiveCount": "5" }'
      
    • Why it works: A higher maxReceiveCount allows the message to be redelivered to the source queue more times before being considered "failed" and sent to the DLQ. This provides more opportunities for transient issues to resolve or for a consumer to successfully process and delete the message.
  2. DLQ Redrive Policy Not Configured on Source Queue:

    • Diagnosis: Verify if the DLQ redrive policy is attached to your source queue.
      aws sqs get-queue-attributes --queue-url YOUR_SOURCE_QUEUE_URL --attribute-names RedrivePolicy
      
      If the output is empty or doesn’t contain the expected deadLetterTargetArn and maxReceiveCount, it’s not configured.
    • Fix: Configure the redrive policy on the source queue, specifying the DLQ ARN and the maxReceiveCount that triggers the redrive.
      aws sqs set-queue-attributes --queue-url YOUR_SOURCE_QUEUE_URL --attributes '{ "RedrivePolicy": "{\"deadLetterTargetArn\": \"YOUR_DLQ_ARN\", \"maxReceiveCount\": \"5\"}" }'
      
    • Why it works: This policy explicitly tells SQS which DLQ to use and, crucially, reiterates the maxReceiveCount threshold for that specific redrive. While maxReceiveCount is also a direct attribute of the source queue, the RedrivePolicy is the mechanism that links the source to the DLQ.
  3. Incorrect maxReceiveCount in Redrive Policy:

    • Diagnosis: If you’ve confirmed the redrive policy is present, check the maxReceiveCount value within the RedrivePolicy JSON.
      aws sqs get-queue-attributes --queue-url YOUR_SOURCE_QUEUE_URL --attribute-names RedrivePolicy
      # Example output: "RedrivePolicy": "{\"deadLetterTargetArn\": \"YOUR_DLQ_ARN\", \"maxReceiveCount\": \"3\"}"
      
      This value might be lower than what you expect or what’s set directly on the source queue attribute.
    • Fix: Update the RedrivePolicy with the desired maxReceiveCount.
      aws sqs set-queue-attributes --queue-url YOUR_SOURCE_QUEUE_URL --attributes '{ "RedrivePolicy": "{\"deadLetterTargetArn\": \"YOUR_DLQ_ARN\", \"maxReceiveCount\": \"10\"}" }'
      
    • Why it works: The maxReceiveCount within the RedrivePolicy is the effective threshold. If it’s set lower here than the direct maxReceiveCount attribute on the source queue, SQS will use the lower value for redriving to the DLQ.
  4. DLQ’s Own VisibilityTimeout is Too Short:

    • Diagnosis: Check the VisibilityTimeout of your DLQ.
      aws sqs get-queue-attributes --queue-url YOUR_DLQ_URL --attribute-names VisibilityTimeout
      
      If this is very short (e.g., 5 seconds) and your redrive process involves immediate re-processing or inspection, the message might become visible again and be redriven back to the DLQ itself unnecessarily.
    • Fix: Increase the VisibilityTimeout of the DLQ to a value that allows sufficient time for any immediate inspection or automated redrive attempts. For example, set it to 300 seconds (5 minutes):
      aws sqs set-queue-attributes --queue-url YOUR_DLQ_URL --attributes '{ "VisibilityTimeout": "300" }'
      
    • Why it works: A longer visibility timeout on the DLQ prevents the DLQ’s own consumers (or automated redrive mechanisms) from losing track of a message before they’ve finished processing it, thereby avoiding spurious redrives back to the DLQ.
  5. Consumer Not Deleting Messages:

    • Diagnosis: This is more of an application-level issue but manifests as messages hitting the DLQ. Observe your source queue’s metrics. If ApproximateNumberOfMessagesVisible is consistently high and ApproximateNumberOfMessagesDeleted is low, or if ApproximateAgeOfOldestMessage is increasing rapidly, your consumers might not be successfully deleting messages after processing.
    • Fix: Ensure your consumer application code explicitly calls sqs.deleteMessage() after successfully processing a message. Handle exceptions gracefully, and don’t delete messages if processing fails, allowing them to become visible again and potentially be retried (up to maxReceiveCount).
    • Why it works: Successful deletion by the consumer is the only way to prevent a message from being counted towards maxReceiveCount and eventually being redriven to the DLQ.
  6. Message Processing Time Exceeds VisibilityTimeout:

    • Diagnosis: Compare your typical message processing time to the VisibilityTimeout of your source queue. If processing consistently takes longer than the timeout, messages will become visible again and be redelivered, eventually hitting maxReceiveCount.
    • Fix: Increase the VisibilityTimeout of your source queue to be longer than your longest expected message processing time. For instance, if processing can take up to 60 seconds, set it to 120 seconds:
      aws sqs set-queue-attributes --queue-url YOUR_SOURCE_QUEUE_URL --attributes '{ "VisibilityTimeout": "120" }'
      
    • Why it works: A longer VisibilityTimeout gives your consumers more time to process a message before it’s automatically made visible again. This reduces the chance of messages being redelivered due to processing taking too long, thus preventing premature hits to maxReceiveCount.

After ensuring your redrive policy and source queue attributes are correctly configured, the next error you might encounter is related to the DLQ’s own message processing or lifecycle management, potentially leading to messages accumulating in the DLQ without being properly handled.

Want structured learning?

Take the full Sqs course →