SQS message retention is the only configurable limit on how long a message can sit in a queue, and most people get it wrong.

Let’s see SQS in action with a simple fan-out pattern. Imagine a single "OrderCreated" event hitting an SQS queue.

{
  "order_id": "ORD12345",
  "customer_id": "CUST9876",
  "items": [
    {"sku": "SKU001", "quantity": 2},
    {"sku": "SKU005", "quantity": 1}
  ]
}

This message is immediately picked up by two separate Lambda functions: InventoryUpdater and NotificationSender.

InventoryUpdater Lambda:

import json
import boto3

sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/OrderQueue'

def lambda_handler(event, context):
    for record in event['Records']:
        message_body = json.loads(record['body'])
        order_id = message_body['order_id']
        print(f"Updating inventory for order: {order_id}")
        # ... actual inventory update logic ...
        sqs.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=record['receiptHandle']
        )
    return {
        'statusCode': 200,
        'body': json.dumps('Inventory updated successfully!')
    }

NotificationSender Lambda:

import json
import boto3

sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/OrderQueue'

def lambda_handler(event, context):
    for record in event['Records']:
        message_body = json.loads(record['body'])
        order_id = message_body['order_id']
        customer_id = message_body['customer_id']
        print(f"Sending notification for order: {order_id} to customer: {customer_id}")
        # ... actual notification logic (e.g., SES, SNS) ...
        sqs.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=record['receiptHandle']
        )
    return {
        'statusCode': 200,
        'body': json.dumps('Notification sent successfully!')
    }

When the OrderCreated message arrives, SQS delivers it to both Lambdas. Each Lambda processes the message independently. Crucially, each Lambda must call sqs.delete_message with the ReceiptHandle it received. If either Lambda fails to delete the message (e.g., due to an error in its logic), SQS will redeliver the message after the visibility timeout expires.

The core problem SQS solves here is decoupling. The service that produces the "OrderCreated" event doesn’t need to know how or when inventory is updated or notifications are sent. It just fires the event into the queue. This allows the downstream services (Lambdas in this case) to operate at their own pace, scale independently, and even fail without impacting the producer.

The "Message Retention Period" is the ultimate safety net. It’s the maximum time a message can remain in the queue before SQS automatically deletes it. By default, this is set to 4 days (96 hours).

The key levers you control are:

  • Visibility Timeout: How long a message is hidden from other consumers after it’s received. If not deleted within this time, it reappears for another consumer.
  • Message Retention Period: How long a message stays in the queue total, regardless of visibility timeouts. After this period, it’s gone forever.
  • Dead-Letter Queue (DLQ): A separate queue where messages are sent if they fail processing a certain number of times (max receives). This is not the same as retention, but it’s how you handle persistent failures without losing messages to the retention period.

You configure the message retention period when you create or update an SQS queue. For a Standard Queue, you can set it from 1 minute up to 14 days. For FIFO Queues, the range is also 1 minute to 14 days.

Here’s how you’d set it to 7 days using the AWS CLI:

aws sqs create-queue \
    --queue-name MyLongRetentionQueue \
    --attributes '{"MessageRetentionPeriod": "604800"}'

The value 604800 is the number of seconds (7 days * 24 hours/day * 60 minutes/hour * 60 seconds/minute).

The surprising thing about message retention is that setting it too high can lead to massive costs and difficult debugging. If your consumers are consistently failing, and a message sits in the queue for the full retention period, it’s gone. If you have a large number of messages and a long retention period, you’re paying to store data that might be unrecoverable. Conversely, setting it too low means messages might be deleted before a slow consumer can even get to them, especially if your visibility timeout is also short or your processing is intermittently slow.

The primary reason you’d want a long retention period is for archival or audit purposes, or if your consumers are expected to be offline for extended periods and you need to guarantee eventual processing. Most applications, however, should aim for a retention period that’s just long enough to account for occasional processing hiccups and retries, plus a buffer. If a message fails more than a few times, it’s usually a sign of a bug that needs fixing, and it should go to a DLQ. If your consumers are consistently slow, you should scale them up, not rely on a long retention period to catch up. The default of 4 days is often too long for typical microservice architectures.

The next problem you’ll encounter is managing the complexity of message ordering when you start using FIFO queues.

Want structured learning?

Take the full Sqs course →