SQS backpressure means your consumers can’t keep up with the rate messages are being produced.

Let’s watch a simple SQS queue and consumer in action. Imagine we have a high-volume-processor service that needs to process messages from an SQS queue named order-processing-queue.

First, we’ll create the queue:

aws sqs create-queue --queue-name order-processing-queue --attributes VisibilityTimeout=30

This creates a standard SQS queue with a 30-second visibility timeout. Messages that are received but not deleted within 30 seconds will become visible again.

Now, let’s simulate message production. We can use a simple script to send messages to our queue:

import boto3
import time

sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = sqs.get_queue_url(QueueName='order-processing-queue')['QueueUrl']

for i in range(1000):
    message_body = f'{"order_id": {i}, "item": "widget", "quantity": 1}'
    sqs.send_message(
        QueueUrl=queue_url,
        MessageBody=message_body
    )
    print(f"Sent message {i}")
    time.sleep(0.05) # Simulate messages arriving at ~20/sec

This script sends 1000 messages, each representing an order, with a small delay between them.

Next, our high-volume-processor service will poll and process these messages. Here’s a simplified consumer:

import boto3
import time

sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = sqs.get_queue_url(QueueName='order-processing-queue')['QueueUrl']

def process_message(message_body):
    # In a real scenario, this would involve complex business logic
    print(f"Processing: {message_body}")
    time.sleep(0.5) # Simulate processing taking 0.5 seconds

while True:
    response = sqs.receive_message(
        QueueUrl=queue_url,
        MaxNumberOfMessages=10,
        WaitTimeSeconds=20 # Long polling
    )

    if 'Messages' in response:
        for message in response['Messages']:
            process_message(message['Body'])
            sqs.delete_message(
                QueueUrl=queue_url,
                ReceiptHandle=message['ReceiptHandle']
            )
        print(f"Processed {len(response['Messages'])} messages.")
    else:
        print("No messages received.")

This consumer uses long polling ( WaitTimeSeconds=20 ) to reduce latency and cost. It attempts to process up to 10 messages at a time and takes 0.5 seconds per message.

Let’s analyze the throughput. The producer sends messages at approximately 20 per second (1 / 0.05). The consumer, however, can only process messages at a rate of 2 per second (1 / 0.5). This mismatch is where backpressure begins.

When the producer outpaces the consumer, messages start accumulating in the SQS queue. SQS provides metrics to observe this. The most critical one for backpressure is ApproximateNumberOfMessagesVisible.

If ApproximateNumberOfMessagesVisible starts to grow steadily and exceeds a predefined threshold (e.g., 1000 messages), it indicates that the queue is filling up faster than it’s being emptied. This is the "backpressure alarm." Your consumers are falling behind.

The core problem is a mismatch between the rate of incoming messages and the rate at which messages can be processed. This can stem from various factors:

  • Under-provisioned Consumers: The most common cause. Your consumer instances don’t have enough CPU, memory, or network bandwidth to handle the load. They might also be too few in number.
    • Diagnosis: Monitor CPU/memory utilization of your consumer instances. Check the number of consumer instances running.
    • Fix: Scale up your consumer instances (more CPU/RAM) or scale out (more instances). For example, if using EC2 Auto Scaling, increase the desired capacity or adjust scaling policies. If using Kubernetes, increase replica count or resource requests/limits.
    • Why it works: More compute resources or more instances allow the consumers to process messages in parallel or faster individually, catching up to the producer rate.
  • Inefficient Consumer Logic: The code within your process_message function is too slow. This could be due to complex computations, inefficient database queries, or slow external API calls.
    • Diagnosis: Profile your process_message function to identify bottlenecks. Use tools like Python’s cProfile or APM solutions.
    • Fix: Optimize the processing logic. Cache expensive lookups, batch operations where possible, or refactor slow algorithms. For instance, if a database call takes too long, consider indexing the relevant tables or using a more efficient query.
    • Why it works: Faster individual message processing directly increases the consumer’s throughput.
  • Network Latency/Bandwidth Issues: The consumers might be experiencing high latency or low bandwidth when communicating with SQS, or with downstream services they depend on.
    • Diagnosis: Use tools like ping, traceroute, or network monitoring tools to check latency and bandwidth between your consumers and AWS services, or between your consumers and their dependencies.
    • Fix: Ensure your consumers are deployed in the same AWS region and Availability Zone as the SQS queue. If using on-premises or hybrid setups, ensure sufficient network connectivity. Consider using AWS Direct Connect for predictable bandwidth.
    • Why it works: Reduced network delays mean consumers can poll for messages and send delete requests more quickly, and interact with dependencies without waiting.
  • Downstream Service Saturation: Your consumer might be fast, but it relies on another service (e.g., a database, another API) that is now the bottleneck.
    • Diagnosis: Monitor the performance metrics of any downstream services your consumers interact with. Look for increased latency, error rates, or resource utilization on those services.
    • Fix: Scale up or optimize the bottlenecked downstream service. This could mean adding read replicas to a database, increasing capacity on an API gateway, or optimizing queries in the downstream system.
    • Why it works: By alleviating the bottleneck in the downstream system, the consumer can complete its work faster and free up its own resources to process more SQS messages.
  • SQS Visibility Timeout Too Short: If the visibility timeout is too short and processing takes longer than that duration, messages can be re-delivered to other consumers (or the same one) before they are deleted, creating duplicate work and artificially inflating queue depth.
    • Diagnosis: Compare your VisibilityTimeout setting with the average/p99 processing time of your messages.
    • Fix: Increase the VisibilityTimeout to be longer than your longest expected processing time. For example, if processing can take up to 2 minutes, set VisibilityTimeout=120.
    • Why it works: A longer visibility timeout ensures that a message remains invisible to other consumers while it’s being processed, preventing premature re-deliveries and ensuring a message is only processed once.
  • SQS Throughput Limits Reached: While less common for standard queues unless at extremely high scale, it’s possible the SQS API itself is throttling your receive_message or delete_message calls.
    • Diagnosis: Check CloudWatch metrics for your SQS queue for ThrottledRequests.
    • Fix: For standard queues, throughput is effectively unlimited, but for FIFO queues, you might hit limits. If you’re hitting limits on standard queues, it’s more likely an application-level issue. If it’s FIFO, consider increasing the number of producer/consumer applications if feasible, or work with AWS support.
    • Why it works: Ensuring your application isn’t being throttled by SQS means your requests are being processed without delay.

The immediate next alarm you might see after fixing backpressure is ApproximateAgeOfOldestMessage starting to decrease rapidly, indicating that the backlog is being cleared.

Want structured learning?

Take the full Sqs course →