SQS consumer lag isn’t just about how many messages are in the queue, but how fast they’re being processed.

Let’s see what a busy SQS queue and its consumers look like in action. Imagine you have a service that processes image uploads. When a user uploads a photo, your application sends a message to an SQS queue like image-processing-queue. A separate fleet of worker instances polls this queue, downloads the image, resizes it, and stores it in S3.

Here’s a snapshot of what your CloudWatch metrics might show:

  • ApproximateNumberOfMessagesVisible: This is the most common metric, showing messages waiting to be processed. If this number is steadily increasing, you have lag.
  • ApproximateNumberOfMessagesNotVisible: Messages that have been received by a consumer but not yet deleted. A high number here might indicate consumers are struggling to complete their work and delete messages before the visibility timeout expires.
  • NumberOfMessagesSent: The rate at which messages are being added to the queue.
  • NumberOfMessagesReceived: The rate at which messages are being polled from the queue.
  • NumberOfMessagesDeleted: The rate at which messages are successfully processed and deleted.

The core idea is to compare NumberOfMessagesSent (or NumberOfMessagesReceived) with NumberOfMessagesDeleted. If NumberOfMessagesDeleted consistently lags behind the rate of incoming messages, your consumers are falling behind.

The problem SQS consumer lag monitoring solves is simple: your system is taking longer to process incoming work than it is to receive it. This can lead to a backlog that grows indefinitely, increasing latency for users and potentially overwhelming your worker fleet.

The mental model for SQS consumer lag is a race between producers (sending messages) and consumers (processing messages).

  • Producers: Your application logic that puts messages onto the SQS queue.
  • Consumers: Your worker instances (EC2, Lambda, ECS tasks, etc.) that poll SQS, process messages, and delete them.
  • Queue: The buffer between producers and consumers. It’s designed to handle bursts, but not sustained over-capacity.
  • Lag: Occurs when the rate of messages arriving into the queue exceeds the rate at which they are being successfully processed and deleted.

The exact levers you control are:

  1. Consumer Throughput: How many messages your workers can process per unit of time. This is determined by the processing logic, instance/container size, concurrency, and network.
  2. Visibility Timeout: The duration a message is hidden from other consumers after being received. If a consumer fails to delete the message within this timeout, it reappears in the queue, potentially leading to duplicate processing if not handled idempotently.
  3. Queue Depth (ApproximateNumberOfMessagesVisible): A lagging indicator, but critical. A consistently rising queue depth signals that your processing rate is insufficient.
  4. Number of Consumers: Scaling up the number of worker instances or tasks directly increases your processing capacity.

The most surprising true thing about SQS consumer lag is that a constantly growing ApproximateNumberOfMessagesVisible is often less alarming than a stable, but high, ApproximateNumberOfMessagesNotVisible combined with a rising ApproximateNumberOfMessagesSent that is only slightly higher than NumberOfMessagesReceived. This is because the former indicates a clear backlog, while the latter, more insidious scenario suggests your consumers are barely keeping up, but are struggling with individual message processing times or are hitting their maximum concurrency, making them brittle and prone to sudden failures that will cause ApproximateNumberOfMessagesVisible to spike dramatically.

The next concept you’ll run into is handling message processing failures gracefully and idempotently, especially when dealing with the visibility timeout.

Want structured learning?

Take the full Sqs course →