SQS CloudWatch Alarms: Alert on Queue Depth Thresholds

CloudWatch alarms on SQS queue depth are surprisingly ineffective if you don’t account for the inherent variability of message arrival and processing.

Let’s see it in action. Imagine you have a standard SQS queue processing tasks. Messages arrive, get processed, and disappear.

{
  "MessageId": "a1b2c3d4-e5f6-7890-1234-abcdef123456",
  "ReceiptHandle": "...",
  "MD5OfBody": "...",
  "Body": "{\"task_id\": \"task-1001\", \"data\": \"some_payload\"}"
}

If you set a CloudWatch alarm on ApproximateNumberOfMessagesVisible to trigger when it exceeds 100, you’ll get a lot of false positives. Why? Because SQS is a distributed system, and ApproximateNumberOfMessagesVisible is an estimate. It’s not a precise, real-time count. It can lag, and it’s aggregated across multiple SQS servers. A temporary spike in incoming messages, even if your consumers are keeping up, can push this number over your threshold for a few minutes before it settles back down.

The real problem is that a static threshold doesn’t understand the rate at which your queue is filling or emptying. It only sees a snapshot.

Here’s how you should approach it.

1. Understanding the Metrics:

  • ApproximateNumberOfMessagesVisible: The number of messages that are visible in the queue. These are messages that have not yet been deleted and are available for consumers to receive.
  • ApproximateNumberOfMessagesNotVisible: The number of messages that are in flight (received by a consumer but not yet deleted).
  • ApproximateNumberOfMessagesDelayed: The number of messages that are scheduled to become visible later.

For queue depth alarms, ApproximateNumberOfMessagesVisible is your primary metric.

2. The Wrong Way (and why it fails):

A common, but flawed, approach is to set an alarm like this:

  • Metric: ApproximateNumberOfMessagesVisible
  • Statistic: Average
  • Period: 5 minutes
  • Threshold: 100 (or some other static number)
  • Condition: GreaterThan

This will fire when the average number of visible messages over 5 minutes is greater than 100. As explained, this leads to false alarms due to SQS’s eventual consistency and temporary processing lags.

3. The Right Way: Rate-Based Alerting & Consumer Throughput:

The real indicator of a problem isn’t just a high number of messages, but a high number of messages that aren’t going down over time, or are consistently increasing. This implies your consumers can’t keep up with the inflow.

You need to compare the rate of messages entering the queue to the rate of messages leaving it.

  • Metric 1: ApproximateNumberOfMessagesVisible
  • Metric 2: NumberOfMessagesReceived (This is a counter, not a gauge. You need to look at its rate of change.)
  • Metric 3: NumberOfMessagesDeleted (Also a counter.)

Diagnosis Command: To check the current state and recent history of your queue, use the AWS CLI:

aws sqs get-queue-attributes --queue-url YOUR_QUEUE_URL --attribute-names ApproximateNumberOfMessagesVisible ApproximateNumberOfMessagesNotVisible

Fix Strategy:

Instead of a static threshold on ApproximateNumberOfMessagesVisible, create alarms that monitor the trend or a combination of metrics.

Option A: Alert on a High ApproximateNumberOfMessagesVisible with a Low Deletion Rate.

This is a more robust approach. You’re not just saying "too many messages," but "too many messages, and not enough are being processed."

  • Alarm 1: High Queue Depth

    • Metric: ApproximateNumberOfMessagesVisible
    • Statistic: Average
    • Period: 15 minutes
    • Threshold: 500 (Adjust this based on your acceptable backlog)
    • Condition: GreaterThan
    • Reasoning: This is your "danger zone" indicator. A sustained high number of visible messages is concerning.
  • Alarm 2: Low Consumer Throughput

    • Metric: NumberOfMessagesDeleted (or NumberOfMessagesReceived, depending on your architecture and if you want to catch a backlog before it’s deleted)
    • Statistic: Sum
    • Period: 5 minutes
    • Threshold: 50 (This represents an average of 10 messages deleted per minute over the 5-minute period, assuming your consumers process at a reasonable rate. Adjust this dramatically based on your expected consumer throughput.)
    • Condition: LessThan
    • Reasoning: This alarm fires if your consumers are deleting fewer messages than expected over a period. This indicates a processing bottleneck.
  • Composite Alarm: Create a composite alarm that triggers if EITHER Alarm 1 OR Alarm 2 is in ALARM state for 10 minutes. This gives you two failure modes: a massive backlog, or a slow but steady build-up due to processing issues.

Option B: Alert on ApproximateNumberOfMessagesVisible relative to Consumer Processing Rate.

This is more advanced and requires understanding your typical consumer processing rate.

  1. Identify your typical consumer processing rate. For example, if you have 10 consumers, and each can process 10 messages per minute, your expected deletion rate is 100 messages/minute.

  2. Set an alarm on ApproximateNumberOfMessagesVisible that triggers if it’s high and NumberOfMessagesDeleted is low.

    • Metric 1: ApproximateNumberOfMessagesVisible
      • Statistic: Average
      • Period: 10 minutes
      • Threshold: 200
      • Condition: GreaterThan
    • Metric 2: NumberOfMessagesDeleted
      • Statistic: Sum
      • Period: 10 minutes
      • Threshold: 500 (This means less than 50 messages deleted per minute on average over 10 minutes, assuming a target of 100/min. Adjust this threshold to be significantly below your expected throughput, e.g., 50% of your target.)
      • Condition: LessThan
    • Composite Alarm: Trigger if Metric 1 is GreaterThan 200 AND Metric 2 is LessThan 500 for 10 minutes.

Fix with Real Values (Example for Option B):

  • Alarm Name: SQS_Queue_Depth_Warning_High_Messages_Low_Throughput
  • Composite Alarm Logic: (ALARM: HighQueueDepth) OR (ALARM: LowDeletionRate)
  • Alarm 1 (HighQueueDepth):
    • Metric: AWS/SQS, ApproximateNumberOfMessagesVisible
    • Statistic: Average
    • Period: 10 minutes
    • Threshold: 200
    • Condition: GreaterThanThreshold
    • Treat Missing Data: ignore
  • Alarm 2 (LowDeletionRate):
    • Metric: AWS/SQS, NumberOfMessagesDeleted
    • Statistic: Sum
    • Period: 10 minutes
    • Threshold: 500 (This means less than 50 messages deleted per minute on average. If your consumers should be deleting 100/min, this is a problem.)
    • Condition: LessThanThreshold
    • Treat Missing Data: breaching (If no messages are deleted, we want to know.)
  • Action: Send to SNS topic arn:aws:sns:us-east-1:123456789012:my-ops-alerts

Why it works mechanically: By combining a gauge (visible messages) with a rate counter (deleted messages), you create an alarm that is sensitive to both a sudden influx of messages and a gradual failure of your consumers to keep up. A temporary spike in ApproximateNumberOfMessagesVisible that quickly resolves won’t trigger the alarm if NumberOfMessagesDeleted is keeping pace. Conversely, a slow but steady increase in visible messages, even if never reaching a "critical" static number, will eventually trigger the alarm if the deletion rate is too low.

The next error you’ll hit is likely related to message processing failures or consumer starvation if the queue depth continues to grow unabated.

Want structured learning?

Take the full Sqs course →