SQS CloudWatch metrics are often treated as simple counters, but their true power lies in revealing the subtle, real-time behavior of your message queues, often before your application even notices a problem.

Let’s watch an SQS queue in action. Imagine we have a simple producer-consumer setup. The producer sends messages, and the consumer polls for them, processes them, and deletes them.

# Producer sending a message
aws sqs send-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --message-body "{\"order_id\": 12345}"

# Consumer polling for messages
aws sqs receive-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --max-number-of-messages 1 --visibility-timeout 30

# Consumer processing and deleting the message
# (assuming message receipt handle is ABCDEFG...)
aws sqs delete-message --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue --receipt-handle ABCDEFG

Now, how do CloudWatch metrics track this?

The most fundamental metric is ApproximateNumberOfMessagesVisible. This is the number of messages in the queue that are available for retrieval. When a producer sends a message, this number goes up. When a consumer successfully retrieves and deletes a message, this number goes down. It’s your primary indicator of queue "fullness."

ApproximateNumberOfMessagesNotVisible is crucial for understanding consumer lag. When a consumer retrieves a message, it becomes "invisible" for the duration of the visibility-timeout. If this metric steadily increases, it means messages are being retrieved but not deleted within their visibility window, signaling that consumers are struggling to keep up.

ApproximateNumberOfMessagesDelayed tells you about messages that have been explicitly delayed using the DelaySeconds parameter during SendMessage. A sustained non-zero value here means your producers are intentionally slowing down message availability, perhaps for batching or rate-limiting purposes.

NumberOfMessagesSent is a simple counter of all messages sent to the queue. It’s useful for correlating with producer activity and understanding overall throughput.

NumberOfMessagesReceived counts messages that have been retrieved from the queue. Comparing this to NumberOfMessagesSent can highlight retrieval issues, though ApproximateNumberOfMessagesNotVisible is usually a more direct indicator of processing delays.

NumberOfMessagesDeleted tracks successful deletions. A consistent gap between NumberOfMessagesReceived and NumberOfMessagesDeleted (especially when ApproximateNumberOfMessagesNotVisible is growing) is a strong signal that consumers are failing to delete messages after processing. This could be due to processing errors, network issues, or incorrect deletion logic.

NumberOfEmptyReceives is a simple count of poll requests that returned no messages. A high number of empty receives, especially when ApproximateNumberOfMessagesVisible is low, can indicate an inefficient polling strategy or that consumers are polling too frequently when the queue is naturally empty.

The true power comes from combining these. For instance, if ApproximateNumberOfMessagesVisible is high and ApproximateNumberOfMessagesNotVisible is low and stable, your consumers are likely keeping pace. If ApproximateNumberOfMessagesVisible is growing and ApproximateNumberOfMessagesNotVisible is also growing, your consumers are falling behind. If ApproximateNumberOfMessagesVisible is low but ApproximateNumberOfMessagesNotVisible is high, messages are being retrieved but not processed and deleted, indicating a bottleneck in your consumer logic or processing environment.

A common pattern is to set up alarms on these metrics. For example, an alarm on ApproximateNumberOfMessagesVisible exceeding a threshold (e.g., 1000) can alert you to a growing backlog. An alarm on ApproximateNumberOfMessagesNotVisible exceeding a certain percentage of ApproximateNumberOfMessagesVisible (e.g., 80%) can indicate consumer processing delays.

One subtle point most people miss is the interplay between ApproximateNumberOfMessagesNotVisible and visibility-timeout. If your consumers are consistently taking longer to process messages than the visibility-timeout, messages will reappear in the queue as "visible" again after the timeout expires, even if the consumer is still working on them. This can lead to duplicate processing if not handled carefully and will manifest as ApproximateNumberOfMessagesNotVisible not decreasing as expected, or even fluctuating wildly. It’s often better to increase visibility-timeout or optimize consumer processing than to rely solely on alarms.

The next logical step is understanding how to use these metrics for automated scaling.

Want structured learning?

Take the full Sqs course →