SQS’s 120,000 in-flight message limit is a hard cap, not a soft suggestion, and hitting it means your queue effectively stops accepting new messages.
Let’s see this in action. Imagine a queue my-processing-queue that’s been steadily receiving messages. If it hits the 120,000 in-flight message limit, any new SendMessage calls will fail with a AWS.SimpleQueueService.TooManyMessagesException.
This limit isn’t about the total number of messages in the queue (which can be virtually unlimited), but specifically about messages that have been sent but not yet deleted or have timed out and are not yet visible again. These are the messages that SQS is tracking as "in flight" for consumers.
The primary reason this limit exists is to prevent resource exhaustion and maintain queue stability. If a queue could hold an unbounded number of in-flight messages, a runaway producer or a stuck consumer could overwhelm SQS’s internal state management for that queue, impacting performance for all users of that queue and potentially other queues on the same SQS endpoint.
Here’s what’s happening under the hood: for every message in flight, SQS maintains metadata. This includes visibility timeout information, message IDs, and pointers to the actual message data. When the volume of this metadata reaches a critical threshold for a single queue, SQS applies the cap to prevent further strain.
Common Causes and How to Fix Them
-
Stuck Consumers: A consumer has received messages but failed to process and delete them before their visibility timeout expires. The messages then become visible again, and if the consumer is still stuck, they re-enter the in-flight count.
- Diagnosis: Check your consumer application logs for errors. Look for repeated processing of the same
MessageIdor messages that are consistently being received but not deleted. SQS metrics in CloudWatch can showApproximateAgeOfOldestMessageincreasing andNumberOfMessagesSentexceedingNumberOfMessagesDeleted. - Fix: Implement robust error handling and dead-letter queues (DLQs). Ensure your consumers
DeleteMessagepromptly after successful processing. If a consumer fails, it should ideally not hold onto the message indefinitely. For persistent issues, configure a DLQ to offload problematic messages for later analysis.aws sqs create-queue --queue-name my-processing-queue-dlq --attributes '{"VisibilityTimeout": "30"}' aws sqs set-queue-attributes --queue-url <YOUR_QUEUE_URL> --attributes '{"RedrivePolicy": "{\"deadLetterTargetArn\": \"<YOUR_DLQ_ARN>\", \"maxReceiveCount\": 10}"}' - Why it works: The DLQ acts as a safety net. After a message has been received a certain number of times (e.g., 10) without being deleted, SQS automatically moves it to the DLQ. This frees up the original queue from repeatedly trying to process a message that’s likely causing issues, and it prevents the in-flight count from being artificially inflated by these retries.
- Diagnosis: Check your consumer application logs for errors. Look for repeated processing of the same
-
Visibility Timeout Too Short: Messages are processed successfully, but the visibility timeout expires before the consumer can finish processing and delete them.
- Diagnosis: Monitor
ApproximateAgeOfOldestMessagein CloudWatch. If this metric is consistently high and close to yourVisibilityTimeoutvalue, and you seeNumberOfMessagesSentsignificantly outpacingNumberOfMessagesDeletedwithout obvious consumer errors, this is a likely culprit. - Fix: Increase the
VisibilityTimeoutfor your queue to a value longer than your longest expected processing time. A common starting point is 5-10 minutes (300-600 seconds).aws sqs set-queue-attributes --queue-url <YOUR_QUEUE_URL> --attributes '{"VisibilityTimeout": "600"}' - Why it works: A longer timeout gives your consumers more time to work on a message without it becoming visible again. This reduces the chances of a message being re-processed or contributing to the in-flight count due to premature visibility.
- Diagnosis: Monitor
-
High Throughput Producers: You’re sending messages at a rate that exceeds your consumers’ ability to process and delete them.
- Diagnosis: Compare
NumberOfMessagesSentandNumberOfMessagesReceivedmetrics in CloudWatch. IfNumberOfMessagesSentis consistently much higher thanNumberOfMessagesReceived(orApproximateNumberOfMessagesVisibleis growing rapidly), your producer is outpacing your consumer. - Fix: Scale up your consumers (add more instances or increase their processing capacity) to match or exceed the production rate. Alternatively, consider adding a buffer queue or slowing down your producer if possible.
# Example: Scale up EC2 instances running your consumers aws autoscaling set-desired-capacity --auto-scaling-group-name my-consumer-asg --desired-capacity 10 - Why it works: By increasing the number of consumers or their processing power, you increase the rate at which messages are received and deleted, bringing the in-flight message count down.
- Diagnosis: Compare
-
Long-Running Batch Operations: Consumers process messages in batches but take too long to complete the entire batch, causing the visibility timeout to expire for messages within that batch before they are all deleted.
- Diagnosis: Examine consumer logic. If a consumer receives a batch of messages (
ReceiveMessagewithMaxNumberOfMessages> 1) and performs a single, long operation that affects all messages in the batch, this can lead to timeouts. CheckApproximateAgeOfOldestMessageandNumberOfMessagesSentvs.NumberOfMessagesDeleted. - Fix: Process messages individually within the consumer, or ensure that the batch operation is fast enough that the
VisibilityTimeoutis not exceeded for any message. If processing a batch takes a long time, consider using theChangeMessageVisibilityAPI to extend the visibility timeout for messages as they are being processed.# Example in Python (Boto3) for message in messages: # Process message # ... # Extend visibility if processing is long sqs_client.change_message_visibility( QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'], VisibilityTimeout=600 # Extend by 10 minutes ) # Delete messages after all processing for message in messages: sqs_client.delete_message( QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'] ) - Why it works: Processing messages individually or strategically extending visibility ensures that each message remains invisible until it’s fully processed and deleted, preventing premature re-visibilization.
- Diagnosis: Examine consumer logic. If a consumer receives a batch of messages (
-
Queue Configuration Issues: Incorrect
VisibilityTimeoutor lack of a DLQ configured.- Diagnosis: Review your queue’s attributes using the AWS CLI or console.
- Fix: As described in points 1 and 2, ensure
VisibilityTimeoutis adequate and a DLQ is configured for robust error handling.aws sqs get-queue-attributes --queue-url <YOUR_QUEUE_URL> --attribute-names VisibilityTimeout,RedrivePolicy - Why it works: Proper configuration ensures the queue behaves as expected under load and that errors are handled gracefully.
-
Unacknowledged Messages Due to Application Crashes: If a consumer application crashes after receiving messages but before deleting them, those messages remain in flight until their visibility timeout expires. If crashes are frequent, this can repeatedly inflate the in-flight count.
- Diagnosis: Correlate SQS metrics with application health monitoring. Look for spikes in
NumberOfMessagesSentandApproximateNumberOfMessagesVisiblethat coincide with application restarts or reported failures. - Fix: Improve application stability. Implement graceful shutdown procedures. Ensure critical state changes (like message deletion) are handled transactionally or are idempotent to minimize data loss or message duplication on restart.
- Why it works: A more stable application reduces unexpected message un-deletion scenarios, leading to a more predictable in-flight message count.
- Diagnosis: Correlate SQS metrics with application health monitoring. Look for spikes in
When you’ve resolved the issue, the next immediate problem you might encounter is a backlog of messages that were waiting to be sent. Once the 120K cap is lifted, these will flood the queue, and your consumers will need to catch up.