SQS is surprisingly cheap, but the illusion of "free" can lead to massive, unexpected bills if you’re not careful about how you’re interacting with it.
Let’s see SQS in action. Imagine a simple producer-consumer pattern. A web server (producer) receives user requests and puts messages onto an SQS queue. A fleet of worker instances (consumers) poll that queue, pull messages, process them, and then delete them.
Here’s what that looks like from the producer’s side, using the AWS SDK for Python (Boto3):
import boto3
sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/my-processing-queue'
def send_message(message_body):
response = sqs.send_message(
QueueUrl=queue_url,
MessageBody=message_body
)
print(f"Sent message ID: {response['MessageId']}")
# Example usage:
send_message("process_user_data:user_id=123")
And here’s the consumer side:
import boto3
import time
sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/my-processing-queue'
def process_messages():
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10, # Batching!
WaitTimeSeconds=20, # Long polling!
VisibilityTimeout=300 # Standard visibility timeout
)
if 'Messages' in response:
for message in response['Messages']:
print(f"Received message ID: {message['MessageId']}")
message_body = message['Body']
# --- Process the message here ---
print(f"Processing: {message_body}")
# --- End processing ---
# Delete the message after successful processing
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message['ReceiptHandle']
)
print(f"Deleted message ID: {message['MessageId']}")
else:
print("No messages received.")
# Example usage:
while True:
process_messages()
# Short sleep to avoid tight loop if WaitTimeSeconds is 0, though not strictly necessary with WaitTimeSeconds > 0
time.sleep(1)
This is the core loop: send_message and receive_message/delete_message. The cost isn’t in the data transferred, but in the API calls made to SQS. Every SendMessage, ReceiveMessage, and DeleteMessage request incurs a small charge. When you have thousands or millions of these per day, it adds up.
The key to optimization lies in minimizing these API calls. The two most impactful levers are batching and long polling.
Batching means sending or receiving multiple messages in a single API call.
SendMessagecan send up to 10 messages in one request.ReceiveMessagecan retrieve up to 10 messages in one request (though you can configure this withMaxNumberOfMessages).DeleteMessagecan delete up to 10 messages in one request if they have the sameReceiptHandle(which is rare, usually you delete individually) orDeleteMessageBatchcan delete up to 10 different messages by providing theirReceiptHandles.
Long polling means that when you ReceiveMessage, SQS holds the request open for up to 20 seconds, waiting for messages to arrive. If messages arrive within that time, they are returned immediately. If not, the request times out and returns an empty response. This is crucial because it prevents "empty polls" – frequent ReceiveMessage calls that return nothing, which are pure API call cost with no work done.
Let’s break down the common cost drivers and how to combat them:
-
Excessive
ReceiveMessagecalls (Short Polling): If your consumer code callsReceiveMessageevery second without waiting, and the queue is often empty, you’re burning API calls.- Diagnosis: Check your consumer code. Look for
ReceiveMessagecalls in a tight loop. Examine your SQS metrics in CloudWatch forNumberOfEmptyReceives. High numbers here indicate short polling or insufficientWaitTimeSeconds. - Fix: Implement long polling by setting
WaitTimeSecondsto a value greater than 0 (e.g.,WaitTimeSeconds=20). This tells SQS to hold the connection open for up to 20 seconds, significantly reducing the number of API calls when the queue is quiet. Your consumer code will then only make aReceiveMessagecall roughly every 20 seconds if no messages are available. - Why it works: Instead of polling every second (120 calls/minute), you poll every 20 seconds (3 calls/minute). This dramatically cuts down
ReceiveMessageAPI calls.
- Diagnosis: Check your consumer code. Look for
-
Not batching
SendMessage: If your producer sends messages one by one, even if it’s processing them quickly.- Diagnosis: Review your producer code. If
send_messageis called in a loop for each individual item, you’re not batching. - Fix: Implement
SendMessageBatch. Collect messages in memory (up to 10) and then send them in a singleSendMessageBatchAPI call. - Why it works: Sending 10 messages individually costs 10
SendMessageAPI calls. Sending them viaSendMessageBatchcosts only 1SendMessageBatchAPI call, a 10x reduction.
- Diagnosis: Review your producer code. If
-
Not batching
DeleteMessage: If your consumer fetches 10 messages usingMaxNumberOfMessages=10but then deletes them one by one.- Diagnosis: Look at your consumer’s delete logic. If it iterates through
response['Messages']and callssqs.delete_messagefor each one individually. - Fix: Use
DeleteMessageBatch. After processing a batch of messages, construct a list ofIdandReceiptHandlepairs for all successfully processed messages and send them in a singleDeleteMessageBatchcall. - Why it works: Deleting 10 messages individually costs 10
DeleteMessageAPI calls. UsingDeleteMessageBatchcosts 1DeleteMessageBatchAPI call, again a 10x reduction for that batch.
- Diagnosis: Look at your consumer’s delete logic. If it iterates through
-
Overly frequent polling with
WaitTimeSeconds=0: Even with long polling configured, if your application framework or logic forces a newReceiveMessagecall immediately after a previous one returns (even if empty), you’ll still have high call volume.- Diagnosis: Your consumer logic might look like
while True: receive_message(); process_messages();. This can lead to rapid polling ifWaitTimeSecondsis set to 0 or if the processing time is very short. - Fix: Ensure
WaitTimeSecondsis set to a value > 0 (e.g., 20). Also, ensure your processing loop doesn’t immediately re-invokereceive_messagewithout allowing theWaitTimeSecondsto expire if no messages were found. A simpletime.sleep(1)after processing (or after an empty receive) can help if yourWaitTimeSecondsis 0, but usingWaitTimeSecondsis preferred. - Why it works:
WaitTimeSecondsis the primary mechanism for throttlingReceiveMessagecalls when no messages are present.
- Diagnosis: Your consumer logic might look like
-
Using Standard Queues for tasks that don’t require strict ordering or exactly-once processing: While not directly an API call cost, it can lead to higher throughput requirements and thus more API calls overall.
- Diagnosis: You’re using standard queues but your application logic can tolerate out-of-order or duplicate messages with minimal impact.
- Fix: Consider using FIFO queues only if ordering or exactly-once processing is a strict requirement. For most background task processing, standard queues are sufficient and often cheaper due to higher throughput limits and simpler internal mechanics. If you must use FIFO, ensure your producer and consumer are optimized for batching.
- Why it works: Standard queues offer higher throughput and a simpler, less costly internal model. FIFO queues have higher latency and cost per API operation due to the mechanisms required to maintain order and exactly-once delivery.
-
Too many queues: Each queue has a small overhead, but more importantly, managing and interacting with many queues can lead to more complex application logic, potentially increasing API calls as you need to query queue attributes or send to dynamic destinations.
- Diagnosis: You have hundreds or thousands of SQS queues, each serving a very specific, small purpose.
- Fix: Consolidate queues where possible. Use message attributes or a field within the message body to route messages to the correct processing logic within a single, larger queue.
- Why it works: Reduces the number of queue-specific API calls (e.g.,
GetQueueUrl) and simplifies overall management.
The most common and impactful optimization is correctly implementing long polling (WaitTimeSeconds > 0) and batching (SendMessageBatch, ReceiveMessage with MaxNumberOfMessages > 1, DeleteMessageBatch). When these are correctly implemented, your API call count for a given workload can drop by orders of magnitude.
After fixing these, you’ll likely encounter the next most common "cost" concern: data transfer costs if your SQS queues are in regions different from your consumers, or if you’re using SQS within a VPC without VPC endpoints.