SQS Lambda event sources are surprisingly difficult to tune for optimal throughput and error handling because their default settings are often too aggressive, leading to dropped messages or excessive retries.
Let’s see it in action. Imagine you have an SQS queue receiving orders. A Lambda function processes these orders.
Here’s a simplified Python Lambda handler:
import json
import boto3
sqs = boto3.client('sqs')
def lambda_handler(event, context):
print(f"Received event: {json.dumps(event)}")
for record in event['Records']:
message_body = record['body']
receipt_handle = record['receiptHandle']
print(f"Processing message: {message_body}")
try:
# Simulate order processing
order_data = json.loads(message_body)
if order_data.get('invalid_item'):
raise ValueError("Invalid item in order")
print(f"Successfully processed order: {order_data['order_id']}")
# Delete message from SQS after successful processing
sqs.delete_message(
QueueUrl='YOUR_QUEUE_URL',
ReceiptHandle=receipt_handle
)
print(f"Deleted message with receipt handle: {receipt_handle}")
except Exception as e:
print(f"Error processing message: {message_body}. Error: {e}")
# The message will be returned to the queue after visibility timeout
# or can be explicitly sent to a Dead Letter Queue (DLQ) here.
# For simplicity, we're not explicitly sending to DLQ here.
# A message failing repeatedly will eventually go to DLQ if configured.
return {
'statusCode': 200,
'body': json.dumps('Processing complete')
}
When Lambda polls SQS, it doesn’t just fetch one message; it fetches a batch. The size of this batch and how many concurrent Lambdas can run are your primary levers.
The core problem Lambda event sources solve is bridging the gap between a long-running, potentially stateful process (your Lambda function) and a message queue that expects quick acknowledgments. Lambda achieves this by:
- Polling: It continuously polls your SQS queue in the background.
- Batching: It retrieves messages in batches, up to a configured
BatchSize. - Invoking Lambda: It invokes your Lambda function with an event payload containing the entire batch of messages.
- Concurrency: It can invoke multiple instances of your Lambda function concurrently.
- Acknowledgment: Your Lambda function must explicitly
delete_messagefrom SQS after successful processing. If it doesn’t, or if the function times out, the message becomes visible again after itsVisibilityTimeoutexpires and is redelivered.
The key configuration parameters for an SQS event source mapping are:
- Batch Size: The maximum number of messages Lambda retrieves from SQS in a single poll. Defaults to 10.
- Maximum Batching Window: The maximum amount of time (in seconds) Lambda waits to gather messages for a batch before invoking your function, even if the
BatchSizeisn’t reached. Defaults to 0. - Concurrent Batches per SQS Queue: The maximum number of concurrent Lambda invocations that can be processing messages from a single SQS queue. Defaults to 10.
- Parallelization Factor: The number of concurrent batches that can be invoked from a single SQS queue per shard (for FIFO queues) or per queue (for standard queues). This is a newer, more granular control. Defaults to 1.
- Tumbling Window: For FIFO queues, this defines how Lambda handles messages with the same Message Group ID. It ensures that only one batch is processed at a time for a given Message Group ID.
The most surprising thing about the ParallelizationFactor is that it can be set much higher than the ConcurrentBatches per SQS Queue limit, and Lambda will still respect the overall queue concurrency limit. The ParallelizationFactor essentially dictates how many concurrent invocations can be triggered for messages belonging to the same SQS queue, but the total number of active Lambda functions processing messages from that queue is capped by the ConcurrentBatches per SQS Queue setting. It’s a subtle but important distinction for FIFO queues where you might want to process messages within a group quickly, but not overwhelm the system with too many groups at once.
When you’re tuning, you’re balancing throughput (processing more messages faster) against reliability (avoiding retries and errors). A small BatchSize might mean more frequent Lambda invocations, each with less work, but also more overhead. A large BatchSize can lead to longer processing times per invocation, increasing the chance of timeout and message redelivery if one message in the batch fails.
The Maximum Batching Window is crucial for micro-batches. If you set it to, say, 5 seconds, Lambda will wait up to 5 seconds to fill a batch. This can reduce the number of invocations but might increase latency for individual messages if the queue is not very busy.
If you’re seeing messages get stuck in a retry loop or end up in a DLQ unexpectedly, it’s often because your Lambda function is taking longer to process a batch than the SQS VisibilityTimeout allows, or because the BatchSize is too large and a single failing message causes the entire batch to be redelivered. Conversely, if your throughput is low, you might need to increase BatchSize, Maximum Batching Window, and Concurrent Batches per SQS Queue.
Understanding the interplay between SQS VisibilityTimeout, Lambda function timeout, and the event source mapping’s BatchSize and Maximum Batching Window is key to efficient processing. If your Lambda function takes 30 seconds to process a batch and your SQS VisibilityTimeout is 60 seconds, you’re probably fine. But if your Lambda times out at 15 seconds and your VisibilityTimeout is 60 seconds, messages will be redelivered, potentially leading to duplicate processing if not handled idempotently.
The next step is often diving into how to make your Lambda functions idempotent to safely handle retries and duplicate messages.