Cloud storage services are often treated as interchangeable, but their underlying architectures and pricing models create wildly different performance and cost profiles for common workloads.
Let’s see what happens when you try to store and retrieve a million tiny files, each only a few kilobytes, from each of these services.
import boto3
import google.cloud.storage
import azure.storage.blob
import time
import random
# --- Configuration ---
AWS_BUCKET_NAME = "your-s3-bucket-name"
GCP_BUCKET_NAME = "your-gcs-bucket-name"
AZURE_CONTAINER_NAME = "your-azure-container-name"
AZURE_ACCOUNT_NAME = "your-azure-account-name"
AZURE_ACCOUNT_KEY = "your-azure-account-key"
NUM_FILES = 1_000_000
FILE_SIZE_KB = 4
REGION = "us-east-1" # Example for S3, adjust for GCS/Azure if needed
# --- S3 ---
print("--- Testing S3 ---")
s3 = boto3.client("s3", region_name=REGION)
start_time = time.time()
for i in range(NUM_FILES):
data = b'\0' * (FILE_SIZE_KB * 1024)
s3.put_object(Bucket=AWS_BUCKET_NAME, Key=f"s3_test/{i}.bin", Body=data)
upload_time_s3 = time.time() - start_time
print(f"S3 Upload Time: {upload_time_s3:.2f} seconds")
start_time = time.time()
for i in range(NUM_FILES):
s3.get_object(Bucket=AWS_BUCKET_NAME, Key=f"s3_test/{i}.bin")
download_time_s3 = time.time() - start_time
print(f"S3 Download Time: {download_time_s3:.2f} seconds")
# --- GCS ---
print("\n--- Testing GCS ---")
gcs_client = google.cloud.storage.Client()
gcs_bucket = gcs_client.bucket(GCP_BUCKET_NAME)
start_time = time.time()
for i in range(NUM_FILES):
blob = gcs_bucket.blob(f"gcs_test/{i}.bin")
blob.upload_from_string(b'\0' * (FILE_SIZE_KB * 1024))
upload_time_gcs = time.time() - start_time
print(f"GCS Upload Time: {upload_time_gcs:.2f} seconds")
start_time = time.time()
for i in range(NUM_FILES):
blob = gcs_bucket.blob(f"gcs_test/{i}.bin")
blob.download_as_string()
download_time_gcs = time.time() - start_time
print(f"GCS Download Time: {download_time_gcs:.2f} seconds")
# --- Azure Blob ---
print("\n--- Testing Azure Blob ---")
azure_blob_service = azure.storage.blob.BlobServiceClient(
account_name=AZURE_ACCOUNT_NAME,
account_key=AZURE_ACCOUNT_KEY
)
container_client = azure_blob_service.get_container_client(AZURE_CONTAINER_NAME)
start_time = time.time()
for i in range(NUM_FILES):
blob_client = container_client.get_blob_client(f"azure_test/{i}.bin")
blob_client.upload_blob(b'\0' * (FILE_SIZE_KB * 1024))
upload_time_azure = time.time() - start_time
print(f"Azure Upload Time: {upload_time_time_azure:.2f} seconds")
start_time = time.time()
for i in range(NUM_FILES):
blob_client = container_client.get_blob_client(f"azure_test/{i}.bin")
blob_client.download_blob().readall()
download_time_azure = time.time() - start_time
print(f"Azure Download Time: {download_time_azure:.2f} seconds")
print("\n--- Summary (Times in Seconds) ---")
print(f"{'Service':<10} | {'Upload':<15} | {'Download':<15}")
print("-" * 45)
print(f"{'S3':<10} | {upload_time_s3:<15.2f} | {download_time_s3:<15.2f}")
print(f"{'GCS':<10} | {upload_time_gcs:<15.2f} | {download_time_gcs:<15.2f}")
print(f"{'Azure':<10} | {upload_time_azure:<15.2f} | {download_time_azure:<15.2f}")
The surprising truth is that for workloads involving many small objects, the latency of the API calls dominates the cost, not the actual data transfer. S3, GCS, and Azure Blob Storage all have different strategies for handling these API requests, leading to significant performance differences.
S3 (Simple Storage Service): AWS’s offering is built around a highly distributed, eventually consistent object store. For small objects, S3’s performance is often limited by the per-object API call overhead. Each put_object or get_object call involves network round trips and processing on S3’s side, which can add up quickly. While S3 offers strong consistency for new objects and overwrites, listing objects can be eventually consistent, meaning a newly created object might not immediately appear in a list_objects call.
GCS (Google Cloud Storage): GCS is designed with a global namespace and a focus on high throughput and low latency, particularly for large objects. For small objects, it also incurs per-object API call overhead. Google’s infrastructure is generally very fast, and GCS benefits from this. It offers strong consistency for all operations, including object listing.
Azure Blob Storage: Azure Blob Storage offers a tiered approach, with Hot, Cool, and Archive tiers. For this kind of small object, frequent access workload, you’d typically use the Hot tier. Azure’s API design can sometimes feel a bit more verbose, and performance for a massive number of small objects can be impacted by the transaction costs. Like S3, it offers strong consistency for new objects and overwrites, but listing might have eventual consistency characteristics depending on the API used.
The mental model for these services often revolves around "buckets" or "containers" as the top-level organizational unit, and then "objects" or "blobs" within them. You pay for storage space, data transfer out, and API requests. For large objects, the data transfer and storage cost are paramount. For millions of small objects, the API request cost and latency become the primary drivers of performance and expense.
The key levers you control are the object size, the number of objects, and the access pattern (read/write frequency). If you have many small files, you might consider techniques like archiving them into larger tarballs or using a different type of service altogether, like a managed database or a file system that’s optimized for this.
What most people don’t realize is how drastically the internal partitioning and metadata management strategies of each service affect the performance of operations that require listing or iterating through many objects. Services that have to perform more work on their control plane to satisfy a list request will inherently be slower and more expensive for that specific operation, even if the data transfer itself is cheap.
The next challenge you’ll likely encounter is dealing with the cost implications of millions of small object API requests.