Tiered Storage: Cut Costs, Keep Performance

Tiered storage is a way to manage data based on how often it’s accessed, saving money and improving performance. The surprising truth is that the "warm" tier often becomes the most expensive in terms of cost per gigabyte, even though it’s not the fastest.

Let’s see how it works with an example. Imagine a popular e-commerce platform. New product listings and recent customer orders are "hot" data, accessed constantly.

{
  "timestamp": "2023-10-27T10:00:05Z",
  "user_id": "user123",
  "product_id": "prod456",
  "action": "view",
  "location": "us-east-1"
}

This data might live on NVMe SSDs in a high-performance object store, offering sub-millisecond latency. When a customer browses, the system immediately retrieves this information.

After a few weeks, that specific product view might not be needed for immediate retrieval, but it’s still useful for trend analysis or personalized recommendations. This becomes "warm" data.

{
  "timestamp": "2023-10-27T09:30:15Z",
  "user_id": "user789",
  "product_id": "prod789",
  "action": "purchase",
  "price": 75.50,
  "location": "eu-west-2"
}

This warm data might reside on standard SSDs or even high-capacity HDDs, perhaps in a data lake. Access might take tens of milliseconds, but it’s still readily available for analytics queries that run daily or hourly. Here’s a typical configuration for a warm tier object store:

storage_class: STANDARD_IA
lifecycle_rules:
  - id: move_to_cold_after_30_days
    status: Enabled
    filter:
      prefix: logs/
    expiration:
      days: 30

Eventually, this data is rarely queried. Historical sales data from years ago, for instance, is only needed for annual compliance audits or deep historical analysis. This is "cold" data.

{
  "timestamp": "2022-01-15T14:00:00Z",
  "user_id": "user001",
  "product_id": "prod001",
  "action": "purchase",
  "price": 120.00,
  "location": "ap-southeast-1"
}

Cold data is typically stored on cheaper, high-density magnetic disks or cloud-based archive storage. Retrieving it can take minutes to hours. A configuration for this might look like:

storage_class: GLACIER
lifecycle_rules:
  - id: restore_on_demand
    status: Enabled
    filter:
      prefix: archive/
    expiration:
      days: 3650 # Keep for 10 years
    transition:
      days: 90 # Move to Glacier after 90 days
      storage_class: DEEP_ARCHIVE

The "archive" tier is for data that is almost never accessed but must be retained indefinitely for legal or regulatory reasons. Retrieval from archive storage is the slowest and most expensive, often involving human intervention or batch processes that take hours or even days.

The system works by defining policies that automatically move data between these tiers based on age, access patterns, or other criteria. For instance, a policy might state: "Move all data in the 'hot' bucket that hasn’t been accessed in 7 days to the 'warm' bucket. Move all data in the 'warm' bucket that hasn’t been accessed in 90 days to the 'cold' bucket."

The key levers you control are the storage classes themselves (e.g., S3 Standard, S3 Standard-IA, S3 Glacier), the transition rules (how long data stays in a tier before moving), and expiration rules (when data is deleted). Understanding the access patterns of your data is paramount; if "warm" data is actually accessed frequently, it will incur higher access costs or require more frequent rehydration from a colder tier, negating savings.

The most impactful optimization most people miss is understanding the retrieval costs and time associated with moving data out of colder tiers. While Glacier storage itself is incredibly cheap per GB, retrieving a terabyte might cost hundreds of dollars and take 12-48 hours, making it unsuitable for anything but the most infrequent, planned access.

The next frontier is often figuring out how to analyze data that spans multiple tiers without exorbitant egress charges.

Related Concepts

More Deep Dives in Storage Systems