Cache Invalidation: The Real Struggle

Distributed caching is fundamentally about trading consistency for availability and performance, but the devil is in the details of how you manage that trade-off.

Let’s see it in action. Imagine a simple web application with a user session cache.

{
  "cache_name": "user_sessions",
  "type": "redis",
  "host": "redis.example.com",
  "port": 6379,
  "db": 0,
  "ttl_seconds": 3600, // Session expires after 1 hour
  "max_items": 100000, // Cache holds up to 100,000 sessions
  "eviction_policy": "allkeys-lru" // Evict least recently used items when full
}

When a user logs in, their session data is stored in Redis:

import redis

r = redis.StrictRedis(host='redis.example.com', port=6379, db=0)

user_id = "user123"
session_data = {"username": "alice", "last_login": "2023-10-27T10:00:00Z"}

r.setex(f"session:{user_id}", 3600, json.dumps(session_data))

When the user makes a request, the application checks the cache first:

cached_session = r.get(f"session:{user_id}")
if cached_session:
    session_data = json.loads(cached_session)
    # Process request using cached_session
else:
    # Fetch session data from primary datastore (e.g., database)
    # Then populate cache:
    # r.setex(f"session:{user_id}", 3600, json.dumps(session_data))
    # Process request

This setup gives us fast access to user sessions. But what happens when the cache gets full, or when session data in the primary datastore changes? That’s where consistency and eviction policies come into play.

Consistency in distributed caching refers to how well the cache reflects the state of the primary data source. Strong consistency means every read from the cache is guaranteed to be the latest data. This is often too slow for caching. Eventual consistency, more common in caching, means that if no new updates are made, eventually all reads will return the last updated value. For session data, eventual consistency is usually acceptable; a slight delay in seeing a profile update isn’t critical.

Eviction is the process of removing items from the cache when it reaches its capacity. The policy dictates which items are removed. Common policies include:

LRU (Least Recently Used): Evicts the item that hasn’t been accessed for the longest time. This is great for data that follows a temporal locality pattern.
LFU (Least Frequently Used): Evicts the item that has been accessed the fewest times. Good for data where access frequency is a better predictor of future use than recency.
Random: Evicts a random item. Simple, but can be inefficient.
TTL (Time To Live): Items are automatically removed after a set duration. This is a form of eviction, but based on time rather than usage.

The allkeys-lru policy in the example configuration means that when the cache is full, Redis will look at all keys (not just those in a specific database or logical group) and evict the least recently used one.

Managing Consistency:

Cache-Aside (Lazy Loading): The application checks the cache first. If data isn’t there, it fetches from the datastore and then populates the cache. This is what the example shows.
- Pros: Simple, cache only stores actively used data.
- Cons: Initial read is slower (cache miss), data can be stale if the datastore is updated without invalidating the cache.
Write-Through: Writes go to the cache and the datastore simultaneously.
- Pros: Cache is always consistent with the datastore.
- Cons: Writes are slower, cache can fill with data that’s never read.
Write-Behind (Write-Back): Writes go only to the cache. The cache asynchronously writes updates to the datastore.
- Pros: Very fast writes.
- Cons: High risk of data loss if the cache fails before writing to the datastore, complex to implement.

For session data, Cache-Aside is usually sufficient. If a user’s role changes, the application needs a strategy to handle it.

Handling Stale Data:

The biggest consistency challenge is when the primary datastore is updated directly or by another process, bypassing the cache update mechanism.

Cache Invalidation: When data in the primary datastore changes, explicitly remove or update the corresponding entry in the cache. This is often done by sending a message to a pub/sub channel that cache clients listen to, or by having the datastore trigger an update. For example, if a user’s profile is updated in the database, a message {"type": "user_updated", "user_id": "user123"} could be published, and cache clients would listen for this and delete session:user123.
Short TTLs: Set a relatively short Time-To-Live (TTL) on cache entries. This ensures that stale data will eventually expire and be refreshed from the datastore on the next read. For session data, a TTL of 1 hour is common, but for frequently changing data, it might be minutes or even seconds.

Choosing Eviction Policies:

The allkeys-lru policy is a good default. However, consider your access patterns. If certain items are accessed frequently but infrequently over long periods (e.g., product catalog categories), LFU might be better. If you have a fixed, known set of data that should always be in cache and isn’t frequently updated, you might even consider a "no eviction" policy (though this risks out-of-memory errors).

When you configure max_items (or maxmemory in Redis), you’re setting a hard limit. The eviction policy is the mechanism to respect that limit. For Redis, maxmemory is the primary configuration setting, and maxmemory-policy defines how eviction happens when maxmemory is reached. The allkeys-lru policy is equivalent to Redis’s allkeys-lru.

The real complexity arises when you have multiple cache nodes and need to ensure that writes to one node are eventually reflected, or at least acknowledged, by others, especially if you’re using sharding or replication. This is where concepts like read-your-writes consistency or distributed locks might become necessary, but that’s a deeper dive into distributed systems coordination.

The next hurdle you’ll face is understanding how to scale your cache horizontally and manage the data distribution across multiple cache nodes.

Related Concepts

More Deep Dives in System Design