This is less about rate limiting and more about managing concurrency and resource access across multiple machines.

Imagine you’re running a popular API. You’ve got thousands of users hitting your endpoints, and if just one of them decides to hammer you with a million requests, they could bring your whole service down. Rate limiting is your defense: a mechanism to ensure no single user (or IP, or API key) can overwhelm your system.

Here’s a simple Python implementation of a token bucket rate limiter. It’s a common algorithm because it’s easy to understand and implement.

import time

class TokenBucket:
    def __init__(self, capacity, fill_rate):
        self.capacity = capacity  # Maximum tokens the bucket can hold
        self.fill_rate = fill_rate # Tokens added per second
        self.tokens = capacity    # Current number of tokens
        self.last_refill_time = time.monotonic() # When we last refilled

    def consume(self, tokens_to_consume):
        self.refill()
        if self.tokens >= tokens_to_consume:
            self.tokens -= tokens_to_consume
            return True
        return False

    def refill(self):
        now = time.monotonic()
        time_elapsed = now - self.last_refill_time
        tokens_to_add = time_elapsed * self.fill_rate
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill_time = now

# Example usage:
# Allow 10 requests per second, with a burst capacity of 20.
limiter = TokenBucket(capacity=20, fill_rate=10)

for i in range(25):
    if limiter.consume(1):
        print(f"Request {i+1}: Allowed")
    else:
        print(f"Request {i+1}: Denied (Rate Limited)")
    time.sleep(0.05) # Simulate requests arriving at a varying pace

This code shows how tokens are added to a "bucket" over time. When a request comes in, it tries to "consume" a token. If there’s a token available, the request is allowed. If not, it’s denied. The capacity acts as a buffer, allowing for bursts of requests, while the fill_rate dictates the sustained rate.

The real magic happens when you need to distribute this across multiple servers. If each server has its own independent TokenBucket, a single user could simply distribute their requests across all your servers, bypassing the intended rate limit. This is where distributed state comes in.

To achieve true distributed rate limiting, all your servers need to agree on a shared state for each rate-limited entity (user, IP, etc.). This shared state typically includes:

  • The current number of tokens available for that entity.
  • The last time the tokens were refilled.

A common way to manage this shared state is by using an external, fast, and reliable data store. Redis is a popular choice because it’s an in-memory data structure store that offers high performance and atomic operations.

Here’s a conceptual outline of how you’d use Redis with a token bucket algorithm:

  1. Key Structure: For each user, you’d have a Redis key, e.g., rate_limit:user_id. This key would store a hash or a serialized object containing tokens and last_refill_time.
  2. Atomic Operations: When a request arrives, your application server would use Redis commands like HGETALL to fetch the current state, calculate the new state based on elapsed time and fill_rate, and then use HMSET to update it. Crucially, you’d wrap these operations in a Redis transaction or use Lua scripting to ensure atomicity. This prevents race conditions where two servers might read the same old state, calculate independent updates, and then overwrite each other, leading to incorrect token counts.
  3. Leaky Bucket Variation: Another popular algorithm is the leaky bucket. Instead of tokens being added, requests are added to a queue. The queue "leaks" requests at a constant rate. If the queue is full, new requests are dropped. This algorithm is good for ensuring a smooth, constant output rate, but it doesn’t allow for bursts as well as the token bucket.

When designing distributed rate limiting, the most surprising true thing is that the "state" of a rate limit isn’t really about the current request count, but about the time elapsed since the last state update and the rate at which refills should occur. The actual number of tokens is a derived value, not a primary one. This means that even if a server goes down, when it comes back up, it can accurately calculate the current token count based on the last recorded refill time and the current system time, assuming the shared state in Redis was correctly maintained.

The next challenge you’ll face is ensuring your rate limiting logic is resilient to the failure of your distributed state store.

Want structured learning?

Take the full System Design course →