The most surprising thing about system design interviews is that the interviewer isn’t trying to trick you; they’re trying to understand your thought process, and that means you need to guide the conversation.
Let’s build a web service that shortens URLs, like bit.ly. Imagine a user enters a long URL, and our service returns a short, unique one. When someone accesses the short URL, we redirect them to the original long URL.
Here’s a simplified version of how it might work:
{
"short_url": "http://short.url/aBcDe",
"original_url": "https://www.verylongurlthatnobodywilleverremember.com/with/lots/of/parameters?and=stuff"
}
When the short_url is requested, the server looks up aBcDe in its database, finds the original_url, and returns an HTTP 301 (Permanent Redirect) to the user’s browser.
The core problem we’re solving is mapping a large space of long URLs to a much smaller space of short, unique identifiers. We need to:
- Generate a unique short ID for each new long URL.
- Store the mapping between the short ID and the long URL.
- Efficiently retrieve the long URL given a short ID.
Let’s think about the ID generation. A common approach is to use a base-62 encoding (0-9, a-z, A-Z). If we use a 6-character ID, we have $62^6$ possible IDs, which is over 56 billion. That’s a lot, but will it last? If we get 1 billion URLs per day, we’ll exhaust this in about 56 days. We probably need more characters or a different strategy.
A more robust way to generate unique IDs is to use a distributed counter or a sequence generator. Imagine a service that hands out incrementing numbers. We can then encode these numbers into base-62. For example, if our counter gives us 1000000000, and we encode it in base-62, we get aBcDe. This ensures uniqueness and gives us a predictable way to scale.
The database needs to store short_id -> long_url. A NoSQL key-value store like Cassandra or DynamoDB is a good fit here. The short_id is the primary key. We’ll need to index on short_id for fast lookups.
Read vs. Write: Most URL shorteners see far more reads (users clicking links) than writes (users creating links). This means our database design should prioritize read performance. Caching is crucial. We can use Redis or Memcached to store frequently accessed short_id -> long_url mappings. When a request comes in for a short URL, we first check the cache. If it’s there, we return it immediately. If not, we query the database, populate the cache, and then return the URL.
Scaling: To handle millions of requests per second, we’ll need multiple application servers behind a load balancer. The distributed counter service for ID generation also needs to be highly available and scalable. If that service goes down, we can’t create new short URLs.
API Design:
POST /shortenwith{"url": "long_url"}in the request body. Returns{"short_url": "http://short.url/aBcDe"}.GET /aBcDe(orGET /{short_id}) which performs the redirect.
The system that generates the short IDs needs to be a separate, highly available service. If your main web service had to generate IDs itself, a single instance would become a bottleneck. By making it a dedicated service, you can scale it independently and ensure it’s fault-tolerant. Imagine it’s just a simple RPC service that returns the next available integer ID.
When designing the storage layer, consider that the long_url can be very long. Storing potentially gigabytes of URL data per entry might become expensive. However, the lookup key (short_id) is always small. This asymmetry is a good fit for many database technologies.
The most commonly overlooked aspect of URL shortening is the redirection strategy. While an HTTP 301 (Permanent Redirect) seems intuitive, it means the browser and search engines cache the redirect. If you ever need to change the destination of a short URL, a 301 makes that impossible without also changing the short URL itself. A 302 (Found/Temporary Redirect) is often a better choice for flexibility, allowing you to update the target URL on the fly without invalidating the short link in caches.
The next challenge you’ll likely face is handling analytics: counting clicks, tracking referrer data, and providing usage statistics for each short URL.