tiny url generator is the most common system design challenge. it involves thinking about various constraints involved in an application, scalability, usage, and availability.
let us talk about general functional requirements required in a tiny url generator application:
- user should be able to generate a short url.
- user should be able to redirect to the long url when the associated short url is provided.
- user should be able to set an expiration date so that the short url becomes invalid afterwards.
- user should be able to set an alias for the generated short url if available.
apart from these, other general functionalities include:
- user management. user is able to look at all the short urls that have been created in the account, history you may say.
- analytics. user is able to see the click counts, geographical data and other analytical data w. r. t. the own created short urls.
let us see the qualities the user expects in this system:
- the first and foremost requirement includes the uniqueness of the short urls. no two long urls should be mapped to the same short url.
- if you think about it, these systems need to be available almost all the time. the user may not attempt creating short urls often but, there will be frequent access to the short urls. this demands high read scalability. one more way to say is the read-write ratio is uneven with more emphasis on the read operations.
- as there will be frequent access to the short urls, the redirection latency should be extremely quick. latency < 100 ms.
apart from these generic requirements, we can estimate the number of short urls that will be generated over the lifetime of the application (or any arbitrary period) and also decide upon the user base and traffic. these metrics also help us decide various strategies w.r.t. caching, database, etc.
let us assume, we have to support 1 billion short urls and 100 million daily active users. (general estimation, reference to the hellointerview article, you can decide upon any)
consider a general client-server architecture. we should be having two kinds of main API endpoints:
- shorten the urls: this request will be a POST method. user posts the details like long url (which he wants shorter version), expiration data (optional) and alias (optional). the method should return the shortened version of the given long url.
- redirect to the original url: this request will be a GET method. if user posts a shortened version of an url then, he should be redirected to the corresponding original/long version of the url.
server takes the url and it handles the shortening of the long url. but, somehow the long url to the short url relation should be one-to-one. because we have to make sure that now two long urls should be having the same short url.
for redirection, we have two primary http status codes that we can set.
- 301 (permanent redirect): this means moved permanently. browsers cache this response hence, the subsequent requests bypass the server and browser directly handles them reducing load on the server.
- 302 (temporary redirect): this means found. browsers do not cache this response hence, every request goes through the server.
generally, 301 is preferred when we want to decrease the load on the server but, it will demolish some requirements that we may have of our tiny url generator system. 302 is often preferred because:
- every request goes through the server giving the server the choice whether to return the original url or return an error. this will allow us to incorporate previously discussed expiration data.
- the server will have a choice to delete or update the short url (if needed).
- we can also maintain analytics as every request is handled by the server.
now comes the part knowing how to generate a unique short url for the given long url.
we have several methods.
the first metric to decide is what should be the length of the url? (the length of this part: https://shorturl/{this_part})
presuming we should be having 1 billion unique short urls:
- the url is a alpha-numeric string so, we can have 62 characters in it.
- so, each position has 62 choices which means taking the length 5 would give somewhere around ~900 million, and taking 6 would give somewhere around 56 billion. as taking length as 5 won't meet our scaling needs, we should go with the length 6.
hash functions:
- we can use hash functions in order to generate unique urls.
- there are several hash functions that we can choose like CRC32 which have 8 characters length when represented in hexadecimal. we can also choose other hash functions that generate bigger character string and take the prefix. but, this has it's own challenges:
- this method doesn't provide 100% uniqueness. the prefix of two different hashes can be equal.
- so, usually we employ recursive process where we check the existence of the generated hash in the database and then if it exists already we append the long url again to the hash and hash it again and repeat the same.
- if it is unique, then we will map the assign the short url. there are several other ways too but, this is the gist of it.
- to check the existence of the generated hash quickly, we can employ bloom filters.
- one more way to make sure the length is as intended, we can do base62 encoding on the generated hash and then take the first N characters.
- and also, two users generating the short url for the same long url would end up having same short url which is an advantage.
the process above usually becomes heavy as we keep on generating. because with the increased number of short urls, the probability of collisions become higher as we are limiting to a certain length. this requires the system to increase the length which destroys the purpose of short url. remaining at the short length will introduce another layer of look-up.
counter and base62 encoding:
- there are several unique number generation systems that we use to generate a unique ID for a url.
- we can then do base62 encoding on the ID to generate our short url. this process gives high throughput than using the hash functions.
the two concerns with this method are:
- security because the subsequent short urls are predictable. as it is just a one big global counter.
- the IDs generated till 1 billion would give strings of length 6. till 3.5 trillion, it will be of length 7. but, we have to mention that the length won't be the same and it keeps increasing with the amount of short urls generated. anyway, this is pretty scalable.
further down, the system at hand is a read-heavy system. the redirections need to be extremely fast.
using database indexing:
- modern relational databases directly provide indexing over the primary key.
- there are several ways for indexing such as B-tree, hashing, tries, etc.
- these helps to lookup a particular position of desired value, directly pointing at what we want.
- B-trees take and hashing as you can say . tries are usually used over strings. we can employ that too here as we are having strings here. but, hashing provides constant time which is better in this case.
caching:
- we can incorporate caching in the system (redis, memcached or any other) in order to speed-up the lookups.
- cache which increases the computational resource makes look-up faster. it will be extremely helpful in the case of frequent queries.
- platforms like redis provide in-built approaches for almost any requirements. we can also maintain the expiry date and other metadata in order to invalidate the cache.
- although, cache invalidation is a bit complex and also we need to address things such as cache eviction policies, etc.
- in a distributed environment this becomes more complex as well.
lookup tables:
- cache helps us to get information. but, what if we have a way to check whether the key exists or not?
- if the key doesn't exist, both cache and database queries are useless. (cache queries aren't that expensive but, anyway).
- so, we can use something called bloom filters in order to lookup very efficiently whether a key is available in the database or not.
- the major issue with the data structures like bloom filters is it will have false positives. and bigger the data gets, more the occurrences of false positives become.
cdn and edge servers:
- for further scaling (when the system gets traction), single distributed environment at limited location won't be enough.
- we need to introduce CDNs (content delivery networks) in order to provide quicker responses from the servers maintained near to the user's physical location.
- this gets extremely fast, CDNs are like separate caching servers local to a particular region. they store the most frequent queries and helps giving faster responses.
- this dramatically reduces the load on the origin server.
- down the line, we can also process data at the locations near to the end-user maintaining what are called as edge servers.
- platforms like cloudflare and aws lambda@edge provide sophisticated functionalities in order to make all of this much easier.
- but, all this boils down to the complexity of the system and the cost to performance and traction of the entire application.
apart from this, the most important part of the system which is unique number generator has it's own way. the uniqueness should be maintained over all the nodes that are in the distributed environment.
now that we are introducing data centers and edge servers, the things need to be consistent as well in terms of the unique number generation systems. mostly, we do not need time-orderingin this system. of course we have to maintain a timestamp in order to maintain expiration but, that can be stored individually (just a bit more space).
we can use global counters using redis or any other system (auto-increment) or else we can use twitter snowflake so that we don't need to maintain a separate timestamp and also they can be directly converted to base62. for larger than life systems, we can use uuid v4 which have 128 bits of size. it is 32 characters in length. need to be trimmed down a lot. but, the distributed nature of these things really gives out an advantage in this application.
there is always predictability nature to these things. if the system wants to have something more debuggable then, it can go for sequential ids which are predictable. but, if the system don't want any security concerns, it can go for opaque ids (or random you can say).
one more concept to mention here is counter batching. this is useful in the case where we use global counters.
- as the application gets larger, although the latency is lower for fetching the counter from redis-like systems (network calls are really fast), it may get expensive over time if we have millions of requests every minute.
- when this happens, a system can register itself with some
mnumber of IDs before-hand. like per call, it can register a 1000 IDs. global counter increments the count by 1000 instead of 1 indicating that it has registered 1000 IDs to that server. - now the server need not make any calls for the next
mwrite requests. - the only concern with this is it not time-ordered anymore.
References
- https://www.hellointerview.com/learn/system-design/problem-breakdowns/bitly
- https://bytebytego.com/courses/system-design-interview/design-a-url-shortener
- https://www.youtube.com/watch?v=xFeWVugaouk