WTH is consistent hashing?!

generally while load balancing several servers, we need to hash each client id or request id and the hash decides which server the request should go to.

if we have a general linear hash algorithm, there will be several keys and those keys will have a certain hash. and if the number of servers is n then, the request routes to hash modulo n.

now, here if we have add or remove servers or if a server fails then (n changes), we need to redistribute the keys so that they route to the servers according to the current n. this may increase the load on certain servers as we need to change the data from the older servers according to the older n to the updated servers according to the updated n.

also in the case of cache maintenance, if a server fails then the requests route to wrong server and there will be many cache misses.

to avoid all of these, we use consistent hashing.

consistent hashing became famous when dynamo db has started using it. it is even used in Akamai cache functionality.

the major point that defines consistent hashing is taking larger and constant hash space that is assumed to handle any type of future scaling. usual range is $[0 - 2^{128} - 1]$ . -> basically maintaining all the unsigned 128 bit integers.

so in this huge hash space, both the storage nodes and objects map to their respective hash values.
but, how do we decide in which storage node do we have to store the objects?

we will store the objects in the storage node that is having hash value that is at immediate right to the hash value of the object.

this puts the below into perspective:

hash function is independent of the number of storage nodes.
associations are relative. they are not driven by collisions.

for an object having the hash value that appears to the extreme right of the hash space, we may not find a storage node present there. so, we will just fall-back to index 0 and decide the storage node from the start.
hence, consistent hashing can be regarded as something like a ring.

you can imagine everything explained below also as a circle. and right neighbor is the neighbor when we move in clock-wise direction.

two cons of the consistent hashing method are:

it requires huge memory. hence, it is only feasible in larger distributed systems with so much scaling to offer.
finding associations for objects by iteration every time to the right. in worst case, it is $O(hashspace)$ .

one way to avoid the complexity involved in finding association is we can maintain two arrays.
one array to store the storage nodes instances. and another one to maintain the position of these storage nodes in hashspace.
this will allow us to quickly retrieve the location of a storage node. without iterating.

scaling up, adding new nodes:

when a new node is added, we just insert at the appropriate hash value.
for the migration of data, only the objects that are in between the node that is to the immediate left of the hash value and the node that is to the immediate right of the hash value are affected.
whatever the objects that are to the left of the hash value and after the previous storage node will now be present in the new node instead of the node to the right.
whatever the objects that are to the right of the hash value will remain in the node to the right itself.
so, all the other objects other than the ones described above stay unaffected.

scaling down, removing a node:

when a node is removed, we just have to insert everything in it into the node to the right.
after that, we can just remove the node from the hashspace.

but, when scaling down, there is a possibility we increase load on a particular node. because, all the load from the removed node will be into the node in the right along with the right containing it's own load.

the solution for this case is virtual nodes. for example, instead of keeping the db1 at a single location like 0, we keep multiple locations associated with it. such as db1-vn1 at 15, db1-vn2 at 20, etc.

so now, if we remove db1, instead of putting all the load into a single neighbor, we distribute the load across multiple neighbors, as we take the neighbors of each virtual node of db1.

consistent hashing is used in:

apache cassandra: to distribute across the ring.
amazon's dynamo db: consistent hashing is the base.
cdn: determine which edge server to access

References

System Design Interview - I - Alex Xu
https://arpitbhayani.me/blogs/consistent-hashing/
https://www.hellointerview.com/learn/system-design/deep-dives/consistent-hashing