Baby steps into the world of distributed systems.

Premise:

let us consider a key-value store to dive deep into some of the important concepts of a distributed system.

a key-value store is nothing but a look-up table. there is a unique key associated to store each value and we will quick access to the value through the key. preferably, for performance reasons, systems often tend to use key-value stores with shorter keys.

in famous key-value stores like memcached, redis, amazon dynamo db, etc., a value is considered an opaque object. meaning, it can be anything, the db system which stores doesn't care about the internal structure of the value stored. but the db system cares about the internal structure of the key though.

single server key-value store:

entire data resides in a single server/node. you can also consider a data store that resides in-memory also as a single point key-value store. you can also consider a hybrid one which stores frequently used in-memory and remaining everything on disk.
the major concern for a single server key-value store is the size. vertical scaling is only feasible to a certain point.
of course you can use data compression to store the data but, it is also limited to a certain point.

so, we need to move to a distributed key-value store. for a distributed system, there are a lot of requirements and trade-offs to consider.

CAP Theorem:

when a distributed system is discussed, we mainly consider the CAP theorem.

consistency: the client should always get the same data no matter the node he connects to in a distributed environment at any point of time.
availability: irrespective of the node failures, the client should be able to connect to the system at any point of time. which means any client requesting data should get a response.
partition tolerance: the communication gap arising between any two nodes shouldn't affect the overall system. the overall system need to work as intended.

it is said that no system can achieve the above three simultaneously. hence, here comes our different kind of trade-offs.

a distributed system can be any of the following three:

a CP system: a system that maintains consistency and partition tolerance but sacrifices availability. which means when there is an instance of node failure, the entire system shuts down till the node gets back to normal. this is maintained when there is a extreme need of consistency across all the servers. usually maintained in banking systems where a single node failure may cause inconsistency across the user's accounts.
a AP system: a system that maintains availability and partition tolerance but sacrifices availability. which means even though there is some node failure, the system continues to operate despite having stale data. as the data which is present in the failed node is unavailable, the other nodes serve stale data or inconsistent data to the clients. this is usually maintained in content platforms like netflix, facebook, etc where high availability is required.
a CA system: a system that maintains consistency and availability but sacrifices partition tolerance. this is practically impossible because a distributed system itself contains partitions. partitions connect through network. and networks aren't 100% reliable.

in real world, partitions cannot be avoided. when a partition occurs, we should either choose consistency or availability.

Data Partition:

let us go through each concept in a distributed environment one by one as we raise the situations we should handle.

for a distributed system, partitions are important. major requirements is even distribution of data. data partition helps us to get the load off of a single server.

for even distribution of data as well as to minimize the data movement when the servers are added or removed, we can choose consistent hashing.

read more about consistent hashing here.

Data Replication:

we know that we can't maintain both consistency and availability. hence, the next concept to apply is data replication.

consistency will be maintained among the replicas. as there are multiple storage points for a single key and there is no single point failure, availability is also maintained. generally, replicas are maintained across data-centers or racks because there is no point in maintaining replicas on the same rack or data center because node failures means they affect the entire data center.

the first thing to decide is N which is the number of replicas to maintain for a key.
let us call the first node that is to the immediate right of a key as coordinator. (also known as proxy)
this coordinator can be called as the head of the replicas that maintain that particular key.
if we choose N replicas, then the key will be stores across the next N nodes. starting with the coordinator.

coordinator/proxy acts as the middle ground between the client and the server. it routes to the request to the necessary replicas based on the key.

no system can provide complete consistency without sacrificing availability. so, we have to tune our consistency intensity. it is always a trade-off between consistency and availability.

Consistency models:

we already have N which tells us the number of replicas. apart from this:

we maintain W as the write quorum, for a write operation to be considered as successful, the coordinator must get W acks from the replicas denoting that value has been written in W replicas.
we maintain R as the read quorum, for a read operation to return the value successfully, the coordinator must get R versions of the value from R replicas so that it returns the most updated value associated with the input key.

our consistency models, depend upon these three values.

when W = R = N, it is 100% consistency. but, when a node fails, as the coordinator won't get response from one of the replicas, we loose availability.
when W + R > N, it is considered as strong consistency. there will be a single overlap, in the sense, at least 2 of the replicas will respond.
- W = R and W + R > N: this implies equal importance to both read and write operations. there will be at least one overlap. (usually, N = 3 and W = R = 2)
- W = 1 and R = N: this implies faster write operations as it requires only one write ack from any of the replicas.
- R = 1 and W = N: this implies faster read operations as it requires only one read response from any of the replicas.
when W + R <= N, it is considered as weak consistency.

apart from these, we have one more important consistency model which is eventual consistency. this is the model which is most recommended as it is something in between weak consistency (low consistency and high availability) and strong consistency (high consistency and low availability).

in eventual consistency model, inconsistent values are allowed into the system and we make the client reflect on the inconsistent values.
as the client responds, eventually we will reconcile values and the system becomes consistent with the subsequent calls.

we also have session consistency, causal consistency and sequential consistency models which are very rare in usage.

to achieve eventual consistency, the system needs to handle inconsistent values that are allowed into the system. the handling of these inconsistent values is something called as reconciliation.

reconciliation: the restoration of friendly relations.

inconsistency comes when there are two separate values that are write to two replicas and those replicas associated different values to same key according to the write operations. why this happens? same thing, quorum maintained is not strict. the coordinator thinking that the operation is successful doesn't mean the write has happened in all of the concerned replicas.

Versioning:

in most systems including dynamo db, this kind of inconsistency is resolved using versioning.

for every write operation, we maintain a log. for example, if there is a data-item D and a write operation for that data-item is handled by the server $S_1$ then a version log which is something like this: $D_1[S_x, 1]$ will be stores in the logs.
these are called as vector clocks. using these vector clocks, we easily combine the information from multiple independent write operations and update the final value.
if the updates are just ancestors to each other, then there are no conflicts. because the child update is always the latest one hence, we just update the value in the latest write operation.
- for example, there are two vector clocks: $D([S_x, 1], [S_y, 1])$ ( $S_x$ has handled it's initial operation and $S_y$ has handled it's initial operation) and $D([S_x, 1], [S_y, 2])$ (meaning $S_x$ has handled it's initial operation and $S_y$ has handled it's second operation).
- in this example, you can clearly see that there is no conflict. the second vector clock is clearly ahead by the first clock. the second vector clock includes the version log updated by the server $S_x$ with an updated version log updated by the server $S_y$ with it's second updation relative to the data-item D.
if the updates are siblings to each other, that means there is a conflict.
- for example, there are two vector clocks: $D([S_x, 1], [S_y, 2])$ and $D([S_x, 2], [S_y, 1])$ . here, for the server $S_x$ there is no issue. it has it's first versioning stated in the first vector clock. and it's second versioning stated in the second vector clock. but the versioning associated with the server $S_y$ has changed.
- this indicates that the updates happened independent without the information from each other. this indicates a conflict.

how we will handle this?

it depends on the application logic.
in general large-scale e-commerce applications, both will be merged and shown at the application level.
later, when the client reflects on the information, they will be updated as required.

downsides:

these version clocks adds extra overhead on the server side.
they increase rapidly and becomes difficult to maintain.
hence, most database systems delete the older ones. they set a threshold and act accordingly. this leads to inefficiencies but, according to dynamo db it is not yet observed in production.

enough with handling data, how do we handle the replicas or nodes distributed across different data centers?

Failure detection:

on a simple note, we can't possibly have mesh-like structure connecting every server to every other server as there will be number of servers and it is very inefficient to maintain.
hence, we use something called gossip protocol.
it takes it's idea from eventual consistency. eventuality. that is the key.
every server pings random servers periodically regarding it's health status. it tells the other servers that it is alive.
those random servers which received this status updates their status and these servers statuses to some other random servers.
this propagates and eventually every server has some kind of status of every other server.
for a certain period, if the status of any server is not getting updated, then it is regarded as an inactive server and the server which identifies this reports the situation.

Sloppy quorum and hinted hand-off:

temporary failures are often handled using sloppy quorum.
the coordinator instead of choosing the next N nodes, it chooses the next N healthy nodes for the operations.
so, when a node fails, it takes the ack from the next node. the next node becomes like a proxy for some time for the failed node.
this node stores what are called as hints for the failed nodes and hands those hints later when the failed node gets back to active state. these hints are nothing but the operations carried out in the mean time. this process is called hinted hand-off.
this isn't possible in the case of strict quorum. that is why it is called as sloppy quorum. the coordinator continues it's operation even though any of the replicas is not active. it still receives some ack from the remaining replicas and continues it's process. of course, there will be some limit on W and R in sloppy quorum also.

Merkel trees:

the above approach works for the replicas that comes back live after some time. it works to update stale data for a limited amount of time.
but, for permanent failures, there will be a lot of data that is stale. when manual attention is required for a node to get back active, it is considered a permanent failure and the data that is stale in this case is probably higher.
in this case, we can use merkel trees or hash trees.
these follow tree-like pattern which hashes each stage on the divided buckets of the key space and use these hashes to decide where the things differ.
the part where it differs will be the only part that will get updated minimizing the data migration.

i have introduced the important topics that comes handy in distributed systems and how people handle them in most cases. will dive deep into the individual topics in later posts.

References

https://bytebytego.com/courses/system-design-interview/design-a-key-value-store
https://www.hellointerview.com/learn/system-design/deep-dives/cap-theorem
many other sources which i don't remember.