Everything about Rate Limiters.

there are two kinds of rate limiters -> client-side rate limiter and server-side rate limiters we need to first decide upon that.

server side rate limiters can be designed in two ways:

a separate module aside from the micro-service architecture
combined with the application code

the decision depends on the design motivation

several questions to have in mind:

which kind of rate limiter is this? client-side or server-side
which kind of rules to follow here? is this effectively only on a particular criteria or is this flexible to have different rules for configuration?
a separate module for the rate limiter? or the rate limiter comes with the application code itself?
is this supposed to work in a distributed environment?

client-side rate limiter:

often, not recommended. here, the client limits the requests.
from the client end, the request will outflow at a certain rate.
hard to implement when the client side is not under our control.

server-side rate limiter:

when it comes to rate limiting, it is always this.
there can be two kinds: one is rate limiter that comes with the application and other is a separate gateway, middleman.
if the language that we are using is having capabilities for a rate limiter to be implemented, then we can do it. it hase high customization and is having our full control. we can configure it as per our needs and is completely within the application code.
if we don't want much complexities, we can use cloud services, cloud services now-a-days provide rate limiters as part of API Gateways. these API gateways combine many features such as SSL termination, authentication, rate limiting, etc.

rate limiting is essentially limiting the number of requests in a time-frame. there are different algorithms we can follow:

Token Bucket:
1. There will be different buckets for each of the rules we define. for example, users can post only 100 posts per day. so, the time-frame here is 1 day. and the limit here is 100.
2. each bucket will have a number of tokens. now, there will be a refill in every time-frame and the bucket will be again filled with certain number of tokens.
3. when a request comes, if a token is available in the bucket, the token will get consumed and the request goes through the server.
4. if the bucket is empty, that implies a rate limited request. the request either drops or goes into a queue to have it processed later.
5. now, we have two parameters: one is bucket size which implies the maximum number of tokens that a bucket can contain and the refill rate which implies the number of tokens put into the bucket every time unit.
Leaky Bucket:
1. a similar yet different algorithm as above. here, there won't be any tokens.
2. but a queue will be present where the requests come and sit.
3. if the queue is full then, the requests won't go through the queue and are rate limited.
4. here, the outflow rate is the most important factor. it is the rate that which the server takes in the requests from the queue.
5. so, there is some stability we can expect in this approach rather than a burst of requests that may come in the previous approach.
6. there are two parameters here too: one is the queue size and the other is the outflow rate.
Fixed Window Counter:
1. here, there is a window which is for a certain time.
2. and for each window, we will only allow certain number of requests.
3. now, if there are already enough requests in the window then, we are going to discard any incoming requests.
4. there is a big disadvantage in this algorithm.
  1. for example, let us take time for each window is 1 minute. and each window will only allow 5 requests.
  2. then, 5 requests came after 30 seconds from the 1st window starts. and in the 2nd window, 5 requests came in before 30 seconds of that window.
  3. so, this implies 10 requests in the span of 1 minute which is violating the algorithm itself.
Sliding Window Log Counter:
1. here, the problem in the above algorithm is solved.
2. we keep track of the timestamps of each request here. and there will be a time frame where these logs are maintained.
3. and there will be a limit to the requests server can receive in the time frame. if the bucket has reached it's capacity, then we no longer accept requests.
  1. for example, 2 requests per minute is our configuration. here, 2 requests is the limit for the time of 1 minute.
  2. if two requests came in at the 0th second and 30th second, we take in the requests and maintain these time stamp logs.
  3. then, if a 3rd request comes, let us suppose at the 50th second then, we discard all the requests that are before 50th second - 60 seconds. there are no such requests. still, the log appears to have 2 requests. so this request is discarded. but, the timestamp log is kept in the cache.
  4. now, if the request comes at 1 minute 31st second then, the first two logs will be discarded which are 0th second and 30th second. so, now, the size of the window is 1 which is less than 2. hence, this request is allowed and the logs is maintained.
4. disadvantage is the storage.
5. there is an issue which i don't have enough clarity but, if we log the rejected requests also there can be a scenario where no request is allowed even though the capacity still exists. there will be only rejected requests which doesn't actually take the server time in the time frame and every new request in the time frame will also be rejected.
6. in most of the cases, we don't log the rejected requests. in that case, everything works as intended.
Sliding Window Counter:
1. There are two ways to implement this algorithm.
2. Much similar to the sliding window log algorithm and fixed window counter, It specifically considers the number of requests in the previous window to a certain extent.
3. the process goes like this:
  1. for example, we allow 7 requests per minute.
  2. now, there are 5 requests in the previous window and there are 3 requests already accepted in the current window.
  3. if a new requests comes at the 30% percentage of the current window, which means at the 30% of the time unit we took for each window, then the number of requests in the rolling window are calculated using the formulae: requests in the current window + requests in the previous window * overlap percentage of rolling window and the previous window.
  4. so, here the number of requests in the rolling window is 3 + 5 * 0.7 = 6.5 requests.
  5. which is lesser than the limit 7, so the new request that came at the 30% percentage of the window is accepted.
4. but here it assumes that 5 requests in the previous window are evenly distributed. but, that doesn't have to be the case every time. hence, we have some other approaches which takes into account the timestamps of the requests and will only consider the requests in the last 70% of the previous window (ref. to the example above). sub-bucket approach where k buckets of small duration time-frame / k are maintained.

high level architecture:

so, essentially we have two things here for any algorithm. two generic things, one is counter and the other is limit.
we need to maintain these in the high level in order to facilitate our rate limiter.
counter increments if a request comes and the counter is in the limit
counter expires when a certain time limit is reached.
majorly for this, we use Redis.

client request --> rate limiting middle ware --> API servers rate limiting middle ware has access to the Redis cache where we maintain the counter.

There can be multiple rate limiting rules here. We have to maintain all these rules in a separate data structure.

we can have dedicated workers that will fetch rules into a cache
and rate limiting middle ware access these rules from this cache and communicated accordingly with the counters that we maintain in the Redis.
exceeding the rate limiter:
- when a request is rate limited according to the defined rules, we either reject it and discard it completely or store/en-queue it somewhere in order to process it later.
- most of the times, it is just "fail-fast". if the request is rejected, we just return the status to the client. this is chosen because, if we maintain a separate queue for the rejected requests, it all becomes so congested that the API response times become unpredictable. it is worse in the case of highly interactive applications.
- generally, we will have different rate limit headers that lets the server inform the client whether any request is rate limited or not.
- when a user sends requests too many times, they will be getting a 429 status code with X-Ratelimit-Retry-After header specifying the number of seconds to wait until they can make another request.
- X-RateLimit-Limit specifies the overall request limit per time unit, generally 1 second. X-RateLimit-Remaining specifies the number of remaining requests that can be accepted in that minute/second. X-RateLimit-Reset specifies the timestamp at which request limit gets reset.

rate limiter in distributed environment:

no one said we just need a single rate limiter. when the request load is increasing, we need to have multiple rate limiters that are able to stay in sync with each other and maintain the whole application server base. when the number of clients increase, we even need to maintain multiple redis instances in order to support huge number of requests.

we need to somehow distribute our requests from the API gateways or rate limiters. we can use consistent hashing here.
in case of a user, userID decides the redis instance the request should reach. in case of a guest, IP decides the instance. in case of a developer, api_key decides the instance.
usually, this can be maintained by redis cluster as long as you don't require any custom consistent hashing. redis cluster by default has 16,384 hash slots.
here, availability is more important than consistency. eventual consistency is more than sufficient.

race condition:
1. when two rate limiters are trying to access the same counter and are trying to modify the values.
2. the popular way is to use locks. that will introduce extra overhead on the server as it slows down the system.
3. two more ways to handle this:
  1. lua script: https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d
  2. sorted sets data structure in redis: https://engineering.classdojo.com/blog/2015/02/06/rolling-rate-limiter/
synchronization issue:
1. when two clients are accessing two different rate limiters interchangeably, rate limiters may not have the data of some client as that client has accessed some other rate limiter before.
2. this can be solved using sticky sessions. this is easy but generally not scalable and flexible.
3. better alternative is to use centralized data store like redis with multiple instances supported by consistent hashing.
performance optimization
1. multi-location data centers for providing better latency for the users from different locations.
2. eventual consistency model. availability > consistency.

References

https://bytebytego.com/courses/system-design-interview/design-a-rate-limiter
https://www.hellointerview.com/learn/system-design/problem-breakdowns/distributed-rate-limiter