Everything about Rate Limiters.

September 15, 2025

there are two kinds of rate limiters -> client-side rate limiter and server-side rate limiters we need to first decide upon that.

server side rate limiters can be designed in two ways:

  1. a separate module aside from the micro-service architecture
  2. combined with the application code

the decision depends on the design motivation


several questions to have in mind:

  1. which kind of rate limiter is this? client-side or server-side
  2. which kind of rules to follow here? is this effectively only on a particular criteria or is this flexible to have different rules for configuration?
  3. a separate module for the rate limiter? or the rate limiter comes with the application code itself?
  4. is this supposed to work in a distributed environment?

client-side rate limiter:

  1. often, not recommended. here, the client limits the requests.
  2. from the client end, the request will outflow at a certain rate.
  3. hard to implement when the client side is not under our control.

server-side rate limiter:

  1. when it comes to rate limiting, it is always this.
  2. there can be two kinds: one is rate limiter that comes with the application and other is a separate gateway, middleman.
  3. if the language that we are using is having capabilities for a rate limiter to be implemented, then we can do it. it hase high customization and is having our full control. we can configure it as per our needs and is completely within the application code.
  4. if we don't want much complexities, we can use cloud services, cloud services now-a-days provide rate limiters as part of API Gateways. these API gateways combine many features such as SSL termination, authentication, rate limiting, etc.

rate limiting is essentially limiting the number of requests in a time-frame. there are different algorithms we can follow:

  1. Token Bucket:
    1. There will be different buckets for each of the rules we define. for example, users can post only 100 posts per day. so, the time-frame here is 1 day. and the limit here is 100.
    2. each bucket will have a number of tokens. now, there will be a refill in every time-frame and the bucket will be again filled with certain number of tokens.
    3. when a request comes, if a token is available in the bucket, the token will get consumed and the request goes through the server.
    4. if the bucket is empty, that implies a rate limited request. the request either drops or goes into a queue to have it processed later.
    5. now, we have two parameters: one is bucket size which implies the maximum number of tokens that a bucket can contain and the refill rate which implies the number of tokens put into the bucket every time unit.
  2. Leaky Bucket:
    1. a similar yet different algorithm as above. here, there won't be any tokens.
    2. but a queue will be present where the requests come and sit.
    3. if the queue is full then, the requests won't go through the queue and are rate limited.
    4. here, the outflow rate is the most important factor. it is the rate that which the server takes in the requests from the queue.
    5. so, there is some stability we can expect in this approach rather than a burst of requests that may come in the previous approach.
    6. there are two parameters here too: one is the queue size and the other is the outflow rate.
  3. Fixed Window Counter:
    1. here, there is a window which is for a certain time.
    2. and for each window, we will only allow certain number of requests.
    3. now, if there are already enough requests in the window then, we are going to discard any incoming requests.
    4. there is a big disadvantage in this algorithm.
      1. for example, let us take time for each window is 1 minute. and each window will only allow 5 requests.
      2. then, 5 requests came after 30 seconds from the 1st window starts. and in the 2nd window, 5 requests came in before 30 seconds of that window.
      3. so, this implies 10 requests in the span of 1 minute which is violating the algorithm itself.
  4. Sliding Window Log Counter:
    1. here, the problem in the above algorithm is solved.
    2. we keep track of the timestamps of each request here. and there will be a time frame where these logs are maintained.
    3. and there will be a limit to the requests server can receive in the time frame. if the bucket has reached it's capacity, then we no longer accept requests.
      1. for example, 2 requests per minute is our configuration. here, 2 requests is the limit for the time of 1 minute.
      2. if two requests came in at the 0th second and 30th second, we take in the requests and maintain these time stamp logs.
      3. then, if a 3rd request comes, let us suppose at the 50th second then, we discard all the requests that are before 50th second - 60 seconds. there are no such requests. still, the log appears to have 2 requests. so this request is discarded. but, the timestamp log is kept in the cache.
      4. now, if the request comes at 1 minute 31st second then, the first two logs will be discarded which are 0th second and 30th second. so, now, the size of the window is 1 which is less than 2. hence, this request is allowed and the logs is maintained.
    4. disadvantage is the storage.
    5. there is an issue which i don't have enough clarity but, if we log the rejected requests also there can be a scenario where no request is allowed even though the capacity still exists. there will be only rejected requests which doesn't actually take the server time in the time frame and every new request in the time frame will also be rejected.
    6. in most of the cases, we don't log the rejected requests. in that case, everything works as intended.
  5. Sliding Window Counter:
    1. There are two ways to implement this algorithm.
    2. Much similar to the sliding window log algorithm and fixed window counter, It specifically considers the number of requests in the previous window to a certain extent.
    3. the process goes like this:
      1. for example, we allow 7 requests per minute.
      2. now, there are 5 requests in the previous window and there are 3 requests already accepted in the current window.
      3. if a new requests comes at the 30% percentage of the current window, which means at the 30% of the time unit we took for each window, then the number of requests in the rolling window are calculated using the formulae: requests in the current window + requests in the previous window * overlap percentage of rolling window and the previous window.
      4. so, here the number of requests in the rolling window is 3 + 5 * 0.7 = 6.5 requests.
      5. which is lesser than the limit 7, so the new request that came at the 30% percentage of the window is accepted.
    4. but here it assumes that 5 requests in the previous window are evenly distributed. but, that doesn't have to be the case every time. hence, we have some other approaches which takes into account the timestamps of the requests and will only consider the requests in the last 70% of the previous window (ref. to the example above). sub-bucket approach where k buckets of small duration time-frame / k are maintained.

high level architecture:

  • so, essentially we have two things here for any algorithm. two generic things, one is counter and the other is limit.
  • we need to maintain these in the high level in order to facilitate our rate limiter.
  • counter increments if a request comes and the counter is in the limit
  • counter expires when a certain time limit is reached.
  • majorly for this, we use Redis.

client request --> rate limiting middle ware --> API servers rate limiting middle ware has access to the Redis cache where we maintain the counter.


There can be multiple rate limiting rules here. We have to maintain all these rules in a separate data structure.

  • we can have dedicated workers that will fetch rules into a cache
  • and rate limiting middle ware access these rules from this cache and communicated accordingly with the counters that we maintain in the Redis.
  • exceeding the rate limiter:
    • when a request is rate limited according to the defined rules, we either reject it and discard it completely or store/en-queue it somewhere in order to process it later.
    • most of the times, it is just "fail-fast". if the request is rejected, we just return the status to the client. this is chosen because, if we maintain a separate queue for the rejected requests, it all becomes so congested that the API response times become unpredictable. it is worse in the case of highly interactive applications.
    • generally, we will have different rate limit headers that lets the server inform the client whether any request is rate limited or not.
    • when a user sends requests too many times, they will be getting a 429 status code with X-Ratelimit-Retry-After header specifying the number of seconds to wait until they can make another request.
    • X-RateLimit-Limit specifies the overall request limit per time unit, generally 1 second. X-RateLimit-Remaining specifies the number of remaining requests that can be accepted in that minute/second. X-RateLimit-Reset specifies the timestamp at which request limit gets reset.

rate limiter in distributed environment:

no one said we just need a single rate limiter. when the request load is increasing, we need to have multiple rate limiters that are able to stay in sync with each other and maintain the whole application server base. when the number of clients increase, we even need to maintain multiple redis instances in order to support huge number of requests.

  • we need to somehow distribute our requests from the API gateways or rate limiters. we can use consistent hashing here.
  • in case of a user, userID decides the redis instance the request should reach. in case of a guest, IP decides the instance. in case of a developer, api_key decides the instance.
  • usually, this can be maintained by redis cluster as long as you don't require any custom consistent hashing. redis cluster by default has 16,384 hash slots.
  • here, availability is more important than consistency. eventual consistency is more than sufficient.

  1. race condition:
    1. when two rate limiters are trying to access the same counter and are trying to modify the values.
    2. the popular way is to use locks. that will introduce extra overhead on the server as it slows down the system.
    3. two more ways to handle this:
      1. lua script: https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d
      2. sorted sets data structure in redis: https://engineering.classdojo.com/blog/2015/02/06/rolling-rate-limiter/
  2. synchronization issue:
    1. when two clients are accessing two different rate limiters interchangeably, rate limiters may not have the data of some client as that client has accessed some other rate limiter before.
    2. this can be solved using sticky sessions. this is easy but generally not scalable and flexible.
    3. better alternative is to use centralized data store like redis with multiple instances supported by consistent hashing.
  3. performance optimization
    1. multi-location data centers for providing better latency for the users from different locations.
    2. eventual consistency model. availability > consistency.

more on race condition:

  1. when two requests from the same client read the counter from the redis instance, they can both assume that the request is allowed independently even though only one request can be allowed. and they both can perform write operation and decrement the counter which is leading to an inconsistent state.
  2. this can be avoided by using lua scripts which are atomic. the entire read-decide-update operations will be performed in one single go and avoids this issue.
  3. of course, we can apply pessimistic lock but, as stated above they are resource heavy.

more on high availability:

  1. if redis shard is down then, all the client data residing in it is unavailable.
  2. there are two options for us here:
    1. fail-closed: reject all the requests related to the client data from the unavailable redis instance. but this will make our system unavailable for a certain group of clients.
    2. fail-open: accept all the requests related to the client data from the unavailable redis instance. there is no rate-limiting. this will make our system prone to heavy load.
  3. when we accept all the requests without performing rate-limiting, there is a possibility of overwhelming requests in case of some viral event. this will affect our back-end server. this phenomenon is called as cascading failures.
  4. hence, we already avoid fail-open and prefer fail-closed. brief periods of rejected requests is better than failure cascading over the entire system. fail-open can also lead to DDoS attacks and other malicious activities as we are accepting all the requests.

more on fault tolerance:

  1. above happens only if there is some failure. to avoid that, we can maintain replicas for each redis instance.
  2. every redis shard is a master-slave pair now. redis cluster provides this in-built and manages assigning of replica as the new master whenever there is a failure in the current master.
  3. redis performs replication asynchronous in background and it is usually quite fast.

more on scalability:

  1. when the requests are increasing, latency becomes an issue.
  2. network latency plays a role here because every rate-limiting operation requires a network round-trip to redis.
  3. to avoid this, we can maintain a pool of persistent connections. each TCP connection is re-usable and API gateways uses these connections.
  4. redis also handle these pool connections as well.
  5. one more thing is geographic dependency. maintaining redis instances over data-centers near to each API gateway will decrease the latency by good amount.
  6. but, again consistency & synchronization across data-centers is complex. anyway, eventual consistency is sufficient here so this is not an issue.
  7. there are other things like caching in the API gateway, batching multiple requests from the same client, etc.

handling hot keys:

  1. in some systems, there can be clients which requires high request throughput. they overwhelm the API gateway with a lot of requests. can be DDoS or bots; can be also legitimate users.
  2. some of the ways to handle this situation:
    1. here comes the client-side rate limiting. every client in the application ecosystem should employ some kind of rate-limiting in the client end.
    2. client can batch multiple requests. adds complexity. request coalescing. this problem is called thundering herd problem.
    3. switch to better tier. separate redis shard for power users, etc.
  3. for abusive traffic, you can employ some black-list (handle it periodically), use cloudflare for DDoS protection, etc.

modifying rules:

  1. rule modification is expensive.
  2. best solution is to maintain a separate data-base and updating rules there. API gateways periodically tallies with them and update their rules. but, there will be some update-delay. poll-based.
  3. in the cases where this delay should be avoided, we have to maintain a dedicated configuration management system like zookeeper. this system pushes the configuration whenever there are some changes. like a pub/sub model. doable with redis and custom configuration system too. push-based.

hard rate limiting: requests cannot exceed the threshold no matter what. soft rate limiting: requests can exceed the threshold for a short amount of time.


References

  1. https://bytebytego.com/courses/system-design-interview/design-a-rate-limiter
  2. https://www.hellointerview.com/learn/system-design/problem-breakdowns/distributed-rate-limiter