Design a Rate Limiter (System Design)
Table of Contents + −
Imagine you run a public API, and you want one simple rule:
- Every user gets to make at most 100 requests per minute, no matter which of your servers handles them.
- Go over that, and the next request gets politely turned away until the minute resets.
- And this has to hold even when you have ten servers running behind a load balancer.
Sounds easy, right? Just keep a count. But the moment you have more than one server, that little counter turns into a proper system design problem. So let’s design a real distributed rate limiter together, step by step. If you haven’t seen the basics yet, the Rate Limiting Explained page is a gentle warm-up. We’ll build on it here.
🎯 What We’re Building
Let’s name the thing plainly first.
- A rate limiter is a gatekeeper that caps how many requests a single client can make in a given stretch of time.
- A client is whoever is calling you. You usually spot them by their user account, their API key, or their IP address. (An IP address is just the network address of a machine, like a return address on a letter.)
- When a client goes over the cap, the limiter stops the request and sends back a rejection, so your real service never even sees it.
Our specific job for this case study:
- Allow each user up to 100 requests per minute across the whole API.
- Reject anything over that with the right signal.
- Make it work the same whether one server or fifty servers are handling traffic.
That last point is the whole game. One server is trivial. Many servers sharing one count is the interesting part.
📋 Requirements
Before drawing any boxes, a good engineer asks: what must this thing actually do? We split that into two buckets.
- A functional requirement is something the system must do, a feature you can point at.
- A non-functional requirement is about how well it does those things, like how fast or how reliable it is.
Here’s what our limiter must do. These are the functional ones:
- Cap requests per client per time window, like 100 per minute.
- Return a clear rejection when a client goes over the limit.
- Let us configure the limits, so different rules are easy to set up later.
And here’s how well it should do them. These are the non-functional ones:
- It must be fast. The limiter sits in front of every request, so it can only add a tiny bit of delay. A slow limiter slows down your whole API.
- It must be accurate. If the cap is 100, letting through 200 defeats the point.
- It must work across many servers. The count has to be the same no matter which server a request lands on.
Always ask before you design
In a real interview, pin down the limits and the scale first. Is it per user or per IP? Per minute or per second? How many requests per second overall? Nailing the requirements before drawing boxes is half the score.
🧮 Pick an Algorithm
“Keep a count” sounds simple, but there are a few ways to actually do the counting. Each one behaves a little differently. Here’s the quick recap.
| Algorithm | The idea in one line |
|---|---|
| Token Bucket | A bucket refills tokens over time; each request spends one, so short bursts are allowed. |
| Fixed Window | Count requests per fixed block of time; reset the count when the block ends. |
| Sliding Window | Count over a window that moves with the clock, which avoids spikes at the window edges. |
The Rate Limiting Explained page goes deep on each, so here we’ll just pick one and say why.
- Token bucket is the friendly one. Tokens drip into a bucket at a steady rate, and each request grabs one. If you’ve been quiet, your bucket is full, so you can fire off a quick burst. It’s great when occasional spikes are normal, and it’s cheap to store, since you only keep a token count and a timestamp.
- Sliding window is the strict one. It always looks back over the last sixty seconds from right now, so a client can’t sneak a double batch around the edge of a minute. It’s a little more work to store, but it gives a smooth, steady cap.
For most APIs, token bucket is the sweet spot. It tolerates the friendly bursts real users create, it’s simple to reason about, and it stores tiny state. So that’s what we’ll design around. If your rule must be razor-strict with no bursts at all, swap in sliding window. The rest of our design barely changes.
📍 Where It Lives
A rate limiter shouldn’t be buried deep inside your business code. You want to catch bad traffic as early as possible, before it wastes any real work.
- The popular spot is the API gateway. A gateway is the single front door that every request passes through before reaching your services.
- Putting the limiter here means a rejected request never touches your databases or your app logic. It gets stopped right at the entrance.
- It can also live as a small piece of middleware in front of your services. Middleware is code that runs on every request before your real handler does, like a checkpoint on the way in.
So picture a guard standing at the one door everyone walks through. They check your tally, and either wave you in or hold you back. That’s the gateway doing rate limiting.
🧩 The Core Challenge: Shared State
Here’s the part that makes this a real design problem, the bit interviewers love poking at.
- You don’t run one server. You run many, sitting behind a load balancer that spreads requests across all of them.
- Alex’s first request might hit server 1, the next hits server 4, the next hits server 2. The load balancer just sends each request wherever there’s room.
- If each server keeps Alex’s count in its own memory, none of them sees the full picture. Server 1 thinks Alex made 30 requests, server 4 thinks 25, server 2 thinks 28. Nobody knows the real total is 83.
So the count cannot live inside any single server’s memory. It has to live somewhere all the servers can reach.
- The fix is a shared store: one fast place that every server reads and writes the same counter in.
- The usual pick is Redis, a super-fast in-memory data store that’s brilliant at simple jobs like counting. “In-memory” means it keeps data in fast RAM instead of slow disk, so a lookup takes well under a millisecond.
- Every server checks Alex’s counter in Redis, not in its own head. So no matter which server a request lands on, they all agree on his true total.
See how all three servers point at the one Redis box? That single shared counter is what turns “many servers” back into “one source of truth”.
⚙️ How It Works
Let’s walk a single request through the limiter from start to finish.
- A request comes in and the gateway figures out who the client is, usually from their API key or IP address.
- It builds a key for that client and looks up their counter in Redis.
- If they’re under the limit, it adds one to the count and lets the request flow through to your service.
- If they’re over the limit, it stops the request right there and sends back a rejection. Your real service stays untouched.
That rejection has an official signal. When a client is over the limit, the limiter replies with the HTTP status code 429 Too Many Requests. It’s the web’s standard way of saying “you’ve sent too many, ease off”. A good 429 also carries a Retry-After header telling the client how long to wait, so well-behaved clients don’t just hammer you again.
Now two small details make this correct, and they matter a lot.
- The read-and-increment must be atomic, meaning it happens as one indivisible step. Otherwise two requests could both read “99”, both think they’re fine, and both go through, pushing Alex to 101. Redis gives us this with a single
INCRcommand that reads and bumps the count in one shot. - The counter needs a TTL, which stands for Time To Live, a timer after which Redis deletes the key on its own. We set the TTL to the window length, say 60 seconds. When it expires, the count vanishes and the client gets a fresh allowance. No cleanup job needed.
Why atomic matters
Without an atomic increment, you have a race: two requests read the same old count at the same time and both decide they’re under the limit. Each one then writes back, and you’ve quietly let an extra request slip past. Redis doing the read-and-add as one operation closes that gap.
🗄️ Data Model
The thing we store is wonderfully small. For each client and window, we just need a count that expires on its own. Here’s the shape.
| Key | Value | TTL |
|---|---|---|
user:123 | 83 (requests so far this window) | 60 seconds, then auto-deleted |
A few things to notice about this tiny model:
- The key identifies who we’re counting.
user:123keeps Alex’s count separate from everyone else’s, so one busy user can’t burn through the shared limit. - The value is just a number that goes up by one on each allowed request.
- The TTL does the resetting for us. When the 60 seconds is up, Redis deletes the key, and the next request starts a brand new count from one.
For a token bucket you’d store a touch more, like the token count plus the timestamp it was last refilled, but the spirit is the same: one tiny key per client, living in the shared store.
🏗️ High-Level Design
Let’s zoom out and put the boxes together. The whole system is just a short chain.
Trace one request through it:
- The client hits the API gateway, which has the rate limiter built in.
- The limiter reads and increments the client’s counter in Redis, atomically.
- If the client is over the limit, the gateway returns
429right away and stops. - If they’re under, the request flows on to your backend services as normal.
Notice that Redis sits to the side as the shared brain, and the backend services only ever see traffic that already passed the check. That’s exactly what we wanted.
📈 Scaling and Edge Cases
Now let’s stress-test the design and handle the tricky bits.
- Redis as the shared counter. Redis is fast enough to be the single source of truth even at heavy load, since each check is one quick in-memory operation. If one Redis instance isn’t enough, you can shard it by client key, so different clients’ counters live on different Redis nodes.
- What if Redis goes down? This is the big one, and there’s a real trade-off. Fail-open means if you can’t reach Redis, you allow the request anyway. Your API stays up, but the limit isn’t enforced for that moment. Fail-closed means you reject when Redis is down, which keeps the limit strict but can take your API offline. Most public APIs lean fail-open, because a brief unmetered burst hurts less than a full outage. A payment or login endpoint might fail-closed instead.
- Per-user vs per-IP. Keying on the user account is precise, but it only works once you know who the user is. For traffic that isn’t logged in yet, you fall back to the IP address. The catch: many users can share one IP, like a whole office or a phone network, so a pure per-IP limit can punish innocent people.
- Window edges. With a fixed window, a client can fire a full batch at the very end of one minute and another at the very start of the next, getting almost double the limit in a tiny stretch. That’s the boundary spike. Token bucket and sliding window both smooth this out, which is another reason we leaned toward token bucket.
Decide fail-open vs fail-closed on purpose
Don’t leave this to chance. If your Redis call quietly throws an error and your code happens to crash the request, you’ve accidentally chosen fail-closed for your whole API. Pick the behavior deliberately and wrap the Redis call so a failure does exactly what you intend.
🧰 Tech Choices
Part of system design is not just naming pieces, it’s saying why you picked each one. Here are the main technology decisions for this system and the reason behind each.
| Decision | Choice | Why |
|---|---|---|
| Limiting algorithm | Token bucket | Allows short bursts while holding a steady average rate. |
| Shared count across servers | Redis | Fast, atomic counters all servers can share. |
| Where it runs | At the API gateway | Blocks abuse at the front door before it reaches services. |
| Safe updates | Atomic increment | Counting stays correct even under many requests at once. |
⚠️ Common Mistakes and Misconceptions
A few things trip people up here. Let’s clear them out.
- “I’ll just keep counts in app memory.” That falls apart the instant you have more than one server, since each server only sees its own slice. The whole point of this design is moving the count into a shared store like Redis.
- “Don’t worry about Redis being down.” You must. Redis sits in the path of every request, so its failure is your failure. Decide fail-open or fail-closed up front, on purpose.
- “Just limit by IP, it’s simpler.” Risky. Lots of real users share one IP, so a per-IP cap can lock out an entire office. Prefer per-user when you know the user, and use IP only as a fallback for anonymous traffic.
- “Skip the atomic increment, a normal read then write is fine.” Not under load. Two requests can read the same count and both slip through. Use Redis’s atomic
INCRso the read and bump happen as one step. - “A 429 means my server is broken.” Not at all. A
429is the limiter working correctly and saying “slow down”. A500is the broken one.
🛠️ Design Challenge
Try extending the design yourself. Think each one through first, then open the answer to see a full breakdown.
Different limits per endpoint. Say /search allows 100 per minute but /upload only 10. How would you change the key so each endpoint counts separately?
Burst handling. A client is quiet for ten minutes, then legitimately needs 30 requests in two seconds. Which algorithm allows that, and what state do you store?
🧩 What You’ve Learned
You can now design a distributed rate limiter from scratch and talk through it clearly. Here’s what you picked up.
- ✅ The core job: cap requests per client per window, and reject the overflow with
429 Too Many Requests. - ✅ Functional vs non-functional requirements, and why you gather them first.
- ✅ Picking an algorithm, with token bucket as the friendly default and sliding window for strict caps.
- ✅ The limiter lives at the API gateway, the single front door, so bad traffic is stopped early.
- ✅ The core challenge is shared state: many servers must agree on one count, so the counter lives in Redis, not server memory.
- ✅ Correctness leans on an atomic increment and a TTL that resets the window on its own.
- ✅ A tiny data model: a key like
user:123mapped to a count, with a TTL for the window. - ✅ Scaling and edge cases: sharding Redis, fail-open vs fail-closed, per-user vs per-IP, and window-edge spikes.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
Why can't each server keep the client's count in its own memory?
Why: Because the load balancer spreads requests, the count must live in a shared store like Redis that every server reads and writes.
- 2
Why must the read-and-increment of the counter be atomic?
Why: An atomic operation like Redis INCR reads and bumps the count in one step, so two requests can't both think they're under the limit at once.
- 3
How does the counter reset at the end of each time window?
Why: Setting a TTL equal to the window length means Redis removes the key when it expires, so the next request starts a fresh count.
- 4
What does 'fail-open' mean if Redis becomes unreachable?
Why: Fail-open lets requests pass when Redis is down, so the API stays up even though the limit isn't enforced for that brief time.
🚀 What’s Next?
This case study leans on two ideas worth going deeper on.
- Rate Limiting Explained walks through the algorithms in detail, the trade-offs between them, and exactly when each one shines.
- Introduction to Redis shows you the fast in-memory store that powers the shared counter at the heart of this design.
Get comfortable with those, then come back and try the design challenge again. The whole system will click into place.