Rate Limiting Explained

Table of Contents +

Picture this. You built a nice little API, and one morning you check your dashboard:

One single client is overloading your server with thousands of requests every second.
Your database is sweating, your cloud bill is climbing, and everyone else’s requests are now crawling.
Maybe it’s a buggy script. Maybe it’s someone trying to break in. Maybe it’s an actual attack.

Whatever it is, you need a way to say “that’s enough, slow down”. That’s exactly what rate limiting does, and we’ll build up the idea one piece at a time. By the end you’ll be able to explain it cleanly, even to an interviewer.

🎯 The Problem

Here’s the pain, plain and simple:

A server can only handle so many requests at once. There’s a limit to its CPU, its memory, and its database connections.
Now imagine one client sends way more than its fair share. It floods the server with requests back to back.
That one greedy client eats up resources that everyone else needed. So your other users see slow responses or errors, even though they did nothing wrong.

And it gets worse than just slowness:

Every request usually costs you money. More database reads, more compute, more bandwidth, all of it adds up on your bill.
A big enough flood can take the whole service down for everyone. When attackers do this on purpose with tons of machines, it’s called a DDoS attack (Distributed Denial of Service, basically many computers flooding you at once to knock you offline).
Without a cap, you’re trusting every client to behave. And on the open internet, that’s a bet you will lose.

So we need a gatekeeper that watches how much each client is asking for, and steps in when someone goes overboard.

🚪 Real-World Analogy

Think of a popular night club with a bouncer at the door. Here’s how the bouncer keeps things sane:

The club has a limit on how many people it can safely hold. So the bouncer doesn’t just let everyone rush in at once.
He lets people in at a steady pace, like only so many per minute. If the crowd shows up faster than that, the rest wait in line outside.
If someone keeps trying to push in over and over, he just says “not right now” and turns them away.

A rate limiter is that bouncer for your API:

The club is your server, with its safe capacity.
Each person is an incoming request.
The bouncer letting in only so many per minute is the limit you set.
And the “not right now” is the rejection your server sends back when a client goes over.

Keep this bouncer in your head. Everything below maps back to him.

🧮 What is Rate Limiting

Let’s define it clearly. Rate limiting is capping how many requests a single client can make in a given time window. Let’s unpack that one piece at a time:

A client is whoever is calling your API. You usually identify them by something like their user account, their API key, or their IP address. (An IP address is just the network address of a machine, like a return address on a letter.)
A time window is a stretch of time you measure over, like one minute or one hour.
The limit is the number of requests you allow inside that window. Something like “100 requests per minute per client”.

So if Alex is allowed 100 requests per minute and sends 100, fine. The 101st request inside that same minute gets turned away. Once the minute rolls over, Alex gets a fresh allowance again.

Rate limiting vs throttling

You’ll hear the word throttling too. They’re closely related. Rate limiting is the rule itself, “100 per minute”. Throttling is the act of slowing a client down or holding their requests once they hit that rule. In everyday talk people use them almost interchangeably, so don’t lose sleep over the difference.

⚙️ How It Works

The core idea is simpler than you’d think. The rate limiter just keeps a count for each client:

For every incoming request, it figures out who the client is, usually from their API key or IP address.
It looks up how many requests that client has made in the current window.
If they’re under the limit, it adds one to their count and lets the request through to your server.
If they’re over the limit, it stops the request right there and sends back a rejection. Your real server never even sees it.

That rejection has an official signal. When a client is over the limit, the server replies with the HTTP status code 429 Too Many Requests. It’s the web’s standard way of saying “you’ve sent too many, ease off”. (An HTTP status code is just a short number that tells the client how the request went, like 200 for success or 404 for not found.)

Here’s the whole decision in one picture.

Often the response also tells the client when they can try again, using a header called Retry-After. (A header is just a small piece of extra info attached to a request or response.) That’s the polite version of the bouncer saying “come back in thirty seconds”.

Be kind with your 429s

A good 429 response says more than just “no”. Include a Retry-After header so well-behaved clients know exactly how long to wait. It turns a frustrating wall into a clear instruction, and it stops clients from retrying in a tight loop that only makes things worse.

🔁 Rate Limiting Algorithms

Now, “keep a count” sounds easy, but there are a few different ways to actually do the counting. Each one has its own personality. Here are the common ones you’ll meet.

Algorithm	The idea in one line
Token Bucket	A bucket refills tokens over time; each request spends one, so short bursts are allowed.
Leaky Bucket	Requests queue up and drain out at a steady rate, smoothing traffic into an even flow.
Fixed Window	Count requests per fixed block of time (each minute); reset the count when the block ends.
Sliding Window	Count over a window that moves with the clock, which avoids spikes at the window edges.

Let’s add a little colour to a few of these, because the differences matter:

Token bucket is the friendly one. It hands out tokens at a steady rate into a bucket, and each request grabs one. If you’ve been quiet for a bit, your bucket is full, so you can fire off a quick burst. That makes it great when occasional spikes are normal.
Fixed window is the simplest to build, but it has a sneaky flaw. Because the count resets exactly when each block ends, a client can send a full batch at the very end of one minute and another full batch at the very start of the next. That’s a boundary spike, almost double the limit in a tiny stretch of time.
Sliding window fixes that. Instead of resetting on a hard clock tick, it always looks back over the last sixty seconds from right now. So that edge-of-the-minute trick doesn’t work anymore, and traffic stays smooth.

Which one should you pick?

For most APIs, token bucket or sliding window is the sweet spot. Token bucket if you want to tolerate friendly bursts, sliding window if you want a strict and steady cap without boundary spikes. Fixed window is fine when you just need something quick and rough.

📍 Where It Runs

A rate limiter isn’t tied to one spot. It can live at a few different layers, and big systems often use more than one:

At the API gateway. A gateway is the single front door that all requests pass through before reaching your services. Putting the limiter here is popular, because you catch bad traffic early, before it touches anything important.
At the load balancer. A load balancer spreads incoming requests across your servers. Some can enforce simple limits right there at the edge.
In the application. Your own code can check limits too, which is handy for rules that depend on business logic, like “free users get fewer requests than paid users”.

But there’s a catch once you have more than one server. Each server only knows about the requests it personally handled. So how do they agree on a single shared count?

The usual answer is a fast shared store, most commonly Redis. (Redis is a super-fast in-memory data store, great at simple jobs like counting.)
Every server reads and updates the same counter in Redis. So no matter which server a request lands on, they all see the true total for that client.

⚡ Benefits

So why is this worth adding? A good rate limiter quietly pays for itself:

It stops abuse. Brute-force login attempts, scrapers, and floods all get capped before they do damage.
It keeps things fair. No single client can hog the server. Everyone gets their slice, so the experience stays steady for all users.
It protects you from spikes. A sudden surge, whether it’s an attack or just a viral moment, gets smoothed out instead of crashing you.
It controls cost. Fewer wasted requests means a smaller cloud bill. You’re not paying to serve traffic that’s just abuse.

A classic real example is login: many sites allow only a handful of failed login attempts per minute from one IP. That single rule makes it painfully slow for an attacker to guess passwords, while a normal user who fumbles their password once or twice barely notices.

⚠️ Common Mistakes and Misconceptions

A few things trip people up here. Let’s clear them out:

“Rate limiting blocks a client forever.” No. It only blocks requests over the limit, and only inside the current window. Once the window passes, the client is welcome again.
“One global counter is enough.” Usually not. You want the count per client, otherwise one busy user could use up the whole limit and lock everyone else out.
“A 429 means the server is broken.” Not at all. A 429 is the server working correctly and politely saying “you’ve sent too many, slow down”. A 500 is the broken one.
“Fixed window is good enough everywhere.” Watch out for that boundary spike. If precise limits matter, reach for sliding window instead.
“I can just count in memory on each server.” That falls apart the moment you have more than one server, since they won’t share the count. Use a shared store like Redis.

🛠️ Design Challenge

Try this one on your own to test yourself.

You’re protecting a login endpoint. Real users sometimes mistype their password a couple of times, but attackers try thousands of guesses. Design a rate limit for it.

What would you key the limit on, the IP address, the username, or both? Think about the trade-offs.

Show the answer

What numbers feel right, like how many attempts per minute before you start saying 429?

Show the answer

Should a blocked client be told when to try again? What would that response look like?

Show the answer

🧩 What You’ve Learned

You can now explain rate limiting from end to end. Here’s what you’ve picked up:

✅ Rate limiting caps how many requests a client can make in a time window, to stop abuse and protect the service.
✅ Requests under the limit pass through; requests over it get rejected with 429 Too Many Requests.
✅ The main algorithms are token bucket, leaky bucket, fixed window, and sliding window, each with its own trade-offs.
✅ Token bucket allows bursts, fixed window can spike at boundaries, and sliding window smooths that out.
✅ It usually runs at the API gateway, and across many servers it shares a counter in a fast store like Redis.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

🚀 What’s Next?

You’ve got the pattern down. Next, go a little deeper into the pieces it leans on.

Introduction to Redis shows you the fast in-memory store that powers shared counters behind real rate limiters.
Authentication vs Authorization explains how you identify and check the very clients you’re rate limiting.

Get comfortable with those, and you’ll have a solid grip on the building blocks every system design interview keeps coming back to.

Previous Circuit Breaker Pattern Next Token Bucket Algorithm

Share & Connect

Share on LinkedIn