Distributed Locks Explained
Table of Contents + −
Picture this. A customer named Alex places one order, and your system happens to be running on two servers at the same time:
- Both servers get told “hey, go charge this order.”
- Server A starts charging Alex’s card. At almost the same instant, Server B also starts charging it.
- Neither one knows the other is doing it. So Alex gets charged twice for the same order.
That’s not a rare freak accident, by the way. The moment you run on more than one machine, this kind of double-action becomes a real, everyday risk. A distributed lock is the tool we use to stop it.
🎯 The Problem
Here’s the pain, in plain words:
- When you run on many machines, more than one of them might try to do the same critical action at the same time.
- A “critical action” is something that must happen exactly once, like charging a card, shipping an item, or handing out the last seat on a flight.
- If two machines do it together, you get a conflict. Alex gets charged twice, or two people get sold the same single seat.
The name for this mess is a race condition. A race condition is when two or more things run at the same time and the final result depends on who happens to win the “race.” And when the result depends on luck like that, sooner or later luck goes the wrong way.
So the question becomes: how do we make sure that across all our machines, only one of them does the critical action at a time?
🔒 What is a Distributed Lock
Let’s define it clearly first:
- A distributed lock is a lock that works across multiple machines, so only one node can hold it at a time, and only the holder is allowed to do the protected action.
- A “node” here just means one running machine or process in your system.
- While one node holds the lock, every other node has to wait its turn. Nobody else gets to touch the protected action.
The idea it gives you is called mutual exclusion. Mutual exclusion means only one node can be inside the critical action at any moment, and everyone else is excluded until it’s done. Think of it like a single bathroom key in an office. There’s one key, only the person holding it can go in, and everyone else waits outside until the key comes back.
Why not just use a normal lock?
A normal lock from your programming language only works inside one process on one machine. The other server has no idea that lock exists. A distributed lock lives outside any single machine, in a shared service all the nodes can see, so it can coordinate across all of them.
⚙️ How It Works
The trick is that the lock doesn’t live inside any one node. It lives in a shared lock service that every node can talk to. A lock service is just a separate system whose job is to hand out the lock to one node at a time.
Here’s the flow when a node wants to do the critical action:
- The node asks the lock service: “can I have the lock for order 123?”
- If the lock is free, the service grants it. Now this node is the holder.
- The node does its protected work, like charging the card. Just once, safely.
- When it’s done, it releases the lock so someone else can have it.
- Any other node that asked in the meantime has to wait, then it gets its turn after the lock is released.
So even though two nodes both wanted to act, only one gets the lock and acts. The other one waits.
The whole point: the lock service is the single referee. Both servers ask the same referee, so exactly one of them wins and the other holds back.
⏳ Locks Need a Timeout
Now here’s a scary question. What if the node holding the lock crashes before it releases it? Without a safety net, that lock stays held forever, and every other node waits forever too. The whole system gets stuck.
So real distributed locks come with a built-in expiry:
- A lease is a lock you only get to hold for a limited time. When the time runs out, the lock auto-releases, even if you never let go of it.
- The length of that time is called the TTL, which stands for Time To Live. It’s just how long the lock stays valid before it expires on its own.
- So if a node grabs the lock and then crashes, the TTL runs out, the lock frees itself, and another node can pick up the work. No permanent freeze.
Think of it like a hotel room key card that stops working at checkout time. You don’t have to hand it back for the room to free up. It just expires.
The TTL is a tradeoff
Pick a TTL too short and the lock might expire while the holder is still doing real work, so a second node jumps in. Pick it too long and a crashed holder blocks everyone for ages. There’s no perfect number, you tune it to how long the job usually takes.
🛠️ How They’re Built
You usually don’t build a lock service from scratch. People reach for a battle-tested tool:
- Redis is a fast in-memory data store, and it’s a common way to do locks. The well-known recipe for doing it across several Redis servers is called Redlock.
- ZooKeeper is a coordination service made exactly for this kind of job, and it’s very good at agreeing on “who holds what” across machines.
- etcd is a similar coordination store, often used in Kubernetes setups, and it handles locks and leases too.
The common thread: all of them give you a shared, reliable place to keep the lock that every node can see and trust.
⚠️ The Hard Parts
I’ll be honest with you here. Distributed locks are genuinely tricky, and a lot of smart people have gotten them wrong. Here’s why they’re hard:
- A node can crash while holding the lock. The TTL saves you from a permanent freeze, but it also opens the door to two holders if the timing goes bad.
- Clocks drift between machines. Different servers don’t agree perfectly on the time. So one machine might think the lease is still valid while another thinks it already expired.
- Network delays sneak in. A node might think it still holds the lock, but its release or renewal message got stuck in the network. Meanwhile the lock expired and someone else grabbed it. Now two nodes both believe they’re the holder, and that’s exactly the double-charge we were trying to avoid.
A common safety trick for this is a fencing token, which is just a number that goes up by one every time the lock is handed out. The protected resource only accepts the newest token, so a confused old holder gets rejected. You don’t need to master it now, just know the real-world locks have extra guards like this for a reason.
✅ When to Use One
Distributed locks are powerful, but they add cost and complexity. So use them with care:
- Use one when you have a truly critical action that must happen one-at-a-time, like charging a card, releasing the last item in stock, or running a job that must not run twice at once.
- Often it’s better to design so you don’t need a lock at all. For example, make the action idempotent, meaning running it twice has the same effect as running it once. Then a double-run does no harm and you can skip the lock entirely.
- Other times you can route all work for one order to the same node, so there’s naturally only one actor and no contest.
The rule of thumb: a lock is a real tool for real one-at-a-time needs, but the best lock is often the one you designed your way out of needing.
⚠️ Common Mistakes and Misconceptions
A few things trip people up. Let’s clear them out:
- “A normal lock works across servers.” No. A regular in-process lock only covers one machine. The other servers can’t see it, so it does nothing to stop cross-machine conflicts. You need a shared lock service.
- “Locks never expire.” They should. A lock with no TTL is a trap, because one crashed holder freezes the whole system forever. Real distributed locks always have an expiry.
- “Distributed locks are perfectly safe.” They’re not. Crashes, clock drift, and network delays can all lead to two holders at once. Locks reduce the risk a lot, but you still design for the case where two nodes briefly act, often with fencing tokens or idempotency.
🛠️ Design Challenge
Try this one on your own to test yourself.
You’re building a flash sale. There’s exactly one limited-edition item left, and ten thousand people click “buy” in the same second across many servers. Walk through how you’d make sure only one person actually gets it. Think about:
- Where the lock lives, and which step actually needs protecting.
- What TTL you’d pick, and what happens if the holder crashes mid-purchase.
- Whether you could avoid the lock entirely by making the “decrement stock” step idempotent or by routing all buys for that item to one node.
See how many failure cases you can name and handle. That’s exactly the reasoning a system design interview is looking for.
Here’s the difference a lock makes, side by side.
| Situation | Without a lock | With a distributed lock |
|---|---|---|
| Two servers charge one order | Both charge it, customer pays twice | Only the lock holder charges, just once |
| Last item in stock | Sold to two people, oversold | One buyer wins, the rest are told it’s gone |
| A scheduled job fires twice | Job runs twice at once, duplicate work | Second run waits or skips, runs once |
| Holder crashes mid-task | No lock, so nothing to recover | TTL expires, another node picks it up |
🧩 What You’ve Learned
You can now explain what a distributed lock is and why it’s hard. Here’s what you’ve picked up.
- ✅ Running on many machines creates race conditions, where two nodes do the same critical action at once.
- ✅ A distributed lock gives mutual exclusion across machines, so only one node holds it and acts at a time.
- ✅ It works through a shared lock service that acts as the single referee, granting the lock to one node while others wait.
- ✅ Locks need a TTL or lease so a crashed holder doesn’t freeze the system forever.
- ✅ They’re usually built on Redis (Redlock), ZooKeeper, or etcd.
- ✅ They’re genuinely tricky thanks to crashes, clock drift, and network delays, so design to avoid needing them when you can.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What is a distributed lock?
Why: A distributed lock lives in a shared service all nodes can see, giving mutual exclusion across the whole system.
- 2
Why can't a normal in-process lock stop two servers from acting at once?
Why: A regular lock covers one machine only; the other servers have no idea it exists, so it cannot coordinate across machines.
- 3
Why do distributed locks need a TTL or lease?
Why: If the holder crashes without releasing the lock, the TTL runs out and frees it so another node can continue.
- 4
How does a fencing token guard against two holders?
Why: A confused old holder carries an older token, which the resource rejects because it only accepts the newest one.
🚀 What’s Next?
This lesson showed you one of the trickiest coordination problems in distributed systems. Next, go deeper into the surrounding ideas.
- Distributed System Challenges covers the broader set of problems that make many-machine systems hard.
- Idempotency shows how to make actions safe to repeat, which is often the cleaner way to dodge needing a lock at all.
Get these two together and you’ll reason about correctness across machines like a pro.