Fault Tolerance
Table of Contents + â
Think about a big passenger plane for a second:
- It usually has more than one engine, right? Like two, sometimes four.
- Thatâs not just so it can go faster. Itâs so that if one engine fails mid-air, the plane keeps flying on the others.
- The plane is built to keep working even when a part breaks.
Your software system needs the exact same idea. Stuff will break. A server dies, a network link drops, a disk goes bad. The question isnât âhow do we make sure nothing ever fails?â The question is âwhen something fails, does the whole thing crash, or does it keep going?â That ability to keep going is called fault tolerance, and thatâs what weâll learn today.
đŻ Why Faults Are Unavoidable
Letâs get one thing straight first, because a lot of beginners get this wrong:
- A fault is just something going wrong in a part of your system. A server crashing, a hard disk dying, a network cable getting unplugged, a bug eating up all the memory.
- When you run one app on one laptop, faults feel rare. Maybe it crashes once a month.
- But real systems donât run on one laptop. They run on hundreds or thousands of machines, talking to each other over networks.
Hereâs the thing about scale. Even if each single machine fails only once in a long while, when you have thousands of them, something is failing somewhere almost all the time:
- Hardware breaks. Disks and memory chips wear out and die. Thatâs just physics.
- Networks drop. Cables get cut, routers get overloaded, packets get lost.
- Software has bugs. A bad deploy, a memory leak, a crash under heavy load.
So at scale, failure isnât a rare accident you can ignore. Itâs the normal weather. You plan for it the way youâd pack an umbrella because you know itâll rain sometimes.
đĄď¸ What is Fault Tolerance
Now that we know faults will happen, hereâs the big idea:
- Fault tolerance means the system keeps working even when some of its parts fail.
- It doesnât mean nothing ever breaks. Parts still break. It means the user mostly doesnât notice, because the system handles the break for them.
- Sometimes the system runs a little weaker for a while, and thatâs okay. Working a bit slower or with one feature missing is still way better than being completely down.
Letâs tie it back to the plane:
- One engine failing is the fault.
- The plane staying in the air on the other engines is the fault tolerance.
- Maybe it flies a bit slower or canât climb as high for now. Thatâs the ârunning a little weakerâ part. But everyone still lands safely.
So the goal is simple to say: design the system so that one broken part doesnât take everything down with it.
đ§Š How We Get It
So how do we actually build a system that shrugs off failures? There are a handful of go-to techniques, and they all share one common idea: donât depend on just one of anything. Here they are at a glance.
| Technique | How it helps |
|---|---|
| Redundancy | Keep spare copies of parts, so a backup is ready if one fails |
| Replication | Keep copies of your data in more than one place, so losing one machine doesnât lose the data |
| Failover | Automatically switch to a backup the moment the main one dies |
| Graceful degradation | Drop one feature instead of crashing the whole app |
Letâs walk through each one in plain words:
- Redundancy means keeping spare copies of the important parts. Instead of one server, you run two or three doing the same job. If one crashes, the others are already there. Itâs the second engine on the plane.
- Replication is redundancy but for your data. You keep copies of the data on more than one machine. So if the machine holding your data dies, another machine still has the same data and nothing is lost.
- Failover is the automatic switch. When the main thing dies, traffic moves over to a backup on its own, without a human waking up at 3 AM to flip a switch. Weâll see this one in a diagram next.
- Graceful degradation means losing a feature instead of the whole app. If one piece breaks, you turn off just that piece and keep the rest running. Weâll dig into this one too.
Spot the common thread
Notice that every technique boils down to âhave more than oneâ. More than one server, more than one copy of the data, more than one path traffic can take. The moment something exists in only one place, it becomes a weak point.
đ Failover in Action
Letâs make failover concrete, because itâs the one that feels like magic the first time you see it. Picture a website with a database:
- Normally, all the traffic goes to the main database. We call that the primary.
- Sitting quietly next to it is a backup database, kept in sync, ready to step in. We call that the standby.
- Something keeps watching the primary, like a health check that pings it every second to ask âare you still alive?â
Now the primary crashes. Hereâs what failover does:
- The health check notices the primary stopped answering.
- It promotes the standby to be the new primary.
- It points all the traffic at the new primary instead.
And the users? Most of them never notice. Maybe one request was a little slow during the switch. Thatâs it. Hereâs the flow:
That automatic switch is the whole point. Without it, someone has to notice the crash, log in, and fix the routing by hand, and your site is down the entire time theyâre doing that.
A backup you never test isn't a backup
A standby thatâs never been tried is just a hope. Teams sometimes find out during a real outage that their failover doesnât actually work, the standby was out of sync or misconfigured. So good teams practice failures on purpose to make sure the switch really happens.
đ Graceful Degradation
Failover keeps a core part alive. But sometimes a part dies and thereâs no backup ready, or the broken part just isnât essential. Thatâs where graceful degradation comes in:
- Graceful degradation means you lose a feature, not the whole app.
- When one piece breaks, you switch off just that piece and keep everything else working.
- The user gets a slightly weaker experience for a bit, instead of a blank error page.
Hereâs a real example youâve probably seen without realizing it. Imagine a shopping site:
- It has a âRecommended for youâ section that suggests products you might like.
- That recommendation service is its own separate thing, and one day it goes down.
- Without graceful degradation, the whole product page might crash, because it was waiting on recommendations that never came.
- With graceful degradation, the page just hides the recommendations and shows everything else. You can still search, browse, add to cart, and check out.
Thatâs a huge win. You lost a nice-to-have feature for a while, but the part that makes money, people buying things, kept working perfectly. Losing a small feature beats losing the whole store every single time.
âď¸ Fault Tolerance vs High Availability
These two come up together a lot, and people mix them up, so letâs keep them straight:
- High availability (HA) is the goal. It means your system is up and ready to use almost all the time, with very little downtime.
- Fault tolerance is how you get there. Itâs the set of techniques (redundancy, replication, failover, graceful degradation) that let the system survive failures.
The easy way to remember it:
- High availability is the âwhat we wantâ, staying up.
- Fault tolerance is the âhow we do itâ, surviving the failures that would otherwise take us down.
So theyâre closely related, almost two sides of the same coin. You build fault tolerance into your system precisely because you want high availability out of it.
â ď¸ Common Mistakes and Misconceptions
A few ideas trip people up when theyâre new to this. Letâs clear them out:
- âOne reliable server is enough.â No single server is reliable enough. Even the best machine dies eventually, and when it does, you have zero. One of anything is always a risk.
- âWeâll add backups later.â Bolting on backups after the fact is painful and often gets skipped until the first big outage forces it. Plan for failure from the start.
- âThereâs no single point of failure here.â A single point of failure (SPOF) is any one part that takes down the whole system if it breaks. People forget the less obvious ones, like one load balancer, one network link, or one database that everything depends on. Hunt them down.
- âFailures wonât happen to us.â They will. Assuming otherwise just means youâll be unprepared when they do. Hope is not a strategy.
- âJust keep retrying until it works.â Retrying a failed request can help, but retrying forever with no limit makes things worse. A flood of retries can pile onto an already struggling server and knock it over completely. Always cap your retries and back off between them.
đ ď¸ Design Challenge
Try this on your own to test yourself.
Imagine youâre designing a simple online food-ordering app. It has a web server, a database, and a separate service that calculates delivery time estimates. Walk through it and answer:
- Where are the single points of failure? What breaks the whole app if it dies?
- How would you add redundancy and failover so the database surviving a crash?
- The delivery-estimate service is flaky. How could graceful degradation help, so the app still takes orders even when estimates are unavailable?
Write down your answers. Thinking through where each part could fail, and what youâd do about it, is exactly how you reason in a real system design interview.
đ§Š What Youâve Learned
You can now explain how systems survive failure. Hereâs what youâve picked up.
- â Faults are unavoidable at scale, so you design for failure instead of pretending it wonât happen.
- â Fault tolerance means the system keeps working, maybe a bit degraded, even when parts fail.
- â Redundancy keeps spare copies of parts, and replication keeps spare copies of data.
- â Failover automatically switches traffic to a backup when the main part dies.
- â Graceful degradation drops a single feature instead of crashing the whole app.
- â Fault tolerance is how you get high availability, and removing single points of failure is at the heart of it.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What does fault tolerance mean?
Why: Fault tolerance means a broken part does not bring the whole system down.
- 2
What is failover?
Why: Failover is the automatic switch to a backup when the main part fails.
- 3
What is graceful degradation?
Why: Graceful degradation drops a single feature, like recommendations, while the rest keeps working.
- 4
How are fault tolerance and high availability related?
Why: High availability is what you want; fault tolerance is the set of techniques that get you there.
đ Whatâs Next?
This lesson gave you the core idea of surviving failure. Next, weâll zoom into the related goals and go deeper.
- What is High Availability? shows how teams measure uptime and aim for those famous âninesâ.
- Reliability in Distributed Systems digs into keeping systems correct and dependable when theyâre spread across many machines.
Once youâve got those, youâll have a solid grip on the resilience fundamentals every system design interview leans on.