Fault Tolerance

Table of Contents +

Think about a big passenger plane for a second:

It usually has more than one engine, right? Like two, sometimes four.
That’s not just so it can go faster. It’s so that if one engine fails mid-air, the plane keeps flying on the others.
The plane is built to keep working even when a part breaks.

Your software system needs the exact same idea. Stuff will break. A server dies, a network link drops, a disk goes bad. The question isn’t “how do we make sure nothing ever fails?” The question is “when something fails, does the whole thing crash, or does it keep going?” That ability to keep going is called fault tolerance, and that’s what we’ll learn today.

🎯 Why Faults Are Unavoidable

Let’s get one thing straight first, because a lot of beginners get this wrong:

A fault is just something going wrong in a part of your system. A server crashing, a hard disk dying, a network cable getting unplugged, a bug eating up all the memory.
When you run one app on one laptop, faults feel rare. Maybe it crashes once a month.
But real systems don’t run on one laptop. They run on hundreds or thousands of machines, talking to each other over networks.

Here’s the thing about scale. Even if each single machine fails only once in a long while, when you have thousands of them, something is failing somewhere almost all the time:

Hardware breaks. Disks and memory chips wear out and die. That’s just physics.
Networks drop. Cables get cut, routers get overloaded, packets get lost.
Software has bugs. A bad deploy, a memory leak, a crash under heavy load.

So at scale, failure isn’t a rare accident you can ignore. It’s the normal weather. You plan for it the way you’d pack an umbrella because you know it’ll rain sometimes.

🛡️ What is Fault Tolerance

Now that we know faults will happen, here’s the big idea:

Fault tolerance means the system keeps working even when some of its parts fail.
It doesn’t mean nothing ever breaks. Parts still break. It means the user mostly doesn’t notice, because the system handles the break for them.
Sometimes the system runs a little weaker for a while, and that’s okay. Working a bit slower or with one feature missing is still way better than being completely down.

Let’s tie it back to the plane:

One engine failing is the fault.
The plane staying in the air on the other engines is the fault tolerance.
Maybe it flies a bit slower or can’t climb as high for now. That’s the “running a little weaker” part. But everyone still lands safely.

So the goal is simple to say: design the system so that one broken part doesn’t take everything down with it.

🧩 How We Get It

So how do we actually build a system that shrugs off failures? There are a handful of go-to techniques, and they all share one common idea: don’t depend on just one of anything. Here they are at a glance.

Technique	How it helps
Redundancy	Keep spare copies of parts, so a backup is ready if one fails
Replication	Keep copies of your data in more than one place, so losing one machine doesn’t lose the data
Failover	Automatically switch to a backup the moment the main one dies
Graceful degradation	Drop one feature instead of crashing the whole app

Let’s walk through each one in plain words:

Redundancy means keeping spare copies of the important parts. Instead of one server, you run two or three doing the same job. If one crashes, the others are already there. It’s the second engine on the plane.
Replication is redundancy but for your data. You keep copies of the data on more than one machine. So if the machine holding your data dies, another machine still has the same data and nothing is lost.
Failover is the automatic switch. When the main thing dies, traffic moves over to a backup on its own, without a human waking up at 3 AM to flip a switch. We’ll see this one in a diagram next.
Graceful degradation means losing a feature instead of the whole app. If one piece breaks, you turn off just that piece and keep the rest running. We’ll dig into this one too.

Spot the common thread

Notice that every technique boils down to “have more than one”. More than one server, more than one copy of the data, more than one path traffic can take. The moment something exists in only one place, it becomes a weak point.

🔁 Failover in Action

Let’s make failover concrete, because it’s the one that feels like magic the first time you see it. Picture a website with a database:

Normally, all the traffic goes to the main database. We call that the primary.
Sitting quietly next to it is a backup database, kept in sync, ready to step in. We call that the standby.
Something keeps watching the primary, like a health check that pings it every second to ask “are you still alive?”

Now the primary crashes. Here’s what failover does:

The health check notices the primary stopped answering.
It promotes the standby to be the new primary.
It points all the traffic at the new primary instead.

And the users? Most of them never notice. Maybe one request was a little slow during the switch. That’s it. Here’s the flow:

That automatic switch is the whole point. Without it, someone has to notice the crash, log in, and fix the routing by hand, and your site is down the entire time they’re doing that.

A backup you never test isn't a backup

A standby that’s never been tried is just a hope. Teams sometimes find out during a real outage that their failover doesn’t actually work, the standby was out of sync or misconfigured. So good teams practice failures on purpose to make sure the switch really happens.

🌗 Graceful Degradation

Failover keeps a core part alive. But sometimes a part dies and there’s no backup ready, or the broken part just isn’t essential. That’s where graceful degradation comes in:

Graceful degradation means you lose a feature, not the whole app.
When one piece breaks, you switch off just that piece and keep everything else working.
The user gets a slightly weaker experience for a bit, instead of a blank error page.

Here’s a real example you’ve probably seen without realizing it. Imagine a shopping site:

It has a “Recommended for you” section that suggests products you might like.
That recommendation service is its own separate thing, and one day it goes down.
Without graceful degradation, the whole product page might crash, because it was waiting on recommendations that never came.
With graceful degradation, the page just hides the recommendations and shows everything else. You can still search, browse, add to cart, and check out.

That’s a huge win. You lost a nice-to-have feature for a while, but the part that makes money, people buying things, kept working perfectly. Losing a small feature beats losing the whole store every single time.

⚖️ Fault Tolerance vs High Availability

These two come up together a lot, and people mix them up, so let’s keep them straight:

High availability (HA) is the goal. It means your system is up and ready to use almost all the time, with very little downtime.
Fault tolerance is how you get there. It’s the set of techniques (redundancy, replication, failover, graceful degradation) that let the system survive failures.

The easy way to remember it:

High availability is the “what we want”, staying up.
Fault tolerance is the “how we do it”, surviving the failures that would otherwise take us down.

So they’re closely related, almost two sides of the same coin. You build fault tolerance into your system precisely because you want high availability out of it.

⚠️ Common Mistakes and Misconceptions

A few ideas trip people up when they’re new to this. Let’s clear them out:

“One reliable server is enough.” No single server is reliable enough. Even the best machine dies eventually, and when it does, you have zero. One of anything is always a risk.
“We’ll add backups later.” Bolting on backups after the fact is painful and often gets skipped until the first big outage forces it. Plan for failure from the start.
“There’s no single point of failure here.” A single point of failure (SPOF) is any one part that takes down the whole system if it breaks. People forget the less obvious ones, like one load balancer, one network link, or one database that everything depends on. Hunt them down.
“Failures won’t happen to us.” They will. Assuming otherwise just means you’ll be unprepared when they do. Hope is not a strategy.
“Just keep retrying until it works.” Retrying a failed request can help, but retrying forever with no limit makes things worse. A flood of retries can pile onto an already struggling server and knock it over completely. Always cap your retries and back off between them.

🛠️ Design Challenge

Try these yourself. Think each one through first, then open the answer to compare.

You’re designing a simple online food-ordering app. It has a web server, a database, and a separate service that calculates delivery time estimates.

Find the single points of failure. Which parts, if they die, take the whole app down?

Show the answer

Redundancy and failover for the database. How would you make the database survive a crash?

Show the answer

Graceful degradation for delivery estimates. The estimate service is flaky. How can the app still take orders when estimates are unavailable?

Show the answer

🧩 What You’ve Learned

You can now explain how systems survive failure. Here’s what you’ve picked up.

✅ Faults are unavoidable at scale, so you design for failure instead of pretending it won’t happen.
✅ Fault tolerance means the system keeps working, maybe a bit degraded, even when parts fail.
✅ Redundancy keeps spare copies of parts, and replication keeps spare copies of data.
✅ Failover automatically switches traffic to a backup when the main part dies.
✅ Graceful degradation drops a single feature instead of crashing the whole app.
✅ Fault tolerance is how you get high availability, and removing single points of failure is at the heart of it.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

🚀 What’s Next?

This lesson gave you the core idea of surviving failure. Next, we’ll zoom into the related goals and go deeper.

What is High Availability? shows how teams measure uptime and aim for those famous “nines”.
Reliability in Distributed Systems digs into keeping systems correct and dependable when they’re spread across many machines.

Once you’ve got those, you’ll have a solid grip on the resilience fundamentals every system design interview leans on.

Previous Consistency in Distributed Systems Next How the Internet Works

Share & Connect

Share on LinkedIn