Health Checks Explained

Picture this. You’ve got a few copies of your app running, and a load balancer in front of them sending users to each one. Now one of those copies quietly crashes. Here’s the scary part:

  • The load balancer doesn’t magically know the server is dead.
  • So it keeps cheerfully sending users to it.
  • And those users just get errors, while a perfectly healthy server sits right next to it doing nothing.

The fix for this is a small, simple idea called a health check. By the end of this lesson you’ll know exactly what it is, how load balancers and Kubernetes use it, and why it’s the thing that lets systems quietly heal themselves.

🎯 The Problem

The whole reason health checks exist is that servers are not as reliable as we’d like. Things go wrong in a few common ways:

  • A server crashes. The process just dies, and now it can’t answer anything.
  • A server hangs. The process is technically still running, but it’s stuck and not actually serving requests.
  • A server isn’t ready yet. It just started up and is still loading config or connecting to the database, so it can’t take traffic for the first few seconds.

In all of these cases, the danger is the same. Traffic gets sent to a server that can’t handle it, and your users see errors. So we need some way for the system to know, at any moment, which servers are actually okay to use.

🩺 What is a Health Check

A health check is a small endpoint, usually something like /health, that a service exposes just to answer one question: “are you okay?” Let’s unpack that:

  • An endpoint is just a URL on your service that you can hit, like /login or /health.
  • This particular endpoint isn’t for users. It’s there so other systems can ping it on a schedule and get a quick yes-or-no on whether the service is healthy.
  • “Ping” here just means send it a small request and see what comes back.

When the service is fine, the endpoint replies with a success status and maybe a tiny bit of info. The reply is meant to be lightweight, so it can be called every few seconds without slowing anything down. A typical healthy response looks like this.

{
"status": "ok",
"uptime": 13452
}

That’s it. A 200 OK status with a small body saying “I’m alive.” If the service is in trouble, the same endpoint instead returns an error status, and that’s the signal everyone else watches for.

Why a dedicated endpoint?

You could try to guess a server’s health from real user traffic, but that’s messy and slow to react. A dedicated /health endpoint gives one clean, predictable place to ask the question, and the answer comes back fast.

🔁 How It’s Used

So who’s actually pinging this endpoint? Mostly two kinds of systems, and they both do roughly the same thing with the answer:

  • A load balancer sits in front of your servers and spreads incoming users across them. It pings each server’s /health regularly so it only routes users to the healthy ones.
  • An orchestrator is a system that runs and manages your containers for you, deciding what runs where and restarting things when they fail. Kubernetes is the famous one. It also pings health checks to decide what to keep, restart, or replace.

Here’s the loop they both follow. They ping /health on a schedule, and they act on the answer:

Yes

No

Load balancer pings /health

Healthy?

Send users to it

Stop sending traffic

Restart or replace it

So the health check is the feedback signal. When a server says “I’m fine,” it keeps getting traffic. When it stops saying that, traffic is pulled away from it, and the orchestrator can start a fresh copy to take its place. Nobody has to wake up at 3am to do this by hand.

❤️ Liveness vs Readiness

Here’s a distinction that trips a lot of people up, and it’s a favorite in interviews. There are really two different questions you might be asking, and they need different answers:

  • Liveness asks: “Is the app alive at all?” If the answer is no, the app is broken or stuck, and the right move is to restart it. A failing liveness check means kill it and bring up a fresh one.
  • Readiness asks: “Is the app ready to take traffic right now?” The app might be perfectly alive but still warming up, like finishing startup or waiting on a database connection. A failing readiness check means don’t restart it, just hold off sending users until it’s ready.

The key difference is what happens when each one fails. Mixing them up causes real damage, so it’s worth keeping straight.

Aspect Liveness Readiness
Question it asks Is the app alive? Is the app ready for traffic?
Fails when App is crashed or stuck App is still starting or a dependency is down
Action on failure Restart the app Stop sending traffic, but don’t restart
Typical endpoint /health/live /health/ready

Don't restart on a readiness failure

If you restart an app just because it’s not ready yet, you can get stuck in a loop. It keeps restarting before it ever finishes warming up. Readiness failures should pause traffic, not trigger a restart.

🧪 What a Good Health Check Tests

A health check is only useful if it tells the truth. So what should it actually check? You want it to confirm the things that really matter for serving a request:

  • The app itself responds. The process is up and able to answer, not hung.
  • The key dependencies are reachable. If your app can’t work without its database, then a health check that ignores the database can lie to you. So it should quietly confirm it can reach the things it truly needs.

But there’s a balance to strike here. The check runs over and over, every few seconds, on every instance. So keep it fast and light:

  • Don’t run heavy queries or slow computations inside it.
  • Don’t check things your app doesn’t actually depend on.
  • Aim for a quick, cheap “yes everything I need is reachable,” then return.

🧩 Shallow vs Deep Checks

Health checks come in two flavors, and the difference is just how much they test:

  • A shallow check only confirms the process is up. It basically says “I’m running” and returns. It’s super fast and cheap, but it can miss problems, like the app being up while its database is unreachable.
  • A deep check also tests the dependencies, like pinging the database or a cache. It gives a more honest picture of whether the app can really do its job, but it costs more and takes longer.

So which do you use? It’s a trade-off:

  • Shallow checks are great for liveness, where you just want to know the process hasn’t died.
  • Deep checks fit readiness, where you care whether the app can actually serve a real request right now.
  • The risk with deep checks is that one slow dependency can make every instance look unhealthy at once, so use them carefully.

⚡ Why It Enables Self-Healing

This is where it all pays off. Self-healing means the system fixes itself without a human stepping in, and it leans entirely on accurate health checks. Here’s how the pieces connect:

  • When a server fails its health check, the load balancer does failover. Failover just means traffic is automatically shifted away from the broken server onto the healthy ones.
  • The orchestrator then restarts or replaces the bad instance, so capacity comes back on its own.
  • All of this happens in seconds, automatically, because the health check gave a trustworthy signal.

This is the foundation under high availability, which is the goal of keeping a system up even when individual parts fail. But notice the catch. If your health check lies, self-healing breaks. A check that always says “ok” means dead servers keep getting traffic, and a check that’s too strict means healthy servers get killed for no reason.

⚠️ Common Mistakes and Misconceptions

A few traps catch people again and again. Let’s clear them out:

  • “We don’t need a health check, our app rarely crashes.” Even rare crashes send users to dead servers, and “rarely” still means “sometimes” at 3am. The whole point is to handle the bad moments automatically.
  • “Heavier is better, check everything.” A health check that’s too heavy or slow becomes a problem itself. It eats resources and can time out, making a fine server look broken.
  • “Liveness and readiness are the same thing.” They’re not. One restarts the app, the other just pauses traffic. Treating them as one leads to restart loops or traffic going to apps that aren’t ready.
  • “A health check that always returns OK is fine.” That’s the worst kind. It hides real failures, so dead instances keep receiving traffic and self-healing never kicks in. A health check has to be willing to say “no.”

🛠️ Design Challenge

Try this one on your own to test yourself.

Imagine Alex is running an online store with several app servers behind a load balancer, and each server talks to a shared database. Design the health checks:

  • What would the liveness check test, and what should happen when it fails?
  • What would the readiness check test, and why might it include the database while liveness doesn’t?
  • The database has a slow moment and every server’s deep check starts failing at once. What goes wrong, and how would you avoid taking the whole site down over one slow dependency?

Sketch your answer in a few lines. This is exactly the kind of reasoning a system design interviewer is looking for.

🧩 What You’ve Learned

You can now explain how systems know which servers are safe to use. Here’s what you’ve picked up:

  • ✅ A health check is a small endpoint like /health that reports whether a service is okay.
  • ✅ Load balancers and orchestrators ping it to decide where to route traffic and what to restart.
  • ✅ Liveness asks if the app is alive (restart on failure), readiness asks if it’s ready for traffic (pause traffic on failure).
  • ✅ Shallow checks confirm the process is up, deep checks also test dependencies, and each has trade-offs.
  • ✅ Accurate health checks are what make self-healing and failover possible.
  • ✅ A good check stays fast and honest, never too heavy and never a check that always says OK.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

  1. 1

    What is a health check?

    Why: A health check is a small endpoint other systems can ping to get a quick yes-or-no on the service's health.

  2. 2

    What should happen when a liveness check fails?

    Why: A failing liveness check means the app is broken or stuck, so the right move is to restart it.

  3. 3

    What does a readiness check failure mean you should do?

    Why: Readiness asks if the app is ready for traffic, so failing it should pause traffic without triggering a restart.

  4. 4

    Why should a health check stay fast and lightweight?

    Why: The check runs over and over on every instance, so a slow or heavy one wastes resources and can falsely mark a server as broken.

🚀 What’s Next?

Health checks are one piece of a bigger goal: keeping your system up no matter what fails. Here’s where to go next.

Share & Connect