Alerting Systems Explained

It’s 3am. Everyone on the team is asleep. And right now, something is quietly going wrong:

  • The error rate on your checkout service just shot up.
  • Users are trying to pay, and half of them are seeing a spinning wheel that never stops.
  • Nobody is looking at a dashboard at 3am, right? So the real question is, who finds out, and how?

That’s the whole job of an alerting system. It’s the thing that wakes up the right person and says “hey, this is broken, go look”. Let’s see how that works.

🎯 What is Alerting

Let’s start with a plain definition:

  • Alerting means automatically notifying a human when a metric or a check shows that something is wrong.
  • A metric is just a number your system reports over time, like error rate, response time, or how much memory is free.
  • So instead of a person staring at numbers all day, the system watches them for you and pings someone only when it actually matters.

Now, alerting doesn’t work alone. It sits right on top of monitoring:

  • Monitoring is the part that collects and stores all those metrics. (We cover that in Monitoring Basics.)
  • Alerting is the part that watches those same metrics and reacts. Monitoring gives you the numbers, alerting decides when a number is bad enough to bother a human.

So think of monitoring as the eyes, and alerting as the voice that shouts when the eyes see trouble.

⚙️ How It Works

The basic idea is simple. You write down a rule, and the system checks it for you again and again. Here are the pieces:

  • You set a threshold on a metric. A threshold is just a line in the sand, like “error rate should stay under 5%”.
  • The system keeps comparing the live metric against that line.
  • The moment the metric crosses the line, an alert fires. “Fires” just means the alert triggers and turns into a notification.
  • That notification goes to the on-call person. On-call means whoever is responsible right now for responding when things break. Teams usually take turns, so one person carries the pager for a week, then it rotates.
  • That person gets pinged, looks at what’s wrong, and starts fixing it.

Here’s the whole flow in one picture.

Metric crosses threshold

Alert fires

Notify on-call person

Human investigates

Human fixes the problem

So the metric goes bad, the alert fires, the on-call person gets a notification, and a human steps in. That’s the loop, every single time.

An alert is a question, not a fix

An alert never fixes anything by itself. All it does is get a human’s attention. The actual fixing is still up to a person, so the goal of a good alert is to bring in the right person at the right time with enough info to act.

📣 Notification Channels

Okay, so an alert fires. Where does it actually show up? That depends on the channel you pick. A channel is just the path the notification travels on. The common ones:

  • Email. Fine for low-urgency stuff you’ll read later, but easy to miss at night.
  • Slack or Teams. Great for team-visible warnings, since everyone can see and chat about it.
  • SMS or a phone call. Loud and hard to ignore, good for the serious stuff.
  • Paging tools like PagerDuty or Opsgenie. These are built for on-call. They actually wake people up, escalate to the next person if nobody answers, and track who’s responsible right now.

The rule of thumb is to match the loudness of the channel to how urgent the problem is.

🎚️ Severity Levels

Not every problem deserves a 3am phone call, right? So we tag each alert with a severity level, which is basically how urgent it is. That tells the system how loudly to shout.

Severity What it means What happens
Critical Users are hurting right now Wake someone up, page them immediately
Warning Trouble is building, not on fire yet Look at it soon, during work hours
Info Worth knowing, nothing’s broken Log it, no need to ping anyone

So a critical alert means the site is down and you page someone now. A warning means disk space is getting low, look at it tomorrow. And info is just a note for the record. Same system, very different loudness.

😵 Alert Fatigue

Here’s the trap that bites almost every team. It’s called alert fatigue:

  • Alert fatigue is when there are so many alerts, or so many noisy useless ones, that people start tuning them out.
  • And once you’re in the habit of ignoring alerts, you ignore the real ones too. That’s the danger.
  • Picture a phone that buzzes forty times a day for nothing. By buzz forty-one, you don’t even glance at it, even if that one was the real fire.

This is the single biggest reason alerting goes wrong. It’s not too few alerts that hurts you, it’s too many. So fewer, sharper alerts beat a flood of noisy ones every time.

Every noisy alert costs you trust

Each alert that fires for no real reason teaches the on-call person to trust the system a little less. Do that enough times and they’ll start swiping away alerts on reflex. Treat a false alarm as a bug to fix, not background noise to live with.

✅ What Makes a Good Alert

So if a flood of alerts is bad, what does a good one look like? A few simple tests it should pass:

  • It’s actionable. When it fires, there’s something the person can actually do. An alert nobody can act on is just noise.
  • It’s tied to user impact. The best alerts fire when real users are feeling pain, not when some internal number wobbles harmlessly.
  • It’s clear. The message says what’s wrong and where, so the half-asleep on-call person gets it in five seconds.
  • It’s not too sensitive. A tiny one-second blip shouldn’t page anyone. Give it a little room so normal wobble doesn’t trigger it.

And here’s the big mental shift: alert on symptoms, not on every cause:

  • A symptom is what the user feels, like “checkout is failing” or “the page is slow”.
  • A cause is some internal detail, like “CPU on server number seven is at 90%”.
  • High CPU might be totally fine if users are still happy. So alert on “users can’t check out”, and let the human dig into the why. One symptom alert beats ten cause alerts.

📐 Alert on SLOs

There’s a clean, modern way to decide what’s worth alerting on, and it’s built around the SLO:

  • An SLO, short for Service Level Objective, is a target you promise to hit, like “the site is up 99.9% of the time”.
  • That little gap, the 0.1% you’re allowed to miss, is your error budget. It’s how much breakage you can afford before you’ve broken the promise.
  • So instead of alerting on every tiny hiccup, you alert when you’re at risk of burning through that budget and breaking the SLO.

This keeps alerts honest. A few errors that don’t threaten the 99.9% promise? Stay quiet. Errors piling up fast enough to break it? Now that’s worth waking someone up for. (More on this in Health Checks and your monitoring setup.)

⚠️ Common Mistakes and Misconceptions

A few things trip teams up over and over. Let’s clear them out:

  • “Alert on everything.” No. This is the fast road to alert fatigue. More alerts feels safer but actually makes you slower, because the real one gets buried.
  • “No severity levels.” If a disk-space warning pages you as loudly as a full outage, you can’t tell what’s urgent. Tag severity so the loud stuff stays loud.
  • “Alerts no one can act on.” An alert that fires but has no fix attached just trains people to ignore it. If you can’t act on it, it shouldn’t page you.
  • “Alert fatigue isn’t a real problem.” It’s the number one reason outages get missed. Ignoring it is how the 3am alert gets swiped away by a tired human on autopilot.

🛠️ Design Challenge

Try this on your own to test yourself.

Alex runs a small online store and wants alerts for the checkout service. For each idea below, decide if it’s a good alert, and if not, why:

  • Page someone whenever CPU goes above 80% for one second.
  • Page someone when more than 5% of checkouts fail for five minutes straight.
  • Send an email every time a single request is slow.
  • Page someone when the error budget is burning fast enough to break the 99.9% uptime SLO.

Think about which ones are actionable, which are tied to user impact, and which would just cause alert fatigue. That’s exactly the reasoning an interviewer wants to see.

🧩 What You’ve Learned

You can now explain how alerting turns bad metrics into action. Here’s what you’ve picked up.

  • ✅ Alerting automatically notifies a human when a metric or check shows something is wrong, sitting right on top of monitoring.
  • ✅ You set a threshold on a metric, and when it’s crossed an alert fires and notifies the on-call person.
  • ✅ Notifications travel over channels like email, Slack, SMS, and paging tools like PagerDuty.
  • ✅ Severity levels (critical, warning, info) decide how loudly the system shouts.
  • ✅ Alert fatigue, from too many noisy alerts, is the biggest trap and makes people ignore real alerts.
  • ✅ Good alerts are actionable, tied to user impact, and fire on symptoms, ideally driven by SLOs.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

  1. 1

    What does an alerting system do?

    Why: Alerting watches metrics and pings the right person when a value crosses a threshold; the human still does the fixing.

  2. 2

    How are monitoring and alerting different?

    Why: Monitoring is the eyes that collect the numbers, and alerting is the voice that shouts when a number goes bad.

  3. 3

    What is alert fatigue?

    Why: A flood of noisy alerts trains people to tune them out, so the real one gets missed.

  4. 4

    What does it mean to alert on SLOs?

    Why: An SLO is a reliability target, so you alert when the error budget is burning fast enough to break it, which keeps alerts meaningful.

🚀 What’s Next?

You’ve got the core of alerting down. Next, go deeper on the pieces around it.

  • Monitoring Basics shows how the metrics behind every alert get collected and stored.
  • Health Checks covers the simple up-or-down probes that many alerts are built on.

Once you’ve got those, you’ll see how monitoring, health checks, and alerting fit together into a full observability setup.

Share & Connect