Alerting Systems Explained
Table of Contents + â
Itâs 3am. Everyone on the team is asleep. And right now, something is quietly going wrong:
- The error rate on your checkout service just shot up.
- Users are trying to pay, and half of them are seeing a spinning wheel that never stops.
- Nobody is looking at a dashboard at 3am, right? So the real question is, who finds out, and how?
Thatâs the whole job of an alerting system. Itâs the thing that wakes up the right person and says âhey, this is broken, go lookâ. Letâs see how that works.
đŻ What is Alerting
Letâs start with a plain definition:
- Alerting means automatically notifying a human when a metric or a check shows that something is wrong.
- A metric is just a number your system reports over time, like error rate, response time, or how much memory is free.
- So instead of a person staring at numbers all day, the system watches them for you and pings someone only when it actually matters.
Now, alerting doesnât work alone. It sits right on top of monitoring:
- Monitoring is the part that collects and stores all those metrics. (We cover that in Monitoring Basics.)
- Alerting is the part that watches those same metrics and reacts. Monitoring gives you the numbers, alerting decides when a number is bad enough to bother a human.
So think of monitoring as the eyes, and alerting as the voice that shouts when the eyes see trouble.
âď¸ How It Works
The basic idea is simple. You write down a rule, and the system checks it for you again and again. Here are the pieces:
- You set a threshold on a metric. A threshold is just a line in the sand, like âerror rate should stay under 5%â.
- The system keeps comparing the live metric against that line.
- The moment the metric crosses the line, an alert fires. âFiresâ just means the alert triggers and turns into a notification.
- That notification goes to the on-call person. On-call means whoever is responsible right now for responding when things break. Teams usually take turns, so one person carries the pager for a week, then it rotates.
- That person gets pinged, looks at whatâs wrong, and starts fixing it.
Hereâs the whole flow in one picture.
So the metric goes bad, the alert fires, the on-call person gets a notification, and a human steps in. Thatâs the loop, every single time.
An alert is a question, not a fix
An alert never fixes anything by itself. All it does is get a humanâs attention. The actual fixing is still up to a person, so the goal of a good alert is to bring in the right person at the right time with enough info to act.
đŁ Notification Channels
Okay, so an alert fires. Where does it actually show up? That depends on the channel you pick. A channel is just the path the notification travels on. The common ones:
- Email. Fine for low-urgency stuff youâll read later, but easy to miss at night.
- Slack or Teams. Great for team-visible warnings, since everyone can see and chat about it.
- SMS or a phone call. Loud and hard to ignore, good for the serious stuff.
- Paging tools like PagerDuty or Opsgenie. These are built for on-call. They actually wake people up, escalate to the next person if nobody answers, and track whoâs responsible right now.
The rule of thumb is to match the loudness of the channel to how urgent the problem is.
đď¸ Severity Levels
Not every problem deserves a 3am phone call, right? So we tag each alert with a severity level, which is basically how urgent it is. That tells the system how loudly to shout.
| Severity | What it means | What happens |
|---|---|---|
| Critical | Users are hurting right now | Wake someone up, page them immediately |
| Warning | Trouble is building, not on fire yet | Look at it soon, during work hours |
| Info | Worth knowing, nothingâs broken | Log it, no need to ping anyone |
So a critical alert means the site is down and you page someone now. A warning means disk space is getting low, look at it tomorrow. And info is just a note for the record. Same system, very different loudness.
đľ Alert Fatigue
Hereâs the trap that bites almost every team. Itâs called alert fatigue:
- Alert fatigue is when there are so many alerts, or so many noisy useless ones, that people start tuning them out.
- And once youâre in the habit of ignoring alerts, you ignore the real ones too. Thatâs the danger.
- Picture a phone that buzzes forty times a day for nothing. By buzz forty-one, you donât even glance at it, even if that one was the real fire.
This is the single biggest reason alerting goes wrong. Itâs not too few alerts that hurts you, itâs too many. So fewer, sharper alerts beat a flood of noisy ones every time.
Every noisy alert costs you trust
Each alert that fires for no real reason teaches the on-call person to trust the system a little less. Do that enough times and theyâll start swiping away alerts on reflex. Treat a false alarm as a bug to fix, not background noise to live with.
â What Makes a Good Alert
So if a flood of alerts is bad, what does a good one look like? A few simple tests it should pass:
- Itâs actionable. When it fires, thereâs something the person can actually do. An alert nobody can act on is just noise.
- Itâs tied to user impact. The best alerts fire when real users are feeling pain, not when some internal number wobbles harmlessly.
- Itâs clear. The message says whatâs wrong and where, so the half-asleep on-call person gets it in five seconds.
- Itâs not too sensitive. A tiny one-second blip shouldnât page anyone. Give it a little room so normal wobble doesnât trigger it.
And hereâs the big mental shift: alert on symptoms, not on every cause:
- A symptom is what the user feels, like âcheckout is failingâ or âthe page is slowâ.
- A cause is some internal detail, like âCPU on server number seven is at 90%â.
- High CPU might be totally fine if users are still happy. So alert on âusers canât check outâ, and let the human dig into the why. One symptom alert beats ten cause alerts.
đ Alert on SLOs
Thereâs a clean, modern way to decide whatâs worth alerting on, and itâs built around the SLO:
- An SLO, short for Service Level Objective, is a target you promise to hit, like âthe site is up 99.9% of the timeâ.
- That little gap, the 0.1% youâre allowed to miss, is your error budget. Itâs how much breakage you can afford before youâve broken the promise.
- So instead of alerting on every tiny hiccup, you alert when youâre at risk of burning through that budget and breaking the SLO.
This keeps alerts honest. A few errors that donât threaten the 99.9% promise? Stay quiet. Errors piling up fast enough to break it? Now thatâs worth waking someone up for. (More on this in Health Checks and your monitoring setup.)
â ď¸ Common Mistakes and Misconceptions
A few things trip teams up over and over. Letâs clear them out:
- âAlert on everything.â No. This is the fast road to alert fatigue. More alerts feels safer but actually makes you slower, because the real one gets buried.
- âNo severity levels.â If a disk-space warning pages you as loudly as a full outage, you canât tell whatâs urgent. Tag severity so the loud stuff stays loud.
- âAlerts no one can act on.â An alert that fires but has no fix attached just trains people to ignore it. If you canât act on it, it shouldnât page you.
- âAlert fatigue isnât a real problem.â Itâs the number one reason outages get missed. Ignoring it is how the 3am alert gets swiped away by a tired human on autopilot.
đ ď¸ Design Challenge
Try this on your own to test yourself.
Alex runs a small online store and wants alerts for the checkout service. For each idea below, decide if itâs a good alert, and if not, why:
- Page someone whenever CPU goes above 80% for one second.
- Page someone when more than 5% of checkouts fail for five minutes straight.
- Send an email every time a single request is slow.
- Page someone when the error budget is burning fast enough to break the 99.9% uptime SLO.
Think about which ones are actionable, which are tied to user impact, and which would just cause alert fatigue. Thatâs exactly the reasoning an interviewer wants to see.
đ§Š What Youâve Learned
You can now explain how alerting turns bad metrics into action. Hereâs what youâve picked up.
- â Alerting automatically notifies a human when a metric or check shows something is wrong, sitting right on top of monitoring.
- â You set a threshold on a metric, and when itâs crossed an alert fires and notifies the on-call person.
- â Notifications travel over channels like email, Slack, SMS, and paging tools like PagerDuty.
- â Severity levels (critical, warning, info) decide how loudly the system shouts.
- â Alert fatigue, from too many noisy alerts, is the biggest trap and makes people ignore real alerts.
- â Good alerts are actionable, tied to user impact, and fire on symptoms, ideally driven by SLOs.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What does an alerting system do?
Why: Alerting watches metrics and pings the right person when a value crosses a threshold; the human still does the fixing.
- 2
How are monitoring and alerting different?
Why: Monitoring is the eyes that collect the numbers, and alerting is the voice that shouts when a number goes bad.
- 3
What is alert fatigue?
Why: A flood of noisy alerts trains people to tune them out, so the real one gets missed.
- 4
What does it mean to alert on SLOs?
Why: An SLO is a reliability target, so you alert when the error budget is burning fast enough to break it, which keeps alerts meaningful.
đ Whatâs Next?
Youâve got the core of alerting down. Next, go deeper on the pieces around it.
- Monitoring Basics shows how the metrics behind every alert get collected and stored.
- Health Checks covers the simple up-or-down probes that many alerts are built on.
Once youâve got those, youâll see how monitoring, health checks, and alerting fit together into a full observability setup.