Monitoring Basics
Table of Contents + −
Let me ask you something. Your website goes down at 2 in the morning. How would you rather find out?
- Option one: a dashboard lights up red, you get a ping, and you fix it before most people even notice.
- Option two: you wake up to a pile of angry tweets and a flood of “site not working??” emails from users.
Nobody wants option two, right? The thing that gives you option one is called monitoring. So let’s understand what it really is, what you watch, and the tools that make it happen.
🎯 What is Monitoring
Okay so at its simplest, here’s the idea:
- Monitoring is continuously watching your system’s health and behavior using numbers, so you catch problems early.
- Those numbers have a name. They’re called metrics. A metric is just a measurement taken over time, like “how many requests came in this minute” or “how long did the server take to reply”.
- The whole point is to know how your system is doing right now, without waiting for a user to tell you.
Think of it like the dashboard in your car. You don’t open the hood every five minutes to check the engine, right? You just glance at the speed, the fuel, the temperature. Monitoring is that dashboard, but for your software.
The one-line version
Monitoring answers a simple question on repeat: “Is my system healthy right now, and if not, where does it hurt?”
📊 Logging vs Monitoring
People mix these two up all the time, so let’s clear it up fast. They’re related but they answer different questions.
- Logs are records of individual events. Each log line is one thing that happened, like “user Alex logged in at 10:42” or “payment failed for order 123”. Logs are the detailed story, event by event.
- Monitoring is about trends and health over time. Instead of single events, it tracks the big-picture numbers, like “errors per minute” or “average response time”, and shows how they move.
Here’s the easy way to remember it. When something breaks, monitoring tells you that it broke and roughly where. Logs help you dig in and find out exactly why. You usually want both.
If you want to go deeper on each side, check out the logging lesson for the event side, and metrics for the numbers that feed monitoring.
📈 What You Monitor
Now you might be thinking, there are a thousand things I could measure, where do I even start? Good news. There’s a famous shortlist called the golden signals. These four numbers tell you most of what you need to know about a service.
Here’s each one in plain words:
- Latency is how long a request takes to get a response. Slow latency means users are sitting there waiting.
- Traffic is how much demand is coming in, like requests per second. It tells you how busy you are.
- Errors is how many requests are failing. A spike here usually means something just broke.
- Saturation is how full your resources are, like CPU, memory, or disk. When this gets close to the limit, things are about to crash.
| Golden Signal | What it measures | What a bad number means |
|---|---|---|
| Latency | How long a request takes to respond | The site feels slow to users |
| Traffic | How much demand is hitting your system | A sudden spike or drop you didn’t expect |
| Errors | How many requests are failing | Something is broken right now |
| Saturation | How full your resources are (CPU, memory, disk) | You’re about to run out of capacity |
So if you only watched these four, you’d already catch most real problems. Start here, then add more as you learn what your system needs.
🖥️ Dashboards
Raw numbers in a table are hard to read, right? Staring at “latency: 240, 250, 245, 900” doesn’t jump out at you. That’s where dashboards come in.
- A dashboard is a set of visual charts of your metrics over time, so you can see your system’s health at a glance.
- Each chart shows one metric, like a line going up and down. When the line suddenly shoots up or drops, your eye catches it instantly.
- The most popular tool for this is Grafana. It pulls in your metrics and draws them as nice graphs on one screen.
Picture a big screen on the wall with a few line charts. Latency is steady, errors are flat near zero, traffic is its usual wave. Then one line spikes red. You don’t need to read a single number. You just see that something’s off, and that’s the whole power of a dashboard.
🚨 From Monitoring to Alerts
Here’s the catch with dashboards. Somebody has to be looking at them. And nobody’s staring at a screen at 3 in the morning, right? So we need the system to poke us when something goes wrong.
- That’s what alerting does. You set a threshold, which is just a limit, on a metric.
- When the metric crosses that limit, the system fires an alert. For example, “if errors go above 5% for two minutes, page the on-call engineer.”
- The alert reaches a human through email, Slack, SMS, or a phone call, depending on how serious it is.
So monitoring is the watching, and alerting is the tap on the shoulder when the watching spots trouble. They work as a pair. You can learn how to set good thresholds and route alerts in the alerting lesson.
Don't alert on everything
If every little blip sends you a page, you’ll start ignoring all of them. That’s called alert fatigue. Alert only on things that actually need a human to act, right now.
🧩 The Tools
You don’t have to build any of this from scratch. There’s a well-worn stack people reach for.
- Prometheus collects and stores your metrics. Your services expose their numbers, and Prometheus regularly grabs them and keeps the history.
- Grafana shows those metrics as dashboards. Prometheus holds the data, Grafana makes it pretty and readable.
- If you’re on the cloud, there are built-in options too, like AWS CloudWatch, Google Cloud Monitoring, or Datadog. They do collecting, dashboards, and alerting all in one place.
Here’s how the pieces fit together. Your services send out metrics, the monitoring system gathers them, and then it feeds both your dashboards and your alerts.
⚡ Why It Matters
So why bother setting all this up? A few solid reasons.
- You catch issues before users do. That’s the big one. A red dashboard at 2am beats angry tweets at 8am every time.
- You understand trends. Watching metrics over weeks shows you slow creep, like latency quietly getting worse as you grow.
- You can plan capacity. If saturation keeps climbing month over month, you know you’ll need more servers before you hit the wall.
- You can prove your promises. If you told customers “we’ll be up 99.9% of the time”, monitoring is how you measure whether you actually kept that. (That promise is called an SLA, short for Service Level Agreement.)
⚠️ Common Mistakes and Misconceptions
A few things trip people up when they start out. Let’s clear them.
- “I’ll add monitoring after something breaks.” That’s backwards. The whole point is to know before it breaks. Add it early, while things are calm.
- “More metrics is always better.” Not really. Watch every single metric and you drown in noise, and the important signal gets lost. Start with the golden signals and grow from there.
- “We collect metrics, so we’re covered.” Collecting without dashboards or alerts is half a system. If nobody’s looking and nothing’s alerting, those metrics aren’t helping anyone.
- “Logging and monitoring are the same thing.” They’re not. Logs are individual events. Monitoring is trends and health from metrics. You usually want both, doing their own jobs.
🛠️ Design Challenge
Try this one on your own to test yourself.
Imagine you run a small online store. Sketch out a simple monitoring setup:
- Pick which golden signals you’d watch, and say why each one matters for a store.
- Decide one alert threshold for each, like “errors above what percent should page someone?”
- Name what you’d put on the main dashboard so a glance tells you if checkout is healthy.
Think about what would actually wake you up versus what can wait till morning. That’s exactly the judgment real teams use every day.
🧩 What You’ve Learned
You can now explain what monitoring is and how a basic setup hangs together. Here’s what you picked up.
- ✅ Monitoring is continuously watching system health using metrics, so you catch problems early.
- ✅ Logs are individual events, while monitoring tracks trends and health over time.
- ✅ The golden signals are latency, traffic, errors, and saturation.
- ✅ Dashboards turn metrics into visual charts you can read at a glance, often with Grafana.
- ✅ When a metric crosses a threshold, an alert fires to a human.
- ✅ Prometheus collects metrics and Grafana shows them, with cloud options available too.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What does monitoring mainly do?
Why: Monitoring keeps watching the system's health over time using metrics, so issues show up before users notice.
- 2
What are the four golden signals?
Why: Latency, traffic, errors, and saturation cover most of what you need to know about a service's health.
- 3
How does alerting relate to monitoring?
Why: Monitoring is the watching, and alerting fires a notification when a metric crosses a set limit.
- 4
Which pair of tools is commonly used to collect metrics and show dashboards?
Why: Prometheus collects and stores metrics while Grafana draws them as readable dashboards.
🚀 What’s Next?
You’ve got the big picture. Next, zoom into the pieces that make monitoring work.
- Metrics Explained goes deeper on the numbers that feed your monitoring.
- Alerting Systems shows how to turn those metrics into smart notifications that reach the right person.
Get these two down and you’ll have a real grip on observability, the skill every backend and system design interview keeps coming back to.