Monitoring Basics

Table of Contents +

Let me ask you something. Your website goes down at 2 in the morning. How would you rather find out?

Option one: a dashboard lights up red, you get a ping, and you fix it before most people even notice.
Option two: you wake up to a pile of angry tweets and a flood of “site not working??” emails from users.

Nobody wants option two, right? The thing that gives you option one is called monitoring. So let’s understand what it really is, what you watch, and the tools that make it happen.

🎯 What is Monitoring

Okay so at its simplest, here’s the idea:

Monitoring is continuously watching your system’s health and behavior using numbers, so you catch problems early.
Those numbers have a name. They’re called metrics. A metric is just a measurement taken over time, like “how many requests came in this minute” or “how long did the server take to reply”.
The whole point is to know how your system is doing right now, without waiting for a user to tell you.

Think of it like the dashboard in your car. You don’t open the hood every five minutes to check the engine, right? You just glance at the speed, the fuel, the temperature. Monitoring is that dashboard, but for your software.

The one-line version

Monitoring answers a simple question on repeat: “Is my system healthy right now, and if not, where does it hurt?”

📊 Logging vs Monitoring

People mix these two up all the time, so let’s clear it up fast. They’re related but they answer different questions.

Logs are records of individual events. Each log line is one thing that happened, like “user Alex logged in at 10:42” or “payment failed for order 123”. Logs are the detailed story, event by event.
Monitoring is about trends and health over time. Instead of single events, it tracks the big-picture numbers, like “errors per minute” or “average response time”, and shows how they move.

Here’s the easy way to remember it. When something breaks, monitoring tells you that it broke and roughly where. Logs help you dig in and find out exactly why. You usually want both.

If you want to go deeper on each side, check out the logging lesson for the event side, and metrics for the numbers that feed monitoring.

📈 What You Monitor

Now you might be thinking, there are a thousand things I could measure, where do I even start? Good news. There’s a famous shortlist called the golden signals. These four numbers tell you most of what you need to know about a service.

Here’s each one in plain words:

Latency is how long a request takes to get a response. Slow latency means users are sitting there waiting.
Traffic is how much demand is coming in, like requests per second. It tells you how busy you are.
Errors is how many requests are failing. A spike here usually means something just broke.
Saturation is how full your resources are, like CPU, memory, or disk. When this gets close to the limit, things are about to crash.

Golden Signal	What it measures	What a bad number means
Latency	How long a request takes to respond	The site feels slow to users
Traffic	How much demand is hitting your system	A sudden spike or drop you didn’t expect
Errors	How many requests are failing	Something is broken right now
Saturation	How full your resources are (CPU, memory, disk)	You’re about to run out of capacity

So if you only watched these four, you’d already catch most real problems. Start here, then add more as you learn what your system needs.

🖥️ Dashboards

Raw numbers in a table are hard to read, right? Staring at “latency: 240, 250, 245, 900” doesn’t jump out at you. That’s where dashboards come in.

A dashboard is a set of visual charts of your metrics over time, so you can see your system’s health at a glance.
Each chart shows one metric, like a line going up and down. When the line suddenly shoots up or drops, your eye catches it instantly.
The most popular tool for this is Grafana. It pulls in your metrics and draws them as nice graphs on one screen.

Picture a big screen on the wall with a few line charts. Latency is steady, errors are flat near zero, traffic is its usual wave. Then one line spikes red. You don’t need to read a single number. You just see that something’s off, and that’s the whole power of a dashboard.

🚨 From Monitoring to Alerts

Here’s the catch with dashboards. Somebody has to be looking at them. And nobody’s staring at a screen at 3 in the morning, right? So we need the system to poke us when something goes wrong.

That’s what alerting does. You set a threshold, which is just a limit, on a metric.
When the metric crosses that limit, the system fires an alert. For example, “if errors go above 5% for two minutes, page the on-call engineer.”
The alert reaches a human through email, Slack, SMS, or a phone call, depending on how serious it is.

So monitoring is the watching, and alerting is the tap on the shoulder when the watching spots trouble. They work as a pair. You can learn how to set good thresholds and route alerts in the alerting lesson.

Don't alert on everything

If every little blip sends you a page, you’ll start ignoring all of them. That’s called alert fatigue. Alert only on things that actually need a human to act, right now.

🧩 The Tools

You don’t have to build any of this from scratch. There’s a well-worn stack people reach for.

Prometheus collects and stores your metrics. Your services expose their numbers, and Prometheus regularly grabs them and keeps the history.
Grafana shows those metrics as dashboards. Prometheus holds the data, Grafana makes it pretty and readable.
If you’re on the cloud, there are built-in options too, like AWS CloudWatch, Google Cloud Monitoring, or Datadog. They do collecting, dashboards, and alerting all in one place.

Here’s how the pieces fit together. Your services send out metrics, the monitoring system gathers them, and then it feeds both your dashboards and your alerts.

⚡ Why It Matters

So why bother setting all this up? A few solid reasons.

You catch issues before users do. That’s the big one. A red dashboard at 2am beats angry tweets at 8am every time.
You understand trends. Watching metrics over weeks shows you slow creep, like latency quietly getting worse as you grow.
You can plan capacity. If saturation keeps climbing month over month, you know you’ll need more servers before you hit the wall.
You can prove your promises. If you told customers “we’ll be up 99.9% of the time”, monitoring is how you measure whether you actually kept that. (That promise is called an SLA, short for Service Level Agreement.)

⚠️ Common Mistakes and Misconceptions

A few things trip people up when they start out. Let’s clear them.

“I’ll add monitoring after something breaks.” That’s backwards. The whole point is to know before it breaks. Add it early, while things are calm.
“More metrics is always better.” Not really. Watch every single metric and you drown in noise, and the important signal gets lost. Start with the golden signals and grow from there.
“We collect metrics, so we’re covered.” Collecting without dashboards or alerts is half a system. If nobody’s looking and nothing’s alerting, those metrics aren’t helping anyone.
“Logging and monitoring are the same thing.” They’re not. Logs are individual events. Monitoring is trends and health from metrics. You usually want both, doing their own jobs.

🛠️ Design Challenge

Imagine you run a small online store. Sketch out a simple monitoring setup by working through the questions below.

Which golden signals would you watch, and why does each one matter for a store?

Show the answer

What single alert threshold would you set for each signal?

Show the answer

What would you put on the main dashboard so a glance tells you if checkout is healthy?

Show the answer

🧩 What You’ve Learned

You can now explain what monitoring is and how a basic setup hangs together. Here’s what you picked up.

✅ Monitoring is continuously watching system health using metrics, so you catch problems early.
✅ Logs are individual events, while monitoring tracks trends and health over time.
✅ The golden signals are latency, traffic, errors, and saturation.
✅ Dashboards turn metrics into visual charts you can read at a glance, often with Grafana.
✅ When a metric crosses a threshold, an alert fires to a human.
✅ Prometheus collects metrics and Grafana shows them, with cloud options available too.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

🚀 What’s Next?

You’ve got the big picture. Next, zoom into the pieces that make monitoring work.

Metrics Explained goes deeper on the numbers that feed your monitoring.
Alerting Systems shows how to turn those metrics into smart notifications that reach the right person.

Get these two down and you’ll have a real grip on observability, the skill every backend and system design interview keeps coming back to.

Previous Logging Basics Next Metrics Explained

Share & Connect

Share on LinkedIn