Metrics Explained
Table of Contents + −
Here’s a question that sounds simple but trips up a lot of people:
- Your app is running fine today. But is it getting slower over the week?
- How would you even know? You can’t just stare at it all day, right?
- And “it feels a bit slow lately” is not something you can put in front of your team.
What you need is a way to put a number on it. Something you can look at and say “yesterday a page took half a second, today it takes two seconds.” That number, tracked over time, is a metric. By the end of this lesson you’ll know exactly what metrics are, the main types, and why they’re the first thing teams reach for when they want to understand a running system.
🎯 What is a Metric
Let’s start with the plain definition:
- A metric is a numeric measurement tracked over time. That’s it. A number, measured again and again, with each measurement saved.
- Think things like requests per second, error rate, or how much memory your app is using right now.
- The key word is “over time”. One number on its own is not very useful. But the same number, recorded every few seconds for a week, tells you a story.
Here’s the thing that makes metrics special. Each measurement gets stamped with the exact time it was taken:
- A list of numbers, each one tagged with the moment it was recorded, is called a time series. Like “10 requests at 9:00, 14 requests at 9:01, 9 requests at 9:02”, and so on.
- Because every point has a timestamp, you can line them up and draw a graph. And a graph is where trends jump out at you.
- So when someone asks “is the app getting slower?”, you don’t guess. You pull up the latency time series for the last week and look at the line.
Why time matters so much
A single number tells you the present moment. A time series tells you the direction. Going up? Going down? Spiky at 9am every day? That direction is usually what you actually care about, and it only shows up once you track the number over time.
🔢 Metric Types
Not all numbers behave the same way. Some only ever climb, some bounce around, and some describe a whole spread of values. Knowing which type you’re dealing with changes how you read it. There are three you’ll meet most:
- A counter is a number that only goes up. It counts how many times something happened since the app started. Like total requests served, or total errors. It never goes down (unless the app restarts and resets to zero).
- A gauge is a number that goes up and down. It’s a snapshot of “right now”. Like memory in use, or how many users are connected this second. Check it again later and it might be higher or lower.
- A histogram records the distribution of many values, not just one number. Like the duration of every request, bucketed into ranges so you can ask “how many requests were fast, and how many were slow?”
Here they are side by side so it sticks:
| Type | How it behaves | Example |
|---|---|---|
| Counter | Only goes up, resets on restart | Total requests served, total errors |
| Gauge | Goes up and down, a “right now” snapshot | Memory used, active connections, queue length |
| Histogram | Records a spread of many values | Request durations, response sizes |
A counter that goes up sounds useless, but it isn't
You almost never look at the raw counter value. What you do is look at how fast it’s climbing. If “total requests” jumps by 600 in a minute, that’s 10 requests per second. The rate of a counter is the useful part, and your monitoring tools work that out for you.
📊 Common Metrics to Track
You could measure a thousand things, but a handful matter way more than the rest. A well-known starting set is called the golden signals, the metrics that tell you most about whether your service is healthy:
- Request rate. How many requests per second your app is handling. This is your traffic. A sudden drop to zero, or a huge spike, both mean something’s up.
- Error rate. How many of those requests are failing, usually as a percentage. If 1 in 100 requests was erroring yesterday and it’s 1 in 5 today, you have a problem.
- Latency. How long requests take to come back. This is the “is it slow?” number from the start of the lesson.
- CPU and memory. How hard the machine is working, and how much memory your app is eating. These tell you if you’re about to run out of room.
A quick note on where these fit:
- Request rate and total errors are usually counters. CPU and memory are usually gauges. Latency is usually a histogram, because you care about the whole spread, not one average.
- Watching these few signals gives you a surprisingly complete picture without drowning in data. We go deeper on choosing and watching them in Monitoring Basics.
📈 Percentiles Beat Averages
Here’s where a lot of beginners get fooled, so slow down on this one. Say you want to know how slow your app feels. The obvious move is to take the average response time. But the average lies to you. Let me show you why:
- Imagine 100 requests. 99 of them are nice and fast, around 100 milliseconds. But one request takes a painful 5 seconds.
- The average comes out to around 150 milliseconds. Looks great! But one of your users just sat there for 5 whole seconds.
- The average hides the slow outliers. It smooths the pain away into a number that looks fine.
So instead of averages, good teams use percentiles. A percentile answers “how slow is the experience for most people, including the unlucky ones?”
- p50 (also called the median) means 50% of requests are faster than this. So half your users see this speed or better. It’s the typical case.
- p95 means 95% of requests are faster than this. Only the slowest 5% are worse. This catches the experience of your less-lucky users.
- p99 means 99% of requests are faster than this. Only the slowest 1% are worse. This is where you spot the real pain.
The point to remember:
- p99 latency tells you about your worst-served users, the ones who’d actually complain or leave. An average never shows them to you.
- That’s exactly why you store latency as a histogram. The histogram keeps the full spread, so the tool can work out p50, p95, and p99 for you.
Don't report a single average and call it a day
“Average response time is 150ms” can be true while a chunk of your users are having a terrible time. Always ask for the p95 and p99 alongside it. The gap between the average and the p99 is often where the real story hides.
🆚 Metrics vs Logs
People mix these two up, so let’s draw a clean line. Both help you understand your system, but they do different jobs:
- Metrics are cheap numeric trends. Just numbers over time, like “errors per second”. They’re small, fast to store, and great for spotting that something changed.
- Logs are detailed event records. Each log line describes one specific thing that happened, with full detail, like “request from user 42 to /checkout failed with a database timeout at 9:03:11”.
- So metrics tell you that something is wrong and roughly when. Logs help you dig into what exactly went wrong and why.
You don’t pick one. You use both together. The metric graph spikes, you notice it, then you open the logs around that time to find the actual cause.
⚙️ How Metrics Are Collected
So where do these numbers come from, and how do they end up on a graph? The flow is pretty consistent across most setups:
- Your app exposes its metrics. It keeps counters and gauges in memory and publishes them, usually on a little internal web page like
/metricsthat just lists the current numbers. - A collection system scrapes them. A tool like Prometheus visits that page every few seconds and grabs the numbers. “Scrape” just means it pulls the values on a schedule.
- The collector stores them in a time-series database, each value stamped with the time it was collected.
- A dashboard tool then graphs them over time, so you and your team can actually see the lines.
Here’s that whole path in one picture:
Prometheus in one line
Prometheus is a very common open-source tool that scrapes metrics from your apps and stores them as time series. You don’t need to know it in depth yet, just know that “something pulls the numbers on a schedule and keeps them” is the normal pattern.
⚡ Why Metrics Matter
Okay, so why go to all this trouble? Because metrics turn vague worries into things you can act on:
- Spot trends. That opening question, “is the app getting slower over the week?”, is now a five-second glance at a graph instead of a gut feeling.
- Set alerts. You can tell the system “ping me if the error rate goes above 2%”. Now you find out before your users do, not after they complain.
- Find slowdowns. When something feels off, the graphs show you which metric moved and when, so you know where to start looking.
- Plan capacity. If memory usage has been creeping up week after week, you can see it coming and add resources before you run out.
In short, metrics are how you go from “I think it’s fine” to “I know it’s fine, and here’s the graph.”
⚠️ Common Mistakes and Misconceptions
A few things trip people up early. Let’s clear them out:
- “The average tells the whole story.” It doesn’t. The average hides slow outliers. Use p95 and p99 to see what your unluckiest users actually experience.
- “Track everything, just in case.” Tempting, but it buries the signals that matter under noise, and it costs storage. Start with the golden signals (rate, errors, latency, resources) and add more only when you have a real question.
- “A counter and a gauge are basically the same.” No. A counter only climbs and you read its rate. A gauge goes up and down and you read its current value. Mixing them up gives you nonsense graphs.
- “Metrics will tell me exactly what broke.” Not on their own. Metrics tell you that something changed and when. To find the actual cause, you go to your logs.
🛠️ Design Challenge
Try this on your own to test yourself.
You’re running a checkout service and the team says “checkout feels slow sometimes.” Decide which metrics you’d track to investigate. For example:
- Request rate on
/checkout, to see traffic patterns. - Error rate, to check if failures are creeping up.
- Latency as a histogram, so you can look at p50, p95, and p99 rather than just the average.
Then ask yourself: if the average latency looks fine but users still complain, which number would you check next, and why? This is exactly the kind of reasoning interviewers love to see.
🧩 What You’ve Learned
You can now explain what metrics are and how teams use them. Here’s what you’ve picked up.
- ✅ A metric is a numeric measurement tracked over time, stored as a time series with timestamps.
- ✅ The three main types are counters (only go up), gauges (go up and down), and histograms (record a distribution).
- ✅ The golden signals to watch are request rate, error rate, latency, and CPU/memory.
- ✅ Percentiles like p95 and p99 beat averages, because averages hide slow outliers.
- ✅ Metrics give cheap numeric trends, while logs give detailed events, and you use both.
- ✅ Apps expose metrics, a tool like Prometheus scrapes and stores them, and dashboards graph them.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What is a metric?
Why: A metric is a number measured again and again over time, with each value stamped with the moment it was taken.
- 2
How does a counter differ from a gauge?
Why: A counter only climbs and you read its rate, while a gauge moves up and down as a right-now snapshot.
- 3
Why are percentiles better than averages for latency?
Why: An average can look fine while a few users wait seconds, but p95 and p99 reveal that slow tail.
- 4
What does p99 latency mean?
Why: p99 means only the slowest 1% of requests are above that value, which catches the painful tail.
🚀 What’s Next?
Now that you know what to measure, the next step is learning how to watch it and how to dig deeper when something breaks.
- Monitoring Basics shows how to turn these metrics into dashboards and alerts that catch problems early.
- Distributed Tracing follows a single request across many services, so you can see exactly where the time went.
Get these down and you’ll have the core of observability: knowing your system is healthy, and knowing where to look when it isn’t.