Metrics Explained

Table of Contents +

Here’s a question that sounds simple but trips up a lot of people:

Your app is running fine today. But is it getting slower over the week?
How would you even know? You can’t just stare at it all day, right?
And “it feels a bit slow lately” is not something you can put in front of your team.

What you need is a way to put a number on it. Something you can look at and say “yesterday a page took half a second, today it takes two seconds.” That number, tracked over time, is a metric. By the end of this lesson you’ll know exactly what metrics are, the main types, and why they’re the first thing teams reach for when they want to understand a running system.

🎯 What is a Metric

Let’s start with the plain definition:

A metric is a numeric measurement tracked over time. That’s it. A number, measured again and again, with each measurement saved.
Think things like requests per second, error rate, or how much memory your app is using right now.
The key word is “over time”. One number on its own is not very useful. But the same number, recorded every few seconds for a week, tells you a story.

Here’s the thing that makes metrics special. Each measurement gets stamped with the exact time it was taken:

A list of numbers, each one tagged with the moment it was recorded, is called a time series. Like “10 requests at 9:00, 14 requests at 9:01, 9 requests at 9:02”, and so on.
Because every point has a timestamp, you can line them up and draw a graph. And a graph is where trends jump out at you.
So when someone asks “is the app getting slower?”, you don’t guess. You pull up the latency time series for the last week and look at the line.

Why time matters so much

A single number tells you the present moment. A time series tells you the direction. Going up? Going down? Spiky at 9am every day? That direction is usually what you actually care about, and it only shows up once you track the number over time.

🔢 Metric Types

Not all numbers behave the same way. Some only ever climb, some bounce around, and some describe a whole spread of values. Knowing which type you’re dealing with changes how you read it. There are three you’ll meet most:

A counter is a number that only goes up. It counts how many times something happened since the app started. Like total requests served, or total errors. It never goes down (unless the app restarts and resets to zero).
A gauge is a number that goes up and down. It’s a snapshot of “right now”. Like memory in use, or how many users are connected this second. Check it again later and it might be higher or lower.
A histogram records the distribution of many values, not just one number. Like the duration of every request, bucketed into ranges so you can ask “how many requests were fast, and how many were slow?”

Here they are side by side so it sticks:

Type	How it behaves	Example
Counter	Only goes up, resets on restart	Total requests served, total errors
Gauge	Goes up and down, a “right now” snapshot	Memory used, active connections, queue length
Histogram	Records a spread of many values	Request durations, response sizes

A counter that goes up sounds useless, but it isn't

You almost never look at the raw counter value. What you do is look at how fast it’s climbing. If “total requests” jumps by 600 in a minute, that’s 10 requests per second. The rate of a counter is the useful part, and your monitoring tools work that out for you.

📊 Common Metrics to Track

You could measure a thousand things, but a handful matter way more than the rest. A well-known starting set is called the golden signals, the metrics that tell you most about whether your service is healthy:

Request rate. How many requests per second your app is handling. This is your traffic. A sudden drop to zero, or a huge spike, both mean something’s up.
Error rate. How many of those requests are failing, usually as a percentage. If 1 in 100 requests was erroring yesterday and it’s 1 in 5 today, you have a problem.
Latency. How long requests take to come back. This is the “is it slow?” number from the start of the lesson.
CPU and memory. How hard the machine is working, and how much memory your app is eating. These tell you if you’re about to run out of room.

A quick note on where these fit:

Request rate and total errors are usually counters. CPU and memory are usually gauges. Latency is usually a histogram, because you care about the whole spread, not one average.
Watching these few signals gives you a surprisingly complete picture without drowning in data. We go deeper on choosing and watching them in Monitoring Basics.

📈 Percentiles Beat Averages

Here’s where a lot of beginners get fooled, so slow down on this one. Say you want to know how slow your app feels. The obvious move is to take the average response time. But the average lies to you. Let me show you why:

Imagine 100 requests. 99 of them are nice and fast, around 100 milliseconds. But one request takes a painful 5 seconds.
The average comes out to around 150 milliseconds. Looks great! But one of your users just sat there for 5 whole seconds.
The average hides the slow outliers. It smooths the pain away into a number that looks fine.

So instead of averages, good teams use percentiles. A percentile answers “how slow is the experience for most people, including the unlucky ones?”

p50 (also called the median) means 50% of requests are faster than this. So half your users see this speed or better. It’s the typical case.
p95 means 95% of requests are faster than this. Only the slowest 5% are worse. This catches the experience of your less-lucky users.
p99 means 99% of requests are faster than this. Only the slowest 1% are worse. This is where you spot the real pain.

The point to remember:

p99 latency tells you about your worst-served users, the ones who’d actually complain or leave. An average never shows them to you.
That’s exactly why you store latency as a histogram. The histogram keeps the full spread, so the tool can work out p50, p95, and p99 for you.

Don't report a single average and call it a day

“Average response time is 150ms” can be true while a chunk of your users are having a terrible time. Always ask for the p95 and p99 alongside it. The gap between the average and the p99 is often where the real story hides.

🆚 Metrics vs Logs

People mix these two up, so let’s draw a clean line. Both help you understand your system, but they do different jobs:

Metrics are cheap numeric trends. Just numbers over time, like “errors per second”. They’re small, fast to store, and great for spotting that something changed.
Logs are detailed event records. Each log line describes one specific thing that happened, with full detail, like “request from user 42 to /checkout failed with a database timeout at 9:03:11”.
So metrics tell you that something is wrong and roughly when. Logs help you dig into what exactly went wrong and why.

You don’t pick one. You use both together. The metric graph spikes, you notice it, then you open the logs around that time to find the actual cause.

⚙️ How Metrics Are Collected

So where do these numbers come from, and how do they end up on a graph? The flow is pretty consistent across most setups:

Your app exposes its metrics. It keeps counters and gauges in memory and publishes them, usually on a little internal web page like /metrics that just lists the current numbers.
A collection system scrapes them. A tool like Prometheus visits that page every few seconds and grabs the numbers. “Scrape” just means it pulls the values on a schedule.
The collector stores them in a time-series database, each value stamped with the time it was collected.
A dashboard tool then graphs them over time, so you and your team can actually see the lines.

Here’s that whole path in one picture:

Prometheus in one line

Prometheus is a very common open-source tool that scrapes metrics from your apps and stores them as time series. You don’t need to know it in depth yet, just know that “something pulls the numbers on a schedule and keeps them” is the normal pattern.

⚡ Why Metrics Matter

Okay, so why go to all this trouble? Because metrics turn vague worries into things you can act on:

Spot trends. That opening question, “is the app getting slower over the week?”, is now a five-second glance at a graph instead of a gut feeling.
Set alerts. You can tell the system “ping me if the error rate goes above 2%”. Now you find out before your users do, not after they complain.
Find slowdowns. When something feels off, the graphs show you which metric moved and when, so you know where to start looking.
Plan capacity. If memory usage has been creeping up week after week, you can see it coming and add resources before you run out.

In short, metrics are how you go from “I think it’s fine” to “I know it’s fine, and here’s the graph.”

⚠️ Common Mistakes and Misconceptions

A few things trip people up early. Let’s clear them out:

“The average tells the whole story.” It doesn’t. The average hides slow outliers. Use p95 and p99 to see what your unluckiest users actually experience.
“Track everything, just in case.” Tempting, but it buries the signals that matter under noise, and it costs storage. Start with the golden signals (rate, errors, latency, resources) and add more only when you have a real question.
“A counter and a gauge are basically the same.” No. A counter only climbs and you read its rate. A gauge goes up and down and you read its current value. Mixing them up gives you nonsense graphs.
“Metrics will tell me exactly what broke.” Not on their own. Metrics tell you that something changed and when. To find the actual cause, you go to your logs.

🛠️ Design Challenge

You’re running a checkout service and the team says “checkout feels slow sometimes.” Work through the questions below.

Which metrics would you track to investigate the slowness?

Show the answer

Why store latency as a histogram instead of a single average?

Show the answer

If the average latency looks fine but users still complain, which number would you check next, and why?

Show the answer

🧩 What You’ve Learned

You can now explain what metrics are and how teams use them. Here’s what you’ve picked up.

✅ A metric is a numeric measurement tracked over time, stored as a time series with timestamps.
✅ The three main types are counters (only go up), gauges (go up and down), and histograms (record a distribution).
✅ The golden signals to watch are request rate, error rate, latency, and CPU/memory.
✅ Percentiles like p95 and p99 beat averages, because averages hide slow outliers.
✅ Metrics give cheap numeric trends, while logs give detailed events, and you use both.
✅ Apps expose metrics, a tool like Prometheus scrapes and stores them, and dashboards graph them.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

🚀 What’s Next?

Now that you know what to measure, the next step is learning how to watch it and how to dig deeper when something breaks.

Monitoring Basics shows how to turn these metrics into dashboards and alerts that catch problems early.
Distributed Tracing follows a single request across many services, so you can see exactly where the time went.

Get these down and you’ll have the core of observability: knowing your system is healthy, and knowing where to look when it isn’t.

Previous Monitoring Basics Next Distributed Tracing Explained

Share & Connect

Share on LinkedIn