Distributed Tracing Explained
Table of Contents + −
Picture this. A user opens your app, taps a button, and the screen just sits there spinning. The request finally comes back after four whole seconds. Four seconds is forever, right? So you go to fix it.
But here’s the catch:
- That one request didn’t touch just one server. It bounced through eight different services on its way.
- It hit the API gateway, then the auth service, then the cart service, then payments, then a couple of databases, and so on.
- Somewhere in that chain, something was slow. But which one?
That’s the question we’re going to answer in this lesson. By the end, you’ll know how to follow a single request through all eight services and point straight at the slow one.
🎯 The Problem
In a small app, this is easy. One server does everything, so if it’s slow, you look at that one server. Done.
But modern systems are usually built as microservices. A microservice is just one small service that does one job, and a big app is made of many of them talking to each other. (We cover the headaches of this setup in Challenges of Microservices.)
Now here’s the pain this creates:
- One user request doesn’t stay in one place. It hops from service to service, like a baton in a relay race.
- Each service writes its own logs in its own corner. A log is just a line of text a service prints to say what it did.
- So when something is slow or broken, you’re staring at eight separate log piles, trying to mentally stitch them together.
- And you can’t even tell which log lines belong to your request, because thousands of other requests are writing logs at the same time.
Logs alone just don’t show you the full journey. You need something that follows one request from start to finish.
🧭 What is Distributed Tracing
So let’s define it plainly. Distributed tracing is a way of following a single request as it travels through all the services it touches, so you can see the whole path and how long each step took.
Think of it like tracking a parcel:
- When you order something online, the parcel gets a tracking number.
- That number stays with the parcel the whole way: warehouse, truck, sorting center, your doorstep.
- At every stop, the system records when the parcel arrived and when it left.
- So at the end, you can see the full route and spot where it sat around for two days.
Distributed tracing does exactly that, but for a request moving through your services. The request gets a tracking number, every service records its part, and you get the full timeline at the end.
🆔 Traces and Spans
These are the two words you’ll hear most, so let’s nail them down right now.
- A trace is the whole journey of one request, end to end. It’s the complete story, from the moment the request arrives to the moment the answer goes back.
- A span is one single step inside that journey. It’s one operation in one service, like “auth service checked the password” or “database ran the query.”
So a trace is made up of many spans. One trace, many spans. And here’s the glue that holds them together:
- Every span in the same journey shares one trace id. A trace id is just a unique tracking number for that one request.
- Each span also has its own span id, so you can tell the steps apart.
- Because they all carry the same trace id, the tracing system can gather them up and know they belong to the same request.
Here’s one request flowing through three services. Notice every span shares the same trace id.
This table sums up the difference so you don’t mix them up.
| Trace | Span |
|---|---|
| The whole journey of one request | One step within that journey |
| Made up of many spans | Belongs to exactly one trace |
| Has one trace id | Has its own span id, plus the shared trace id |
| Shows the full path and total time | Shows what one service did and how long it took |
⚙️ How It Works
Okay, so how does this actually happen in the background? It’s simpler than you’d think.
- When the request first arrives at the edge of your system, a trace id gets created for it. The edge is usually the first service it hits, like the API gateway.
- As the request moves to the next service, that trace id gets passed along with it. This passing-along is called context propagation, which just means carrying the tracking info from one service to the next.
- Each service records its own span: when its work started, when it finished, and which trace id it belongs to.
- All these spans get sent off to a tracing system, a separate tool whose job is to collect them.
- The tracing system sees they all share the same trace id, so it stitches them together into one timeline you can look at.
If the trace id isn't passed along, the trace breaks
The whole thing depends on every service passing the trace id forward. If one service forgets to propagate it, the next service starts a brand new trace. Now your journey is split in two, and the picture has a hole in it right where you need it.
⏱️ What It Shows You
Once those spans are stitched together, you get a timeline view of the request. And this is where the magic happens.
- You see the full path: which services the request touched, and in what order.
- You see how long each service took. Each span has a start and end time, so the slow steps stand out as long bars.
- You see where errors happened, because a span can be marked as failed.
- And from all that, you spot the bottleneck: the one step eating most of the time. A bottleneck is just the slowest part that holds everything else up.
So back to our four-second request. With a trace, you’d open the timeline and instantly see that seven services finished in milliseconds, but the payments service took 3.8 seconds. There’s your culprit. No more guessing across eight log piles.
🧩 The Tools
You don’t build all this by hand. There’s a standard toolkit for it.
- OpenTelemetry is the tool you use to instrument your services. Instrumenting just means adding the code that creates spans and passes the trace id along. It’s an open standard, so it works the same across most languages.
- Jaeger and Zipkin are tools that collect those spans and show you the timeline view in a nice UI. You pick one to actually look at your traces.
So the usual combo is: OpenTelemetry to produce the traces, and Jaeger or Zipkin to view them.
🔗 Logs vs Metrics vs Traces
You’ll often hear these called the three pillars of observability. Observability just means how well you can understand what’s going on inside your system from the outside. Each pillar answers a different question.
| Pillar | What it is | What it answers |
|---|---|---|
| Logs | Text records of individual events | What exactly happened at this moment? |
| Metrics | Numbers measured over time, like request count or error rate | How is the system doing overall? |
| Traces | The path of one request across services | Where did this one request spend its time? |
The thing is, they work best together. Metrics tell you something’s wrong, traces tell you where it’s wrong, and logs tell you exactly what went wrong at that spot.
⚠️ Common Mistakes and Misconceptions
A few things trip people up. Let’s clear them out.
- “Logs are enough for microservices.” Not really. Logs show single events, but they don’t connect them into one request’s journey across services. Tracing is what links the dots.
- “Just trace every request in full detail.” That sounds nice, but it’s a lot of data and it costs money and slows things down. So most systems use sampling, which means keeping only a slice of traces (say one in a hundred) instead of all of them.
- “The trace id gets passed automatically.” It doesn’t, unless your services are set up to propagate it. If one service forgets to pass it on, the trace breaks right there and your timeline has a gap.
- “A span and a trace are the same thing.” No. A span is one step. A trace is the whole journey made of many spans sharing one trace id.
🛠️ Design Challenge
Try this one yourself to lock it in.
Imagine a checkout request that flows through a gateway, an auth service, a cart service, and a payment service. The request feels slow, but you don’t know why. Sketch out:
- Where the trace id gets created.
- How it gets passed from one service to the next.
- What each service’s span would record.
- How you’d use the final timeline to find the slow step.
Then think about this twist: if the cart service forgot to pass the trace id forward, what would your timeline look like, and why would that make debugging harder?
🧩 What You’ve Learned
You can now explain how to follow a request through a whole microservices system. Here’s what you’ve picked up.
- ✅ Distributed tracing follows one request as it travels across many services.
- ✅ A trace is the whole journey; a span is one step inside it.
- ✅ Every span shares one trace id, which is created at the entry and passed along to each service.
- ✅ The tracing system stitches spans into a timeline that shows the path, the timing, and where the errors are.
- ✅ That timeline lets you find the bottleneck, the one slow step holding everything up.
- ✅ OpenTelemetry instruments your services, and Jaeger or Zipkin show you the traces.
- ✅ Logs, metrics, and traces are the three pillars of observability, and they work best together.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What is distributed tracing used for?
Why: Tracing follows a single request across all its services so you can see the full path and how long each step took.
- 2
What is the difference between a trace and a span?
Why: A trace is the full end-to-end journey, made up of many spans, where each span is one step in one service.
- 3
How do all the spans get linked to the same request?
Why: Every span carries the same trace id, so the tracing system groups them into one journey.
- 4
What are the three pillars of observability?
Why: Logs record events, metrics track numbers over time, and traces follow one request across services.
🚀 What’s Next?
You’ve got the big picture of tracing now. To round out your observability knowledge, go deeper next.
- Logging Basics covers how individual services record what they do, the pillar that pairs with tracing.
- Challenges of Microservices explains why systems get this complicated in the first place, and the problems tracing helps you tame.
Once you’ve got those, you’ll be able to reason about a slow request from end to end, just like you would in a real interview.