Distributed Tracing Explained

Table of Contents +

Picture this. A user opens your app, taps a button, and the screen just sits there spinning. The request finally comes back after four whole seconds. Four seconds is forever, right? So you go to fix it.

But here’s the catch:

That one request didn’t touch just one server. It bounced through eight different services on its way.
It hit the API gateway, then the auth service, then the cart service, then payments, then a couple of databases, and so on.
Somewhere in that chain, something was slow. But which one?

That’s the question we’re going to answer in this lesson. By the end, you’ll know how to follow a single request through all eight services and point straight at the slow one.

🎯 The Problem

In a small app, this is easy. One server does everything, so if it’s slow, you look at that one server. Done.

But modern systems are usually built as microservices. A microservice is just one small service that does one job, and a big app is made of many of them talking to each other. (We cover the headaches of this setup in Challenges of Microservices.)

Now here’s the pain this creates:

One user request doesn’t stay in one place. It hops from service to service, like a baton in a relay race.
Each service writes its own logs in its own corner. A log is just a line of text a service prints to say what it did.
So when something is slow or broken, you’re staring at eight separate log piles, trying to mentally stitch them together.
And you can’t even tell which log lines belong to your request, because thousands of other requests are writing logs at the same time.

Logs alone just don’t show you the full journey. You need something that follows one request from start to finish.

🧭 What is Distributed Tracing

So let’s define it plainly. Distributed tracing is a way of following a single request as it travels through all the services it touches, so you can see the whole path and how long each step took.

Think of it like tracking a parcel:

When you order something online, the parcel gets a tracking number.
That number stays with the parcel the whole way: warehouse, truck, sorting center, your doorstep.
At every stop, the system records when the parcel arrived and when it left.
So at the end, you can see the full route and spot where it sat around for two days.

Distributed tracing does exactly that, but for a request moving through your services. The request gets a tracking number, every service records its part, and you get the full timeline at the end.

🆔 Traces and Spans

These are the two words you’ll hear most, so let’s nail them down right now.

A trace is the whole journey of one request, end to end. It’s the complete story, from the moment the request arrives to the moment the answer goes back.
A span is one single step inside that journey. It’s one operation in one service, like “auth service checked the password” or “database ran the query.”

So a trace is made up of many spans. One trace, many spans. And here’s the glue that holds them together:

Every span in the same journey shares one trace id. A trace id is just a unique tracking number for that one request.
Each span also has its own span id, so you can tell the steps apart.
Because they all carry the same trace id, the tracing system can gather them up and know they belong to the same request.

Here’s one request flowing through three services. Notice every span shares the same trace id.

This table sums up the difference so you don’t mix them up.

Trace	Span
The whole journey of one request	One step within that journey
Made up of many spans	Belongs to exactly one trace
Has one trace id	Has its own span id, plus the shared trace id
Shows the full path and total time	Shows what one service did and how long it took

⚙️ How It Works

Okay, so how does this actually happen in the background? It’s simpler than you’d think.

When the request first arrives at the edge of your system, a trace id gets created for it. The edge is usually the first service it hits, like the API gateway.
As the request moves to the next service, that trace id gets passed along with it. This passing-along is called context propagation, which just means carrying the tracking info from one service to the next.
Each service records its own span: when its work started, when it finished, and which trace id it belongs to.
All these spans get sent off to a tracing system, a separate tool whose job is to collect them.
The tracing system sees they all share the same trace id, so it stitches them together into one timeline you can look at.

If the trace id isn't passed along, the trace breaks

The whole thing depends on every service passing the trace id forward. If one service forgets to propagate it, the next service starts a brand new trace. Now your journey is split in two, and the picture has a hole in it right where you need it.

⏱️ What It Shows You

Once those spans are stitched together, you get a timeline view of the request. And this is where the magic happens.

You see the full path: which services the request touched, and in what order.
You see how long each service took. Each span has a start and end time, so the slow steps stand out as long bars.
You see where errors happened, because a span can be marked as failed.
And from all that, you spot the bottleneck: the one step eating most of the time. A bottleneck is just the slowest part that holds everything else up.

So back to our four-second request. With a trace, you’d open the timeline and instantly see that seven services finished in milliseconds, but the payments service took 3.8 seconds. There’s your culprit. No more guessing across eight log piles.

🧩 The Tools

You don’t build all this by hand. There’s a standard toolkit for it.

OpenTelemetry is the tool you use to instrument your services. Instrumenting just means adding the code that creates spans and passes the trace id along. It’s an open standard, so it works the same across most languages.
Jaeger and Zipkin are tools that collect those spans and show you the timeline view in a nice UI. You pick one to actually look at your traces.

So the usual combo is: OpenTelemetry to produce the traces, and Jaeger or Zipkin to view them.

🔗 Logs vs Metrics vs Traces

You’ll often hear these called the three pillars of observability. Observability just means how well you can understand what’s going on inside your system from the outside. Each pillar answers a different question.

Pillar	What it is	What it answers
Logs	Text records of individual events	What exactly happened at this moment?
Metrics	Numbers measured over time, like request count or error rate	How is the system doing overall?
Traces	The path of one request across services	Where did this one request spend its time?

The thing is, they work best together. Metrics tell you something’s wrong, traces tell you where it’s wrong, and logs tell you exactly what went wrong at that spot.

⚠️ Common Mistakes and Misconceptions

A few things trip people up. Let’s clear them out.

“Logs are enough for microservices.” Not really. Logs show single events, but they don’t connect them into one request’s journey across services. Tracing is what links the dots.
“Just trace every request in full detail.” That sounds nice, but it’s a lot of data and it costs money and slows things down. So most systems use sampling, which means keeping only a slice of traces (say one in a hundred) instead of all of them.
“The trace id gets passed automatically.” It doesn’t, unless your services are set up to propagate it. If one service forgets to pass it on, the trace breaks right there and your timeline has a gap.
“A span and a trace are the same thing.” No. A span is one step. A trace is the whole journey made of many spans sharing one trace id.

🛠️ Design Challenge

Imagine a checkout request that flows through a gateway, an auth service, a cart service, and a payment service. The request feels slow, but you don’t know why. Work through the questions below.

Where does the trace id get created?

Show the answer

How does the trace id get passed from one service to the next?

Show the answer

What would each service’s span record?

Show the answer

How would you use the final timeline to find the slow step?

Show the answer

If the cart service forgot to pass the trace id forward, what would your timeline look like, and why would that make debugging harder?

Show the answer

🧩 What You’ve Learned

You can now explain how to follow a request through a whole microservices system. Here’s what you’ve picked up.

✅ Distributed tracing follows one request as it travels across many services.
✅ A trace is the whole journey; a span is one step inside it.
✅ Every span shares one trace id, which is created at the entry and passed along to each service.
✅ The tracing system stitches spans into a timeline that shows the path, the timing, and where the errors are.
✅ That timeline lets you find the bottleneck, the one slow step holding everything up.
✅ OpenTelemetry instruments your services, and Jaeger or Zipkin show you the traces.
✅ Logs, metrics, and traces are the three pillars of observability, and they work best together.

Check Your Knowledge

Test what you learned. Pick an answer for each question, then click Check.

🚀 What’s Next?

You’ve got the big picture of tracing now. To round out your observability knowledge, go deeper next.

Logging Basics covers how individual services record what they do, the pillar that pairs with tracing.
Challenges of Microservices explains why systems get this complicated in the first place, and the problems tracing helps you tame.

Once you’ve got those, you’ll be able to reason about a slow request from end to end, just like you would in a real interview.

Previous Metrics Explained Next Alerting Systems Explained

Share & Connect

Share on LinkedIn