Introduction to Apache Kafka
Table of Contents + −
Think about a big app like YouTube or Uber for a second. Stuff is happening all the time, right?
- Every click, every page view, every video play, every ride request is a tiny piece of data.
- At big companies that’s millions of these little things every single second.
- And someone has to catch all of them, store them, and pass them along to whoever needs them.
So here’s the question we’ll answer today:
- How does one system swallow millions of events per second without falling over?
- How can many different teams read that same flood of data, each at their own pace?
- The popular answer to both is a tool called Apache Kafka. Let’s see what it actually is.
🌊 What is Kafka
Let’s start with the one word that’s at the heart of all this, the event.
- An event is just a record that something happened. Like “user Alex clicked play” or “order #482 was placed”.
- It’s small, it has a timestamp, and once it happened, it happened. You don’t edit it later, you just record it.
- A constant flow of these events, one after another, is called a stream.
Now Kafka. Here’s the plain definition:
- Apache Kafka is a distributed event-streaming platform. That’s a mouthful, so let’s break it down.
- “Event-streaming” means its whole job is to carry those streams of events from the things that make them to the things that need them.
- “Distributed” means it doesn’t run on one machine. It runs across a cluster of machines working as a team, so it can handle huge load and keep going even if one machine dies. (A cluster is just a group of computers acting as one system.)
- And it stores those events durably, meaning it writes them to disk and keeps them around, so they don’t vanish the moment someone reads them.
So in one line: Kafka is a system that takes in streams of events, stores them safely, and lets many readers pull from the same stream whenever they want.
🧩 Core Concepts
Kafka has a handful of words you’ll hear over and over. Let’s define each one before we go further, because the rest of the tutorial leans on them.
- A topic is a named stream of events, basically a category. Like a topic called
clicksor one calledorders. Producers write into a topic and consumers read from it. - A partition is a topic split into pieces. One topic can be cut into many partitions, and each partition can live on a different machine. This is how Kafka scales out and how it keeps events in order (more on that soon).
- A producer is anything that writes events into a topic. Your web server, your mobile app’s backend, a sensor, whatever is making events.
- A consumer is anything that reads events out of a topic. And a consumer group is a team of consumers that share the work of reading one topic, so they can chew through it faster together.
Here’s a quick reference table you can come back to.
| Concept | What it is | In plain words |
|---|---|---|
| Topic | A named stream of events | A labelled box where related events go, like orders |
| Partition | A slice of a topic | The topic cut into pieces so it can scale and stay ordered |
| Producer | Writes events in | The thing that creates events and sends them to a topic |
| Consumer group | A team of readers | Consumers sharing the work of reading one topic together |
Here’s how those pieces fit together. Producers push events into a topic, the topic is split into partitions, and consumer groups read from those partitions.
Notice the same topic is being read by two different groups at once. The billing team and the analytics team both get every event, independently. Hold on to that idea, it’s the whole magic of Kafka.
📜 The Log Idea
To really get Kafka, you have to get one simple idea: the log. Not the “error log” kind. A log here means an append-only record.
- Append-only means you can only add new entries to the end. You never go back and change or delete what’s already there.
- Picture a notebook where you only ever write on the next empty line. Line 1, line 2, line 3, and so on, forever.
- Each partition in Kafka is exactly this: a log. New events get added to the end, in order, and they stay put.
Now, how do consumers keep their place in this notebook? With an offset.
- An offset is just a position number, like a bookmark. “I’ve read up to line 57.”
- Each consumer group tracks its own offset. So billing might be on line 57 while analytics is still on line 30.
- Because Kafka keeps the events and each reader keeps its own bookmark, two groups can read the very same data at totally different speeds, without stepping on each other.
And here’s the part people find surprising:
- When a consumer reads an event, Kafka does not throw it away. The event stays in the log.
- It stays for a set amount of time called the retention period. You might keep events for 7 days, or 30 days, or forever, your choice.
- So if a consumer crashes, or a new team shows up next month, they can rewind their offset and replay old events from the log. That’s a superpower most queues just don’t have.
The log is what makes Kafka, Kafka
Most message systems hand a message to a reader and then delete it. Kafka instead keeps a durable log and lets each reader track its own position. That single design choice is why Kafka can do high throughput, replay, and many independent readers all at once.
⚡ Why Kafka is Special
So why do giant companies reach for Kafka? It comes down to a few things that fall right out of that log design.
- Huge throughput. Throughput means how much data it can handle per second. Because writing to the end of a log is so simple, and because topics split into partitions across many machines, Kafka can take in millions of events per second.
- Durable storage. Events are written to disk and copied to more than one machine. So if a machine dies, your data is still safe on another one. (Keeping copies like this is called replication.)
- Replay. Since events stick around for the retention period, you can re-read the past. Great for fixing a bug, training a new model, or onboarding a new service.
- Many independent consumers. The same stream can feed billing, analytics, fraud detection, and search, all at once, each reading at its own pace.
- Built for pipelines. Kafka shines when data has to flow continuously from one place to many others, which is exactly what data pipelines and streaming systems need.
The thing is, these aren’t separate features bolted on. They all come from “store events as a durable log and let readers track their own offset.” One idea, lots of payoff.
🆚 Kafka vs RabbitMQ
People often ask, “isn’t this just a message queue like RabbitMQ?” Close, but the goals are different. Let’s lay them side by side.
| Question | Apache Kafka | RabbitMQ |
|---|---|---|
| What is it really? | A durable event log for streaming | A flexible task queue and message broker |
| After a message is read? | Kept in the log for the retention period | Usually removed once a worker acknowledges it |
| Can you replay old data? | Yes, rewind the offset and re-read | No, gone once consumed |
| Best at | Very high volume, many readers, streaming | Routing tasks to workers, complex delivery rules |
Here’s the easy way to remember it:
- Reach for Kafka when you have a firehose of events that many teams need to read, and you might want to replay it later.
- Reach for RabbitMQ when you have tasks to hand out to workers, and you want fine-grained routing of who gets what.
- New to the other side? Start with Introduction to RabbitMQ to see the task-queue style up close.
✅ When to Use It
Kafka isn’t the answer to everything, so let’s be clear about where it really fits.
- Event streaming. You’ve got a steady flow of events (clicks, page views, ride requests) and you want to move it around in real time.
- Log and metrics pipelines. Pulling logs and metrics from hundreds of servers into one place for monitoring.
- Analytics. Feeding raw events into dashboards, data warehouses, or machine-learning systems.
- Very high throughput. When the volume is so large that a normal queue would buckle.
- Multiple consumers of the same data. When billing, search, and analytics all need the exact same stream, independently.
If your case looks like one of those, Kafka is probably a good fit. If not, hold on, the next section covers when it’s the wrong tool.
⚠️ Common Mistakes and Misconceptions
A few myths trip up beginners. Let’s clear them out so you don’t repeat them in an interview.
- “Kafka is just a queue.” Not quite. A plain queue hands a message to one worker and forgets it. Kafka is a durable log that many groups can read independently and even replay. The log is the whole difference.
- “Kafka deletes messages once you read them.” No. Reading does not remove anything. Events stay for the retention period you set, which is exactly why replay works. Don’t confuse “I read it” with “it’s gone”.
- “Just use Kafka for everything.” Please don’t. Kafka is heavier to run and overkill for a small app or a simple background job. If you just need to send a handful of tasks to a worker, a lighter queue is simpler and cheaper.
🛠️ Design Challenge
Try this one on your own to test yourself.
Imagine you’re designing the backend for a ride-hailing app like Uber. Every time a driver’s location updates, that’s an event, and a few different teams need it: the live map, the pricing engine, and the analytics team.
- Which Kafka topic (or topics) would you create for location updates?
- How would consumer groups let the map team and the pricing team both read every update without slowing each other down?
- If the analytics team joins next month and wants yesterday’s data, how does the log and retention help them?
Sketch the producers, the topic with its partitions, and the consumer groups. This is exactly the kind of design reasoning interviewers love.
🧩 What You’ve Learned
You can now explain what Kafka is and why people use it. Here’s what you’ve picked up.
- ✅ Kafka is a distributed event-streaming platform that stores streams of events durably.
- ✅ An event is a record that something happened, and a topic is a named stream of those events.
- ✅ Topics split into partitions, which scale Kafka out and keep events in order.
- ✅ Each partition is an append-only log, and consumers track their position with an offset.
- ✅ Events stay for a retention period, so many consumers can read and replay the same data independently.
- ✅ Kafka is built for high throughput, durability, and streaming pipelines, while RabbitMQ fits flexible task queues.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What is Apache Kafka in one sentence?
Why: Kafka is a distributed event-streaming platform that keeps streams of events as a durable, append-only log so many consumers can read or replay them independently.
- 2
What is a partition in Kafka?
Why: A partition is a slice of a topic, so one topic can be split across many machines for scale, and each partition keeps its own events in order.
- 3
What is an offset, and who tracks it?
Why: An offset is like a bookmark showing a consumer's position in a partition's log, and each consumer group tracks its own, so groups can read at different speeds.
- 4
Does Kafka delete a message after a consumer reads it?
Why: Reading does not remove anything in Kafka, since events stay in the log for the retention period you set, which is exactly what lets consumers rewind and replay.
🚀 What’s Next?
You’ve got the foundation now. Next, see how Kafka fits into the bigger picture of message systems and event-driven design.
- Introduction to RabbitMQ shows the task-queue style of messaging, so you can compare it with Kafka’s log.
- Event-Driven Architecture zooms out to how whole systems are built around events flowing between services.
Once you’ve got those, you’ll be able to reason about messaging and streaming like a real system designer.