Design a Web Crawler
Table of Contents + −
Search engines like Google know about billions of web pages. But how did they find all of them? Nobody typed them in by hand. A program went out and discovered them, page by page, link by link. That program is a web crawler, and designing one is a classic system design question. Let’s build it up step by step, the way you would in an interview.
🎯 What a Web Crawler Does
A web crawler is a program that visits web pages, reads them, and follows their links to find more pages. Then it repeats, forever.
The basic loop is simple:
- Start with a few known web addresses (URLs).
- Download each page.
- Find all the links on that page.
- Add those new links to the list of pages to visit.
- Repeat with the next page.
So from a handful of starting pages, the crawler keeps following links and discovers more and more of the web. The whole challenge is doing this at a huge scale, politely, without getting stuck.
📋 Requirements
Before designing, let’s pin down what we need. Splitting requirements into two kinds is a habit that impresses interviewers.
Functional (what it must do):
- Start from some seed URLs and discover new pages by following links.
- Download each page and store its content for later use (like search).
- Not get stuck visiting the same page over and over.
Non-functional (how well it must do it):
- Scale: handle billions of pages.
- Politeness: don’t overload any single website with too many requests.
- Fault tolerance: if it crashes, it can pick up where it left off.
Always start with requirements
In any design question, list the functional and non-functional requirements first. It shows you think before you build, and it guides every decision that follows.
🏗️ High-Level Design
Here’s the shape of the system. The heart of it is a to-do list of URLs, called the URL frontier, plus workers that process them.
Reading the flow:
- The URL frontier is the queue of pages still to visit.
- A downloader takes a URL and fetches the page.
- A parser reads the page, pulls out the content and the links.
- The content gets saved in a content store.
- New links are checked against the seen-URLs list, and the truly new ones go back into the frontier.
That loop, run by many workers at once, is the whole crawler.
🔁 Avoiding Duplicate Pages
Here’s the first big problem. The web is full of links pointing to the same page. If the crawler visits the same URL again and again, it wastes huge effort and can loop forever.
So we keep a set of seen URLs. Before adding a URL to the frontier, we check: have we seen this already? If yes, skip it. If no, add it and mark it seen.
- For billions of URLs, this set is huge, so it’s spread across many machines.
- A common trick is to store a short hash of each URL instead of the full text, to save space.
The rule is simple: never crawl the same page twice. That one check saves an enormous amount of work.
🙏 Being Polite to Websites
Here’s a problem beginners miss. If our crawler hits one website with thousands of requests per second, we’d basically attack that site and knock it offline. That’s rude and harmful.
So a good crawler is polite:
- It limits how often it hits any single website, like waiting a moment between requests to the same site.
- It checks a special file called
robots.txton each site, which tells crawlers which pages they’re allowed to visit.
Politeness is not optional
A crawler that ignores politeness can overload websites and get permanently blocked. Spreading requests out and respecting robots.txt is a core part of the design, not an afterthought.
📈 Scaling It Up
One machine can’t crawl billions of pages in any reasonable time. So we spread the work across many machines.
- The URL frontier is shared, and many downloader workers pull from it at once.
- We often split work by website, so all of one site’s pages go to the same worker. That makes politeness easy to enforce (one worker controls the pace for that site).
- If a worker crashes, its URLs are still in the frontier, so another worker picks them up. That’s our fault tolerance.
So the design scales by adding more workers, all sharing the frontier and the seen-URLs set.
🧰 Tech Choices
Part of system design is not just naming pieces, it’s saying why you picked each one. Here are the main technology decisions for this system and the reason behind each.
| Decision | Choice | Why |
|---|---|---|
| Store URLs to visit | A queue (the URL frontier) | Holds billions of URLs, lets many workers pull work, and survives crashes since URLs stay queued. |
| Store page content | Object storage | Cheap, huge-scale storage built for endless files. |
| ”Seen this URL?” checks | Distributed hash set / key-value store | Very fast lookups, spread across machines for billions of URLs. |
| Split the work | Partition by domain | All of one site goes to one worker, which makes politeness (rate limits) easy. |
⚠️ Common Mistakes and Misconceptions
A few things to keep straight:
- “Just download pages as fast as possible.” No. Without politeness limits, you overload websites and get blocked. Speed must be balanced with being a good citizen.
- “Skip the seen-URLs check, it’s just an optimization.” It’s essential. Without it, the crawler revisits pages forever and may never make progress.
- “One powerful server can crawl the whole web.” Not at web scale. You need many machines sharing the work, with a design that survives crashes.
🧩 What You’ve Learned
Nice work. Here’s the recap:
- ✅ A web crawler visits pages, saves their content, and follows links to discover more pages, in a loop.
- ✅ The core pieces are a URL frontier (to-do queue), downloaders, a parser, a content store, and a seen-URLs check.
- ✅ A seen-URLs set stops the crawler from visiting the same page twice.
- ✅ Politeness (rate limits and robots.txt) keeps it from overloading websites.
- ✅ It scales by spreading work across many workers that share the frontier, and survives crashes because URLs stay in the queue.
Check Your Knowledge
Test what you learned. Pick an answer for each question, then click Check.
- 1
What is the URL frontier in a web crawler?
Why: The URL frontier is the to-do queue of pages to crawl. Workers pull from it, and newly found links are added back to it.
- 2
Why does a crawler keep a set of seen URLs?
Why: The seen-URLs set stops the crawler from re-visiting the same page, which would waste effort and could loop forever.
- 3
What does 'politeness' mean for a crawler?
Why: A polite crawler spreads out its requests so it doesn't overload any website, and it follows each site's robots.txt rules.
- 4
How does the crawler keep working if a worker crashes?
Why: Because URLs live in the shared frontier, a crash doesn't lose them. Another worker simply continues with those URLs.
🛠️ Design Challenge
Try extending the crawler yourself. Think each one through first, then open the answer to see a full breakdown.
Freshness: re-crawling pages. A news site changes every few minutes, but a help page barely changes in a year. How would you decide what to re-crawl, and how often?
Duplicate content, not just duplicate URLs. Two different URLs can show the exact same page. How do you avoid storing the same content twice?
Politeness across many workers. With hundreds of workers, how do you make sure they don’t collectively hammer one website?
🚀 What’s Next?
You’ve designed your first crawler. Let’s look at related designs.
- Design URL Shortener is another classic, focused on storing and looking up links.
- Design a Distributed Cache shows how to share data fast across many machines, like the seen-URLs set.
Get these down and you’ll start to see the same building blocks reused across many systems.