HomeBlog › Design a Web Crawler

Design a Web Crawler

A search-engine crawler: the URL frontier, politeness and robots.txt, the fetch/parse pipeline, deduplication, and crawler-trap detection.

"Design a web crawler" (like Googlebot) tests pipeline thinking, politeness, and deduplication at scale. The heart of the design is the URL frontier — the queue that decides what to crawl next, and how to do it without hammering any single website.

Here's the full walkthrough with a pipeline diagram, including robots.txt politeness and how to avoid crawler traps.

Web crawler pipeline: seed URLs, URL frontier, fetchers, parser, dedup, content store, politeness control
Web crawler pipeline: seed URLs, URL frontier, fetchers, parser, dedup, content store, politeness control

1. Clarify the requirements

Functional requirements

Non-functional requirements

Back-of-the-envelope scale: Assume a target of 1B pages/month → ~400 pages/sec sustained. With average pages of ~100KB, that's ~100TB/month of raw content before deduplication.

2. API design

A crawler is an internal pipeline rather than a public API. Its core loop, conceptually:

loop:
  url   = frontier.next()            # politeness-aware pick
  if not allowed(url): continue       # robots.txt + dedup check
  html  = fetch(url)                  # with per-host rate limit
  links = parse(html)                 # extract URLs + text
  store(url, content)                 # to content store
  for link in links:
    if not seen(link): frontier.add(link)

3. High-level architecture

The URL frontier is the centerpiece: a set of priority queues that selects the next URL while enforcing politeness — never sending two requests to the same host too close together. It balances priority (important pages first) with per-host rate limits.

Pools of fetcher workers download pages (using a cached DNS resolver to avoid repeated lookups), the parser extracts links and text, and new links flow back into the frontier — a closed loop. A dedup stage drops URLs already seen and pages whose content hash matches something stored.

Politeness is enforced by honoring robots.txt (cached per host), applying any crawl-delay, and rate-limiting per domain. This is the detail that separates a real crawler design from a naive BFS.

4. Data model & storage

Seen-URL set: billions of URLs, checked on every discovered link. A bloom filter gives a compact, fast "probably seen" test, backed by an authoritative store for confirmation.

Content store: raw and parsed page content in object storage, keyed by URL hash, with a content hash for near-duplicate detection. Metadata (last-crawled, status, freshness) lives in a separate index.

5. Scaling and bottlenecks

Key trade-offs the interviewer probes

Framework reminder: every system design answer follows the same arc — requirements → estimates → API → high-level design → data model → scale → trade-offs. Keep the system design cheat sheet in mind and narrate which stage you're in.

Design data pipelines with live AI support

CoPilot Interview surfaces a structured design skeleton — requirements, API, data model, and scaling — in about 4 seconds during real Zoom and Teams calls. Free for Windows and macOS, invisible on screen-share.

Download free

FAQ

What is the URL frontier in a web crawler?

The frontier is the queue (really a set of priority queues) that decides which URL to crawl next while enforcing politeness - never hitting the same host too frequently. It balances page priority against per-host rate limits and is the centerpiece of the design.

How does a crawler avoid crawling the same page twice?

It maintains a 'seen URL' set, checked on every discovered link. At billions of URLs a bloom filter gives a compact, fast 'probably seen' test, backed by an authoritative store for confirmation. Content hashing additionally catches near-duplicate pages at different URLs.

How do you make a crawler polite?

Honor robots.txt (cached per host), respect any crawl-delay directive, and rate-limit requests per domain so you never overload a site. Distributing the frontier by host keeps politeness enforcement local to each crawler node.

What are crawler traps and how do you handle them?

Crawler traps are pages that generate effectively infinite URLs - endless calendars, session-ID links, or deep recursive paths. Defend with URL-depth limits, a per-host page cap, and pattern detection that spots and skips trap-like URL structures.

How do you scale a crawler to billions of pages?

Distribute the frontier by host hash so each node owns a set of domains, run many parallel fetcher workers (crawling is I/O-bound), use a bloom filter for fast dedup, and store content in object storage. Schedule re-crawls by how often each page changes.