Design a Web Crawler — System Design

A web crawler is a graph-traversal engine over the entire internet: start from a set of seed URLs, download each page, extract the links it contains, and recursively follow them — forever. The data structure is conceptually trivial (a breadth-first search over a directed graph) but the engineering is brutal at scale. You must fetch billions of pages without overwhelming any single web server, resolve hundreds of millions of hostnames through a DNS layer that is orders of magnitude slower than your fetch loop, deduplicate URLs and content so you do not crawl the same page a thousand times, honor every site's robots.txt rules, survive deliberately hostile crawl traps that generate infinite URLs, distribute the work across a fleet of machines while keeping politeness state coherent, and recrawl pages often enough that your index stays fresh without re-downloading the static 90% of the web every day. Each of these is a real system-design problem, and they interact in subtle ways.

⚡ Quick Takeaways

The URL frontier is the heart — Mercator's two-tier design: front queues rank by priority (importance + freshness), back queues are per-host with a crawl-delay gate so one slow site never blocks the fleet.
Bloom filter seen-set — 1B+ URLs fit in ~1.2 GB at 1% false-positive rate; O(1) membership keeps dedup off disk. False positives skip a few new URLs; there are never false negatives.
DNS is the hidden bottleneck — a synchronous resolver call per fetch caps throughput; resolve once per host and cache aggressively.
robots.txt is mandatory — fetch and cache per host, honor Disallow and crawl-delay; ignoring it gets you IP-banned and is ethically wrong.
Content dedup via hashing + SimHash — exact duplicates by content hash; near-duplicates (mirrors, boilerplate) by SimHash Hamming distance.
Crawl traps need hard bounds — max depth, max pages per host, URL-length caps, and pattern detection for calendars and session IDs.
Partition workers by hostname — hash(host) → worker shard keeps all of a host's politeness state local, with no cross-worker coordination on every fetch.
Adaptive recrawl for freshness — recrawl rate ∝ observed change frequency × page importance; news hourly, static pages monthly.

tldr

A web crawler is a distributed BFS over the web. The core component is the URL frontier — a two-tier priority + politeness queue (Mercator). Resolve DNS through an aggressive cache, honor robots.txt per host, deduplicate URLs with a Bloom filter and content with SimHash, bound crawl traps with depth and per-host caps, partition workers by hostname so politeness is local, store raw HTML in object storage with metadata in a wide-column store, and recrawl adaptively based on each page's change rate and importance.

Step 1 — Clarify Requirements

"Design a web crawler" hides an enormous range of scope. A focused crawler for one domain is a weekend project; a general-purpose crawler that feeds a search engine is a multi-year effort. Pin down the scope before drawing a single box.

Functional requirements

Given a set of seed URLs, download the referenced pages and recursively follow their links.
Extract and store page content (raw HTML) and the link graph (which page links to which).
Respect robots.txt directives and any crawl-delay a site declares.
Deduplicate: never store the same page content twice, and detect near-duplicate / mirror content.
Recrawl pages periodically so stored content stays fresh.
Filter by content type — typically HTML, optionally PDFs, images, or other media.

Non-functional requirements

Scalability — crawl billions of pages; scale horizontally to thousands of concurrent fetcher threads across many machines.
Politeness — never overload a single host; honor crawl-delay and keep at most one open connection per host at a time.
Robustness — survive malformed HTML, redirect loops, slow or dead servers, and deliberately hostile crawl traps without crashing or wedging.
Freshness — recrawl high-change pages frequently while not wasting bandwidth on static pages.
Extensibility — pluggable parsers and protocol handlers so new content types can be added later.

interview tip

Clarify two things up front, because they dominate the design. First: is this a general (Googlebot-scale) crawler, a focused crawler for one site/topic, or a search-index feeder? Second: must you render JavaScript? Fetching static HTML is cheap; running a headless browser to render client-side SPAs is 10–100× more expensive in CPU and memory, and changes your whole capacity model. State your assumption and move on.

Step 2 — Capacity Estimation

Back-of-the-envelope numbers anchor the bandwidth, storage, and memory decisions. Assume a target of 1 billion pages per month — a realistic mid-size crawl.

Throughput

1B pages / month ÷ (30 × 86,400 s) ≈ 386 pages/sec sustained; with a 2× peak factor, design for ~800 pages/sec.
At an average HTML page of ~100 KB, download bandwidth ≈ 400 pages/sec × 100 KB ≈ 40 MB/sec ≈ 320 Mbps sustained ingress — the crawler is fundamentally bandwidth-bound, not CPU-bound (unless you render JS).
Each page fetch involves: a DNS lookup (cached), a TCP + TLS handshake, the HTTP GET, parsing, and link extraction. To hit hundreds of pages/sec you need thousands of concurrent connections (async I/O or a large thread pool), because per-fetch latency is dominated by network round-trips, not local work.

Storage

Raw HTML: 1B pages × 100 KB = ~100 TB/month uncompressed; gzip on HTML compresses ~4:1, so ~25 TB/month stored. Over a year of crawling that is hundreds of terabytes — object storage territory, not a SQL database.
URL metadata (URL, status, last-crawled time, content hash, ETag, next-recrawl time): ~500 bytes/URL × 1B ≈ ~500 GB, growing with the crawl frontier.
Seen-URL set: a Bloom filter at 1% false-positive rate needs ~9.6 bits/element ≈ 1.2 bytes/URL → ~1.2 GB for 1B URLs, ~12 GB for 10B. Fits in memory, sharded across a few nodes.

note

The dominant resource is network bandwidth, and the dominant constraint is politeness — you cannot simply open 10,000 connections to one popular host. Real crawl throughput is gated by how many distinct hosts you can fetch from in parallel, which is why the frontier's per-host queueing (Step 4) is the make-or-break component, not raw machine count.

Step 3 — High-Level Architecture

The crawler is a pipeline of cooperating components fed by a central work queue (the frontier). Each fetcher thread runs the same loop: pull a URL, resolve DNS, fetch, parse, extract links, dedup, store, and push new URLs back into the frontier.

flow

Seed URLs ──► URL Frontier (priority + politeness queues)
                  │  dequeue (host-ready URL)
                  ▼
            DNS Resolver (cached) ──► IP
                  ▼
            robots.txt cache ──► allowed?
                  ▼ yes
            HTML Fetcher (thread pool, conditional GET)
                  ▼ raw HTML
            Content Dedup ──► content-hash + SimHash
                  │  new content?
                  ▼ yes
            Parser / Link Extractor
                  │  normalize + filter links
                  ▼
            URL Dedup (Bloom filter seen-set)
                  │  unseen?
                  ▼ yes              ▼
            push to Frontier    Storage: raw HTML (object store)
            (assign priority)            + metadata (wide-column)
                                         + link graph (adjacency lists)
                  ▲
            Recrawl Scheduler re-injects due URLs by change rate

The frontier, DNS resolver, dedup structures, and storage are all shared services consumed by every fetcher. The two hard problems hiding in this simple diagram are the frontier (how to order and pace fetches) and dedup (how to do it cheaply at billions of items). The rest is plumbing — important, but well understood.

Step 4 — The URL Frontier

The frontier is the data structure that holds URLs waiting to be crawled. It is not a simple FIFO queue — it must satisfy two competing goals simultaneously: priority (crawl important and time-sensitive pages first) and politeness (never hammer a single host). The canonical solution is the Mercator two-tier queue design.

Front queues — priority

A prioritizer assigns each incoming URL a priority from 1 to n based on signals like the page's importance (PageRank, inbound link count), its update frequency, and how stale the stored copy is. There is one FIFO front queue per priority level; the prioritizer enqueues each URL into the queue matching its priority. A higher-priority page jumps ahead of the static long tail.

Back queues — politeness

A back-queue router guarantees that every URL for a given host lands in exactly one back queue, and each back queue holds URLs for exactly one host. A worker thread is bound to a back queue and keeps at most one open connection per host. A min-heap keyed on each queue's next-fetch timestamp enforces the crawl-delay: after fetching from host X, that queue's next-fetch time is pushed forward by the delay, so the heap will not surface it again until the politeness gap has elapsed.

structure

# Front side: priority
prioritize(url) -> p in [1..n]      # PageRank, freshness, staleness
front_queues[p].push(url)           # one FIFO per priority level

# Back side: politeness (one host per back queue)
back_queues[i]            # FIFO of URLs, all same host
host_to_queue[host] = i   # routing table

# Heap orders back queues by when they may next be fetched
heap = min_heap_of (next_fetch_time, queue_id)

def worker_loop():
    next_fetch_time, qid = heap.pop()      # soonest-ready host
    sleep_until(next_fetch_time)
    url = back_queues[qid].pop()
    page = fetch(url)                       # one connection to this host
    delay = crawl_delay_for(url.host)       # robots.txt or default 1–2 s
    if back_queues[qid].empty():
        refill_from_front_queues(qid)       # pull next host's URLs
    else:
        heap.push(now() + delay, qid)        # re-arm politeness timer

This design decouples the two concerns elegantly: the front queues decide what is worth crawling next, the back queues decide when it is polite to crawl it, and the heap makes "pick the soonest-ready host" an O(log n) operation. At scale the frontier is sharded — see Step 10 — but the per-shard structure is exactly this. Because the frontier persists in-flight work, it is often backed by a durable queue such as Kafka or an on-disk structure so that a worker crash never loses URLs.

Step 5 — DNS Resolution

DNS is the single most underestimated bottleneck in a crawler. Every fetch needs the host's IP, and a cold DNS lookup can take anywhere from a few milliseconds to several seconds — far slower than the rest of the fetch loop. A naive crawler that calls getaddrinfo() synchronously before each fetch will spend most of its time blocked on DNS, and the OS resolver is typically single-threaded and serializes requests.

Two mitigations are essential. First, run a dedicated, multi-threaded DNS resolver pool so that DNS lookups happen in parallel and never block a fetcher thread. Second, maintain an aggressive DNS cache: resolve each hostname once and reuse the IP for all subsequent URLs on that host, honoring the record's TTL. Because the frontier already groups URLs by host (Step 4), the vast majority of fetches hit a warm cache entry. For very large crawls, teams often run a local caching DNS server (or a custom resolver) close to the fetchers, and the cache itself behaves like any other read-heavy cache — sized to hold the hot working set of active hosts.

gotcha

Do not cache DNS forever to "save lookups" — sites move IPs, and a stale entry sends your traffic to the wrong server or a dead host. Honor TTLs, but also bound the cache by host count and evict least-recently-used hosts. The right cache hit rate comes for free from host-grouped fetching, not from ignoring TTLs.

Step 6 — Fetching and robots.txt

Before fetching any URL on a host, the crawler must fetch and obey that host's robots.txt (the Robots Exclusion Protocol). The file is fetched once per host and cached with its own TTL (typically 24 hours); it declares Disallow path prefixes, optional Allow exceptions, an optional crawl-delay, and often a Sitemap pointer that seeds the frontier with the site's own URL list.

Honor it strictly — fetching disallowed paths gets your crawler IP-banned, blocklisted by anti-bot services, and is ethically and sometimes legally indefensible. The robots cache is consulted on the politeness path before every fetch.
Redirects — follow them but cap the chain (e.g., 5 hops) to avoid redirect loops; treat the final URL as the canonical one and dedup against it.
HTTP hygiene — send a descriptive User-Agent with a contact URL, set connect/read timeouts, request gzip, and cap the response body size to defend against multi-gigabyte "zip bomb" pages.
Status handling — 2xx store; 3xx follow; 404/410 mark gone; 429/503 back off and retry later with exponential delay; persistent 5xx demote the host's priority.

Step 7 — Deduplication: URLs and Content

Without dedup, a crawler drowns: the web is full of the same URL reached by many paths, and the same content served under many URLs. There are two distinct dedup problems.

URL dedup — the seen-set

Before adding a URL to the frontier, check whether it has already been seen. First canonicalize the URL — lowercase the host, remove the default port, strip the fragment (#…), sort query parameters, drop known tracking params, and resolve relative paths — so that trivially different strings map to one key. Then test membership in a Bloom filter seen-set:

Bloom filter says "not seen" (some bit unset) → the URL is definitely new — add it to the frontier and set its bits.
Bloom filter says "maybe seen" (all bits set) → almost certainly a duplicate; skip it, or if completeness is critical, confirm against a backing key-value store.

A Bloom filter holds a billion URLs in roughly a gigabyte and answers in O(1) without touching disk. Its only error mode is a false positive — wrongly skipping a small fraction of genuinely new URLs — which is an acceptable trade for the memory savings. It never produces false negatives, so it will never re-crawl a page it has already filtered.

Content dedup — exact and near-duplicate

Two different URLs can serve identical or nearly identical content (mirror sites, syndicated articles, printer-friendly versions). Catch exact duplicates with a content hash — MD5 or SHA-256 of the normalized HTML — and skip storing a page whose hash already exists. Catch near-duplicates with a similarity fingerprint: SimHash (or MinHash) reduces a document to a 64-bit signature such that similar documents have signatures within a small Hamming distance. Pages within ~3 bits are treated as duplicates. This is what stops a crawler from indexing ten thousand near-identical templated pages from a single content farm.

Step 8 — Parsing and URL Extraction

Once a page is fetched and confirmed new, the parser extracts outbound links and other content. The steps:

Parse the HTML into a DOM, tolerantly — real-world HTML is malformed constantly, so use a forgiving parser (libxml2, lxml, jsoup) that never throws on bad markup.
Extract anchors — collect href values from <a> tags, plus canonical-link and sitemap hints.
Resolve to absolute URLs — convert relative links against the page's base URL, then run the same canonicalization used by the seen-set.
Filter — keep only crawlable schemes (http/https), drop disallowed paths per robots, drop unwanted content types, and apply trap bounds (Step 9).
Enqueue — push surviving, unseen URLs back into the frontier with a freshly assigned priority.

If the target sites are JavaScript-heavy single-page apps, static HTML parsing misses most links because the DOM is built client-side. Then you need a headless browser (headless Chrome via Puppeteer/Playwright) to render the page before extraction — far more expensive, so most crawlers render selectively, only for hosts known to require it.

Step 9 — Crawl Traps and Politeness

A crawl trap is a region of a site that generates an unbounded number of URLs, either accidentally or maliciously. Left undefended, a single trap can consume your entire frontier and bandwidth. Common traps:

Infinite calendars — "next month" links that go forever (/calendar?date=2099-12 → 2100-01 → …).
Session IDs in URLs — every visit mints a new ?sid=…, so the same page looks like infinitely many URLs.
Faceted-search explosions — every combination of filter parameters is a distinct URL, producing a combinatorial blowup.
Infinitely deep paths — misconfigured servers that resolve /a/a/a/a/… to the same page at any depth.

Defenses are a layered set of hard bounds: cap URL depth (path segments and crawl distance from seed), cap pages per host, cap URL length and query-parameter count, and detect repeating path patterns. Crucially, content dedup (Step 7) is the backstop — a trap that generates near-identical pages gets filtered by SimHash after only a few hits even if URL bounds miss it.

politeness recap

Politeness is not just crawl-delay. Adapt to the host's health: back off when you see 429 (Too Many Requests) or 503, and slow down automatically if a host's response times rise — a sign you are stressing it. A good crawler treats every target server as a guest treats a host's home: one request at a time, spaced out, and gone the moment you are asked to leave.

Step 10 — Distributed Workers and Coordination

A single machine cannot crawl a billion pages a month. The work is spread across a fleet, and the partitioning scheme is the key decision because it determines how politeness is enforced.

Partition by hostname. Hash each URL's host and assign it to a crawler node: node = hash(host) % N. Every URL for example.com goes to the same node, so all of that host's politeness state — its back queue, its crawl-delay timer, its robots cache, its DNS entry — lives on one machine. No cross-node coordination is needed on the fetch hot path, which is what makes the design scale. The frontier and the Bloom filter seen-set are sharded by the same hostname key so a node owns a self-contained slice of the crawl.

Node membership and host-to-node assignment are managed by a coordinator (ZooKeeper or etcd) using consistent hashing, so that when a node joins or dies only its share of hosts is remapped rather than the whole keyspace. The frontier is backed by a durable, partitioned log so in-flight URLs survive a worker crash and can be replayed by whichever node inherits the host.

Step 11 — Storage

The crawler produces three kinds of data, each with a different access pattern and store:

Data	Store	Why
Raw HTML (page bodies)	Object storage (S3 / GCS / HDFS), gzip-compressed, keyed by content hash	Petabyte scale, write-once read-rarely, cheap per GB; content-hash key gives free exact dedup
URL metadata	Wide-column / key-value store (Bigtable, Cassandra) keyed by URL hash	Billions of small rows, point lookups for "have I crawled this and when?", high write throughput
Link graph	Adjacency lists in the same wide-column store, or a dedicated graph store	Powers PageRank and frontier prioritization; written as edges during extraction

Raw page bodies go to object storage because they are large, immutable, and read infrequently — exactly the workload object stores are cheapest at. Keying each blob by its content hash means identical content collapses to one object automatically. The metadata table holds the small, frequently-updated facts about each URL (last-crawled time, HTTP status, ETag, content hash, next-recrawl time) and is the table the scheduler scans to decide what to recrawl.

Step 12 — Freshness and Recrawl

Crawling once is not enough — the web changes. But pages change at wildly different rates: a news homepage updates every few minutes, an archived blog post never changes. Recrawling everything on a fixed schedule wastes most of your bandwidth re-downloading unchanged pages, so freshness must be adaptive.

Conditional GET — send If-Modified-Since and If-None-Match (with the stored ETag). If the page is unchanged the server returns 304 Not Modified with no body, saving nearly all the bandwidth while confirming freshness.
Estimate change frequency — model each page's updates (a Poisson process is the classic estimator) from the history of observed changes. A page that changed on 3 of the last 5 visits gets recrawled sooner than one that never changes.
Schedule by rate × importance — set the recrawl interval inversely proportional to estimated change frequency, weighted by page importance (PageRank). High-churn, high-value news pages recrawl hourly; static low-value pages recrawl monthly.
Re-inject via the frontier — the scheduler scans the metadata table for URLs whose next-recrawl time has arrived and pushes them back into the frontier at the appropriate priority. Recrawl is not a separate pipeline; it just feeds the same frontier.

Step 13 — Scaling and Fault Tolerance

At a billion pages a month, the crawler must keep running through machine failures, slow hosts, and traffic spikes without losing work or violating politeness.

Scaling

Fetcher threads are stateless and scale horizontally — add machines to add bandwidth and concurrency. The frontier is sharded by host, the Bloom filter is sharded by host, and the DNS cache is distributed, so no shared structure becomes a global bottleneck. The real ceiling is egress/ingress bandwidth and politeness: you cannot crawl faster than the number of distinct hosts × their allowed rate, regardless of machine count.

Fault tolerance

Worker crash — the coordinator reassigns the dead node's hosts to surviving nodes via consistent hashing; because the frontier is a durable partitioned log, in-flight URLs are replayed, not lost.
Idempotent fetches — re-fetching a page after a crash is harmless: content-hash dedup means reprocessing the same page produces no duplicate storage. This makes at-least-once frontier processing safe, so you never need expensive exactly-once semantics.
Seen-set durability — checkpoint the Bloom filter periodically; on restart, reload the checkpoint. A small amount of re-crawling between checkpoint and crash is acceptable thanks to content dedup.
Poison URLs — a URL that repeatedly crashes a worker (malformed gigabyte page, parser bomb) is detected by a retry counter and quarantined to a dead-letter queue rather than retried forever.

Step 14 — Key Tradeoffs

Decision	Choice	Trade-off accepted
Frontier design	Mercator priority + politeness	More moving parts vs. fair ordering and per-host politeness without starving throughput
URL dedup	Bloom filter seen-set	Tiny false-positive URL loss vs. fitting billions of URLs in ~1 GB of RAM
Content near-dup	SimHash fingerprint	Occasional wrong merge of similar-but-distinct pages vs. catching mirror/content farms cheaply
Rendering	Static HTML only (selective JS)	Misses links in JS-heavy SPAs vs. 10–100× cost of headless rendering everything
Partitioning	By hostname	Hot-host skew (one giant site) vs. fully local politeness with no per-fetch coordination
Storage	Object store + wide-column metadata	Eventual consistency vs. cheap petabyte scale and free content-hash dedup

takeaway

A web crawler looks like a simple BFS and turns out to be an exercise in pacing and deduplication at planetary scale. The frontier — priority for what to crawl, politeness for when — is the component that makes or breaks the design; everything else (DNS caching, robots.txt, Bloom-filter dedup, SimHash, hostname partitioning, adaptive recrawl) exists to feed that frontier efficiently and behave like a good citizen of the web. Know why politeness caps throughput, why a Bloom filter is the right seen-set, and how adaptive recrawl keeps the index fresh without re-downloading the static web — and you can defend every box on the diagram.

🎯 interview hot-takes

How does the URL frontier enforce politeness without starving throughput? Use the Mercator two-tier design: front queues rank URLs by priority (importance and freshness need), while back queues are per-host — one host maps to exactly one back queue with a single in-flight connection and a crawl-delay gap enforced by a min-heap keyed on next-fetch time. Politeness is per-host, so overall throughput stays high because thousands of distinct hosts are fetched in parallel.
Why use a Bloom filter for the seen-URL set, and what is the risk? A billion-plus URLs will not fit a hash set in memory; a Bloom filter holds 1B URLs in roughly 1.2 GB at a 1% false-positive rate with O(1) membership checks. The risk is false positives — a tiny fraction of genuinely new URLs are wrongly skipped as already-seen (there are never false negatives). If completeness matters, consult a backing key-value store on positives.
How do you detect duplicate and near-duplicate content? Catch exact duplicates with a content hash (MD5 or SHA-256) of the normalized HTML. Catch near-duplicates — mirror sites and boilerplate-heavy templates — with SimHash or MinHash: compute a fingerprint and compare Hamming distance, treating pages within a few bits as duplicates and skipping them.
How do you avoid crawler traps? Bound the crawl: cap URL depth, pages per host, URL length, and query-parameter count, and detect patterns for calendar and session-id loops. Combine these limits with content dedup so an infinite trap that generates near-identical pages is filtered out after only a few hits.
How do you keep content fresh without re-crawling everything? Recrawl adaptively. Track each page's change frequency using ETag and Last-Modified plus observed diffs, then schedule recrawl intervals proportional to change rate and page importance (PageRank). High-churn news pages recrawl hourly while static pages recrawl monthly — the frontier priority encodes this, and conditional GETs returning 304 save bandwidth.