I aim to build a web crawler that can download a billion pages in a week.
Below are some desirable properties any web crawler should have:
The web is enormous and continually growing. A crawler should scale
linearly with the number of agent-machines that are added to the
system. This allows us to add more agents as our needs increase.
Speed is a significant issue at this scale. For example, if we want to crawl 1
billion pages in a week (this is less than 1/1000th of the web), our system
will have to sustain a rate of 1653 downloads per second.
To achieve this speed we need to employ a number of techniques such as
concurrent connections, data compression, dns caching, minimize disk seeks,
While we need a high rate of download, we must be polite and not
overload one particular server. Najork et. al. propose limiting requests to
a single server by waiting 10 times the time it took to download the
last page. 1
We aim to build a crawler that visits “high-quality” or “relevant” pages
well-known quality metric. However, Najork et. al. found that a simple
breadth-first crawl tends to visit high-quality pages first 2.
We’ll eventually build a more intelligent page-selection mechanism, but for now
breadth-first will work.
Agents as a Distributed System
Given that the crawling problem cannot be solved by a single machine,
we are required to form our solution as a distributed
system. Distributed systems introduce more room for failures
and errors in coordination. Therefore we define the following desirable
features for our distributed system:
Hardware failure is unavoidable. The failure of one node should not
prevent survivors from continuing to operate.
The URL frontier should be evenly distributed across all agents to
evenly assign the work. Many crawlers use a hashing function to distribute
the URLs among machines.
Overlap is defined as
n is the total number of
crawled pages and
u is the number of unique pages (sometimes
u < n
because the same page has been erroneously fetched several times).
Optimally, we want an overlap of 0. 3
During the crawl we may want to add additional resources. The system
should support agents coming and leaving the group.
There are five major parts to a web crawler:
- The URL frontier
- IP address lookup
- Page download
- Page processing
- Tracking URLs encountered
Over the next few articles we will be designing a each of the five
components. The list of desirable features give us guidelines
that help shape the decisions about each component.
- Najork M. High-Performance Web Crawling. Systems Research. 2001.
- Najork M, Wiener JL. Breadth-First Search Crawling Yields High-Quality Pages. Systems Research. 2001:114-118.
- Boldi P, Codenotti B, Santini M, Vigna S. UbiCrawler: a scalable fully distributed Web crawler. Software: Practice and Experience. 2004;34(8):711-726.