Desirable Properties for a Web Crawler

I aim to build a web crawler that can download a billion pages in a week.
Below are some desirable properties any web crawler should have:

Scalability

The web is enormous and continually growing. A crawler should scale
linearly with the number of agent-machines that are added to the
system. This allows us to add more agents as our needs increase.

Speed

Speed is a significant issue at this scale. For example, if we want to crawl 1
billion pages in a week (this is less than 1/1000th of the web), our system
will have to sustain a rate of 1653 downloads per second.

To achieve this speed we need to employ a number of techniques such as
concurrent connections, data compression, dns caching, minimize disk seeks,
etc.

Politeness

While we need a high rate of download, we must be polite and not
overload one particular server. Najork et. al. propose limiting requests to
a single server by waiting 10 times the time it took to download the
last page. 1

Quality

We aim to build a crawler that visits “high-quality” or “relevant” pages
well-known quality metric. However, Najork et. al. found that a simple
breadth-first crawl tends to visit high-quality pages first 2.
We’ll eventually build a more intelligent page-selection mechanism, but for now
breadth-first will work.

Agents as a Distributed System

Given that the crawling problem cannot be solved by a single machine,
we are required to form our solution as a distributed
system. Distributed systems introduce more room for failures
and errors in coordination. Therefore we define the following desirable
features for our distributed system:

Fault Tolerance

Hardware failure is unavoidable. The failure of one node should not
prevent survivors from continuing to operate.

Even Partitioning

The URL frontier should be evenly distributed across all agents to
evenly assign the work. Many crawlers use a hashing function to distribute
the URLs among machines.

Minimize Overlap

Overlap is defined as (n-u)/u, where n is the total number of
crawled pages and u is the number of unique pages (sometimes u < n
because the same page has been erroneously fetched several times).
Optimally, we want an overlap of 0. 3

Agent churn

During the crawl we may want to add additional resources. The system
should support agents coming and leaving the group.

Next Steps

There are five major parts to a web crawler:

  • The URL frontier
  • IP address lookup
  • Page download
  • Page processing
  • Tracking URLs encountered

Over the next few articles we will be designing a each of the five
components. The list of desirable features give us guidelines
that help shape the decisions about each component.

  1. Najork M. High-Performance Web Crawling. Systems Research. 2001.
  2. Najork M, Wiener JL. Breadth-First Search Crawling Yields High-Quality Pages. Systems Research. 2001:114-118.
  3. Boldi P, Codenotti B, Santini M, Vigna S. UbiCrawler: a scalable fully distributed Web crawler. Software: Practice and Experience. 2004;34(8):711-726.
Share:
  • del.icio.us
  • Reddit
  • Technorati
  • Twitter
  • Facebook
  • Google Bookmarks
  • HackerNews
  • PDF
  • RSS
This entry was posted in crawling. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.