How long would it take to crawl a billion pages using
We’re on a quest to write a scalable web crawler. Our goal is to build a web crawler that will download a billion pages a week. We’ve calculated that to download a billion pages in a week we need to sustain a rate of 1653 pages per second .
The problem with these kinds of numbers is that, unless you are familiar with web-crawling, it is not obvious how fast that really is. How fast can a simple crawler go? 10 pages per second? A thousand?
We set out to benchmark the simplest thing that could possibly work:
wget is a popular tool used for downloading files from the web. It has a flexible set of options and built in support for crawling.
xargs is used to run a command repeatedly over a given set of inputs. In our case, we’re using a fixed URL list as our input. We use
xargs as our “thread-pool” (it’s actually a “process-pool”). Using the
-P <numprocs> option.
xargs will run through the input file of URLs and each wget process will take a URL off the stack and run until it finishes the crawl for that domain. The number of concurrent processes is limited by
Before we actually run our jobs, let’s try to predict the kind of results we’ll get. I’m running the jobs on my MacBookPro Intel Core 2 Duo with 4GB RAM. I’m on my home network where I have AT&T U-Verse with advertised download speed of 18Mbps (mega bits per second).
wget measures rate limiting in kilobytes rather than kilobits, so we’ll use bytes rather than bits:
18Mbps = 2.25 megabytes per second =~ 2300 kilobytes/s
We’re just doing rough calculations at this point, so lets just guess that the mean size of each page is 10KB. At this page size, the absolute best number we can expect to get is around 230 pages/second before we saturate my connection.
While we want our crawler as a system to go as fast as possible, we don’t want to hit any one server too many times. Not only might we get banned, but we it isn’t kind to the site owners. Many smaller servers can’t handle the load of a crawler thrown against it at full speed.
So if we want to get to 200+ pages/sec we’re going to have to have many concurrent connections.
wget has a number of options that we can set to be nicer to each individual server. So our strategy will be to crawl many servers concurrently, but only hit a particular server lightly.
Here are a few of the relevant
wget options we will set:
--random-wait– wait a random amount of time between requests averaging 2 seconds. The waiting time is for the servers benefit but the random time is for ours. Given that we are going to be running a large number of processes in parallel, we’d rather have them be out of step with each other.
--tries=5– only retry 5 times
--timestamping– If the file exists on disk, send the server a HEAD request and check the
Last-Modifiedheader. If the file on disk has a timestamp greater than or equal to the
Last-Modifieddate, we don’t request the whole page. This extra
HEADrequest doesn’t really slow us down because
wgetwill only request it if the file already exists on disk. This is just a little extra protection in case our separate processes start to visit the same sites.
DMOZ Sample Set
For multiple runs of our test we don’t want to hit one particular server repeatedly. We’re going to use DMOZ to get a random sample of URLs to test and use a few commands to extract some random URLs:
mkdir -p data/dmoz curl -0 http://rdf.dmoz.org/rdf/content.rdf.u8.gz > data/dmoz/dmoz-content.rdf.u8.gz cd data/dmoz unzip data/dmoz/dmoz-content.rdf.u8.gz cat dmoz-content.rdf.u8 | grep http | grep r:resource | \ grep -o '<link r:resource=['"'"'"][^"'"'"']*['"'"'"]' | \ sed -e 's/^<link r:resource=["'"'"']//' -e 's/["'"'"']$//' \ > urls.txt ruby random-lines.rb urls.txt 300 > random-urls.txt
(You can get the
The DMOZ file is around 300MB so this will take a few minutes. The DMOZ RDF file is well formed, so we’re just using
sed to extract the URLs.
wget command is below. You can see we aren’t trying very hard to access a page that doesn’t respond quickly (the various
timeout options). Also, we’re only looking 5 pages deep per URL. We are not visiting any “parent” pages, that is, we’re not crawling up any directories. We don’t want any images or binary files (the
reject) options and we don’t care about invalid SSL certificates (
wget \ --tries=5 \ --dns-timeout=30 \ --connect-timeout=5 \ --read-timeout=5 \ --timestamping \ --directory-prefix=data/pages \ --wait=2 \ --random-wait \ --recursive \ --level=5 \ --no-parent \ --no-verbose \ --reject *.jpg --reject *.gif \ --reject *.png --reject *.css \ --reject *.pdf --reject *.bz2 \ --reject *.gz --reject *.zip \ --reject *.mov --reject *.fla \ --reject *.xml \ --no-check-certificate
A nice thing about our setup is that each
wget process is assigned to one domain.
wget caches the DNS lookup so we only need to make one DNS request per process. A problem with this setup is that
getaddrinfo depending on your platform). A quick check on
man gethostbyname shows that on my BSD-based Mac
gethostbyname is thread-safe e.g. it is synchronized. The result is that there is going to be some resource starvation when we have hundreds of
wget processes all trying to call BIND all at the same time. 1
We set a DNS timeout of 30 seconds here, but in practice I found that it didn’t matter much. All of the processes race to grab the DNS lookup lock at the beginning, a large number time-out (waiting for the lock), but the requests even out over a couple of minutes.
xargs acts as our thread-pool or, more specifically, our process-pool.
-P specifies the number of processes to use.
-I <sub> is a substitution parameter. It means “for each line in
CMD substituting the current line for
<sub>“. So below we substitute
_URL_ with the actual URL contained in the
cat $URLS_FILE | xargs -P $CRAWLERS -I _URL_ $WGET_CMD _URL_
Our crawler script looks like this:
#!/bin/bash # a basic crawler in bash # usage: crawl.sh urlfile.txt <num-procs> URLS_FILE=$1 CRAWLERS=$2 mkdir -p data/pages WGET_CMD="wget \ --tries=5 \ --dns-timeout=30 \ --connect-timeout=5 \ --read-timeout=5 \ --timestamping \ --directory-prefix=data/pages \ --wait=2 \ --random-wait \ --recursive \ --level=5 \ --no-parent \ --no-verbose \ --reject *.jpg --reject *.gif \ --reject *.png --reject *.css \ --reject *.pdf --reject *.bz2 \ --reject *.gz --reject *.zip \ --reject *.mov --reject *.fla \ --reject *.xml \ --no-check-certificate" cat $URLS_FILE | xargs -P $CRAWLERS -I _URL_ $WGET_CMD _URL_
I’ve put this code on github with a
Rakefile so you can follow along.
git clone git://github.com/jashmenn/bashpider.git cd bashpider rake data:get_urls # downloads and parses DMOZ, will take a while rake crawl:restart # this will run a crawl
If you want to monitor the downloads per second, in another window type the following:
When you feel you’ve gathered enough data,
CTRL-C to kill both windows and then type:
Results at Home
As you can see from the chart, on my home computer through u-verse we max out at about 150 agents at 27 pages/sec, far below our original estimate of 200 pages/sec.
procs pages/sec 10 3.9 25 8.9 50 15.8 75 19.7 100 25.8 125 26.1 150 27.3 175 22.3 200 6.0
First of all, our initial estimate of 10KB per page was too low. In reality we observed a mean page size of around 37KB. This means on our 2300KB connection we can only expect a best-case download rate of 62 pages/sec. Still, 27 pages/sec is not even half that speed.
The other problem could be DNS requests. My home router also serves as my DNS server. It’s good enough for home use, but I’m pretty sure it’s not up to this task. Come to think of it, I’m not even sure how fast this CAT5 cable is.
I think it’s time to try out this setup in a better environment.
In the Data Center
For this experiment we loaded our script onto a beefy 8-core machine with a fat bandwidth connection.
The results were much better:
procs pages/sec 150 54 200 71 300 107 400 141 500 178 600 214 700 244 800 386 900 327 1000 203 1100 222 1200 392 1300 202 1500 249 1600 485 2000 577 3000 679 3500 459 4000 336
Take these numbers as rough estimates. For each of these entries I only let them run for a few minutes.
When I started getting into the thousands of processes, I wondered if I would hit the user process limit. Jay Donnell pointed out to me that
uname will also give the process limit:
ulimit -a max user processes (-u) 268287
So with 260k+ processes available, we have no problem there.
wget process gets its own file, which is uncompressed. So we’ve got a lot of disk IO going on. We’d probably save a good amount of time if each process just opened one file and appended content to it. We’d also save the file system from creating hundreds of thousands of inodes.
Also, the decline we see around 3000 agents may be due, in part, to the max number of open files on our system:
$ ulimit -a open files (-n) 1024
Each crawler waits an average of 2 seconds before making the next request, at which time it makes the request and then downloads the file. So each process is making 1 file every >2 seconds. This means the number of
wget processes we can run should be at least twice the max number of open files (2048).
My theory is this: at around 3000 concurrent agents the time it takes to actually download the content means that the probability that we will have enough file descriptors available. However, once we have 4000 concurrent agents the probability that any two agents will need to write a file at the same time is much higher, we see a significant performance drop.
I think we’re going to need to look at the file system format. Currently we’re using ext3, but I’m not sure if we should switch to xfs. While monitoring the file count using
find I kept getting the following error:
find: WARNING: Hard link count is wrong for <some file>: this may be a bug in your filesystem driver. Automatically turning on find's -noleaf option. Earlier results may have failed to include directories that should have been searched.
kjournald seemed to be working very hard to keep up with all the file writes. I’m not sure if this is unavoidable or not. I’m going to leave this problem for future work.
This crawler is just a baseline to see what performance is possible of basic unix utilities. Obviously, this approach used a list of static URLs and in a “real” crawler you probably want to have a mechanism for communicating and prioritizing URLs throughout the system.
That said, if you already know the list of URLs you want to download, you could download tens-of-million pages over a 24-hour window. For instance, if we assume a sustained rate of 600 pages per second you could download 51.8 million pages in 24 hours.
So how long would it take to download a billion pages with
If you had the list of URLs beforehand, according to these numbers it would
take 19 days .
To download a billion pages is a week we’re going to need to figure out a way to download at least 1000 more pages per second.
What we’ve learned:
- watch out for file limits
- append to a single file rather than creating thousands of tiny files
- run your own non-locking DNS server
- unix tools are handy and powerful
What do you think?
Any suggestions for cranking out more performance out of
wget? Should I try increasing my open file limit and see what happens? Think these numbers are ridiculous? Leave a comment below!