a crawler using wget and xargs

How long would it take to crawl a billion pages using wget and xargs?

We’re on a quest to write a scalable web crawler. Our goal is to build a web crawler that will download a billion pages a week. We’ve calculated that to download a billion pages in a week we need to sustain a rate of 1653 pages per second .

The problem with these kinds of numbers is that, unless you are familiar with web-crawling, it is not obvious how fast that really is. How fast can a simple crawler go? 10 pages per second? A thousand?

We set out to benchmark the simplest thing that could possibly work: wget and xargs.

Our Tools

wget is a popular tool used for downloading files from the web. It has a flexible set of options and built in support for crawling.

xargs is used to run a command repeatedly over a given set of inputs. In our case, we’re using a fixed URL list as our input. We use xargs as our “thread-pool” (it’s actually a “process-pool”). Using the -P <numprocs> option. xargs will run through the input file of URLs and each wget process will take a URL off the stack and run until it finishes the crawl for that domain. The number of concurrent processes is limited by <numprocs>.

Napkin Calculations

Before we actually run our jobs, let’s try to predict the kind of results we’ll get. I’m running the jobs on my MacBookPro Intel Core 2 Duo with 4GB RAM. I’m on my home network where I have AT&T U-Verse with advertised download speed of 18Mbps (mega bits per second).

wget measures rate limiting in kilobytes rather than kilobits, so we’ll use bytes rather than bits:

18Mbps = 2.25 megabytes per second =~ 2300 kilobytes/s

We’re just doing rough calculations at this point, so lets just guess that the mean size of each page is 10KB. At this page size, the absolute best number we can expect to get is around 230 pages/second before we saturate my connection.

Politeness

While we want our crawler as a system to go as fast as possible, we don’t want to hit any one server too many times. Not only might we get banned, but we it isn’t kind to the site owners. Many smaller servers can’t handle the load of a crawler thrown against it at full speed.

So if we want to get to 200+ pages/sec we’re going to have to have many concurrent connections. wget has a number of options that we can set to be nicer to each individual server. So our strategy will be to crawl many servers concurrently, but only hit a particular server lightly.

Here are a few of the relevant wget options we will set:

  • --wait=2 and --random-wait – wait a random amount of time between requests averaging 2 seconds. The waiting time is for the servers benefit but the random time is for ours. Given that we are going to be running a large number of processes in parallel, we’d rather have them be out of step with each other.
  • --tries=5 – only retry 5 times
  • --timestamping – If the file exists on disk, send the server a HEAD request and check the Last-Modified header. If the file on disk has a timestamp greater than or equal to the Last-Modified date, we don’t request the whole page. This extra HEAD request doesn’t really slow us down because wget will only request it if the file already exists on disk. This is just a little extra protection in case our separate processes start to visit the same sites.

DMOZ Sample Set

For multiple runs of our test we don’t want to hit one particular server repeatedly. We’re going to use DMOZ to get a random sample of URLs to test and use a few commands to extract some random URLs:

mkdir -p data/dmoz
curl -0 http://rdf.dmoz.org/rdf/content.rdf.u8.gz > data/dmoz/dmoz-content.rdf.u8.gz
cd data/dmoz 
unzip data/dmoz/dmoz-content.rdf.u8.gz
cat dmoz-content.rdf.u8 | grep http | grep r:resource | \
    grep -o '<link r:resource=['"'"'"][^"'"'"']*['"'"'"]' | \
    sed -e 's/^<link r:resource=["'"'"']//' -e 's/["'"'"']$//' \
    > urls.txt
ruby random-lines.rb urls.txt 300 > random-urls.txt

(You can get the random-lines.rb script here).

The DMOZ file is around 300MB so this will take a few minutes. The DMOZ RDF file is well formed, so we’re just using grep and sed to extract the URLs.

Shaping wget

Our wget command is below. You can see we aren’t trying very hard to access a page that doesn’t respond quickly (the various timeout options). Also, we’re only looking 5 pages deep per URL. We are not visiting any “parent” pages, that is, we’re not crawling up any directories. We don’t want any images or binary files (the reject) options and we don’t care about invalid SSL certificates (no-check-certificate).

wget \
  --tries=5 \
  --dns-timeout=30 \
  --connect-timeout=5 \
  --read-timeout=5 \
  --timestamping \
  --directory-prefix=data/pages \
  --wait=2 \
  --random-wait \
  --recursive \
  --level=5 \
  --no-parent \
  --no-verbose \
  --reject *.jpg --reject *.gif \
  --reject *.png --reject *.css \
  --reject *.pdf --reject *.bz2 \
  --reject *.gz  --reject *.zip \
  --reject *.mov --reject *.fla \
  --reject *.xml \
  --no-check-certificate

DNS

A nice thing about our setup is that each wget process is assigned to one domain. wget caches the DNS lookup so we only need to make one DNS request per process. A problem with this setup is that wget uses gethostbyname (or getaddrinfo depending on your platform). A quick check on man gethostbyname shows that on my BSD-based Mac gethostbyname is thread-safe e.g. it is synchronized. The result is that there is going to be some resource starvation when we have hundreds of wget processes all trying to call BIND all at the same time. 1

We set a DNS timeout of 30 seconds here, but in practice I found that it didn’t matter much. All of the processes race to grab the DNS lookup lock at the beginning, a large number time-out (waiting for the lock), but the requests even out over a couple of minutes.

xargs

xargs acts as our thread-pool or, more specifically, our process-pool.

-P specifies the number of processes to use. -I <sub> is a substitution parameter. It means “for each line in STDIN run CMD substituting the current line for <sub>“. So below we substitute _URL_ with the actual URL contained in the URLS_FILE.

cat $URLS_FILE | xargs -P $CRAWLERS -I _URL_ $WGET_CMD _URL_

Code

Our crawler script looks like this:

#!/bin/bash
# a basic crawler in bash
# usage: crawl.sh urlfile.txt <num-procs>
URLS_FILE=$1
CRAWLERS=$2

mkdir -p data/pages

WGET_CMD="wget \
  --tries=5 \
  --dns-timeout=30 \
  --connect-timeout=5 \
  --read-timeout=5 \
  --timestamping \
  --directory-prefix=data/pages \
  --wait=2 \
  --random-wait \
  --recursive \
  --level=5 \
  --no-parent \
  --no-verbose \
  --reject *.jpg --reject *.gif \
  --reject *.png --reject *.css \
  --reject *.pdf --reject *.bz2 \
  --reject *.gz  --reject *.zip \
  --reject *.mov --reject *.fla \
  --reject *.xml \
  --no-check-certificate"

cat $URLS_FILE | xargs  -P $CRAWLERS -I _URL_ $WGET_CMD _URL_

I’ve put this code on github with a Rakefile so you can follow along.

git clone git://github.com/jashmenn/bashpider.git
cd bashpider
rake data:get_urls # downloads and parses DMOZ, will take a while 
rake crawl:restart # this will run a crawl

If you want to monitor the downloads per second, in another window type the following:

rake crawl:watch

When you feel you’ve gathered enough data, CTRL-C to kill both windows and then type:

rake results:process 

Results at Home

As you can see from the chart, on my home computer through u-verse we max out at about 150 agents at 27 pages/sec, far below our original estimate of 200 pages/sec.

wget-pages-per-second

procs pages/sec
10     3.9
25     8.9
50    15.8
75    19.7
100   25.8
125   26.1
150   27.3
175   22.3
200    6.0

First of all, our initial estimate of 10KB per page was too low. In reality we observed a mean page size of around 37KB. This means on our 2300KB connection we can only expect a best-case download rate of 62 pages/sec. Still, 27 pages/sec is not even half that speed.

The other problem could be DNS requests. My home router also serves as my DNS server. It’s good enough for home use, but I’m pretty sure it’s not up to this task. Come to think of it, I’m not even sure how fast this CAT5 cable is.

I think it’s time to try out this setup in a better environment.

In the Data Center

For this experiment we loaded our script onto a beefy 8-core machine with a fat bandwidth connection.

The results were much better:

wget-pages-per-second-datacenter

procs  pages/sec
  150   54
  200   71
  300  107
  400  141
  500  178
  600  214
  700  244
  800  386
  900  327
 1000  203
 1100  222 
 1200  392
 1300  202
 1500  249
 1600  485
 2000  577
 3000  679
 3500  459
 4000  336 

Take these numbers as rough estimates. For each of these entries I only let them run for a few minutes.

Processes

When I started getting into the thousands of processes, I wondered if I would hit the user process limit. Jay Donnell pointed out to me that uname will also give the process limit:

 ulimit -a
 max user processes              (-u) 268287

So with 260k+ processes available, we have no problem there.

Files

Using wget process gets its own file, which is uncompressed. So we’ve got a lot of disk IO going on. We’d probably save a good amount of time if each process just opened one file and appended content to it. We’d also save the file system from creating hundreds of thousands of inodes.

Also, the decline we see around 3000 agents may be due, in part, to the max number of open files on our system:

$ ulimit -a
open files                      (-n) 1024

Each crawler waits an average of 2 seconds before making the next request, at which time it makes the request and then downloads the file. So each process is making 1 file every >2 seconds. This means the number of wget processes we can run should be at least twice the max number of open files (2048).

My theory is this: at around 3000 concurrent agents the time it takes to actually download the content means that the probability that we will have enough file descriptors available. However, once we have 4000 concurrent agents the probability that any two agents will need to write a file at the same time is much higher, we see a significant performance drop.

I think we’re going to need to look at the file system format. Currently we’re using ext3, but I’m not sure if we should switch to xfs. While monitoring the file count using find I kept getting the following error:

find: WARNING: Hard link count is wrong for <some file>: this may be a bug
    in your filesystem driver.  Automatically turning on find's -noleaf option.  
    Earlier results may have failed to include directories that should have been searched.

Also kjournald seemed to be working very hard to keep up with all the file writes. I’m not sure if this is unavoidable or not. I’m going to leave this problem for future work.

Summary

This crawler is just a baseline to see what performance is possible of basic unix utilities. Obviously, this approach used a list of static URLs and in a “real” crawler you probably want to have a mechanism for communicating and prioritizing URLs throughout the system.

That said, if you already know the list of URLs you want to download, you could download tens-of-million pages over a 24-hour window. For instance, if we assume a sustained rate of 600 pages per second you could download 51.8 million pages in 24 hours.

So how long would it take to download a billion pages with xargs and wget?

If you had the list of URLs beforehand, according to these numbers it would
take 19 days .

To download a billion pages is a week we’re going to need to figure out a way to download at least 1000 more pages per second.

What we’ve learned:

  • watch out for file limits
  • append to a single file rather than creating thousands of tiny files
  • run your own non-locking DNS server
  • unix tools are handy and powerful

What do you think?

Any suggestions for cranking out more performance out of wget? Should I try increasing my open file limit and see what happens? Think these numbers are ridiculous? Leave a comment below!

Share:
  • del.icio.us
  • Reddit
  • Technorati
  • Twitter
  • Facebook
  • Google Bookmarks
  • HackerNews
  • PDF
  • RSS
This entry was posted in crawling. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • Joel

    For each site does wget maintain a history of all those pages it has visited so that it only visits each page once? Is this history written to disk?

  • http://badcheese.com Steve Webb

    So, you’re just trying to max out your internet connection? To me, the difficult part of writing a crawler is extracting urls from the content, then pushing those urls into a database, de-duping them, pulling out ‘interesting’ urls from those, running those through the robots.txt file to make sure that I’m ok to pull those urls, and then sending those urls back to the crawler. I understand that wget gives you most of this for free, but then post-processing the data is the real question. If you do end up maxxing-out your internet connection, you still need to do something with the data, which may take much longer than just grabbing it all of the internet.

    Also, your WGET_CMD has a wait=2, so shouldn’t the pages per second be about 1/3 of the number of processes?

    What kind of memory did you have on the machine? Was your beefy machine struggling at all? When you reached 4000 processes, did disk I/O or memory become a performance bottleneck?

    Would a DNS cache on the machine make a second run of the crawl go faster?

    Would you have better performance if you used a multi-threadded app instead of using full-blown processes?

  • http://fishyfishyfishyfish.com A guy

    Ok, doing more of my own calculations, 1653 pages / sec, assuming that the average page size is 500kB, you’d have to sustain about 6.8 Gbit / sec over your internet connection.

    I wrote my own internet crawler a while back – all in C, multi-threadded, and it doesn’t care about robots.txt – it’s made for speed-only. I ran a test using Amazon’s EC2 small instance, writing to a 100GB ramdisk, 500k urls, 200 threads on linux, 10 second timeout. I chose to run against a single domain to take DNS out of the equation, and I wanted a site that’s pretty beefy and could stand up to my hammering without being affected too much. I decided that I’d target stumbleupon.com for my horrible use of network bandwidth.

    The whole test took about 80 minutes before stumbleupon.com blocked my IP (so I think that I was probably using more bandwidth and CPU than they liked me to use – sorry stumbleupon.com).

    My findings were: CPU is not an issue. A single, ‘small’ instance (one core) was enough to satisfy my crawler without any problems. Bandwidth, disk I/O (writes) and html page size were the bottlenecks. I was about to sustain between 500MB (bytes) and 1GB (bytes) per second reads from the network (Amazon must have 10Gbit to their servers, which I almost maxxed-out for the whole test).

    I completed 308k urls in 85 minutes (which is only 60/sec on average).

    As the above poster mentioned, the post-processing is an important step in any crawl too. I wrote my own url grabber which pulls urls from html as fast as possible (in C – much faster than some of the perl or sed scripts that you’d find on the internet) and just for grins, I thought that I’d time how long it would take to extract the urls from all of the content, sort, uniq, and total. That maxxed-out my ‘small’ instance’s CPU and took 20-ish minutes to complete.

    So, take into account your pipe. Take into account network overhead (DNS, connection processing) and post-processing (which is the real ‘meat’ of crawling – doing something with all of that data).

    Other stats: 45GB of data pulled total, cost on AWS: about $0.50.

  • http://www.xcombinator.com Nate Murray

    @Steve You’re right in that a web-information-extraction system needs to parse the html and extract information to be useful. This is a hard problem and something completely ignored in the above post.

    I hope its clear that I’m not actually proposing someone writes a whole crawling infrastructure using nothing but wget. You should almost certainly see better performance if you used either a multi-threaded or evented app instead of firing up and shutting down wget processes. In fact, writing evented services (using something like epoll) is a series of articles that are coming up. I’ve been playing with Netty recently and I am impressed by it.

    As far as DNS goes, yes. We definitely need to cache the DNS requests. In our experiment above each process was assigned one domain. Wget caches the DNS result so its only called once per process. I want to setup djbdns and use that as it gets very good performance.

    I’m not 100% sure on if disk IO or memory either one was the problem. Actually I think I hit the limit on file descriptors. I haven’t had the time to up it and re-run the tests. This was simply to establish a baseline. Right now I’m working on a “real” crawler and 600 pages per second is a nice target to aim at.

  • http://www.xcombinator.com Nate Murray

    @Aguy My results showed an average page size of way less than 500 kilobytes. I don’t remember offhand, but it was something between 50 and 100 kilobytes.

    Thanks for sharing your results. Is your crawler open-source?

  • http://fishyfishyfishyfish.com A guy

    No, my crawler is compiled using gcc, so I guess it should be open-sourced, but I don’t want to open it up yet mainly because I feel that it’s not ready for prime-time and I don’t feel like getting publicly ridiculed for my design quite yet. :) I’ve been working on it as a side-task for years. I’ve written 2 crawlers in perl and one in C. The two perl crawlers are focused on a single domain and honor robots.txt, are pretty nice on the site that they’re crawling and keeps a database to track urls. The C crawler just hammers any list of urls as quickly as possible – mainly to max out a pipe. For the C crawler to be usable in a regular crawl, there needs to be some other logic upstream to create the url list in such a way that it doesn’t contain too many urls from the same domain.

    I’ve also used the archive.org crawler to do certain tasks in the past. They have a nice crawler, but getting the data out of their storage format is not trivial, and configuration is not the easiest thing in the world. httrack is ok, but died on large crawls. Nutch I’ve not had good results with. wget works well, but for large crawls, it can bloat in memory because it keeps url history in ram, not in a database where it should be.

    I actually found in my research that writing the “converting a relative url to an absolute url” code can be pretty tricky. I worked around the whole url de-duping by using a key/value database to track urls – with the key being the url and the value being the last crawled time stamp, return code and other meta-data, so when I get a bunch of urls, I run them by the database to make sure that I haven’t crawled that url already. If it makes it past the database check, I run it by robots.txt, sort, de-dupe the whole lot and send them back to the crawler. Then, I post-process the data to extract info from the html. My crawler doesn’t honor “last-modified” or “etag” headers yet.

    I’ve crawled Digg.com, StumbleUpon.com, Slashdot.org and a few other large-ish sites using the slower crawler and the crawls take months, but are fairly complete. Running a “catch-up” crawl can still take a while since these sites sometimes create pages faster than you can crawl them at a ‘nice’ rate.

    An interesting test that I’m considering writing is a discovery crawl to just find as many domains as I can. It’s a “wide” crawl. My first few sites would be sites that link to other sites, like delicio.us, digg.com, stumbleupon.com and then fan out from those sites. I would crawl every url that I came across (an ever-expanding database that would never narrow in scope), and parse the html, then throw away the html after ripping urls and finding domains. Theoretically, It would take 100′s of years to crawl the whole internet even with a gigabit internet connection, but it would be an interesting exercise, I think.

    Bandwidth: I’m using my home internet connection to do most of my crawls. Comcast has already threatened to turn off my internet connection because I went over their 250GB cap last summer, so I had to get a separate connection just for crawling that doesn’t have a bandwidth cap. Using EC2 is not a bad solution for cheap hosting now – they’ve reduced their costs down to $0.02 per hour, which works out to around $14/mo not including bandwidth. The bandwidth costs for EC2 are $0.10 per 1M “GET”s (urls).

    I’m curious why you’re trying to get a billion urls per week … What’s the end goal of your project?

  • http://fishyfishyfishyfish.com A guy

    FYI: Amazon’s EC2 has a deal where they are not charging anything for inbound bandwidth until Oct 31st. Also, their ‘micro’ instances can be as little as $0.01 per hour, so you could really take advantage of their network and infrastructure for the next 30+ days and crawl like crazy using their machines if you wanted to. My crawl yesterday didn’t cost anything for inbound bandwidth and I grabbed 45GB of data at a fairly fast rate. You could fire up ten ‘micro’ instances and crawl like crazy for a very small amount of money.

  • Jake

    Great Work Nate

    I ran your script but Xarg & Wget stops working after about an hour

    I am running your script with Cygwin on Windows XP with Dual Core and 4GB Ram

    I run the script with ./crawl.sh urlfile.txt 50
    I am trying to trawl through 4GB of URL

    The Bash shell seems to stop after about an hour and the downloads stop even though its only 2% through the URL list

    Any ideas at what could be wrong?

    Many Thanks

  • Jake

    Xarg & Wget stops working after about an hour

    I am running your script with Cygwin on Windows XP with Dual Core and 4GB Ram
    I run the script with ./crawl.sh urlfile.txt 50
    I am trying to trawl through 4GB of URL

    The Bash shell seems to stop after about an hour and the downloads stop even though its only 2% through the URL list

    Any ideas at what could be wrong?

  • Benjamin

    Please, what was the speed of the bandwith you used precisely, and the full configuration of your computer based in the datacenter. Regards.