Challenges in Large-Scale Web Crawling

Simple web crawling is easy but when you start crawling several hundred million pages there are a number of difficult challenges.

Last Friday, I gave a talk on how to overcome some of the challenges of large-scale web crawling at Berkeley. Below are the slides from that talk.

introduction to
WEB CRAWLING
& extraction by Nate Murray
ask for show of hands – how many written any code to download data automatically – downloaded megabytes / gigabytes / terabytes ?
WHO AM I ?
Nate Murray
AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005
what is
WEB CRAWLING?
definition:
web crawler
a program that browses the web.
crawler downloads pages and puts them in storage somewhere
definition:
web extraction
transforming web data into data
unstructured
structured
but the reality is, if your data is really unstructured then you’re quickly talking about NLP.
definition:
web extraction
transforming web data into data
semistructured
structured
motivation
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation: bookmark buddies
URL Title Users
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation:
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation: business hours
Day
Openness
MMon
Closed
TTue
11:30-14:30
17:30-22:00
Wed
11:30-14:30
17:30-22:00
Thur
11:30-14:30
17:30-22:00
Fri
11:30-14:30
17:30-22:00
Sat
12:00-14:30
17:00-22:00
Sun
-
17:00-21:00
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation:
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation: recommend videos
Users
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation:
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation: vertical search
Image
SKU
Name
Price
Rating
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
motivation:
- find internet friends (delicious bookmarks) – extract hours of operation – make a video player service (co-occurring commenters) – power tools search engine (affiliate fees)
DESIRED PROPERTIES
DESIRED PROPERTIES
SPEED
speed. really you just want to download as many pages as you can as fast as possible
• •
Politeness Distributed
it’s easy to burden
small servers (for any significant
crawl) n machines =
n*m pages-per-second every machine should
perform equal work crawl each page exactly once
• •

Linear Scalability Even partitioning
Minimum overlap
CONSTRAINTS
BASIC ALGORITHM
Initialize: UrlsDone = null
UrlFrontier = {‘google.com/index.html’, ..} Repeat
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl
If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
crawler downloads pages and puts them in storage somewhere
architecture overview
FETCHER
CRAWL
PLANNER URL
QUEUE
URLs
Web Data
INTERNET
Web Data
Web Data
STORAGE
CHALLENGES
challenges:
depends on your ambitions
as the old saying goes ‘everything is fast for small n’
challenges:
Google’s Index Size:
1998 – 26 million 2005 – 8 billion 2008 – 1 trillion

http://www.nytimes.com/2005/08/15/technology/15search.html

http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

challenges:
< 10MM
small crawls are easy
and not that interesting [click] i hate to draw too hard of a line, but i also want to give you an intuition about the numbers we’re talking about here. i’d say a small crawl is roughly less than 10mm
challenges:
large crawls are interesting
challenges:
DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLs
challenges:
DNS LOOKUP
Initialize: UrlsDone = null
UrlFrontier = {‘google.com/index.html’, ..} Repeat
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath()) UrlsDone.insert(url) newUrls = parseForLinks(html) For each newUrl
If not UrlsDone.contains(newUrl) then UrlsTodo.insert(newUrl)
crawler downloads pages and puts them in storage somewhere
challenges:
DNS LOOKUP
can easily be a bottleneck
challenges:
DNS LOOKUP

consider running your own DNS servers
• • •
djbdns PowerDNS etc.
challenges:
DNS LOOKUP

be aware of software limitations • gethostbyaddr is synchronized

same with many “default” DNS clients
challenges:
DNS LOOKUP
You’ll know when you need it
challenges:
URLs CRAWLED
Initialize:
UrlsDone = null
UrlFrontier = {‘google.com/index.html’, ..} Repeat
url = UrlFrontier.getNext() ip = DNSlookup(url.getHostname()) html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html) For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712
x 100 million =~ 5.4 gigabytes
doable if you have a lot of memory. but turn that 100mm into 1 billion and you’re really in trouble. can we do better?
can we do better?
BLOOM FILTERS
BLOOM FILTERS
answers the question:
is this item in the set?
BLOOM FILTERS
Have we crawled: http://www.xcombinator.com?
• •
answers either:
yes, probably definitely not
challenges:
URLs CRAWLED
1 machine, bloom filter
NAPKIN CALCULATION
100 million URLs
1 in 100 million chance of false positive
=~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
1/10th
• • •

acceptable
BLOOM FILTER
drawbacks
probabilistic – occasional errors
solutions
estimate # of items ahead of time
can’t delete

• •
not hard, see Dynamic BFs
pick granularity (days) cascade them
BLOOM FILTERS
references:

http://en.wikipedia.org/wiki/Bloom_filter

http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/
challenges:
POLITENESS
obey robots.txt
this won’t always work. especially for large sites
rule of thumb: wait 2 seconds (w.r.t. ip)
this won’t always work. especially for large sites
centralized politeness
SPOF contention
zookeeper
challenges:
POLITENESS

Options:
• • •
central database distributed locks (paxos/sigma/zookeeper) controlled URL distribution

http://en.wikipedia.org/wiki/Paxos_(computer_science)

http://zookeeper.apache.org/

challenges:
URL FRONTIER
url frontier
idea:
consistently distribute URLs based on IP
modulo
IP
SHA-1
bucket (mod 5)
174.132.225.106
4dd14b0b…
2
74.125.224.115
cf4b7594…
1
157.166.255.19
0ac4d141…
4
69.22.138.129
6c1584fa…
4
98.139.50.166
327252c5…
3
benefits:
same IP always goes to same machine simple
this won’t always work. especially for large sites
drawbacks:
susceptible to skew can’t add / remove nodes without pain
re hash the whole keyspace, every url will go to a new machine
consistent hashing
source: http://michaelnielsen.org/blog/consistent-hashing/
source: http://michaelnielsen.org/blog/consistent-hashing/
source: http://michaelnielsen.org/blog/consistent-hashing/
source: http://michaelnielsen.org/blog/consistent-hashing/
benefits:
~ 1/(n+1) URLs move on add/remove
virtual nodes help skew robust (no SOP)
drawbacks:
naive solution won’t work for large sites
further reading:
Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al.
Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007
Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al.
challenges:
QUEUEING URLS
situation:
URL not recently crawled allowed by robots.txt polite
you’ve got urls…
how to you order them?
(within a single machine)
threads
hash each lane:
123

http://yachtmaintenanceco.com/

http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu
http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/
http://www.musiciansdfw.org/ http://www.ariana.org/
123
123
123
ERLANG
lookup: erlang B / C / engset
Agner Erlang
telephone lines – used in call centers to calculate number of agents you need relative to call volume
as many threads as possible
number of agents doesn’t really translate
don’t sort input URLs
fetch

http://abcnews.go.com/

http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&page=1 http://abcnews.go.com/2020/story?id=207269&page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&page=1
wait fetch wait fetch wait
no waiting!

http://yachtmaintenanceco.com/

http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu
http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/
challenges:
EXTRACTING URLS
challenges:
EXTRACTING URLS
the internet is full of garbage
challenges:
EXTRACTING URLS
enormous pages terrible markup ridiculous urls
☃.net/ “unicode snowman dot net”
challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizer
SOFTWARE
• •
software advice:
goals determine scale someone else has already done it
- web crawling and scraping isn’t new
2 second crawler:
function wgetspider() { wget –html-extension –convert-links –mirror \
–page-requisites –progress=bar –level=5 \ –no-parent –no-verbose \ –no-check-certificate “$@”;
} $ wgetspider http://www.ischool.berkeley.edu/
- web crawling and scraping isn’t new
• • •
java crawlers:
Heritrix (Internet Archive) Nutch (Lucene) Bixo (Hadoop / Cascading)

http://crawler.archive.org/

http://nutch.apache.org/ http://bixo.101tec.com/
• • •
extraction packages:
mechanize BeautifulSoup & urllib2 Scrapy

http://wwwsearch.sourceforge.net/mechanize/

http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/
wrapper induction(ish)
• • • •
Ariel RoadRunner TemplateMaker scrubyt

http://ariel.rubyforge.org/index.html

http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.html
QUESTIONS?
FEEDBACK:
nate@xcombinator.com
www.xcombinator.com @xcombinator

Share:
  • del.icio.us
  • Reddit
  • Technorati
  • Twitter
  • Facebook
  • Google Bookmarks
  • HackerNews
  • PDF
  • RSS
This entry was posted in big-data, crawling. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • http://openbixo.org John

    Hi,
    The Bixo project now has it’s own website – http://openbixo.org.