Category Archives: crawling

Challenges in Large-Scale Web Crawling

Simple web crawling is easy but when you start crawling several hundred million pages there are a number of difficult challenges. Last Friday, I gave a talk on how to overcome some of the challenges of large-scale web crawling at Berkeley. Below are the slides from that talk. Challenges in Large-Scale Web Crawling View more [...]
Also posted in big-data | 1 Comment

Custom Hive UDFs in Clojure

Introduction We process all of our web-crawl data in Hadoop. If I’m writing jobs that will only be run by my team, then Cascalog is my tool of choice. But unfortunately, not everyone is going to learn Cascalog (much less Cascading or Clojure). However, many people know a little SQL and the best tool for [...]
Also posted in big-data | Leave a comment

URL Normalization in Clojure

Bandwidth is often one of the first bottlenecks you’ll hit when web crawling. So, it’s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you’ve already crawled a page you need to keep an identifier of each page that you’ve crawled. The naive solution to this is [...]
Posted in crawling | Tagged | 1 Comment