Bandwidth is often one of the first bottlenecks you’ll hit when web crawling. So, it’s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you’ve already crawled a page you need to keep an identifier of each page that you’ve crawled.
The naive solution to this is to just use the URL as the key. But it’s easy to see that this will cause duplicate pages to be downloaded because:
- URLs for a given page aren’t consistent even within a given site. (e.g.
- Many pages have links to anchor tags which are all on the same page (e.g.
Pop Quiz: What’s the “normal” form of each of these URL?
http://www.foo.com:80/foo http://www.foo.com/foo/../foo#bam http://:@www.FOO.com/foo/../foo
We need a URL normalizer that will return a consistent URL for all URLs that point to a given page. Again, note that a single page has many URLs.
If the URL
http://:@www.FOO.com/foo/../fooseems a bit contrived,
let me tell you: it isn’t. As soon as you start crawling you learn the web is full of hideous markup including non-intuitive (and nonsensical) URLs.
URL normalization is one of those problems that seems simple but, in fact, the details get pretty hairy. So Jay Donnell and I have been working on a small URL normalizer that makes it easy. It’s still young but already passes a large number of tests, including most of the Pace URL Normalization Tests.
(ns my.namespace (:use [url-normalizer.core)) (canonicalize-url "http://www.example.com:80/foo#bar") -> "http://www.example.com/foo"
The inspiration for this library comes from Sam Ruby’s
Interested in URL Normalization? Want to write a large-scale web-crawler in Clojure? We’re hiring. Send me an email.