URL Normalization in Clojure

Bandwidth is often one of the first bottlenecks you’ll hit when web crawling. So, it’s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you’ve already crawled a page you need to keep an identifier of each page that you’ve crawled.

The naive solution to this is to just use the URL as the key. But it’s easy to see that this will cause duplicate pages to be downloaded because:

  • URLs for a given page aren’t consistent even within a given site. (e.g. http://www.foo.com/ and http://www.foo.com/index.html)
  • Many pages have links to anchor tags which are all on the same page (e.g. http://www.foo.com/index.html and http://www.foo.com/index.html#locations)

Pop Quiz: What’s the “normal” form of each of these URL?

http://www.foo.com:80/foo

http://www.foo.com/foo/../foo#bam


http://:@www.FOO.com/foo/../foo

The answer: http://www.foo.com/foo.

We need a URL normalizer that will return a consistent URL for all URLs that point to a given page. Again, note that a single page has many URLs.

If the URL http://:@www.FOO.com/foo/../foo seems a bit contrived,
let me tell you: it isn’t. As soon as you start crawling you learn the web is full of hideous markup including non-intuitive (and nonsensical) URLs.

URL normalization is one of those problems that seems simple but, in fact, the details get pretty hairy. So Jay Donnell and I have been working on a small URL normalizer that makes it easy. It’s still young but already passes a large number of tests, including most of the Pace URL Normalization Tests.

Usage

 (ns my.namespace 
     (:use [url-normalizer.core))

 (canonicalize-url "http://www.example.com:80/foo#bar")
 -> "http://www.example.com/foo"

The code is on github and the jar is on clojars.

The inspiration for this library comes from Sam Ruby’s urlnorm.py.

Interested in URL Normalization? Want to write a large-scale web-crawler in Clojure? We’re hiring. Send me an email.

Share:
  • del.icio.us
  • Reddit
  • Technorati
  • Twitter
  • Facebook
  • Google Bookmarks
  • HackerNews
  • PDF
  • RSS
This entry was posted in crawling and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.