Tag Archives: clojure

Clojure’s keyword can fill up your PermGen space

We’ve been working on a custom web-crawler for a few months now. Recently we were having a problem where after a few minutes the JVM would run out of PermGen space. If you’re not familiar with PermGen space, it is a portion of memory reserved for the JVM itself. It is used for storing information [...]
Posted in programming | Tagged | 2 Comments

URL Normalization in Clojure

Bandwidth is often one of the first bottlenecks you’ll hit when web crawling. So, it’s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you’ve already crawled a page you need to keep an identifier of each page that you’ve crawled. The naive solution to this is [...]
Posted in crawling | Tagged | Leave a comment

Extract Text from a HTML Document in Clojure

There are many Java HTML parsers and it can be tricky to figure out which one to use. If you need to quickly extract just the text of a document I’d recommend using the Jericho HTML Parser. Here’s a quick example on how to use it: ;; lein dependency: [net.htmlparser.jericho/jericho-html "3.1"] (ns foo.preprocess (:import [java.io [...]
Posted in programming | Tagged | 5 Comments