Extract Text from a HTML Document in Clojure

There are many Java HTML parsers and it can be tricky to figure out which one to use. If you need to quickly extract just the text of a document I’d recommend using the Jericho HTML Parser.

Here’s a quick example on how to use it:

;; lein dependency: [net.htmlparser.jericho/jericho-html "3.1"]
(ns foo.preprocess
  (:import 
   [java.io File BufferedInputStream FileInputStream]
   [net.htmlparser.jericho Source TextExtractor]))
 
(defn extract-text 
  "given File returns a String of the extracted text"
  [f]
  (let [source (Source. (BufferedInputStream. (FileInputStream. f)))]
    (.toString (TextExtractor. source))))
 
(def filename "data/some-index.html")
(extract-text (java.io.File. filename))

TextExtractor has sensible defaults and ignores the css and javascript by default. See the TextExtractor class for more details.

Share:
  • del.icio.us
  • Reddit
  • Technorati
  • Twitter
  • Facebook
  • Google Bookmarks
  • HackerNews
  • PDF
  • RSS
This entry was posted in programming and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • Jürgen Hötzel

    Nice Library. clojure.java.io offers IO utility functions:


    (defn extract-text
    "given input f returns a String of the extracted text"
    [f]
    (str (->> (input-stream f) Source. TextExtractor.)))

  • http://www.xcombinator.com Nate Murray

    Nice Jürgen, thats clean and much nicer.

  • Vincent Murphy

    Broken link: Java HTML parsers
    http://java-source.net/open-source/html-parsers

  • http://www.xcombinator.com Nate Murray

    Link fixed. Thanks, Vincent.

  • Chan

    Another alternative:

    
    (use jsoup.soup)
    
    ($ (get! "http://eigenjoy.com/2010/12/01/extract-text-from-a-html-document-in-clojure/") (.text))
    

    from clojure-soup