Clojure’s keyword can fill up your PermGen space

We’ve been working on a custom web-crawler for a few months now. Recently we were having a problem where after a few minutes the JVM would run out of PermGen space.

If you’re not familiar with PermGen space, it is a portion of memory reserved for the JVM itself. It is used for storing information about Classes and interned Strings.

When you intern a String the JVM stores a single copy of that String in the PermGen space. This can save RAM because only one copy of the String will exist in the system. It can also speed up == comparisons for two interned Strings because you only have to compare the reference not the characters.

The problem is, the PermGen space is typically very small (64m is a common default). So if you have many classes or a lot of interned Strings, you can easily blow out the PermGen space.

This problem was showing up in our crawler and we traced it to how we were parsing robots.txt.

robots.txt is a convention website owners can use that will instruct
crawlers how to act while they are on their site. All polite crawlers
use them. For example:

Disallow: /no-crawl/
Allow: /
Sitemap: http://www.foo.com/sitemap.xml

In our crawler, we’ve written a custom robots.txt parsing library: clj-robots (github).

In clj-robots, there was one section of the code where we were taking the left hand side of the robots.txt and converting it to a keyword. This made for cleaner code than comparing Strings. Since there are only a fixed number of robots.txt directives, this should be safe, right?

It turns out it isn’t safe. First, you don’t know what people are actually going to put in their robots.txt. Second, what we forgot was that many sites don’t have a robots.txt, but they don’t return an empty 404, they often return their custom 404 HTML page. What happened was we were parsing an HTML page as a robots.txt and then interning everything that looked like a robots.txt directive.

The result was a spectacular “java.lang.OutOfMemoryError: PermGen space” after just a few minutes. The general principle here is that you should never allow user-generated input become an interned String.

Lessons learned:

  • PermGen stores Classes and interned Strings
  • Clojure’s keyword interns a String
  • Don’t call keyword on user-generated input
  • A profiling tool (e.g. JProfiler) can be your best friend in these situations

References:

Share:
  • del.icio.us
  • Reddit
  • Technorati
  • Twitter
  • Facebook
  • Google Bookmarks
  • HackerNews
  • PDF
  • RSS
This entry was posted in programming and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • http://hugoduncan.org/ Hugo Duncan

    I wonder if this is related to CLJ-746

  • http://clojure.com Alan Dipert

    As of this commit, this no longer happens.

    To verify, you can run the JVM and Clojure with:

    java -cp clojure.jar -XX:PermSize=16M -XX:MaxPermSize=16M clojure.main
    

    Then, run this snippet:

    (loop [n 0]
            (println n)
            (keyword (gensym))
            (recur (inc n)))
    

    For me, with Clojure 1.2.0, 35860 is the last number printed before the JVM bails with an OutOfMemoryError. With 1.3.0/master, the loop runs indefinitely.

    Rich’s fix was to use WeakReferences instead of SoftReferences for the values in Clojure’s keyword map. Effectively, keywords are now more aggressively garbage collected than they used to be.

    So, while you’re still probably right that it’s not good practice to call keyword on user input, it’s not as technically disastrous to do so as your post might lead folks believe :)

    Alan