We’ve been working on a custom web-crawler for a few months now. Recently we were having a problem where after a few minutes the JVM would run out of PermGen space.
If you’re not familiar with PermGen space, it is a portion of memory reserved for the JVM itself. It is used for storing information about Classes and interned Strings.
When you intern a String the JVM stores a single copy of that String in the PermGen space. This can save RAM because only one copy of the String will exist in the system. It can also speed up == comparisons for two interned Strings because you only have to compare the reference not the characters.
The problem is, the PermGen space is typically very small (64m is a common default). So if you have many classes or a lot of interned Strings, you can easily blow out the PermGen space.
This problem was showing up in our crawler and we traced it to how we were parsing robots.txt.
robots.txt is a convention website owners can use that will instruct
crawlers how to act while they are on their site. All polite crawlers
use them. For example:Disallow: /no-crawl/ Allow: / Sitemap: http://www.foo.com/sitemap.xml
In our crawler, we’ve written a custom robots.txt parsing library: clj-robots (github).
In clj-robots, there was one section of the code where we were taking the left hand side of the robots.txt and converting it to a keyword. This made for cleaner code than comparing Strings. Since there are only a fixed number of robots.txt directives, this should be safe, right?
It turns out it isn’t safe. First, you don’t know what people are actually going to put in their robots.txt. Second, what we forgot was that many sites don’t have a robots.txt, but they don’t return an empty 404, they often return their custom 404 HTML page. What happened was we were parsing an HTML page as a robots.txt and then interning everything that looked like a robots.txt directive.
The result was a spectacular “java.lang.OutOfMemoryError: PermGen space” after just a few minutes. The general principle here is that you should never allow user-generated input become an interned String.
- PermGen stores Classes and interned Strings
- Clojure’s keyword interns a String
- Don’t call keyword on user-generated input
- A profiling tool (e.g. JProfiler) can be your best friend in these situations