[jdom-interest] JDOM and memory
jdom at tuis.net
Sat Jan 28 11:42:02 PST 2012
On 28/01/2012 1:37 PM, Michael Kay wrote:
>> Finally, I have in the past had some success with the concept of
>> 'reusing' String values. XML Parsers (like SAX, etc.) typically create
>> a new String instance for all the variables they pass. For example,
>> the Element names, prefixes, etc. are all new instances of String.
>> Thus, if you have hundreds of Elements called 'car' in your input XML,
>> you will get hundreds of different String Element names with the value
>> 'car'. I have built a class that does something similar to
>> String.intern() in order to rationalize the hundreds of
>> different-but-equals() values that are passed in by the parsers.
> Have you measured how your optimization compares with the effect of
> setting the http://xml.org/sax/features/string-interning property on the
> SAX parser?
> Are you doing the interning in a way that guarantees strings can be
> compared using "==", and if so, are you taking advantage of this when
> doing the comparisons? .The big win comes with XPath searches such as
> //x. Does the interning introduce any synchronization? (This is the big
> disadvantage with Saxon's NamePool - it speeds up XPath searching
> substantially, but the contention in a highly concurrent workload can
> become quite significant.)
> Are you pooling the QName as a whole, or the local name, prefix and URI
> Michael Kay
In answer to your questions...
no, I have not compared against string-interning property. I was not
aware of that. But, reading the documentation, it says: All element
names, prefixes, attribute names, Namespace URIs, and local names are
internalized using java.lang.String.intern.
This is *not* a good thing. String.intern() uses PermGen space to intern
the value (as if the value is a String constant in the code). PermGen
space is typically limited to a hundred or so megabytes. I have, in the
past, run in to significant issues where you get OutOfMemory issues when
String.intern is used liberally.... and changing -Xmx makes no
difference... very confusing the first time you run in to that....
So, I have not compared, to string-intern of the SAX parser. And I would
not recommend that people use that unless they know what they are doing,
and what sort of data they have.
The mechanism I do use is based on previous experience with this sort of
problem, and it works by doing a memory-efficient hash-table to store
unique instances of String. Subsequent lookups in to the hash table
return the previously stored string value, if any. Because the
hash-table is not a global hash table, and because it is not linked in
to any core Java structures, you cannot guarantee == based comparisons,
but, in many cases, the String.equals() returns immediately because you
are in fact comparing identical instances and the first linke of
String.equals() does the == comparison.
My method does not use any synchronization, and I expect each JDOM
builder to have it's own cache, possibly for the duration of a single
parse only. It makes a difference on small-scale items only. I have in
the past built a thread-safe and 'global' type cache using similar
principles, and it is a good concept, but it would be overkill for here.
With JDOM in particular you do not want large memory structures hanging
around... and limiting this cache to a single builder is about the right
sort of compromise. Further, because I have implemented in a new
JDOMFactory implementation, it is easy for the JDOM user to manage how
long it lives for, and they can call the SlimJDOMFactory.clearCache() to
remove any previously cached String instances. In other words, the JDOM
user can use it as much or as little as they want ( but not concurrently)
In my testing the Jaxen-based XPath expressions are in fact faster with
the 'cached' string values ... about 1ms faster on a 30ms process... not
very significant (not significant enough to be purely attributable to
So, it is a single-threaded cache that reuses previously cached values.
It can be applied to a single, or consecutive processes, and the cache
itself is available outside the SlimJDOMFactory if people want to borrow
that code in their own way.
In my experience, the benefit of this sort of caching is most obvious in
a GC - monitored environment where the GC times can be substantially
shortened.... but not easily measured.
More information about the jdom-interest