[jdom-interest] JDOM and memory

Sat Jan 28 11:42:02 PST 2012

On 28/01/2012 1:37 PM, Michael Kay wrote:
>
>>
>>
>> Finally, I have in the past had some success with the concept of
>> 'reusing' String values. XML Parsers (like SAX, etc.) typically create
>> a new String instance for all the variables they pass. For example,
>> the Element names, prefixes, etc. are all new instances of String.
>> Thus, if you have hundreds of Elements called 'car' in your input XML,
>> you will get hundreds of different String Element names with the value
>> 'car'. I have built a class that does something similar to
>> String.intern() in order to rationalize the hundreds of
>> different-but-equals() values that are passed in by the parsers.
> Have you measured how your optimization compares with the effect of
> setting the http://xml.org/sax/features/string-interning property on the
> SAX parser?
>
> Are you doing the interning in a way that guarantees strings can be
> compared using "==", and if so, are you taking advantage of this when
> doing the comparisons? .The big win comes with XPath searches such as
> //x. Does the interning introduce any synchronization? (This is the big
> disadvantage with Saxon's NamePool - it speeds up XPath searching
> substantially, but the contention in a highly concurrent workload can
> become quite significant.)
>
> Are you pooling the QName as a whole, or the local name, prefix and URI
> separately?
>
> Michael Kay
> Saxonica

Hi Michael,

In answer to your questions...

no, I have not compared against string-interning property. I was not 
aware of that. But, reading the documentation, it says: All element 
names, prefixes, attribute names, Namespace URIs, and local names are 
internalized using java.lang.String.intern.

This is *not* a good thing. String.intern() uses PermGen space to intern 
the value (as if the value is a String constant in the code). PermGen 
space is typically limited to a hundred or so megabytes. I have, in the 
past, run in to significant issues where you get OutOfMemory issues when 
String.intern is used liberally.... and changing -Xmx makes no 
difference... very confusing the first time you run in to that....

So, I have not compared, to string-intern of the SAX parser. And I would 
not recommend that people use that unless they know what they are doing, 
and what sort of data they have.

The mechanism I do use is based on previous experience with this sort of 
problem, and it works by doing a memory-efficient hash-table to store 
unique instances of String. Subsequent lookups in to the hash table 
return the previously stored string value, if any. Because the 
hash-table is not a global hash table, and because it is not linked in 
to any core Java structures, you cannot guarantee == based comparisons, 
but, in many cases, the String.equals() returns immediately because you 
are in fact comparing identical instances and the first linke of 
String.equals() does the == comparison.

My method does not use any synchronization, and I expect each JDOM 
builder to have it's own cache, possibly for the duration of a single 
parse only. It makes a difference on small-scale items only. I have in 
the past built a thread-safe and 'global' type cache using similar 
principles, and it is a good concept, but it would be overkill for here. 
With JDOM in particular you do not want large memory structures hanging 
around... and limiting this cache to a single builder is about the right 
sort of compromise. Further, because I have implemented in a new 
JDOMFactory implementation, it is easy for the JDOM user to manage how 
long it lives for, and they can call the SlimJDOMFactory.clearCache() to 
remove any previously cached String instances. In other words, the JDOM 
user can use it as much or as little as they want ( but not concurrently)

In my testing the Jaxen-based XPath expressions are in fact faster with 
the 'cached' string values ... about 1ms faster on a 30ms process... not 
very significant (not significant enough to be purely attributable to 
that ...).

So, it is a single-threaded cache that reuses previously cached values. 
It can be applied to a single, or consecutive processes, and the cache 
itself is available outside the SlimJDOMFactory if people want to borrow 
that code in their own way.

In my experience, the benefit of this sort of caching is most obvious in 
a GC - monitored environment where the GC times can be substantially 
shortened.... but not easily measured.

Rolf