[jdom-interest] JDOM parser reuse memory problem

Rolf jdom at tuis.net
Sat Nov 19 20:12:49 PST 2011


Hi all.

I am looking to run some ideas past the group. I see a number of 
problems with the SAXBuilder as it currently is. It is somewhat hard to 
describe them all, but, the bottom line is that I think the API should 
be changed for it in a smallish way that will affect people who use a 
custom SAXHandler, or those who hard-code a SAXParser Driver classname 
in the SAXBuilder constructor. I believe the vast majority of people use 
the default constructor, and do not subclass the SAXHandler so this 
change will affect only a small subset of JDOM users.

So, here are the problems I see, in addition to the bug related to 
long-living memory references.


Problem 1: SAXParser creation

JDOM uses 3 mechanisms to create a SAX parser:
1. if the user specifies a specific SAX 'Driver' classname
2. else falls-back to JAXP
3. else falls back to a 'default' SAX Driver (xerces)

I believe that the 'default' fall-back should be removed because if JAXP 
fails there's nothing. At minimum, JAXP will find the parser embedded in 
the Java runtime, and the 'default' fallback will never happen. Put 
another way, if JAXP fails, there is no reason to expect
that the 'default' "org.apache.xerces.parsers.SAXParser" will work 
(because if you have org.apache.xerces.parsers.SAXParser then you also 
have a working JAXP parser....)

I also believe the user-specified 'driver' mechanism should be replaced 
with a straight XMLReaderFactory instance. This makes the JDOM user 
responsible for creating the factory. It also adds the ability for the 
user to have just a single Factory instance and not have JDOM creating a 
new instance each time a new SAXBuilder is created. This will give the 
user the opportunity to improve performance that JDOM cannot do. 
XMLReaderFactory is part of SAX2.0 and has been in Java since at least 
Java 1.4. It is the 'correct' way to get an XMLReader instance. Also, 
new JDOM users will not be confused by this string value, wheras 
XMLReaderFactory is a real, standard, and well documented entity.

Further, there should be no fallback mechanism: if the user manually 
provides a XMLReaderFactory and it fails then it should all fail. If the 
user uses JAXP (the default), and JAXP fails then we fail. In the Java5+ 
world JDOM should not need to be 'molly-coddling' the JAXP process. 
Also, we should not be useing such outdated mechanisms as direct SAX 
driver classes.

This change would 'neaten' up the API for creating SAXBuilders:
1. you either use the 'normal' JAXP process, or...
2. you use the standard non-JAXP mechanism XMLReaderFactory


Problem 2: Parser reuse.

XMLReader reuse is much more efficient than creating a new parser for 
each JDOM build. There have been a few attempts to improve the parser 
reuse in JDOM, but it could be taken even further by only re-configuring 
the XMLReader when the SAXBuilder configuration changes. In a typical 
use where the configuration is unchanged between consecutive JDOM builds 
then there does not need to be any reconfiguration at all.


Problem 3: The long-linked memory

The fix for this is probably going to need a 'reset' method on the 
SAXHandler that de-references the Document that was last parsed. This in 
turn will require an API change on SAXHandler.

Problem 4: SAXHandler sub-classing

SAXHandler subclassing allows for custom event handling, but, in order 
to use a custom SAXHandler you also have to subclass SAXBuilder and 
override the createContentHandler() method. This is a cumbersome (and 
not well documented) mechanism.



What with these (at least) 4 issues with SAXBuilder it makes sense to 
change the API slightly to accomodate the 'new' way of doing things. 
This will impact the way that subclassing is done, and will impact those 
who use a non-JAXP SAX parser.

If these changes (or others like them) need to happen (and I think they 
do), then it makes sense to do it right, and comprehensively.

I am going to play with the code a little to get an idea of what can be 
done, but I am looking for any ideas, suggestions, criticisms.

I have already made some changes affecting the JDOM2 API but I think 
this could be one of those changes that makes a real difference (for the 
better).

Rolf


On 18/11/2011 7:32 PM, Rolf wrote:
> I have updated the issue with some performanc numbers for some different
> conditions.
>
> Have a look at: https://github.com/hunterhacker/jdom/issues/52
>
> It seems to indicate that fixing the 'back to raw JAXP for each loop'
> will only save a little time, but parser reuse saves a lot.
>
> Need to implement both options, I think, implement SAXFactory caching as
> well as better memory management on Parser reuse.
>
> Out of interest, I thought the default setting for parser reuse was
> 'false', but it is true. XMLReaders will be reused unless you explicitly
> setReuseParser(false);
>
> This in turn means that my comments about 'normal' process should be
> reversed, the normal case for this bug condition is that we keep a
> reference from the SAXBuilder to the Document for as long as the
> SAXBuilder is active, and not used to rebuild another document.
>
> Thanks
>
> Rolf
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>



More information about the jdom-interest mailing list