[jdom-interest] JDOM parser reuse memory problem
jdom at tuis.net
Sat Nov 19 20:12:49 PST 2011
I am looking to run some ideas past the group. I see a number of
problems with the SAXBuilder as it currently is. It is somewhat hard to
describe them all, but, the bottom line is that I think the API should
be changed for it in a smallish way that will affect people who use a
custom SAXHandler, or those who hard-code a SAXParser Driver classname
in the SAXBuilder constructor. I believe the vast majority of people use
the default constructor, and do not subclass the SAXHandler so this
change will affect only a small subset of JDOM users.
So, here are the problems I see, in addition to the bug related to
long-living memory references.
Problem 1: SAXParser creation
JDOM uses 3 mechanisms to create a SAX parser:
1. if the user specifies a specific SAX 'Driver' classname
2. else falls-back to JAXP
3. else falls back to a 'default' SAX Driver (xerces)
I believe that the 'default' fall-back should be removed because if JAXP
fails there's nothing. At minimum, JAXP will find the parser embedded in
the Java runtime, and the 'default' fallback will never happen. Put
another way, if JAXP fails, there is no reason to expect
that the 'default' "org.apache.xerces.parsers.SAXParser" will work
(because if you have org.apache.xerces.parsers.SAXParser then you also
have a working JAXP parser....)
I also believe the user-specified 'driver' mechanism should be replaced
with a straight XMLReaderFactory instance. This makes the JDOM user
responsible for creating the factory. It also adds the ability for the
user to have just a single Factory instance and not have JDOM creating a
new instance each time a new SAXBuilder is created. This will give the
user the opportunity to improve performance that JDOM cannot do.
XMLReaderFactory is part of SAX2.0 and has been in Java since at least
Java 1.4. It is the 'correct' way to get an XMLReader instance. Also,
new JDOM users will not be confused by this string value, wheras
XMLReaderFactory is a real, standard, and well documented entity.
Further, there should be no fallback mechanism: if the user manually
provides a XMLReaderFactory and it fails then it should all fail. If the
user uses JAXP (the default), and JAXP fails then we fail. In the Java5+
world JDOM should not need to be 'molly-coddling' the JAXP process.
Also, we should not be useing such outdated mechanisms as direct SAX
This change would 'neaten' up the API for creating SAXBuilders:
1. you either use the 'normal' JAXP process, or...
2. you use the standard non-JAXP mechanism XMLReaderFactory
Problem 2: Parser reuse.
XMLReader reuse is much more efficient than creating a new parser for
each JDOM build. There have been a few attempts to improve the parser
reuse in JDOM, but it could be taken even further by only re-configuring
the XMLReader when the SAXBuilder configuration changes. In a typical
use where the configuration is unchanged between consecutive JDOM builds
then there does not need to be any reconfiguration at all.
Problem 3: The long-linked memory
The fix for this is probably going to need a 'reset' method on the
SAXHandler that de-references the Document that was last parsed. This in
turn will require an API change on SAXHandler.
Problem 4: SAXHandler sub-classing
SAXHandler subclassing allows for custom event handling, but, in order
to use a custom SAXHandler you also have to subclass SAXBuilder and
override the createContentHandler() method. This is a cumbersome (and
not well documented) mechanism.
What with these (at least) 4 issues with SAXBuilder it makes sense to
change the API slightly to accomodate the 'new' way of doing things.
This will impact the way that subclassing is done, and will impact those
who use a non-JAXP SAX parser.
If these changes (or others like them) need to happen (and I think they
do), then it makes sense to do it right, and comprehensively.
I am going to play with the code a little to get an idea of what can be
done, but I am looking for any ideas, suggestions, criticisms.
I have already made some changes affecting the JDOM2 API but I think
this could be one of those changes that makes a real difference (for the
On 18/11/2011 7:32 PM, Rolf wrote:
> I have updated the issue with some performanc numbers for some different
> Have a look at: https://github.com/hunterhacker/jdom/issues/52
> It seems to indicate that fixing the 'back to raw JAXP for each loop'
> will only save a little time, but parser reuse saves a lot.
> Need to implement both options, I think, implement SAXFactory caching as
> well as better memory management on Parser reuse.
> Out of interest, I thought the default setting for parser reuse was
> 'false', but it is true. XMLReaders will be reused unless you explicitly
> This in turn means that my comments about 'normal' process should be
> reversed, the normal case for this bug condition is that we keep a
> reference from the SAXBuilder to the Document for as long as the
> SAXBuilder is active, and not used to rebuild another document.
> To control your jdom-interest membership:
More information about the jdom-interest