[jdom-interest] Proposal: JDOM event based processing

Mon Nov 6 04:17:48 PST 2000

<background>
When processing XML documents in Java there are two primary methods of
parsing.

1. event based processing - such as implementing a SAX ContentHandler (in
SAX2)
2. tree based processing - such as processing a JDOM Document.

SAX based processing is very low level. Anyone who's ever parsed any non
trivial document has probably found that using SAX is quite hard and can
take alot of effort (maintaining stacks, state and whatnot) to process the
document.

Tree based processing is much more simple and straightforward to do. Its
only drawback is the memory consumption - that the whole document must be
read into memory first before it can be processed.

I've often used event based processing in the past through necessity due to
memory consumption. It is quite common for XML documents to be used to
represent database centric data, to move data around a network and for these
documents to get quite big. Any XML document over 1-10 Mb can often cause
memory problems in tree based parsers - certainly 100-1000Mb documents will
fail on most JVMs these days.
</background>

What I'd like to be able to do is to do event based processing of an XML
document via JDOM, avoiding the need to use SAX directly.
Consider the following interface proposal

    package org.jdom;

    public interface ElementHandler

        public void handle( Element element ) throws JDOMException;
    }

The above handler interface represents a JDOM Element handler which is quite
similar in essence to a SAX ContentHandler. It allows a single element tree
to be handled which is a higher level to SAX which has callbacks for start,
end of each element together with character blocks etc.

Lets consider a sample books.xml document which is very large...

<books>
    <book id="1">
        <title>XML Bible</title>
        <author>Elliotte Rusty Harold</author>
    </book>
        ...

    <book id="10000000">
        ...
    </book>
</book>

Now we might write some Java to take a book element subtree and process it
in some way.

    public BookHandler implements ElementHandler {
        public void handle( Element book ) throws JDOMException {
            Element title = book.getChild( "title" );
            Element author = book.getChild( "author" );
            ...
        }
    }

Now we may wish to parse the example books.xml document and use the
BookHandler to process each individual book subtree (shall we call it a
'branch'?). Note that the BookHandler is quite reusable and could be used in
other XML documents with a slightly different overall tree structure,
provided they have the necessary book branch in them at some point.

So in this example, we have a very large books.xml file which we only need a
single book subtree to be constructed as a JDOM tree in memory at once.

I have found this example above to be extremely common use case in parsing
XML documents.  Using database terminology, often the XML document is
equivalent to table of rows and you want to process a row at a time. Using
object terminology, a document is often a collection of component trees and
you often want to process a component tree at a time.

However we have an all or nothing choice right now - we either read the
whole document into memory, via JDOM or we use SAX directly and don't use
JDOM, which is too low level.

What I'd like is JDOM to provide a custom SAXBuilder that can read sub-trees
or branches at a time without reading the whole tree into memory at once. If
this were available I would never have much of a need to ever use SAX again
;-)

Let us consider how we might use this 'event based parsing' in JDOM. What
should the implementation look like?

Lets assume we have a new kind of SAXBuilder, which for now I'll call
SAXProcessor. What would be really cool would be a if we could use XPath
expressions to denote a nodeset of the XML document, for which each node in
the nodeset is constructed as an Element (with children) and passed onto an
ElementHandler.

For example, here's some code to process the big books.xml file, a book at a
time

    SAXProcessor processor = new SAXProcessor();
    processor.addProcessor( "//book", new BookHandler() );
    processor.process( "books.xml" );

The above code would work with any size books.xml file, providing that each
individual book tree is not of a huge size.

So should we introduce 'event based parsing' into JDOM? I hope so. It would
only involve a custom SAXBuilder implementation and 1 new interface,
ElementHandler.

I see JDOM's current limitation on requiring whole documents to be parsed
into a full tree a temporary limitation only and one that should be lifted
ASAP. I also like the idea of introducing event based processing to JDOM
(where an event is a branch or subtree) to promote resuablility of handlers.

What are peoples thoughts on this?

J.

James Strachan
=============
email: james at metastuff.com
web: http://www.metastuff.com

If you are not the addressee of this confidential e-mail and any
attachments, please delete it and inform the sender; unauthorised
redistribution or publication is prohibited. Views expressed are those of
the author and do not necessarily represent those of Citria Limited.