[jdom-interest] Re: Manipulating a very large XML file

Mon Mar 14 21:47:19 PST 2005

Jason Robbins wrote:

> As others pointed out, if you have 5GBs of data, you probably should
> not be keeping it in one XML file.  Did you mean 5MB?

Working at Mark Logic I know many customers have XML files in the 5 gig 
size range and XML content sets exceeding 5 terabytes.  I think we'll 
see larger XML files and XML content sets as people get familiar with 
the advanced tools capable of handling them.  As a comparison, probably 
no one has a 100 million row Excel table, but a 100 million row 
relational database is common.

JDOM of course isn't an "advanced tool" capable of handling these large 
document sizes.  In that world I see JDOM as the mechanism to interface 
with such tools from Java.  For example, you can issue an XQuery via XQJ 
and get the results -- pulled perhaps from the 5 terabyte XML contenbase 
-- as a series of JDOM objects.

> Actually, this brings of a question for the group:  
> Would people be interested in a memory-efficient DOM or JDOM 
> implementation?

It's not a high priority from my perspective.  I think the right 
solution for managing large XML datasets isn't to write a memory 
efficient data model.  That's just a stopgap solution, akin to making 
Excel support 256k rows instead of 64k.  The right solution is to use 
(or write if needed) an XML contentbase that indexes the XML and makes 
it queryable.  That gets you both larger data sizes and a more effective 
way of interacting with the content.

If you have open source energy to devote to the problem, look at eXist 
(http://exist.sourceforge.net/) and help give it some slick indexing 
capabilities.

Or if you want a commercial grade solution, look at Mark Logic.  You can 
get a 30 day trial that supports data sets up to 1G 
(http://xqzone.marklogic.com).  The official product goes four orders of 
magnitude larger than that.  It's really fun.

Here's a screencast I did with Jon Udell showing off XQuery to 
manipulate some O'Reilly books in docbook format:
   http://weblog.infoworld.com/udell/2005/02/15.html#a1177

It's the kind of stuff that wouldn't be nearly as effective using 
something like JDOM or even XSLT when your data sets get large.  Like I 
said above, JDOM still has its place with large data sets as the way to 
interact with the contentbase storing the XML.  But for interacting with 
large quantities of XML, you need more.  Jon for example started using 
Mark Logic to back his blog and query it along with the RSS feeds of others.

-jh-