[jdom-interest] Re: Manipulating a very large XML file
jhunter at xquery.com
Mon Mar 14 21:47:19 PST 2005
Jason Robbins wrote:
> As others pointed out, if you have 5GBs of data, you probably should
> not be keeping it in one XML file. Did you mean 5MB?
Working at Mark Logic I know many customers have XML files in the 5 gig
size range and XML content sets exceeding 5 terabytes. I think we'll
see larger XML files and XML content sets as people get familiar with
the advanced tools capable of handling them. As a comparison, probably
no one has a 100 million row Excel table, but a 100 million row
relational database is common.
JDOM of course isn't an "advanced tool" capable of handling these large
document sizes. In that world I see JDOM as the mechanism to interface
with such tools from Java. For example, you can issue an XQuery via XQJ
and get the results -- pulled perhaps from the 5 terabyte XML contenbase
-- as a series of JDOM objects.
> Actually, this brings of a question for the group:
> Would people be interested in a memory-efficient DOM or JDOM
It's not a high priority from my perspective. I think the right
solution for managing large XML datasets isn't to write a memory
efficient data model. That's just a stopgap solution, akin to making
Excel support 256k rows instead of 64k. The right solution is to use
(or write if needed) an XML contentbase that indexes the XML and makes
it queryable. That gets you both larger data sizes and a more effective
way of interacting with the content.
If you have open source energy to devote to the problem, look at eXist
(http://exist.sourceforge.net/) and help give it some slick indexing
Or if you want a commercial grade solution, look at Mark Logic. You can
get a 30 day trial that supports data sets up to 1G
(http://xqzone.marklogic.com). The official product goes four orders of
magnitude larger than that. It's really fun.
Here's a screencast I did with Jon Udell showing off XQuery to
manipulate some O'Reilly books in docbook format:
It's the kind of stuff that wouldn't be nearly as effective using
something like JDOM or even XSLT when your data sets get large. Like I
said above, JDOM still has its place with large data sets as the way to
interact with the contentbase storing the XML. But for interacting with
large quantities of XML, you need more. Jon for example started using
Mark Logic to back his blog and query it along with the RSS feeds of others.
More information about the jdom-interest