[jdom-interest] Re: Manipulating a very large XML file

Tue Mar 15 08:40:56 PST 2005

Hi Jason and everyone,

> > Actually, this brings of a question for the group:  
> > Would people be interested in a memory-efficient DOM or JDOM 
> > implementation?
> 
> It's not a high priority from my perspective.  I think the right 
> solution for managing large XML datasets isn't to write a memory 
> efficient data model.  That's just a stopgap solution, akin to making 
> Excel support 256k rows instead of 64k.  The right solution is to use 
> (or write if needed) an XML contentbase that indexes the XML and makes 
> it queryable.  That gets you both larger data sizes and a more effective 
> way of interacting with the content.

Oh, I agree.  As a computer scientist, I know that both 8*N and N/2
are both O(N), so from that point of view, it really doesn't matter.
In the long run, as N continues to grow, people absolutely need
to switch to a database approach.

But, I'm also a practical system builder, and I suspect that a lot
of other system builders have invested in DOM and need to get it
to grow a little bit, even though it cannot grow indefinitely.

By analogy, imagine that a family with N kids is shopping for a 
car.  If N=1 or 2, any sedan or small SUV will do.  If N grows
to 3 or 4, a larger SUV or mini-van is better.  With N=5 or 6,
maybe a huge SUV or full sized van is justified.  With N=7 or
8, you need an extended full-size van.  Eventually, with larger
N, the family would need a small bus, a full-sized bus, or even
some kind of train or something.   Does that mean that there
is no market for SUVs and mini-vans?  What if the family
only has N=2, but they think that they might have N=3 or 4 later,
they might even get a larger vehicle now, just in case.

Back to DOM, I agree that there is a definite need for small
N and for huge N.  My question is, is there a practical need for
a larger "small" N?

> Or if you want a commercial grade solution, look at Mark Logic.  You can 
> get a 30 day trial that supports data sets up to 1G 
> (http://xqzone.marklogic.com).  The official product goes four orders of 
> magnitude larger than that.  It's really fun.

Cool.  Hardcore!

If a dataset contains gigabytes, doesn't that make it more likely
that the results of a given query could be tens of megabytes?

In a relational database, the RDBMS can return a large rowset
as a stream, and the application goes through it row-by-row.
If an XML query results in a big nodelist, that could certainly
be streamed.  But, if it results in a big sub-tree, doesn't
that need to be represented in RAM in an efficient way?

> Here's a screencast I did with Jon Udell showing off XQuery to 
> manipulate some O'Reilly books in docbook format:
>    http://weblog.infoworld.com/udell/2005/02/15.html#a1177

Very cool.  I definitely need to learn more about xquery.

Thanks,
jason!

-- 
P.S. You might also be interested in my latest project, ReadySET Pro.
http://www.readysetpro.com/