[jdom-interest] Re: Manipulating a very large XML file
jrobbins at tigris.org
Tue Mar 15 08:40:56 PST 2005
Hi Jason and everyone,
> > Actually, this brings of a question for the group:
> > Would people be interested in a memory-efficient DOM or JDOM
> > implementation?
> It's not a high priority from my perspective. I think the right
> solution for managing large XML datasets isn't to write a memory
> efficient data model. That's just a stopgap solution, akin to making
> Excel support 256k rows instead of 64k. The right solution is to use
> (or write if needed) an XML contentbase that indexes the XML and makes
> it queryable. That gets you both larger data sizes and a more effective
> way of interacting with the content.
Oh, I agree. As a computer scientist, I know that both 8*N and N/2
are both O(N), so from that point of view, it really doesn't matter.
In the long run, as N continues to grow, people absolutely need
to switch to a database approach.
But, I'm also a practical system builder, and I suspect that a lot
of other system builders have invested in DOM and need to get it
to grow a little bit, even though it cannot grow indefinitely.
By analogy, imagine that a family with N kids is shopping for a
car. If N=1 or 2, any sedan or small SUV will do. If N grows
to 3 or 4, a larger SUV or mini-van is better. With N=5 or 6,
maybe a huge SUV or full sized van is justified. With N=7 or
8, you need an extended full-size van. Eventually, with larger
N, the family would need a small bus, a full-sized bus, or even
some kind of train or something. Does that mean that there
is no market for SUVs and mini-vans? What if the family
only has N=2, but they think that they might have N=3 or 4 later,
they might even get a larger vehicle now, just in case.
Back to DOM, I agree that there is a definite need for small
N and for huge N. My question is, is there a practical need for
a larger "small" N?
> Or if you want a commercial grade solution, look at Mark Logic. You can
> get a 30 day trial that supports data sets up to 1G
> (http://xqzone.marklogic.com). The official product goes four orders of
> magnitude larger than that. It's really fun.
If a dataset contains gigabytes, doesn't that make it more likely
that the results of a given query could be tens of megabytes?
In a relational database, the RDBMS can return a large rowset
as a stream, and the application goes through it row-by-row.
If an XML query results in a big nodelist, that could certainly
be streamed. But, if it results in a big sub-tree, doesn't
that need to be represented in RAM in an efficient way?
> Here's a screencast I did with Jon Udell showing off XQuery to
> manipulate some O'Reilly books in docbook format:
Very cool. I definitely need to learn more about xquery.
P.S. You might also be interested in my latest project, ReadySET Pro.
More information about the jdom-interest