[jdom-interest] Maximizing Effeciency of XPATH calls

Fri Sep 9 13:59:57 PDT 2005

Paul Libbrecht wrote:

> I'd be interested to know how much performance one can expect of such  
> engines... we keep processing XML but end-up storing and retrieving  
> with Lucene which has real good performances (like 1500 queries/second  
> with about 7000 items overall of 50Mb on a single PC, they are about 20  
> typical queries).
> 
> XML-databases that I try were way slower (by a factor of 10 or 100) but  
> I never went till a real index.
> 
> What does such a commercial produce as Mark Logic achieve ?

If your XML database is written in Java, you can pretty much assume it's 
not going to be fast.  I love Java, made my career off Java, and wrote 
JDOM explicitly *for* Java, but you don't write fast databases in Java. 
  Fast databases need too much low-level control -- of the filesystem 
and threading and memory management.  I'll bet whatever you tried was 
written in Java.

Mark Logic's engine is in C++.  Thank God some people still remember 
C++.  :)  It's designed with a search engine style architecture (similar 
to Lucene) but with indexes that understand the structure of the 
documents.  To get that across, it helps to understand that in MarkLogic 
Server a query of //foo is a wee bit faster than /a/b/foo.  Thta's 
because to satisfy the first query it only needs to know where <foo> 
instances are, which it knows easily with its index.  With the second it 
needs to find <foo> instances but also use other indexes to make sure 
it's under <b> which is under <a>.  Both queries are very fast (the join 
between indexes for the second query is fast).  Thus starting to stream 
answers basically instantaneously because they can be fully answered 
with indexes.

I say "start to stream" above because to actually receive all <foo> 
elements on a 5 Tera data set itself can take a while.

Other queries like //foo[title = "x"] are also fast for large data 
because that query too can be fully resolved with indexes.  So too with 
//foo[cts:contains(title, "foo")] which looks for titles containing the 
token "foo" (token-based like Lucene so it correctly doesn't match 
"food").  The cts: namespace is a Mark Logic extension.  Regular XQuery 
doesn't have token-based fast text search yet standardized.

You can go beyond XPath in XQuery to do fancy things like

for $hit in
cts:search(//chapter[cts:contains(., "servlet")]/sect1/para,
               cts:word-query("apply style", "stemmed"))[1 to 10]
return
<span>
   <book>{ $para/ancestor::book/title/text() }</book>
   { htmllib:render($hit) }
</span>

This says to search all para elements under sect1 elements of all 
chapters which have "servlet" in their titles (case insensitive) for the 
phrase "apply style" with stemming enabled so "applying styles" would be 
a legal match too, then return the top 10 most relevant.  For each $hit 
item return a <span> containing the reference of the book containing the 
paragraph and underneath render the paragraph nicely as html (using a 
user-defined function).  O'Reilly's doing stuff like this with their 
content using Mark Logic, against many gigs of book and article content. 
  I showed some demos at FOO Camp there that were fun.

If you want to see all the text search stuff, I wrote a paper titled 
"XQuery Search and Update" for XTech 2005 available at 
http://idealliance.org/proceedings/xtech05/papers/02-04-01/.

Anyway, as you can tell by my return address, I've been enjoying XQuery 
quite a lot.  :)

-jh-