<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi Cliff.<br>
<br>
JDOM cannot generate an XSD for a document (interesting idea, but
very complicated.... like, how would it set maxOccurs? ... and in
your use case that would be significant....)<br>
<br>
The best I can suggest is that you will need to do a 'deep
inspection' of the XML, create your own sort of 'fingerprint' for
the document, and then use that.<br>
<br>
JDOM could possibly be useful because it makes the inspection part a
whole lot easier than building a SAX ContentHandler, etc (but at the
price of some speed and some memory). Once you have built the JDOM
document you can run all sorts of functions on the data to create
the 'fingerprint'.<br>
<br>
Again, this could potentially be done inside the database to be more
efficient.<br>
<br>
Unfortunately (for you), this is not something that I think there is
an easy, or preexisting solution for (nothing comes to mind).<br>
<br>
Also, as Michael says, you need to build up your 'taxanomical' (nice
word, Michael) rules, and in a 'real world' instance, you should be
namespace aware, etc. Again, JDOM can help with that.... but only as
a part of a bigger solution.<br>
<br>
Rolf<br>
<br>
<br>
<br>
If you need to do 'deep inspection' of the XML to determine it's <br>
<br>
On 04/01/2012 4:00 PM, cliff palmer wrote:
<blockquote
cite="mid:CABhr9SvABYbapTjaRMpQQVVv6Yowkb0XyBmvWCiesFu7aFBVzg@mail.gmail.com"
type="cite">Rolf it's a little more involved than reading the XSI
refs. I need to look at the nodes.<br>
<br>
<a> <b> <c> </c> </b> </a> and
<a> <b> <c> </c> <c> </c>
</b> </a> are considered the same structure for what I
am doing because repeating nodes aren't considered a difference.<br>
<a> <b> <c> </c> </b> </a> and
<a> <b> <c> </c> </b> <b>
<c> </c> <c> </c> </b> </a>
are considered the same structure for what I am doing because
repeating groups aren't considered a difference.<br>
<a> <b> <c> </c> </b> </a> and
<a> <b> </b> </a> are not the same because
<c> is missing in the 2nd case so it does not contain all
the elements as the 1st case.<br>
<br>
I had hoped (apparently it's just a hope) that JDOM could generate
an XSD from an XML DOM object.<br>
<br>
More ideas are welcome :)<br>
<br>
Cliff<br>
<br>
<br>
<div class="gmail_quote">On Wed, Jan 4, 2012 at 2:53 PM, Rolf Lear
<span dir="ltr"><<a moz-do-not-send="true"
href="mailto:jdom@tuis.net">jdom@tuis.net</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> Hi Cliff.<br>
<br>
I can't think of any magic 'short cut'.... and certainly, I
do not think JDOM will be the fastest/best way to 'classify'
each document.<br>
<br>
Things you should consider though:<br>
- Using a plain SAX Parser (xmlreader) with a clever 'Entity
Resolver' may help you to quickly access what external URL's
(probably XML Schemas) are needed to resolve the document
(although there is no concept of an 'order' of schemas).
This could help 'identify' the document.<br>
- Cutting short the parser (throw a SAX exception) would
speed things up once you have entered the main part of the
document (startElement()) because you probably do not need
to parse the whole document, just the xsi schema-location
references.<br>
- Finally, depending on your database, you may already have
a JRE available in the database server ('big-brand databases
mostly already do, like DB2, Oracle, Sybase, etc.), in which
case you can build a 'clever' Java function that evaluates
the document *inside* the database, and avoid creating a lot
of external traffic.... for example, you may be able to
create a custom java-backed function 'xmlschemas()' which
returns the list of schemas in use in a document, and then
you can do something like:<br>
<br>
select xmlschemas(xmldatacol) as schemas, count(*) from
table group by schemas<br>
<br>
Rolf
<div>
<div class="h5"><br>
<br>
On 04/01/2012 2:11 PM, cliff palmer wrote: </div>
</div>
<blockquote type="cite">
<div>
<div class="h5">I need to examine XML documents
contained in multiple columns in a database table with
over a million rows and identify each of the different
structures used for the XML data, producing a count if
the number of instances that use each structure.<br>
<br>
I thought of using the SAXParser then creating a list
of the XML headers in the order used and storing each
unique list and accumulating a count based on matching
an already encountered list object, but I am hoping
there is a less cumbersome approach.<br>
<br>
I would appreciate any and all suggestions.<br>
<br>
Thanks!<br>
Cliff<br>
<br>
<fieldset></fieldset>
<br>
</div>
</div>
<pre>_______________________________________________
To control your jdom-interest membership:
<a moz-do-not-send="true" href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com</a></pre>
</blockquote>
<br>
</div>
<br>
_______________________________________________<br>
To control your jdom-interest membership:<br>
<a moz-do-not-send="true"
href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com"
target="_blank">http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com</a><br>
</blockquote>
</div>
<br>
</blockquote>
<br>
</body>
</html>