[jdom-interest] Still more Verification

Jason Hunter jhunter at collab.net
Tue Aug 22 20:28:47 PDT 2000


I'm curious what people think about this approach.  What Elliotte's code
does is ensure that you absolutely cannot create an non-well-formed XML
document using JDOM.  That's a cool feature!

My concern is that every change to the JDOM document is going to be
checked char by char by char, resulting in a noticeable performance
decrease.  Elliotte says he saw a 20% slowdown (not sure on what test). 
It's probably really bad for documents that are mostly text.

We could perhaps find a way for SAXBuilder to avoid the slowdown by
using some special constructor.  Problem there is that since builders
are and should be in a different package than the core (because people
should have the ability to write their own builders), we're going to
have to expose those special constructors to the public at large, and
that eliminates the ability to say you cannot create an non-well-formed
JDOM document, because with those constructors you can.

Is it worth a 20% performance on all element construction to sanity
check the text content?  The answer is probably sometimes yes, sometimes
no.  But how would one differentiate between the two?

We have a similar issue already for checking tag names, PI content, and
so on.  If the content has already passed through a parser like Xerces,
checking again only wastes CPU cycles.  We haven't worried about it for
things like checking tag names because it's relatively fast, but when
you have a document that could have large amounts of text, do you really
want to check every character one at a time against a matrix of legal
characters?

-jh-


Elliotte Rusty Harold wrote:
> 
> I've updated my versions of the
> 
> Verifier
> Element
> Attribute
> ProcessingInstruction
> Comment
> 
> classes to support the latest CVS build. These are now ready to be
> merged into the main tree. As well as being based on the most current
> tree, they are a little cleaner than the versions I posted yesterday.
> I removed some redundant code and was more consistent overall.
> 
> All setter and add methods now check the contents of these items, not
> just the names. In particular, they check every character of the
> contents to make sure it's legal parsed character data and is not,
> for example, a C0 control character like NUL or form feed. They also
> check, for example, to see that the List passed to setMixedContent
> doesn't contain anything funky like a Frame or a Document instead of
> just the expected types. For example, here's the new Attribute
> setValue() method:
> 
>      public void setValue(String value) {
>          String reason = Verifier.checkCharacterData(value);
>          if (reason != null) {
>              throw new IllegalDataException(value, reason);
>          }
>          this.value = value;
>      }
> 
> Everything's at http://metalab.unc.edu/xml/jdom/
> You can find my changes by grepping for  //^^
> I put such a comment in front of every method I changed.
> 
> I'm going to take a stab at figuring out how to maintain namespace
> prefixes next.
> 
> +-----------------------+------------------------+-------------------+
> | Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
> +-----------------------+------------------------+-------------------+
> |                  The XML Bible (IDG Books, 1999)                   |
> |              http://metalab.unc.edu/xml/books/bible/               |
> |   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
> +----------------------------------+---------------------------------+
> |  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
> |  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/     |
> +----------------------------------+---------------------------------+
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com



More information about the jdom-interest mailing list