[jdom-interest] Verifier

Mon Apr 1 17:18:14 PST 2002

> I have been taking a look at the Verifier code (as Jason tricked me into
> promising at JavaOne) with an eye towards making it faster without removing
> the checks.  I found a few interesting things:

Actually, Harry, I was trying to get you to look at the general speed of
the build, because right now I don't think Verifier is the bottleneck. 
It looks to vary a lot depending on JDK and JVM, but on JDK 1.2 in
OptimizeIt the bulk of time was being spent in StringBuffer.copy() as a
result of the setLength(0) calls.  There's a "shared" flag which defines
if the buffer's internal char[] needs to be copied, and at least on JDK
1.2 it's behaving so it needs to be copied.  On JDK 1.4 it looks like
Java itself found an optimization which avoids that problem, but I can't
run OptimizeIt on JDK 1.4 so it's hard to know.  

> 1.  Unless there is some reason anyone can see against it, I think most of
> the methods in Verifier, such as isXMLLetter, isXMLDigit, and
> isXMLCombiningChar, should be using the Character.Subset interface defined
> in java.lang, as this is the standard way to define ranges of characters for
> Java.  This won't help performance (shouldn't really hurt it either), but it
> will make it a bit more standard.

I'd be interested in hearing Elliotte's thoughts on this.  Lacking his
commentary, I personally think it's better to match the XML spec
explicitly than to rely on Java's built-in behaviors.

> 2.  Segmenting the searches (with a few greater than checks) would make the
> performance over the entire character range faster.  This change would hurt
> the common case with a single additional check, but would help all the
> checks that currently fall towards the end of the cascading if statements.
> For example, in isXMLLetter, if the character being checked is between
> 0x4E00 and 0x9FA5, it must cascade through over 400 if statements to be
> properly checked.  Breaking the group of if statements into segments, and
> nesting these segments in if statements, would allow the groups later in the
> checks to be accessed more quickly.

Yep, I looked at this one a long time ago.  For ASCII and Latin-1 we're
right now about as fast as we can be.  I'm most concerned with that
common case.  I looked at using an array to do the lookups, maybe even
an array of shorts to have 8 flags per location, but then the cost is
how to build the array.  Do we set the values at compile time?  Is it
worth consuming the memory?  I tried a few things and let it sit since
overall the time spent in this area didn't justify huge amounts of
tweaks, and still I think getting a 1.0 feature complete is more
important than speeding up high-value Unicode languages.

> 3.  The only common case fix I see without a lot more work would be checking
> the values against a common case range (like 0x00FF) and indexing into a
> boolean array for whether it is valid or not.  All other cases would fall
> through the common case into a (possibly segmented) implementation that
> resembles the current code.  This would speed up the ascii case, but
> penalize those not in the standard ascii range.

I was thinking that too at the time.

> 4.  Another possible solution is to precheck the ideographic cases  (since
> the ranges are so large), and otherwise do check for char presence in a
> hashset against the remaining possible values.  This sounds large, but in
> the example of isXMLLetter, there are only 2237 values represented until the
> last two checks.  The downside to this approach is wrapping the char in a
> Character object, and the memory overhead of a static map for each of the
> types checked.  It would definitely be a faster approach, though.

I was always thinking you'd get the int value of the char and look it up
in an array.  That'd be pretty fast, but again I opted not to pursue it
since it was little bang for the buck.  Right now StringBuffer.copy() is
far more important to figure out.

-jh-