[jdom-interest] Verifier

Mon Apr 1 15:51:02 PST 2002

I have been taking a look at the Verifier code (as Jason tricked me into
promising at JavaOne) with an eye towards making it faster without removing
the checks.  I found a few interesting things:

1.  Unless there is some reason anyone can see against it, I think most of
the methods in Verifier, such as isXMLLetter, isXMLDigit, and
isXMLCombiningChar, should be using the Character.Subset interface defined
in java.lang, as this is the standard way to define ranges of characters for
Java.  This won't help performance (shouldn't really hurt it either), but it
will make it a bit more standard.

2.  Segmenting the searches (with a few greater than checks) would make the
performance over the entire character range faster.  This change would hurt
the common case with a single additional check, but would help all the
checks that currently fall towards the end of the cascading if statements.
For example, in isXMLLetter, if the character being checked is between
0x4E00 and 0x9FA5, it must cascade through over 400 if statements to be
properly checked.  Breaking the group of if statements into segments, and
nesting these segments in if statements, would allow the groups later in the
checks to be accessed more quickly.

3.  The only common case fix I see without a lot more work would be checking
the values against a common case range (like 0x00FF) and indexing into a
boolean array for whether it is valid or not.  All other cases would fall
through the common case into a (possibly segmented) implementation that
resembles the current code.  This would speed up the ascii case, but
penalize those not in the standard ascii range.

4.  Another possible solution is to precheck the ideographic cases  (since
the ranges are so large), and otherwise do check for char presence in a
hashset against the remaining possible values.  This sounds large, but in
the example of isXMLLetter, there are only 2237 values represented until the
last two checks.  The downside to this approach is wrapping the char in a
Character object, and the memory overhead of a static map for each of the
types checked.  It would definitely be a faster approach, though.

Any one have thoughts on any or all of the above cases?

Harry Evans