[jdom-interest] Character escaping

Sun Mar 16 12:49:54 PST 2003

I was bored this afternoon, so I started looking at the output-escaping
problem mentioned in the TODO list. 

The problem is how to determine which characters need to be escaped (by
character references like "&#xABCD;") for a particular encoding. In JDK
1.4, java.nio.charset.CharsetEncoder can do this for us, but we still
need to be able to run on pre-1.4 systems. My idea is to have an
interface called EscapeStrategy (see below). We can have an
NioEscapeStrategy implementation class that uses CharsetEncoder.
XMLOutputter will attempt to soft-load that implementation (using
Class.forName). If that fails, it will fall back to another
implementation that works on all systems. This BasicEscapeStrategy will
allow the following characters to pass through unescaped:

All characters, for UTF-8 and UTF-16.
8-bit characters, for ISO-8859-1 (Latin-1)
7-bit characters, for all other encodings

All other characters are escaped. So we'll never output a bad character
(one that can't be encoded in the current charset), but we might encode
more characters than we need to, if you're using a pre-1.4 Java and
you're not using UTF-8 or UTF-16 or Latin 1. Oh well.

The first issue is that, while we'll still be able to run on pre-1.4
systems, we won't be able to compile on them, unless they manually
delete the NioEscapeStrategy.java file first.

The second issue is with characters > 16 bits, which I understand only
partially. (Elliotte you'll have to help me out here.)  It seems that
Java doesn't fully support this now, since there's a JSR to add support
for them in JDK 1.5. Presumably this support will use surrogate pairs,
where it takes two Java chars to represent these new Unicode characters.
But CharsetEncoder in 1.4 seems to take this into account, it talks
about surrogate pairs. I guess this API was written with the future in
mind, for when Java does fully support them?

Anyway, assuming that I've got all that right... Exposing surrogate
pairs in the EscapeStrategy interface would complicate it, and probably
make it much less efficient (since CharsetEncoder deals with surrogate
pairs only when you pass in a CharSequence). Instead, I think we can
decide to always encode characters > 16 bits. This doesn't seem like
much of a limitation, since the output will still be correct - it just
might be inefficient if you're using UTF-8 and your document contains
lots of musical symbols or Old Italic. So XMLOutputter would check for
surrogate pairs (by checking for chars between D800 and DFFF), and would
go ahead and encode them, rather than asking the EscapeStrategy. Sound
OK?

Alex Rosen
Novell, Inc.

P.S. Hmm, I just noticed that TODO says that Brad has a suggested
solution. Didn't mean to step on anyone's toes. Was this on the mailing
list somewhere?

/**
 *  This interface tells XMLOutputter if a particular character should
 *  be "escaped", by outputting a character reference (e.g.
"&#x58a1;")
 *  instead of the actual character (e.g. the char whose value is
0x58A1).
 *  Commonly, characters that can't be expressed natively in the
specified
 *  encoding will escaped.
 */
public interface EscapeStrategy
{
    /**
     *  Called to inform us of the encoding that is being used.
     */
    public void setEncoding(String encoding);

    /**
     *  Return true if this character should be escaped. Note that the
following
     *  types of characters are automatically escaped, and are not
passed in to 
     *  this method:
     *  <ul>
     *  <li>Characters that must be escaped because of the XML rules
     *  (e.g. ampersands, or quotes in attributes)
     *  <li>Characters larger than 16 bits (represented in Java by 
     *  surrogate pairs)
     *  </ul>
     */
    public boolean shouldEscape(char ch);
}