[jdom-interest] XMLOutputter Entity escaping again

Alex Chaffee / Purple Technology guru at stinky.com
Fri Jun 6 06:42:04 PDT 2003


I've been off-list for a while, but I notice the latest CVS for
XMLOutputter has a mechanism for escaping entites and character
references.  This is great!

However, it only does half the job.  It allows me to specify, for a
given character, whether it should be escaped; but it does not say
*what* to escape it as.  There are several choices:

 - a named entity (as declared in the DTD) ( )
 - a numeric char reference in decimal format ( )
 - a numeric char reference in hex format ( )

The present system only allows for decimal references.  This is
unfortunate, since many people use XML documents with entities as
defined in their DTDs, and it would be nice not to lose these in a
round-trip through JDOM.

I've written and contributed to Jakarta Commons a class called
Entities that does escaping and unescaping.  With the use of this
class I was able to accomplish the escaping I wanted with the
following override:

        XMLOutputter xmlOutputter = new XMLOutputter()
        {
            public String escapeElementEntities(String text)
            {
                return Entities.HTML40.escape(text);
            }
        };

Essentially an Entities is a bidirectional map from character to
entity name, with escape() and unescape() methods.  I've been
optimizing it lately and it's pretty efficient.  You can fill it with
whatever set of value-name pairs you like.

I'd be happy to donate it to JDOM as well if you think you can use it
(obviously, I think you can!)  I'm thinking that you can tell an
XMLOutputter (or a Format) to use a user-defined Entities class, which
the user can fill with entries from his own DTD; there would also be a
default one which only escapes the 4 (or is it 5?) XML-standard chars,
and leaves everything else alone.

Note that it's not *quite* complete yet -- it still needs a pluggable
Strategy for when to numeric-escape (presently, for characters with no
named entry, it just escapes anything over 127 ASCII) and whether to
use hex or decimal.  

Also, it would be totally cool if someone, maybe from JDOM, could
write a routine that parses a DTD file and extracts the <!ENTITY>
references and creates a corresponding Entities object...

Code is in Jakarta Commons CVS if you want to take a look.  I'm
offline as I write this so I can't find an exact URL but I hope it's
pretty obvious.

Later -

 - Alex

-- 
Alex Chaffee                               mailto:alex at jguru.com
Purple Technology - Code and Consulting    http://www.purpletech.com/
jGuru - Java News and FAQs                 http://www.jguru.com/alex/
Gamelan - the Original Java site           http://www.gamelan.com/
Stinky - Art and Angst                     http://www.stinky.com/



More information about the jdom-interest mailing list