[jdom-interest] Dealing with binary characters in-memory -> outputter

Fri Sep 21 00:52:30 PDT 2001

Hello,

I'm aware that when creating XML files
you must escape invalid binary characters
with the & sequences.

But I'm having a slightly different problem.

I'm building a JDOM tree in memory, by creating
Elements and using the setContent( some string)
methods.  I get the bits of text from outside
sources, often as web content.

Sometimes these bits of text have invalid XML
characters (invalid HTML as well, but web
browsers allow it).  As an example, I might
get the Microsoft non-standard single-byte
copyright symbol, 0xA9 I believe.

.addContent() doesn't complain about this.

Later I output this JDOM tree to an XML file
using the standard JDOM output methods,
with no complaint.

Then later on I try to read it back in.
BAM!  That's when I have the problem.
When JDOM output the file in the previous
step, it didn't convert the single byte
into an & hex escape sequence.  It DOES
escape < into &lt;, > into &gt;, etc.

So I would have thought it would just
turn the single byte 0xA9 into the 6 byte
sequence " &#xA9; " in the resulting file,
just the way an XML author would write it.
Then it would unescape it just fine on the
way back in.
I believe this would be the correct behavior
but it doesn't seem to work this way?

This seems like a really simple thing to do -
output XML and then read it
back in - but nothing I've tried or thought
about seems to work.

What I've tried / considered:

* Just clean up the strings I get in - dump
  invalid characters.

  I really don't want to do this.  The
  "invalid" character was in the web page
  I scanned, I may need it later, I don't
  want to just delete it.

* Set a particular encoding on the
  way out and on the way in.

  This doesn't seem to make any difference.
  I think because 0xA9 is not part of any
  standard code page, it is ignored.

* Scan the inbound text and apply the escape
  sequence to it before sending the string
  to .addContent().

  I don't think this will work.  If I change
  0xA9 into &#xA9; then write it to disk I
  believe the & will get escaped into &amp;
  and I'll wind up with &amp;#A9; on disk,
  which won't turn back into the single byte
  0xA9 when it comes back in.

* Try wrapping the text into a CDATA node.

  Oddly this doesn't work.  The outputter does
  include the CDATA tags, but when I read it
  back in I still get an error.

  Apparently even CDATA sections are checked
  for binary characters?  I thought CDATA was
  supposed to get around this?

* Don't use any of the outputter stuff?

  This seems realy extreme to me.  I suppose it's
  possible but this seems overly complex.

* Some type of strange mime encoding or something?

  Again this seems like way overkill?

* Maybe override just the Element and CDATA methods
  and change just a few of the methods?
  I'm a little shaky on overriding JDOM components.

It's not just this one character.  I'm getting
fairly random bits of single byte content from
the outside world.  If I do anything strange
to catch this I'll have to do it to virtually
every bit of data I ever encounter.

I'd really appreciate any thoughts on this?

Thanks,
Mark