[jdom-interest] Dealing with binary characters in-memory -> outputter
mbennett at ideaeng.com
Fri Sep 21 02:14:51 PDT 2001
Thanks for your suggestion.
I had tried UTF-8, but the outputter seemed to ignore it.
I agree, if authoring XML in an ASCII editor, that would
be a fine way to do it.
And I hear what you're saying about the different encodings
having different characters.
But how about for a given encoder:
* Is this character in my map?
then output it as it is mapped
then use the generic escape sequence &#xNN;
So instead of tracking rules for every character, it would
simply need to know that this wasn't in it's map, so it should
therefore use the generic escaping.
From: Attila Szegedi [mailto:szegedia at freemail.hu]
Sent: Friday, September 21, 2001 1:45 AM
To: jdom-interest at jdom.org
Cc: Mark Bennett
Subject: Re: [jdom-interest] Dealing with binary characters in-memory ->
Just a wild guess, but u00A9 should be valid in Unicode. The XML spec's
"Char" production also does not exclude it. If you use UTF-16 or UTF-8 as
your output encoding, that should work...
The XMLOutputter authors do a pretty good job of &# escaping "common
renegade" characters, so maybe the ultimate solution is to add this one to
the set... The problem is that for every encoding, the set of chars that
must be escaped is different, and solving this problem on a per-encoding
basis would be too expensive, either in memory or in time terms. Using the
newly-introduced Encoder interface in java.io. of JDK1.4 should help, but
it'll take time until it gets mainstream...
----- Original Message -----
From: "Mark Bennett" <mbennett at ideaeng.com>
To: <jdom-interest at jdom.org>
Cc: "Mark Bennett" <mbennett at ideaeng.com>
Sent: 2001. szeptember 21. 9:52
Subject: [jdom-interest] Dealing with binary characters in-memory ->
> I'm aware that when creating XML files
> you must escape invalid binary characters
> with the & sequences.
> But I'm having a slightly different problem.
> I'm building a JDOM tree in memory, by creating
> Elements and using the setContent( some string)
> methods. I get the bits of text from outside
> sources, often as web content.
> Sometimes these bits of text have invalid XML
> characters (invalid HTML as well, but web
> browsers allow it). As an example, I might
> get the Microsoft non-standard single-byte
> copyright symbol, 0xA9 I believe.
> .addContent() doesn't complain about this.
> Later I output this JDOM tree to an XML file
> using the standard JDOM output methods,
> with no complaint.
> Then later on I try to read it back in.
> BAM! That's when I have the problem.
> When JDOM output the file in the previous
> step, it didn't convert the single byte
> into an & hex escape sequence. It DOES
> escape < into <, > into >, etc.
> So I would have thought it would just
> turn the single byte 0xA9 into the 6 byte
> sequence " © " in the resulting file,
> just the way an XML author would write it.
> Then it would unescape it just fine on the
> way back in.
> I believe this would be the correct behavior
> but it doesn't seem to work this way?
> This seems like a really simple thing to do -
> output XML and then read it
> back in - but nothing I've tried or thought
> about seems to work.
> What I've tried / considered:
> * Just clean up the strings I get in - dump
> invalid characters.
> I really don't want to do this. The
> "invalid" character was in the web page
> I scanned, I may need it later, I don't
> want to just delete it.
> * Set a particular encoding on the
> way out and on the way in.
> This doesn't seem to make any difference.
> I think because 0xA9 is not part of any
> standard code page, it is ignored.
> * Scan the inbound text and apply the escape
> sequence to it before sending the string
> to .addContent().
> I don't think this will work. If I change
> 0xA9 into © then write it to disk I
> believe the & will get escaped into &
> and I'll wind up with &#A9; on disk,
> which won't turn back into the single byte
> 0xA9 when it comes back in.
> * Try wrapping the text into a CDATA node.
> Oddly this doesn't work. The outputter does
> include the CDATA tags, but when I read it
> back in I still get an error.
> Apparently even CDATA sections are checked
> for binary characters? I thought CDATA was
> supposed to get around this?
> * Don't use any of the outputter stuff?
> This seems realy extreme to me. I suppose it's
> possible but this seems overly complex.
> * Some type of strange mime encoding or something?
> Again this seems like way overkill?
> * Maybe override just the Element and CDATA methods
> and change just a few of the methods?
> I'm a little shaky on overriding JDOM components.
> It's not just this one character. I'm getting
> fairly random bits of single byte content from
> the outside world. If I do anything strange
> to catch this I'll have to do it to virtually
> every bit of data I ever encounter.
> I'd really appreciate any thoughts on this?
> To control your jdom-interest membership:
More information about the jdom-interest