[jdom-interest] non-ascii characters in xml document

Thu Nov 29 16:43:51 PST 2001

To look at a simpler test case, I commented out my code that saves xml in gzip format,
and just used straight UTF-8 xml to and from a file. The "curly" single and double
quote characters give me exceptions like this:

     [java] org.jdom.JDOMException: Error on line 1 of document
file:/C:/Development/Projects/HierarchicalPIM/default.xml: Character
conversion error: "Unconvertible UTF-8 character beginning with
0x92" (line number may be too low).
     [java]     at org.jdom.input.SAXBuilder.build(SAXBuilder.java:296)

It sees the single and double quote chars as 0x92 and 0x93, respectively. Maybe these
characters aren't Unicode. Could they be Windows-specific character codes, since the
text is being pasted from a Windows application into a Java app?

Jason Hunter wrote:

> UTF-8 is a charset that contains all Unicode characters, so on output it
> should be properly encoded.  The char will be encoded in 2 bytes though,
> so readers that don't know the byte stream is in UTF-8 format will be
> confused.
>
> -jh-
>
> Dave Neuendorf wrote:
> >
> > The problem originated when I was using the default UTF-8. The quote
> > characters were not properly represented in the xml file, which is why I
> > decided to try ISO-8859-1 encoding (which didn't help). The JTextArea into
> > which the text had been pasted was properly displaying the quote characters
> > until the text was read back in from xml, at which point the bogus characters
> > from the xml were displayed.
> >
> > Jason Hunter wrote:
> >
> > > > I'm working on an application, in which the user is allowed to paste
> > > > text into a JTextArea. The text can include "curly" single and double
> > > > quotes, and presumably other non-ascii characters. When the text is
> > > > written to an xml file from a jdom Document, each such character is
> > > > replaced in the file with some other non-ascii character. I tried
> > > > changing the encoding from the default UTF-8 to ISO-8859-1, but the
> > > > result is that now the replacement character is always a question mark.
> > >
> > > If you're using UTF-8, all Unicode characters can and will be
> > > represented and you'll have them nicely encoded in UTF-8 format.  If it
> > > shows up as a ? for you, it's probably because your viewer isn't
> > > recognizing that the characters are encoded as UTF-8, or it doesn't have
> > > the glyph necessary to display the chars.
> > >
> > > -jh-
> >
> > _______________________________________________
> > To control your jdom-interest membership:
> > http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com