No subject


Fri Aug 6 17:04:17 PDT 2004


<record>
    =C9tats-unis
</record>

The above character =C9 is properly encoded in UTF-8 as bytes C3 89 which
decode to C9 or 201 which is indeed the Unicode value for that character.

However when I parse the document, and then get the text from the element,
It appears that the 201 has turned into 131 which happens to be the code fo=
r
=C9 in the MacOS latin charset.

So it looks like the element data is converted to the platform charset
rather than unicode.

I hope I'm simply missing something. Here is how I parse the data:

byte[] data; // comes from elsewhere (at this point the bytes are C389)
InputStream is =3D new ByteArrayInputStream(data);
SAXBuilder sax =3D new SAXBuilder();
sax.setIgnoringElementContentWhitespace(true);
Document received =3D sax.build(is);
// when I get there the character is 131 instead of 201

Any advice will be appreciated,

Eric





More information about the jdom-interest mailing list