[jdom-interest] UTF8 charset issues...

Fri Oct 10 11:08:17 PDT 2003

Alex,

Well I am pretty sure it is not working because if I save my XML  
document and then I try to read it back in my java app I get the  
following exception:

java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.
         at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown  
Source)
         at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
         at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
....

The scenario to get this exception is:

1 - Create a jdom Document and call element.setText("Æ") to set an  
element's text value

2 - Save this Document (ie create a local XML file) test.xml

3 - Read this XML document back which leads to the above exception.

Note: If I use a XML aware tool like oxygen to look at test.xml, the  
'Æ' character shows up as '�'
However if I save my document using:
String text = "Æ";
byte[] bytes = text.getBytes("UTF8");
text = new String(bytes);
setText(text);

In that case my document is properly saved and I am able to read it  
back in my Java app

I am using Java 1.4.1 on MacOSX

Thanks again

Patrick

On 10 Oct 2003, at 6:34 PM, Alex Rosen wrote:

> "just calling Element.setText("Æ") does not generate a correct UTF-8  
> encoded document."
>
> How did you determine this? I.e. what tool did you use to look at the  
> document? What I'm getting at is, I think that the document was right,  
> but the tool you used to look at it made it look "wrong". Realize that  
> the *bytes* of the UTF-8 encoding of Æ are going to look like garbage  
> characters. If you view the file using a tool that uses any encoding  
> other than UTF-8, it'll look mangled, even though it's not. The viewer  
> you used (e.g. maybe Notepad or another text editor) probably read it  
> using your machine's default encoding (such as Latin 1), so it looked  
> garbled even though it was really OK (i.e. if your viewer used UTF-8  
> to show it to you, it would be fine.)
>
> Encoding issues are really confusing, unfortunately.
>
> Alex
>
>>>> Patrick JUSSEAU <patrick at openbase.com> 10/10/2003 8:35:20 AM >>>
> Hi all,
>
> I am trying to understand how jdom handles character encodings. Here is
> what I am doing:
>
> I have a java app which reads data from a xml file (UTF-8 encoded). I
> am able to get text just fine using
> String str = anElement.getText();
>
> The resulting str string (Unicode encoded) contains exactly what was
> defined in my xml file. The charset translation is here transparent for
> me. For example if my xml document is:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE DOCUMENT SYSTEM "annonce.dtd">
> <DOCUMENT>
>      <TEXT>Æ</TEXT>
> </DOCUMENT>
>
> I get Æ in my str string.
>
>
> However when I am trying to generate a xml document with this exact
> same Æ value, just calling Element.setText("Æ") does not generate a
> correct UTF-8 encoded document. I have first to manually do this in my
> code:
> 		String text = "Æ";
> 		try{
> 			byte[] bytes = text.getBytes("UTF8");
> 			String newText = new String(bytes);
> 			setText(newText);
> 		}catch(UnsupportedEncodingException uee){
> 			uee.printStackTrace();
> 		}
>
> Why do I have to do this for the xml generation to work. Why isn't jdom
> taking care of the charset translation for me since the resulting file
> has UTF-8 encoding specified in it?
>
> Thanks for any help
>
> Patrick
>
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/ 
> youraddr at yourhost.com
>