[jdom-interest] UTF8 charset issues...

Alex Rosen arosen at novell.com
Fri Oct 10 09:34:24 PDT 2003


"just calling Element.setText("Æ") does not generate a correct UTF-8 encoded document."

How did you determine this? I.e. what tool did you use to look at the document? What I'm getting at is, I think that the document was right, but the tool you used to look at it made it look "wrong". Realize that the *bytes* of the UTF-8 encoding of Æ are going to look like garbage characters. If you view the file using a tool that uses any encoding other than UTF-8, it'll look mangled, even though it's not. The viewer you used (e.g. maybe Notepad or another text editor) probably read it using your machine's default encoding (such as Latin 1), so it looked garbled even though it was really OK (i.e. if your viewer used UTF-8 to show it to you, it would be fine.)

Encoding issues are really confusing, unfortunately.

Alex

>>> Patrick JUSSEAU <patrick at openbase.com> 10/10/2003 8:35:20 AM >>>
Hi all,

I am trying to understand how jdom handles character encodings. Here is 
what I am doing:

I have a java app which reads data from a xml file (UTF-8 encoded). I 
am able to get text just fine using
String str = anElement.getText();

The resulting str string (Unicode encoded) contains exactly what was 
defined in my xml file. The charset translation is here transparent for 
me. For example if my xml document is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DOCUMENT SYSTEM "annonce.dtd">
<DOCUMENT>
     <TEXT>Æ</TEXT>
</DOCUMENT>

I get Æ in my str string.


However when I am trying to generate a xml document with this exact 
same Æ value, just calling Element.setText("Æ") does not generate a 
correct UTF-8 encoded document. I have first to manually do this in my 
code:
		String text = "Æ";
		try{
			byte[] bytes = text.getBytes("UTF8");
			String newText = new String(bytes);
			setText(newText);
		}catch(UnsupportedEncodingException uee){
			uee.printStackTrace();
		}

Why do I have to do this for the xml generation to work. Why isn't jdom 
taking care of the charset translation for me since the resulting file 
has UTF-8 encoding specified in it?

Thanks for any help

Patrick

_______________________________________________
To control your jdom-interest membership:
http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com




More information about the jdom-interest mailing list