[jdom-interest] Re: Getting original Encodin
g and changing the d efau lt UTF-8
matthew.young at rfv.sfa.se
Fri Sep 10 01:11:00 PDT 2004
Exactly. We have child documents that get included which should have a certain
ISO encoding but don't. Then the default takes over and the swedish characters
bomb the parser.
Simplest thing is to demand our projects to deliver documents with the correct
Find it odd that when transforming with XSLT (say Xalan) that the encoding of
the style sheet overides all of the input XML documents. Seems like XML
parsers should apply the same principle with "included" child documents to a
parent XML. If the main XML says the encoding should be XYZ then regardless of
what is stated in the headers of subdocuments the document gets translated
with XYZ encoding.
Jason Hunter (2004-09-10 09:53):
Young Matthew wrote:
> Regarding the default encoding I more thinking on the front end and not with
> printing. In other words before parsing a document it would be cool if I
> shift the encoding to someother than UTF-8 to handle svenska characters.
XML files generally have their encoding listed in the declaration if
they're not UTF-8. So the parser automatically can determine the proper
encoding to use. Getting the data in correctly isn't an issue; the
issue arises if you want to encode the document the same way on output
instead of using the universal UTF-8 encoding. SAX doesn't report what
the original encoding was, just returns the already-decoded characters.
Another builder, like an XNI builder, could report the encoding. The
Document class doesn't currently have an encoding property but we could
add one if we had a parser that reported it. That is, assuming it's a
document-level notion. The story's less clear when pulling together
elements from multiple documents. If the original Document node was
Latin-1 but you included an Element from a Shift_JIS document, you can't
reliably assume Latin-1 for the new document.
More information about the jdom-interest