[jdom-interest] SaxBuilder.build(url) and encoding

Jason Hunter jhunter at acm.org
Fri Dec 13 22:25:28 PST 2002

Elliotte Rusty Harold wrote:
> At 9:34 PM -0800 12/11/02, Jason Hunter wrote:
> >When you use a URL the underlying parser determines the encoding,
> >typically by looking at the declaration.
> Not necessarily. In an HTTP environment, the encoding specified by
> the MIME type takes precedence over the encoding specified by the XML
> document (though not all parsers get this right). If the HTTP header
> says the document is UTF-8 and the encoding declaration says ISO
> 8859-1, then the parser uses UTF-8. I have to double check this, but
> I also think that if the HTTP header says the document is text/xml
> without any encoding, then the parser picks US-ASCII regardless of
> what the encoding declaration says. Again, only some parsers
> correctly implement the spec here.

Rusty, I'd be interested in citations for that.  I find this in XML spec
section 4.3.3:

In the absence of information provided by an external transport protocol
(e.g. HTTP or MIME), it is an error for an entity including an encoding
declaration to be presented to the XML processor in an encoding other
than that named in the declaration, or for an entity which begins with
neither a Byte Order Mark nor an encoding declaration to use an encoding
other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary
ASCII entities do not strictly need an encoding declaration.

I note this doesn't say it's an error to use the encoding in the decl,
just that it's an error not to unless you have a reason not to, such as
the external transport giving you other indications.  The spec lets
parsers go with what the encoding the external transport reports,
presumably to save time.  Parsers that don't do this optimization aren't
necessarily spec non-conformant.  If encodings collide, it seems this
paragraph lets there be two legit outcomes.


More information about the jdom-interest mailing list