[jdom-interest] Parsing files starting with UTF-8 Byte Order Mark

Tue Jul 1 02:09:37 PDT 2003

Hi Peter, 

The UTF-8 byte order mark is supposedly optional, but unfortunately there is a known bug in Sun JVMs which means they do not ignore it; so if it's present, you'll see it in your input stream (Sun JVM bug #4508058, http://developer.java.sun.com/developer/bugParade/bugs/4508058.html). 

The typical workaround is to do the check yourself when reading the input stream, for example: 

	InputStream in = ...
	StringBuffer buf = new StringBuffer()
	int first = in.read();
	if ((first != -1) && (first != 0xFEFF)
		buf.append((char)first);

	... Read the rest of the stream ...

I haven't needed to use this with JDOM, but I expect you could get round the problem by using a java.io.PushbackReader. This wraps another Reader and allows you to read the first char, and if it is anything other than 0xFEFF, "push it back" into the Reader before passing the PushbackReader to SAXBuilder().build(). There may be more elegant ways round the problem too. 

Al.

> -----Original Message-----
> From: jdom-interest-admin at jdom.org 
> [mailto:jdom-interest-admin at jdom.org] On Behalf Of Peter Eriksson
> Sent: 01 July 2003 06:46
> To: jdom-interest at jdom.org
> Subject: [jdom-interest] Parsing files starting with UTF-8 
> Byte Order Mark
> 
> 
> Hello Everybody,
> 
> I have a problem with parsing some XML files generated from 
> .Net. It seems that the file starts with the Byte Order Mark 
> for UTF-8 (EF BB BF). If I try to load the file using jdom-b8 
> I get an exception. Is there some way that I can load files 
> with or without this Byte Order Mark transparently, i.e. 
> without an exception being thrown.
> 
> Anybody have a solution to the problem?
> 
> /Peter
> 
> 
> 
> 
>