<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2655.35">
<TITLE>RE: [jdom-interest] B9-rc1: inputstreams, or readers: Invalid encoding name "KSC5601"</TITLE>
</HEAD>
<BODY>
<P><FONT SIZE=2>My point is that the data passes XML SAXBuilder IF it is processed as an Input Stream, but fails as a Reader.</FONT>
</P>
<P><FONT SIZE=2>The encoding is processed "just fine" when the data is processed as a Reader InputSource, but fails as an InputStream.</FONT>
</P>
<P><FONT SIZE=2>As I say, I am unsure of where this is a bug, or even IF this is a bug, but it certainly is suspicious.</FONT>
</P>
<P><FONT SIZE=2>Attached is the Zipped XMLDocument which fails "well-formedness" as a ByteStream, but passes as a Reader.</FONT>
</P>
<P><FONT SIZE=2>Here is my test code:</FONT>
</P>
<P><FONT SIZE=2>==============================</FONT>
<BR><FONT SIZE=2>import java.io.FileInputStream;</FONT>
<BR><FONT SIZE=2>import java.io.FileReader;</FONT>
</P>
<P><FONT SIZE=2>import org.jdom.input.SAXBuilder;</FONT>
</P>
<P><FONT SIZE=2>public class MainParse {</FONT>
</P>
<P><FONT SIZE=2> public static void main(String[] args) {</FONT>
<BR><FONT SIZE=2> try {</FONT>
<BR><FONT SIZE=2> new SAXBuilder().build(new FileInputStream(args[0]));</FONT>
<BR><FONT SIZE=2> System.out.println("PASSED: Processed file as an input stream.");</FONT>
<BR><FONT SIZE=2> } catch (Exception e) {</FONT>
<BR><FONT SIZE=2> System.out.println("FAILED: Processed file as an input stream.");</FONT>
<BR><FONT SIZE=2> e.printStackTrace();</FONT>
<BR><FONT SIZE=2> }</FONT>
<BR><FONT SIZE=2> try {</FONT>
<BR><FONT SIZE=2> new SAXBuilder().build(new FileReader(args[0]));</FONT>
<BR><FONT SIZE=2> System.out.println("PASSED: Processed file as a Reader.");</FONT>
<BR><FONT SIZE=2> } catch (Exception e) {</FONT>
<BR><FONT SIZE=2> System.out.println("FAILED: Processed file as a Reader.");</FONT>
<BR><FONT SIZE=2> e.printStackTrace();</FONT>
<BR><FONT SIZE=2> }</FONT>
<BR><FONT SIZE=2> }</FONT>
<BR><FONT SIZE=2>}</FONT>
<BR><FONT SIZE=2>==================================</FONT>
</P>
<P><FONT SIZE=2>and this is my output from the command:</FONT>
<BR><FONT SIZE=2>java -cp .:/lib/jaxen-jdom.jar:./lib/jdom.jar:./lib/xerces.jar MainParse mydoc_raw.xml</FONT>
</P>
<BR>
<P><FONT SIZE=2>FAILED: Processed file as an input stream.</FONT>
<BR><FONT SIZE=2>org.jdom.input.JDOMParseException: Error on line 1: Invalid encoding name "KSC5601".</FONT>
<BR><FONT SIZE=2> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:381)</FONT>
<BR><FONT SIZE=2> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:684)</FONT>
<BR><FONT SIZE=2> at MainParse.main(MainParse.java:23)</FONT>
<BR><FONT SIZE=2>Caused by: org.xml.sax.SAXParseException: Invalid encoding name "KSC5601".</FONT>
<BR><FONT SIZE=2> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)</FONT>
<BR><FONT SIZE=2> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)</FONT>
<BR><FONT SIZE=2> ... 2 more</FONT>
<BR><FONT SIZE=2>Caused by: org.xml.sax.SAXParseException: Invalid encoding name "KSC5601".</FONT>
<BR><FONT SIZE=2> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)</FONT>
<BR><FONT SIZE=2> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)</FONT>
<BR><FONT SIZE=2> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:684)</FONT>
<BR><FONT SIZE=2> at MainParse.main(MainParse.java:23)</FONT>
<BR><FONT SIZE=2>Caused by: org.xml.sax.SAXParseException: Invalid encoding name "KSC5601".</FONT>
<BR><FONT SIZE=2> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)</FONT>
<BR><FONT SIZE=2> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)</FONT>
<BR><FONT SIZE=2> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:684)</FONT>
<BR><FONT SIZE=2> at MainParse.main(MainParse.java:23)</FONT>
<BR><FONT SIZE=2>PASSED: Processed file as an input stream.</FONT>
</P>
<P><FONT SIZE=2>Rolf</FONT>
</P>
<BR>
<BR>
<P><FONT SIZE=2>-----Original Message-----</FONT>
<BR><FONT SIZE=2>From: Jason Hunter [<A HREF="mailto:jhunter@acm.org">mailto:jhunter@acm.org</A>]</FONT>
<BR><FONT SIZE=2>Sent: Wednesday, April 16, 2003 6:48 PM</FONT>
<BR><FONT SIZE=2>To: Rolf Lear</FONT>
<BR><FONT SIZE=2>Cc: Jdom-Interest (E-mail)</FONT>
<BR><FONT SIZE=2>Subject: Re: [jdom-interest] B9-rc1: inputstreams, or readers: Invalid</FONT>
<BR><FONT SIZE=2>encoding name "KSC5601"</FONT>
</P>
<BR>
<P><FONT SIZE=2>It may be that the encoding name isn't known to XML but may be known to</FONT>
<BR><FONT SIZE=2>Java. There's a Xerces feature to tell it to respect Java names for</FONT>
<BR><FONT SIZE=2>encodings. Try that.</FONT>
</P>
<P><FONT SIZE=2>-jh-</FONT>
</P>
<P><FONT SIZE=2>> Rolf Lear wrote:</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> I have been trying to find/fix performance issues in JDom, and was</FONT>
<BR><FONT SIZE=2>> playing around with the Verifier.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> To test the effect of changes to the Verifier, I first load an XML</FONT>
<BR><FONT SIZE=2>> Document in to memory, then parse it using SAXbuilder.build.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> To test wierd XML, I found this:</FONT>
<BR><FONT SIZE=2>> <A HREF="http://ropas.kaist.ac.kr/viewcvs/viewcvs.cgi/*checkout*/n/nXml/testdata/document/mydoc_raw.xml?rev=HEAD&content-type=text/xml" TARGET="_blank">http://ropas.kaist.ac.kr/viewcvs/viewcvs.cgi/*checkout*/n/nXml/testdata/document/mydoc_raw.xml?rev=HEAD&content-type=text/xml</A></FONT></P>
<P><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> which is partially Korean.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> First, remove the Doctype declaration in the document.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> My program does the following (See the code at the end).</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> It loads the file up as an array of bytes.</FONT>
<BR><FONT SIZE=2>> It loads the file up as an array of Char.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> It parses each through SAXBuilder.build using an inputstream on the</FONT>
<BR><FONT SIZE=2>> bytes, and a reader on the chars.</FONT>
<BR><FONT SIZE=2>> InputSource source = new InputSource(new</FONT>
<BR><FONT SIZE=2>> ByteArrayInputStream(bytedata));</FONT>
<BR><FONT SIZE=2>> and</FONT>
<BR><FONT SIZE=2>> InputSource source = new InputSource(new CharArrayReader(chardata));</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Now, parsing the Reader passes, and the InputStream fails with:</FONT>
<BR><FONT SIZE=2>> Invalid encoding name "KSC5601" (in Xerces).</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> org.jdom.input.JDOMParseException: Error on line 1: Invalid encoding</FONT>
<BR><FONT SIZE=2>> name "KSC5601".</FONT>
<BR><FONT SIZE=2>> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:381)</FONT>
<BR><FONT SIZE=2>> at MainTest.main(MainTest.java:77)</FONT>
<BR><FONT SIZE=2>> Caused by: org.xml.sax.SAXParseException: Invalid encoding name</FONT>
<BR><FONT SIZE=2>> "KSC5601".</FONT>
<BR><FONT SIZE=2>> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown</FONT>
<BR><FONT SIZE=2>> Source)</FONT>
<BR><FONT SIZE=2>> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)</FONT>
<BR><FONT SIZE=2>> ... 1 more</FONT>
<BR><FONT SIZE=2>> Caused by: org.xml.sax.SAXParseException: Invalid encoding name</FONT>
<BR><FONT SIZE=2>> "KSC5601".</FONT>
<BR><FONT SIZE=2>> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown</FONT>
<BR><FONT SIZE=2>> Source)</FONT>
<BR><FONT SIZE=2>> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)</FONT>
<BR><FONT SIZE=2>> at MainTest.main(MainTest.java:77)</FONT>
<BR><FONT SIZE=2>> Caused by: org.xml.sax.SAXParseException: Invalid encoding name</FONT>
<BR><FONT SIZE=2>> "KSC5601".</FONT>
<BR><FONT SIZE=2>> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown</FONT>
<BR><FONT SIZE=2>> Source)</FONT>
<BR><FONT SIZE=2>> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)</FONT>
<BR><FONT SIZE=2>> at MainTest.main(MainTest.java:77)</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Now I am the first to admit that my Unicode,charset knowledge is</FONT>
<BR><FONT SIZE=2>> really flakey, so any suggestions as to whether this is a bug in my</FONT>
<BR><FONT SIZE=2>> code, JDOM, or Xerces is welcome.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Rolf</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> ======================================================</FONT>
<BR><FONT SIZE=2>> /*package default.*/</FONT>
<BR><FONT SIZE=2>> import java.io.ByteArrayInputStream;</FONT>
<BR><FONT SIZE=2>> import java.io.CharArrayReader;</FONT>
<BR><FONT SIZE=2>> import java.io.File;</FONT>
<BR><FONT SIZE=2>> import java.io.FileInputStream;</FONT>
<BR><FONT SIZE=2>> import java.io.FileReader;</FONT>
<BR><FONT SIZE=2>> import java.io.IOException;</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> import org.jdom.JDOMException;</FONT>
<BR><FONT SIZE=2>> import org.jdom.input.SAXBuilder;</FONT>
<BR><FONT SIZE=2>> import org.xml.sax.InputSource;</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> public class MainTest {</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> private static byte[] loadedFileBytes(String filename) throws</FONT>
<BR><FONT SIZE=2>> IOException {</FONT>
<BR><FONT SIZE=2>> File file = new File(filename);</FONT>
<BR><FONT SIZE=2>> byte[] buffer = new byte[(int)file.length()];</FONT>
<BR><FONT SIZE=2>> FileInputStream fis = new FileInputStream(file);</FONT>
<BR><FONT SIZE=2>> int got = 0;</FONT>
<BR><FONT SIZE=2>> int size = buffer.length;</FONT>
<BR><FONT SIZE=2>> for (got = 0; got < size; ) {</FONT>
<BR><FONT SIZE=2>> int read = fis.read(buffer, got, size - got);</FONT>
<BR><FONT SIZE=2>> if (read >= 0) {</FONT>
<BR><FONT SIZE=2>> got += read;</FONT>
<BR><FONT SIZE=2>> } else {</FONT>
<BR><FONT SIZE=2>> throw new IOException ("do not expect end of file</FONT>
<BR><FONT SIZE=2>> before " + size + " bytes, but got it at " + got + " bytes.");</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> if (fis.read() != -1) {</FONT>
<BR><FONT SIZE=2>> throw new IOException ("Thought we read to end of file,</FONT>
<BR><FONT SIZE=2>> but there is still more.....");</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> return buffer;</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> private static char[] loadedFileChars(String filename) throws</FONT>
<BR><FONT SIZE=2>> IOException {</FONT>
<BR><FONT SIZE=2>> File file = new File(filename);</FONT>
<BR><FONT SIZE=2>> FileReader fr = new FileReader(file);</FONT>
<BR><FONT SIZE=2>> StringBuffer sb = new StringBuffer();</FONT>
<BR><FONT SIZE=2>> int read = 0;</FONT>
<BR><FONT SIZE=2>> char[] buffer = new char[1024*4];</FONT>
<BR><FONT SIZE=2>> while ((read = fr.read(buffer)) >= 0) {</FONT>
<BR><FONT SIZE=2>> sb.append(buffer, 0, read);</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> return sb.toString().toCharArray();</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> public static void main(String[] args) throws</FONT>
<BR><FONT SIZE=2>> ClassNotFoundException, IOException {</FONT>
<BR><FONT SIZE=2>> long start = System.currentTimeMillis();</FONT>
<BR><FONT SIZE=2>> Class.forName("org.jdom.Verifier").getDeclaredMethods();</FONT>
<BR><FONT SIZE=2>> long load = System.currentTimeMillis() - start;</FONT>
<BR><FONT SIZE=2>> System.out.println("Loaded Verifier Class: " + load + "ms.");</FONT>
<BR><FONT SIZE=2>> int iterations = new Integer(args[0]).intValue();</FONT>
<BR><FONT SIZE=2>> SAXBuilder builder = new SAXBuilder(false);</FONT>
<BR><FONT SIZE=2>> for (int i = 1; i < args.length; i++) {</FONT>
<BR><FONT SIZE=2>> start = System.currentTimeMillis();</FONT>
<BR><FONT SIZE=2>> byte[] bytedata = loadedFileBytes(args[i]);</FONT>
<BR><FONT SIZE=2>> char[] chardata = loadedFileChars(args[i]);</FONT>
<BR><FONT SIZE=2>> load = System.currentTimeMillis() - start;</FONT>
<BR><FONT SIZE=2>> System.out.println("Loaded Data in File '" + args[i] + "'</FONT>
<BR><FONT SIZE=2>> in " + load + "ms. " + (bytedata.length / 1024) + "KB. " +</FONT>
<BR><FONT SIZE=2>> (chardata.length / 1024) + " KChars About to SAXBuild");</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> try {</FONT>
<BR><FONT SIZE=2>> for (int j = 0; j < iterations; j++) {</FONT>
<BR><FONT SIZE=2>> InputSource source = new InputSource(new</FONT>
<BR><FONT SIZE=2>> ByteArrayInputStream(bytedata));</FONT>
<BR><FONT SIZE=2>> start = System.currentTimeMillis();</FONT>
<BR><FONT SIZE=2>> builder.build(source);</FONT>
<BR><FONT SIZE=2>> load = System.currentTimeMillis() - start;</FONT>
<BR><FONT SIZE=2>> System.out.println("SAXBuilder built document '" +</FONT>
<BR><FONT SIZE=2>> args[i] + "' (BYTES) iteration " + j + " in " + load + "ms.");</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> } catch (JDOMException e) {</FONT>
<BR><FONT SIZE=2>> e.printStackTrace();</FONT>
<BR><FONT SIZE=2>> } catch (IOException ioe) {</FONT>
<BR><FONT SIZE=2>> ioe.printStackTrace();</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> try {</FONT>
<BR><FONT SIZE=2>> for (int j = 0; j < iterations; j++) {</FONT>
<BR><FONT SIZE=2>> InputSource source = new InputSource(new</FONT>
<BR><FONT SIZE=2>> CharArrayReader(chardata));</FONT>
<BR><FONT SIZE=2>> start = System.currentTimeMillis();</FONT>
<BR><FONT SIZE=2>> builder.build(source);</FONT>
<BR><FONT SIZE=2>> load = System.currentTimeMillis() - start;</FONT>
<BR><FONT SIZE=2>> System.out.println("SAXBuilder built document '" +</FONT>
<BR><FONT SIZE=2>> args[i] + "' (CHARS) iteration " + j + " in " + load + "ms.");</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> } catch (JDOMException e) {</FONT>
<BR><FONT SIZE=2>> e.printStackTrace();</FONT>
<BR><FONT SIZE=2>> } catch (IOException ioe) {</FONT>
<BR><FONT SIZE=2>> ioe.printStackTrace();</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> }</FONT>
<BR><FONT SIZE=2>> ===================================================================================</FONT>
</P>
<P><FONT FACE="Arial" SIZE=2 COLOR="#000000"></FONT>
</BODY>
</HTML>