[jdom-interest] Verbose XHTML 1.1 Doctype

David Dorward david at dorward.me.uk
Wed Mar 24 10:47:47 PST 2004


I have a number of XHTML 1.1 documents, all conforming to the same
template, which I want to extract some data from and then insert that
data into different XHTML 1.1 documents.

As a first step I am trying to read in a document and then print it out
again without any modification. I've run into two issues:

1. It appears to be downloading the DTD from the w3c website - this
takes time and bandwidth.

2. It seems to be expanding the Doctype line (example below).

Is there any way to stop this? I'd like to leave the Doctype alone and
save time on reading the DTD (I don't care about validation - that is
handled elsewhere). I couldn't find anything looking at the docs, but I
suspect this is due to not knowing what to look for.

My code:

import org.jdom.*;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import java.io.IOException;

public class Parse {

public static void main (String [] args) {

    SAXBuilder builder = new SAXBuilder();
    Document doc;
    XMLOutputter outputter = new XMLOutputter();

    try {
      doc = builder.build("/path/to/about.xhtml");
      System.out.println(" is well formed.");
      try {
        outputter.output(doc, System.out);
      } catch (IOException e) {
        System.err.println(e);
      }
    } catch (JDOMException e) {
      // indicates a well-formedness or other error
      System.out.println(" is not well formed: " + e.getMessage());
    } catch (IOException e) {
      System.out.println("Could not check ");
      System.out.println(" because " + e.getMessage());
    }
  }
}



Examples: 
For input of:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html
xmlns="http://www.w3.or
g/1999/xhtml" xml:lang="en">
<head>
<title>About</title>
etc

It outputs:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" [
  <!NOTATION w3c-xml PUBLIC "ISO 8879//NOTATION Extensible Markup
Language (XML) 1.0//EN">
  <!NOTATION cdata PUBLIC "-//W3C//NOTATION XML 1.0: CDATA//EN">
  <!NOTATION fpi PUBLIC "ISO 8879:1986//NOTATION Formal Public
Identifier//EN">
  <!NOTATION length PUBLIC "-//W3C//NOTATION XHTML Datatype:
Length//EN">
  <!NOTATION linkTypes PUBLIC "-//W3C//NOTATION XHTML Datatype:
LinkTypes//EN">
  <!NOTATION mediaDesc PUBLIC "-//W3C//NOTATION XHTML Datatype:
MediaDesc//EN">
  <!NOTATION multiLength PUBLIC "-//W3C//NOTATION XHTML Datatype:
MultiLength//EN">
  <!NOTATION number PUBLIC "-//W3C//NOTATION XHTML Datatype:
Number//EN">
  <!NOTATION pixels PUBLIC "-//W3C//NOTATION XHTML Datatype:
Pixels//EN">
  <!NOTATION script PUBLIC "-//W3C//NOTATION XHTML Datatype:
Script//EN">
  <!NOTATION text PUBLIC "-//W3C//NOTATION XHTML Datatype: Text//EN">
  <!NOTATION character PUBLIC "-//W3C//NOTATION XHTML Datatype:
Character//EN">
  <!NOTATION charset PUBLIC "-//W3C//NOTATION XHTML Datatype:
Charset//EN">
  <!NOTATION charsets PUBLIC "-//W3C//NOTATION XHTML Datatype:
Charsets//EN">
  <!NOTATION contentType PUBLIC "-//W3C//NOTATION XHTML Datatype:
ContentType//EN">
  <!NOTATION contentTypes PUBLIC "-//W3C//NOTATION XHTML Datatype:
ContentTypes//EN">
  <!NOTATION datetime PUBLIC "-//W3C//NOTATION XHTML Datatype:
Datetime//EN">
  <!NOTATION languageCode PUBLIC "-//W3C//NOTATION XHTML Datatype:
LanguageCode//EN">
  <!NOTATION uri PUBLIC "-//W3C//NOTATION XHTML Datatype: URI//EN">
  <!NOTATION uris PUBLIC "-//W3C//NOTATION XHTML Datatype: URIs//EN">
]>
<?doc type="doctype" role="title" { XHTML 1.1 } ?><html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" version="-//W3C//DTD
XHTML 1.1//EN">
<head profile="">
<title>About</title>

etc

-- 
David Dorward                                 <http://dorward.me.uk/>



More information about the jdom-interest mailing list