[jdom-interest] How to exclude DTD and namespace during Sax (TagSoup) parsing in JDOM

Jack Bush netbeansfan at yahoo.com.au
Fri Nov 7 04:32:55 PST 2008

Hi All,
I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional/ /EN" "http://www. w3.org/TR/ xhtml1/DTD/ xhtml1-transitio nal.dtd">
<html xmlns="http: //www.w3. org/1999/ xhtml">
<meta http-equiv=" Content-Type" content="text/ html; charset=UTF- 8" />
    <div id="container">
        <div id="content">
            <table class="sresults">
                        <a href="http:/ /www.abc. com/areas" title=" Hollywood , CA "> hollywood </a>
                        <a href="http:/ /www.abc. com/areas" title=" San Jose , CA "> san jose </a>
                        <a href="http:/ /www.abc. com/areas" title=" San Francisco , CA "> san francisco </a>
                        <a href="http:/ /www.abc. com/areas" title=" San Diego , CA "> San diego </a>
Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of  <a>):
             import java.util.*;
             import org.jdom.*;
             import org.jdom.xpath. *;
             import org.saxpath. *;
             import org.ccil.cowan. tagsoup.Parser;
( 1 )       frInHtml = new FileReader(" C:\\Tmp\\ ABC.html" );
( 2 )       brInHtml = new BufferedReader( frInHtml) ;
( 3 )       SAXBuilder saxBuilder = new SAXBuilder(" org.ccil. cowan.tagsoup. Parser");
( 4 )       org.jdom.Document jdomDocument = saxbuilder.build( brInHtml) ;
( 5 )       XPath xpath =  XPath.newInstance( "/ns:html/ ns:body/ns: div[@id=' container' ]/ns:div[ @id='content' ]/ns:table[ @class='sresults ']/ns:tr/ ns:td/ns: a");
( 6 )       xpath.addNamespace( "ns", "http://www. w3.org/1999/ xhtml");
( 7 )       java.util.List list = (java.util.List) (xpath.selectNodes( jdomDocument) );
( 8 )       Iterator iterator = list.iterator( );
( 9 )     while (iterator.hasNext( ))
( 10 )     {
( 11 )            Object object = iterator.next( );
( 12 ) //         if (object instanceof Element)
( 13 ) //               System.out.println( ((Element) object).getTextN ormalize( ));
( 14 )             if (object instanceof Content)
( 15 )                   System.out.println( ((Content) object).getValue ());
 I would like to achieve the following objectives if possible:
 ( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done?
(ii ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?
 I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.
 Any assistance would be appreciated.
 Thanks in advance,
Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started. __._,_.___ 
Messages in this topic (1) Reply (via web post) | Start a new topic 
Messages | Files | Photos | Links | Database | Polls | Members | Calendar 
To unsubscribe, send a blank email to tagsoup-friends-unsubscribe at yahoogroups.com 
Change settings via the Web (Yahoo! ID required) 
Change settings via email: Switch delivery to Daily Digest | Switch format to Traditional 
Visit Your Group | Yahoo! Groups Terms of Use | Unsubscribe 
Recent Activity
	*  2
New MembersVisit Your Group 
Give Back
Yahoo! for Good
Get inspired
by a good cause.
Y! Toolbar
Get it Free!
easy 1-click access
to your groups.
Yahoo! Groups
Start a group
in 3 easy steps.
Connect with others.
Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started.

      Find your perfect match today at the new Yahoo!7 Dating. Get Started http://au.dating.yahoo.com/?cid=53151&pid=1012
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.jdom.org/pipermail/jdom-interest/attachments/20081107/f7fb1e61/attachment.htm

More information about the jdom-interest mailing list