[jdom-interest] Parsing Microsoft Word Documents

Fri Dec 24 11:00:47 PST 2004

Here is the code for future reference:

	public void run() throws FitException {
		fixtureDocumentProccessor = new FixtureDocumentProcessor();
		Document fixtureDocument = null;
		Document parsedFixtureDocument = null;
		try {
			SAXBuilder builder = new SAXBuilder("org.cyberneko.html.parsers.SAXParser");
			builder.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",false);
			builder.setProperty("http://cyberneko.org/html/properties/names/elems",
"lower");
			builder.setFeature("http://cyberneko.org/html/features/override-doctype",
false);
			URL fileURL = inputFile.toURL();
			fixtureDocument = builder.build(fileURL);
		} catch (IOException e) {
			e.printStackTrace();
		} catch (JDOMException e) {
			e.printStackTrace();
		}
		parsedFixtureDocument = fixtureDocumentProccessor.parse(fixtureDocument);
		this.outputFitResults(parsedFixtureDocument);
	}

	private void outputFitResults(Document fitTestResult) {
		XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
		HTMLSerializer htmlSerializer = new HTMLSerializer();
		try {
			FileOutputStream fileOutputStream = new FileOutputStream(
					this.outputFile);
			DOMOutputter converter = new DOMOutputter();
			org.w3c.dom.Document domDocument = converter.output(fitTestResult);
			OutputFormat format = new OutputFormat(domDocument);
			HTMLSerializer html = new HTMLSerializer(fileOutputStream, format);
			html.serialize(domDocument);
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} catch (JDOMException e) {
			e.printStackTrace();
		}
	}

On Fri, 24 Dec 2004 13:54:37 -0500, Hugo Garcia <hugo.a.garcia at gmail.com> wrote:
> AHA!
> 
> Using the org.apache.xml.serialize.HTMLSerializer worked perfectly. No
> more funny characthers in the output. This works. I will explore the
> JTidy option later when I finish the alpha I am trying to build.
> 
> thanks and have a good holiday
> 
> -H
> 
> 
> On Fri, 24 Dec 2004 10:37:13 +0000, Paul Reeves <p_a_reeves at hotmail.com> wrote:
> > Hugo
> >
> > There hasn't been an offical jtidy release for donkeys years but that doesnt
> > mean it doesnt work! It is more than up to the task. I wouldn't hold your
> > breath for a new release  in the next few months......
> >
> > If you are using nekohtml i find that if you output the document by
> > converting it back from a jdom document to a dom document and use an
> > org.apache.xml.serialize.HTMLSerializer to output it, it usually comes out
> > looking o.k.
> >
> > merry chrimbo
> >
> > Paul
> >
> > >From: Hugo Garcia <hugo.a.garcia at gmail.com>
> > >Reply-To: Hugo Garcia <hugo.a.garcia at gmail.com>
> > >To: jdom-interest at jdom.org
> > >Subject: Re: [jdom-interest] Parsing Microsoft Word Documents
> > >Date: Thu, 23 Dec 2004 14:56:13 -0500
> > >
> > >I didn't try jtidy since the realease is so old. I rahter wait on the
> > >new release.  TagSoup didn't work becasue ti doesn't support
> > >namespaces in order to use XPath.
> > >
> > >NekoHTML parses the doument correctily yet when I see the result in
> > >Firefox (Linux) the document looks funny. I suspect it might be the
> > >characther set where  it is specified as windows but I am not sure. I
> > >am using XPath to modify a clone of the input document.
> > >
> > >Any input of your experience parsing the HTML generated from Microsoft
> > >Word is welcome.
> > >
> > >
> > >This is the intial code that sets things in motion:
> > >
> > >       public void run() throws FitException {
> > >               fixtureDocumentProccessor = new FixtureDocumentProcessor();
> > >               Document fixtureDocument = null;
> > >               try {
> > >                       SAXBuilder builder = new
> > >SAXBuilder("org.cyberneko.html.parsers.SAXParser");
> > >                                       builder.setProperty("http://cyberneko.org/html/properties/names/elems",
> > >"lower");
> > >                       builder.setFeature("http://cyberneko.org/html/features/override-doctype",
> > >false);
> > >                       URL fileURL = inputFile.toURL();
> > >                       fixtureDocument = builder.build(fileURL);
> > >               } catch (IOException e) {
> > >                       e.printStackTrace();
> > >               } catch (JDOMException e) {
> > >                       e.printStackTrace();
> > >               }
> > >               this.outputFitResults(fixtureDocumentProccessor.parse(fixtureDocument));
> > >       }
> > >
> > >
> > >-------------
> > >-H
> > >
> > >
> > >On Sat, 18 Dec 2004 11:14:11 +0000, Paul Reeves <p_a_reeves at hotmail.com>
> > >wrote:
> > > > This isnt technically a jdom question....
> > > >
> > > > Get hold of JTidy http://sourceforge.net/projects/jtidy or even better,
> > > > nekohtml http://www.apache.org/~andyc/neko/doc/html/
> > > >
> > > > Both will fix your unquotted attribute problem and also attempt to
> > >correct
> > > > unbalanced tags - jtidy also has a "clean word" facility which is rather
> > > > useful
> > > >
> > > > Paul
> > > >
> > > > >From: Hugo Garcia <hugo.a.garcia at gmail.com>
> > > > >Reply-To: Hugo Garcia <hugo.a.garcia at gmail.com>
> > > > >To: jdom-interest at jdom.org
> > > > >Subject: [jdom-interest] Parsing Microsoft Word Documents
> > > > >Date: Fri, 17 Dec 2004 11:56:57 -0500
> > > > >
> > > > >Hi
> > > > >
> > > > >I am trying to parse a Microsoft Wrod document with the SAXBuilder but
> > > > >I get an error that attributes must be qouted. When I look at the
> > > > >document I see that indeed some attibutes, especially in various meta
> > > > >tags are not quoted. I wonder if anyone has run into this problem and
> > > > >if so if you have a work around or solution.
> > > > >
> > > > >thanks
> > > > >
> > > > >-H
> > > > >_______________________________________________
> > > > >To control your jdom-interest membership:
> > > > >http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
> > > >
> > > >
> > >_______________________________________________
> > >To control your jdom-interest membership:
> > >http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
> >
> >
>