I think I'd use XSL-T in that case.  It will handle XML -> Text transformations easily and scriptably (if that's a word).<div>  (*Chris*)<br><br><div class="gmail_quote">On Thu, Mar 29, 2012 at 9:54 AM, Oliver Ruebenacker <span dir="ltr"><<a href="mailto:curoli@gmail.com">curoli@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">     Hello,<br>

<br>

  Thanks for all the advice, but it seems I did not make myself<br>

sufficiently clear.<br>

<br>

  My situation is this: some one else already parsed XHTML and gave me<br>

the JDOM element that represents a fragment of it.<br>

<br>

  Let us say the original fragment looks something like this:<br>

<br>

  "<p><b>&copy; 2012</b> by <em>Dewey, Cheetham &amp; Howe</em></p>"<br>

  "<p><b>&#169; 2012</b> by <em>Dewey, Cheetham &#38; Howe</em></p>"<br>

  "<p><b>&#x00a9; 2012</b> by <em>Dewey, Cheetham &#26; Howe</em></p>"<br>

<br>

  I never get to see that fragment, but instead an object of type<br>

Element. What I want to get is a String that looks roughly like this:<br>

<br>

  "© 2012 by Dewey, Cheetham & Howe"<br>

<br>

  A simple lightweight solution that is roughly acceptable in most<br>

simple cases is fine for my purpose.<br>

<br>

  So I am trying a recursive method that iterates over<br>

Element.getContent() and then I am wondering what to do if the content<br>

happens to be EntityRef?<br>

<br>

package cbit.vcell.model.summaries;<br>

<br>

import org.jdom.Comment;<br>

import org.jdom.DocType;<br>

import org.jdom.Element;<br>

import org.jdom.EntityRef;<br>

import org.jdom.ProcessingInstruction;<br>

import org.jdom.Text;<br>

<br>

public class XHTMLToPlainTextConverter {<br>

<br>

        public static String convert(Element element) {<br>

                String text = "";<br>

                for(Object content : element.getContent()) {<br>

                        if(content instanceof Comment) {<br>

                                // ignore<br>

                        } else if(content instanceof DocType) {<br>

                                // ignore<br>

                        } else if(content instanceof Element) {<br>

                                Element childElement = (Element) content;<br>

                                text = text + convert(childElement);<br>

                        } else if(content instanceof EntityRef) {<br>

                                EntityRef ref = (EntityRef) content;<br>

                                text = text + ref; // ???<br>

                        } else if(content instanceof ProcessingInstruction) {<br>

                                // ignore<br>

                        } else if(content instanceof Text) {<br>

                                Text childText = (Text) content;<br>

                                text = text + childText.getText();<br>

                        } else {<br>

                                // ignore, should not happen<br>

                        }<br>

                }<br>

                return text;<br>

        }<br>

<br>

}<br>

<br>

  Thanks!<br>

<br>

     Take care<br>

<span class="HOEnZb"><font color="#888888">     Oliver<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Thu, Mar 29, 2012 at 12:19 PM, Chris Pratt <<a href="mailto:thechrispratt@gmail.com">thechrispratt@gmail.com</a>> wrote:<br>

> Another option I've used in the past is changing the underlying SAX parser<br>

> that jDOM uses to TagSoup ( <a href="http://ccil.org/~cowan/XML/tagsoup/" target="_blank">http://ccil.org/~cowan/XML/tagsoup/</a>).  Their<br>

> parser is tuned to parsing not fully XML compliant HTML.<br>

><br>

>   (*Chris*)<br>

><br>

> On Thu, Mar 29, 2012 at 8:47 AM, Olivier Jaquemet<br>

> <<a href="mailto:olivier.jaquemet@jalios.com">olivier.jaquemet@jalios.com</a>> wrote:<br>

>><br>

>> Hi Oliver,<br>

>><br>

>> JDom is a great tool for parsing XML...<br>

>><br>

>> ... but for XHTML fragment (which may not be completely XHTML compliant<br>

>> ... ?)<br>

>> and specially for text extraction, I would strongly suggest JSoup<br>

>> <a href="http://jsoup.org/" target="_blank">http://jsoup.org/</a><br>

>><br>

>>  String text = org.jsoup.Jsoup.parse(html).text();<br>

>><br>

>> Whatever is your html it will work like a charm (even it is an ugly copy<br>

>> paste wysiwyg from word or any ugly html export from whatever website)<br>

>><br>

>> Olivier<br>

>><br>

>><br>

>> On 29/03/2012 15:23, Oliver Ruebenacker wrote:<br>

>>><br>

>>>      Hello,<br>

>>><br>

>>>   I need a simple way to convert some XHTML fragments, provided as a<br>

>>> JDOM Element, into plain text. I am willing to ignore most HTML tags<br>

>>> and consider only the most commonly used predefined entities.<br>

>>><br>

>>>   In JDOM, an entity reference has a name, a public id and a system<br>

>>> id. I think I know what the named means, for named entities. But what<br>

>>> about numeric entities, how do I get the code point? And what are<br>

>>> public id and system id?<br>

>>><br>

>>>   Thanks!<br>

>>><br>

>>>      Take care<br>

>>>      Oliver<br>

>>><br>

>><br>

>> --<br>

>> Olivier Jaquemet<<a href="mailto:olivier.jaquemet@jalios.com">olivier.jaquemet@jalios.com</a>><br>

>> Ingénieur R&D Jalios S.A. - <a href="http://www.jalios.com/" target="_blank">http://www.jalios.com/</a><br>

>> @OlivierJaquemet <a href="tel:%2B33970461480" value="+33970461480">+33970461480</a><br>

>><br>

>><br>

>><br>

>> _______________________________________________<br>

>> To control your jdom-interest membership:<br>

>> <a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com</a><br>

><br>

><br>

><br>

> _______________________________________________<br>

> To control your jdom-interest membership:<br>

> <a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com</a><br>

<br>

<br>

<br>

</div></div><div class="HOEnZb"><div class="h5">--<br>

Oliver Ruebenacker, Computational Cell Biologist<br>

Virtual Cell (<a href="http://vcell.org" target="_blank">http://vcell.org</a>)<br>

SBPAX: Turning Bio Knowledge into Math Models (<a href="http://www.sbpax.org" target="_blank">http://www.sbpax.org</a>)<br>

<a href="http://www.oliver.curiousworld.org" target="_blank">http://www.oliver.curiousworld.org</a><br>

</div></div></blockquote></div><br></div>