I think I'd use XSL-T in that case. It will handle XML -> Text transformations easily and scriptably (if that's a word).<div> (*Chris*)<br><br><div class="gmail_quote">On Thu, Mar 29, 2012 at 9:54 AM, Oliver Ruebenacker <span dir="ltr"><<a href="mailto:curoli@gmail.com">curoli@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hello,<br>
<br>
Thanks for all the advice, but it seems I did not make myself<br>
sufficiently clear.<br>
<br>
My situation is this: some one else already parsed XHTML and gave me<br>
the JDOM element that represents a fragment of it.<br>
<br>
Let us say the original fragment looks something like this:<br>
<br>
"<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"<br>
"<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"<br>
"<p><b>© 2012</b> by <em>Dewey, Cheetham  Howe</em></p>"<br>
<br>
I never get to see that fragment, but instead an object of type<br>
Element. What I want to get is a String that looks roughly like this:<br>
<br>
"© 2012 by Dewey, Cheetham & Howe"<br>
<br>
A simple lightweight solution that is roughly acceptable in most<br>
simple cases is fine for my purpose.<br>
<br>
So I am trying a recursive method that iterates over<br>
Element.getContent() and then I am wondering what to do if the content<br>
happens to be EntityRef?<br>
<br>
package cbit.vcell.model.summaries;<br>
<br>
import org.jdom.Comment;<br>
import org.jdom.DocType;<br>
import org.jdom.Element;<br>
import org.jdom.EntityRef;<br>
import org.jdom.ProcessingInstruction;<br>
import org.jdom.Text;<br>
<br>
public class XHTMLToPlainTextConverter {<br>
<br>
public static String convert(Element element) {<br>
String text = "";<br>
for(Object content : element.getContent()) {<br>
if(content instanceof Comment) {<br>
// ignore<br>
} else if(content instanceof DocType) {<br>
// ignore<br>
} else if(content instanceof Element) {<br>
Element childElement = (Element) content;<br>
text = text + convert(childElement);<br>
} else if(content instanceof EntityRef) {<br>
EntityRef ref = (EntityRef) content;<br>
text = text + ref; // ???<br>
} else if(content instanceof ProcessingInstruction) {<br>
// ignore<br>
} else if(content instanceof Text) {<br>
Text childText = (Text) content;<br>
text = text + childText.getText();<br>
} else {<br>
// ignore, should not happen<br>
}<br>
}<br>
return text;<br>
}<br>
<br>
}<br>
<br>
Thanks!<br>
<br>
Take care<br>
<span class="HOEnZb"><font color="#888888"> Oliver<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On Thu, Mar 29, 2012 at 12:19 PM, Chris Pratt <<a href="mailto:thechrispratt@gmail.com">thechrispratt@gmail.com</a>> wrote:<br>
> Another option I've used in the past is changing the underlying SAX parser<br>
> that jDOM uses to TagSoup ( <a href="http://ccil.org/~cowan/XML/tagsoup/" target="_blank">http://ccil.org/~cowan/XML/tagsoup/</a>). Their<br>
> parser is tuned to parsing not fully XML compliant HTML.<br>
><br>
> (*Chris*)<br>
><br>
> On Thu, Mar 29, 2012 at 8:47 AM, Olivier Jaquemet<br>
> <<a href="mailto:olivier.jaquemet@jalios.com">olivier.jaquemet@jalios.com</a>> wrote:<br>
>><br>
>> Hi Oliver,<br>
>><br>
>> JDom is a great tool for parsing XML...<br>
>><br>
>> ... but for XHTML fragment (which may not be completely XHTML compliant<br>
>> ... ?)<br>
>> and specially for text extraction, I would strongly suggest JSoup<br>
>> <a href="http://jsoup.org/" target="_blank">http://jsoup.org/</a><br>
>><br>
>> String text = org.jsoup.Jsoup.parse(html).text();<br>
>><br>
>> Whatever is your html it will work like a charm (even it is an ugly copy<br>
>> paste wysiwyg from word or any ugly html export from whatever website)<br>
>><br>
>> Olivier<br>
>><br>
>><br>
>> On 29/03/2012 15:23, Oliver Ruebenacker wrote:<br>
>>><br>
>>> Hello,<br>
>>><br>
>>> I need a simple way to convert some XHTML fragments, provided as a<br>
>>> JDOM Element, into plain text. I am willing to ignore most HTML tags<br>
>>> and consider only the most commonly used predefined entities.<br>
>>><br>
>>> In JDOM, an entity reference has a name, a public id and a system<br>
>>> id. I think I know what the named means, for named entities. But what<br>
>>> about numeric entities, how do I get the code point? And what are<br>
>>> public id and system id?<br>
>>><br>
>>> Thanks!<br>
>>><br>
>>> Take care<br>
>>> Oliver<br>
>>><br>
>><br>
>> --<br>
>> Olivier Jaquemet<<a href="mailto:olivier.jaquemet@jalios.com">olivier.jaquemet@jalios.com</a>><br>
>> Ingénieur R&D Jalios S.A. - <a href="http://www.jalios.com/" target="_blank">http://www.jalios.com/</a><br>
>> @OlivierJaquemet <a href="tel:%2B33970461480" value="+33970461480">+33970461480</a><br>
>><br>
>><br>
>><br>
>> _______________________________________________<br>
>> To control your jdom-interest membership:<br>
>> <a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com</a><br>
><br>
><br>
><br>
> _______________________________________________<br>
> To control your jdom-interest membership:<br>
> <a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com</a><br>
<br>
<br>
<br>
</div></div><div class="HOEnZb"><div class="h5">--<br>
Oliver Ruebenacker, Computational Cell Biologist<br>
Virtual Cell (<a href="http://vcell.org" target="_blank">http://vcell.org</a>)<br>
SBPAX: Turning Bio Knowledge into Math Models (<a href="http://www.sbpax.org" target="_blank">http://www.sbpax.org</a>)<br>
<a href="http://www.oliver.curiousworld.org" target="_blank">http://www.oliver.curiousworld.org</a><br>
</div></div></blockquote></div><br></div>