[jdom-interest] Feature Request
cowan at ccil.org
Sat Feb 21 10:52:30 PST 2004
Dennis Sosnoski scripsit:
> I wanted HTML to be parsed with a SAX API (and without namespaces, so that
> I could easily use XPath on the constructed model).
Since there's demand for this, I'll make sure the SAX namespace and
namespace-prefix features work correctly on the next release.
Currently they can be set and cleared but don't change anything.
> schema.elementType("span", Schema.M_ANY, Schema.M_ANY, 0);
> schema.elementType("div", Schema.M_ANY, Schema.M_ANY, 0);
> schema.elementType("table", Schema.M_ANY, Schema.M_ANY, 0);
> schema.elementType("br", Schema.M_EMPTY, Schema.M_ANY, 0);
I'd be interested in knowing why these particular ones were important.
I understand the issue with script and style.
> John, I
> should also mention that I ran into cases where the parser was not
> clearing itself properly when starting a new parse, I think because the
> Parser.theSaved field was not being set to null.
Thanks; I'll add that to the to-do list for 0.9.2. (I forgot to test
for parser reusability; my tests always instantiate a new one.)
> >>The only downside I've noticed is that the handling it uses to
> >>turn HTML into XHTML can go berserk in some cases of real-world HTML,
> >>such as <script> and <style> elements within the <body> (it properly
> >>tries to force them into a <head> element, so you end up with multiple
> >><head>s and <body>s).
TagSoup's content models are implicitly of the form (A|B|C|...)*, so
it thinks the content model of the html element is (head|body)*.
I may do some special-casery to fix this, but probably not for 0.9.2
unless I see a very easy way to do it.
John Cowan www.reutershealth.com www.ccil.org/~cowan jcowan at reutershealth.com
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say "Gosh!"
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.
More information about the jdom-interest