[jdom-interest] Parsing a MODS-document with validation fails
mike at saxonica.com
Wed Aug 10 02:03:37 PDT 2011
> You are right, it is extra work to maintain these structures for a
> case that no one hit before. One can find arguments for one or the
> other solution. The amount of addition memory should be negligible but
> my patch introduced a bit of work while parsing every document while
> you suggested changes seems to produce more work in the rare case that...
Forgive me butting in to a thread that I've only been skim-reading until
now. But I thought I would look at what Saxon does about this problem.
Firstly, Saxon states in its documentation that it expects the stream of
ContentHandler events to correspond to those that come from a parser
that has been configured with namespaces="true" and
namespace-prefixes="false". It has no way of checking this in general
(though it does so on paths where it has access to the XMLReader).
Saxon does a few checks on the consistency of the event stream where
these can be done cheaply. For example, it checks for the attribute
names "xmlns" and "xmlns:*" and ignores them if they appear, even though
they shouldn't appear in theory.
But there's one area Saxon relies on something that isn't guaranteed by
the SAX spec, namely it assumes that the QName will be present and
correct, even though it is optional when namespace-prefixes="false". I
made this decision because all known XML parsers supply the QName, and
because coping with its absence would incur significant cost on a
performance-critical path. I've reasoned in the path that if someone
needs to work with a source of SAX events that doesn't supply the QName,
a filter could be added to the pipeline to make good the deficiency.
In this particular case, if I've understood the thread correctly, the
QName is present but doesn't contain a legitimate prefix. In Saxon
(where I'm sure the same sequence of SAX events might be received) I
think I would have similar problems in dealing with this input. My
response to a bug report on this would be that the input is invalid
according to the SAX spec and should be corrected by inserting a filter:
there is an implicit constraint that the stream of SAX events represents
a well-formed XML document, and in a well-formed XML document, if an
attribute is in a namespace then it must have a prefix. I wouldn't be
prepared to add a performance penalty into the mainstream document
building path in order to detect or repair this rare anomaly.
More information about the jdom-interest