[jdom-interest] Parsing a MODS-document with validation fails

Michael Kay mike at saxonica.com
Wed Aug 10 02:03:37 PDT 2011


> You are right, it is extra work to maintain these structures for a 
> case that no one hit before. One can find arguments for one or the 
> other solution. The amount of addition memory should be negligible but 
> my patch introduced a bit of work while parsing every document while 
> you suggested changes seems to produce more work in the rare case that...


Forgive me butting in to a thread that I've only been skim-reading until 
now. But I thought I would look at what Saxon does about this problem.

Firstly, Saxon states in its documentation that it expects the stream of 
ContentHandler events to correspond to those that come from a parser 
that has been configured with namespaces="true" and 
namespace-prefixes="false". It has no way of checking this in general 
(though it does so on paths where it has access to the XMLReader).

Saxon does a few checks on the consistency of the event stream where 
these can be done cheaply. For example, it checks for the attribute 
names "xmlns" and "xmlns:*" and ignores them if they appear, even though 
they shouldn't appear in theory.

But there's one area Saxon relies on something that isn't guaranteed by 
the SAX spec, namely it assumes that the QName will be present and 
correct, even though it is optional when namespace-prefixes="false". I 
made this decision because all known XML parsers supply the QName, and 
because coping with its absence would incur significant cost on a 
performance-critical path. I've reasoned in the path that if someone 
needs to work with a source of SAX events that doesn't supply the QName, 
a filter could be added to the pipeline to make good the deficiency.

In this particular case, if I've understood the thread correctly, the 
QName is present but doesn't contain a legitimate prefix. In Saxon 
(where I'm sure the same sequence of SAX events might be received) I 
think I would have similar problems in dealing with this input. My 
response to a bug report on this would be that the input is invalid 
according to the SAX spec and should be corrected by inserting a filter: 
there is an implicit constraint that the stream of SAX events represents 
a well-formed XML document, and in a well-formed XML document, if an 
attribute is in a namespace then it must have a prefix. I wouldn't be 
prepared to add a performance penalty into the mainstream document 
building path in order to detect or repair this rare anomaly.

Michael Kay
Saxonica



More information about the jdom-interest mailing list