[jdom-interest] JDOM Issue #5 - DTD-aware Attribute output

Fri Mar 23 06:40:39 PDT 2012

Rolf,

I think your assumption is wrong: I remember Michael Kay had a long FAQ entry about justifying why a DTD was read even though validation was not activated (for Saxon Aelfred which we have extensively used) and indeed it is my experience that any parser, Xerces included, parses the DTD completely (including included entities as is the case here) and injects all default values of attributes (including namespaces) without it being validating.

Validating implies breaking somehow after an error (the first or the last?).

To summarize I see the following modes:
- ignore the DTD completely (no parser does this unless explicitly told it)
- use DTD (and inclusions) for all default values
- use DTD and report all errors but keep doing
- use DTD and break at first error

My understanding is that my SAXBuilder.build was throwing an exception if I activated DTD validation (so the last two possibilities) thus making it impossible obtain a good jdom Document object form a slightly invalid document.

paul

PS: sorry for the mailing-fuss, I thought I sent it to the list a bit later realizing that jdom at tuis.net was not... the list...

Le 23 mars 2012 à 14:21, Rolf Lear a écrit :

> Hi Paul.
> 
> If you were wondering why no-one on the list has commented, it may be because you you never sent it to the list, just to me ... ;-), so I have CC'd the list for you...
> 
> Anyway, I have been looking in to things, and I think the problem is that you have missed a detail in the way the data is processed.
> 
> Using your example document:
> http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath
> 
> This document (apart from being 'big'), refers to a single DTD, which, in the case of this document, only really defaults one attribute: 'scheme' on the 'competency' element (which defaults to "PISA").
> 
> Now, as far as I know, there are only the following ways to reference content of the DTD:
> 
> If you are doing no DTD validation, the DTD will still be accessed to resolve entity references. But, that is the *only* thing that will be pulled form the DTD.
> 
> If you do validation, then the entire DTD is read, and the validation is done, and any attributes defaulted in the DTD will be created in the XML 'Model'.
> 
> So, it is my understanding that it is impossible to have 'all the defaulted attributes' without also having done the full DTD Validation.
> 
> As it happens, I often use the tool 'xmllint' (available on most unix systems, including linux) to check my understanding, and, I may be wrong on this because xmllint has the argument --dtdattr which appears to do a partial thing of loading the defaulted attrs, but not a full validation...
> 
> Anyway, the point is that, using JDOM, and standard SAX parsing, the only time you could have had 'all the defaulted attrs was when you were doing full validation anyway... and that full validation fails.
> 
> So, if you do not do validating, you will not get the 'scheme' attributes, and you will not output the scheme attributes (you do not have them to output...).
> 
> If you do validating, then you have the scheme attributes, and then you can now choose to ignore them on the output with the new Format setting.
> 
> Your particular problem is confusing to me, and there must be something I am missing.... I can't figure out why you think you are getting all the defaulted attributes when it is clear you are not validating...
> 
> So, that is my first issue, and I think it means that you are confused too ;-)
> 
> 
> The second issue with the namespace declarations is also confusing to me. In your example document, every single namespace declaration is essential.... not a single one is 'redundant'.
> 
> Is it possible that it is just a bad example?
> 
> Anyway, at the worst possible case, I have a hack that would probably make you happy, but makes me cringe.... I would rather understand your problem properly before I suggest it.
> 
> Thanks
> 
> Rolf
> 
> 
> On 22/03/2012 4:27 PM, Paul Libbrecht wrote:
>> 
>> Hello list,
>> 
>> Rolf has been so kind to show me how JDOM issue #5 can be run.
>> 
>> So I ran the following snippet:
>> 
>>         SAXBuilder builder = new SAXBuilder(XMLReaders.DTDVALIDATING);
>>         Document doc = builder.build(new URL(args[0]));
>>         Format speconly = Format.getRawFormat();
>>         speconly.setSpecifiedAttributesOnly(true);
>>         XMLOutputter xout = new XMLOutputter(speconly);
>>         xout.output(doc, System.out);
>> 
>> which allows me to parse a JDOM source, make modifications (typically: refactorings), then output with almost no difference.
>> 
>> The big advantage to that is that the attributes that were not there... are simply not injected from the DTD.
>> This is enormous in some XML editing tradition which uses implied values a lot.
>> 
>> There's two BUT:
>> 
>> 1) This currently fails if the validation fails and this is a big problem to me.
>> An example file would be the following:
>>   http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath
>> which references a DTD nearby. This is a manually edited file.
>> 
>> Removing the validation, sadly disables the passing of attribute presence info, it seems.
>> Rolf, is there a way that the attribute presence info is passed but the validation is not stopped?
>> 
>> 
>> 2) namespace declarations, which are kind of attributes, still resurface. They should be avoided if not present ideally. Doable?
>> 
>> The approach of Rolf is better than the one I had because mine was simply checking in the DTD if the attribute was provided by it and, if yes, removing its output while in Rolf's approach, an attribute that is there is output if... it was there, simply!
>> 
>> Thanks for comments.
>> 
>> paul
>