Thanks Rolf. That actually does clarify a few things. Especially:<div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

a big difference between XML 1.0 and 1.1 is that the Char dataset for 1.1 is larger than 1.0 (it includes [#x1-#xD7FF] instead of 'just' #x9 | #xA | #xD | [#x20-#xD7FF] )</blockquote><div><br></div><div>I am slowly becoming 'one' with the xml  :)</div>

<div><div><br></div><div>Wilf</div></div></div><br><div class="gmail_quote">On Fri, Sep 7, 2012 at 4:29 PM, Rolf Lear <span dir="ltr"><<a href="mailto:jdom@tuis.net" target="_blank">jdom@tuis.net</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">So, I have been studying up on the Chars and RestrictedChars in the XML1.1 spec.<br>

<br>

My personal feeling is that the RestrictedChars mechanism for specifying the document format is somewhat complicated, but I now believe I have 'grokked' it. It all boils down to these four constraints:<br>

<br>

1. There are two sets of Characters defined for XML:<br>

<br>

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]<br>

RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]<br>

<br>

RestrictedChar is a subset of Char<br>

<br>

2. a valid XML *unparsed* document is defined as:<br>

<br>

document ::= ( prolog element Misc* ) - ( Char* RestrictedChar Char* )<br>

<br>

3. prolog, element, and Misc are all (indirectly) constrained to 'Char' based characters.<br>

<br>

4. Character and entity references must resolve to data from the 'Char' set... <a href="http://www.w3.org/TR/xml11/#sec-references" target="_blank">http://www.w3.org/TR/xml11/#<u></u>sec-references</a><br>

<br>

Based on the four statements above it is apparent that a valid document consists of a prolog (which may be empty), an element (which must exist), and followed by optional comments, PI's and whitespace. Further, there are not allowed to be any restricted chars in the *unparsed* document anywhere.<br>


<br>

But, a big difference between XML 1.0 and 1.1 is that the Char dataset for 1.1 is larger than 1.0 (it includes [#x1-#xD7FF] instead of 'just' #x9 | #xA | #xD | [#x20-#xD7FF] )<br>

<br>

So, XML 1.1 includes all the low-value control characters.... but, it *Restricts* them from appearing *raw* in the unparsed document. It goes even further, and it also restricts the following chars in the *unparsed* document: [#x7F-#x84] | [#x86-#x9F].<br>


<br>

In XML 1.1 though, you can use a char reference to display these restricted chars like &#x1;<br>

<br>

Unfortunately for you, Wilf, XML 1.1 still makes the following Java char values illegal as XML characters: 0x0000, 0xD800-0xDFFF, and 0xFFFF<br>

<br>

<br>

JDOM 2.x follows JDOM 1.x and allows the set of characters defined for XML 1.0.<br>

<br>

This is likely a problem. Unfortunately, it is not easily possible for JDOM to 'infer' whether it is working with an XML 1.0 or 1.1 document.<br>

<br>

Perhaps this needs some thought.<br>

<br>

Rolf<br>

<br>

<br>

<br>

<br>

On 07/09/2012 2:48 PM, Rolf Lear wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Hi Wilf.<br>

<br>

You are getting your wires crossed..... In your mail you referenced parsed<br>

and external entities. These have nothing to do with PCDATA (parsed<br>

character data - regular XML text), and CDATA (unparsed character data -<br>

<![CDATA[ ... ]]> )<br>

<br>

Michael was answering your question based on the 'entities', where as you<br>

want the details on the 'PCDATA' and the 'CDATA'.<br>

<br>

So, forget about the 'entity' references, and focus on the valid character<br>

data for XML.<br>

<br>

The only difference between CDATA (character blocks between <![CDATA[  and<br>

]]> ) and PCDATA (element 'text'), is that the XML Parser will look for<br>

'<' and '&' characters in PCDATA, but not in CDATA.<br>

<br>

With the correct escaping, all CDATA content can be expressed as PCDATA<br>

content.<br>

<br>

This does not help you though, because not all Java 'char' characters are<br>

valid Unicode characters, and thus not all chars are valid as either CDATA<br>

or PCDATA.<br>

<br>

In XML 1.0 this distinction was clear.<br>

<br>

In XML 1.1 I am not certain how to interpret the difference between<br>

'Chars' and 'RestrictedChars': <a href="http://www.w3.org/TR/xml11/#charsets" target="_blank">http://www.w3.org/TR/xml11/#<u></u>charsets</a><br>

<br>

JDOM takes a 1.0 perspective on Characters... which may be a problem, but<br>

it is not going to solve your issues even if it supports 1.1 chars.<br>

<br>

Rolf<br>

<br>

<br>

<br>

<br>

On Fri, 7 Sep 2012 08:45:33 -0700, Canadian Wilf <<a href="mailto:canwilf@gmail.com" target="_blank">canwilf@gmail.com</a>><br>

wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Then what is the proper mode:<br>

<br>

Element e = new Element("foo")<br>

<br>

Should I do this:<br>

<br>

e.setText(string_of_sanitized_<u></u>data_with_illegal_characters_<u></u>escaped);<br>

<br>

or<br>

<br>

e.setText(any_text);<br>

<br>

<br>

Wilf<br>

<br>

<br>

On Fri, Sep 7, 2012 at 6:05 AM, Michael Kay <<a href="mailto:mike@saxonica.com" target="_blank">mike@saxonica.com</a>> wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  No, that's all wrong. The contents of an unparsed entity are always an<br>

external resource, they are never part of a text or attribute node.<br>

Parsed<br>

entities do become part of the content, but they must always use the<br>

</blockquote></blockquote>

XML<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

character set.<br>

<br>

Michael Kay<br>

Saxonica<br>

<br>

On 07/09/2012 13:10, Canadian Wilf wrote:<br>

<br>

According to the xml 1.1 spec:<br>

<br>

  4 Physical Structures ...<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

[Definition: An *unparsed entity* is a resource whose contents may or<br>

may not be text <<a href="http://www.w3.org/TR/xml11/#dt-text" target="_blank">http://www.w3.org/TR/xml11/#<u></u>dt-text</a>>, and if text,<br>

</blockquote></blockquote></blockquote>

may<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


be other than XML. Each unparsed entity has an associated<br>

notation<<a href="http://www.w3.org/TR/xml11/#dt-notation" target="_blank">http://www.w3.org/TR/<u></u>xml11/#dt-notation</a>>,<br>

identified by name. Beyond a requirement that an XML processor make<br>

</blockquote></blockquote></blockquote>

the<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


identifiers for the entity and notation available to the application,<br>

XML<br>

places no constraints on the contents of unparsed entities.]<br>

</blockquote>

<br>

<br>

<br>

  AND<br>

<br>

  Entities may be either parsed or unparsed. [Definition: The contents<br>

</blockquote></blockquote>

of<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


a *parsed entity* are referred to as its replacement<br>

text<<a href="http://www.w3.org/TR/xml11/#dt-repltext" target="_blank">http://www.w3.org/TR/<u></u>xml11/#dt-repltext</a>>;<br>

this text <<a href="http://www.w3.org/TR/xml11/#dt-text" target="_blank">http://www.w3.org/TR/xml11/#<u></u>dt-text</a>> is considered an<br>

integral part of the document.]<br>

</blockquote>

<br>

[Definition: An *unparsed entity* is a resource whose contents may or<br>

</blockquote></blockquote>

may<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


not be text <<a href="http://www.w3.org/TR/xml11/#dt-text" target="_blank">http://www.w3.org/TR/xml11/#<u></u>dt-text</a>>, and if text, may be<br>

other than XML. Each unparsed entity has an associated<br>

notation<<a href="http://www.w3.org/TR/xml11/#dt-notation" target="_blank">http://www.w3.org/TR/<u></u>xml11/#dt-notation</a>>,<br>

identified by name. Beyond a requirement that an XML processor make<br>

</blockquote></blockquote></blockquote>

the<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


identifiers for the entity and notation available to the application,<br>

XML<br>

places no constraints on the contents of unparsed entities.]<br>

Parsed entities are invoked by name using entity references; unparsed<br>

entities by name, given in the value of *ENTITY* or *ENTITIES*<br>

  attributes.<br>

</blockquote>

<br>

<br>

<br>

  In the current JDOM version, Element method setText(string) and also<br>

addContent(CDATA) refuses text that contains illegal characters. It is<br>

treating the data provided as 'parsed' when it should by the spec be<br>

treating it as free content.<br>

<br>

  I understand:<br>

<br>

   1) The xml 1.1 spec defines a parsed entity as its 'replacement<br>

</blockquote></blockquote>

text'.<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

  2) Replacement text' would refer to the actual textual makeup of a<br>

serialized Element, not the data an Element holds in a Text content<br>

element<br>

<br>

<br>

  Then, if the above is true, the current implementation is actually<br>

</blockquote></blockquote>

wrong<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

to verify data.<br>

<br>

  I propose that JDOM stop verifying data set as Element text and CDATA<br>

and leave it to the xerces (or whatever) to make sure the document is<br>

proper 1.1.<br>

<br>

  Am I understanding everything correctly?<br>

<br>

  Thoughts?<br>

<br>

  ---------- Forwarded message ----------<br>

From: Canadian Wilf <<a href="mailto:canwilf@gmail.com" target="_blank">canwilf@gmail.com</a>><br>

Date: Thu, Sep 6, 2012 at 9:52 PM<br>

Subject: XML 1.1 -- Please stab me with a dull knife and trample my<br>

</blockquote></blockquote>

dead<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

body<br>

To: <a href="mailto:jdom-interest@jdom.org" target="_blank">jdom-interest@jdom.org</a><br>

<br>

<br>

Hi All,<br>

<br>

  I just learned that in order to safely use JDOM2, I will need to<br>

sanitize my Element .setText(string) so that the parsed data does not<br>

contain verboten characters under the XML 1.1 spec.<br>

<br>

  I have an ascii processor and it needs to be able to use xml as a<br>

document format. Unfortunately, not all ascii is allowed in an Element<br>

text.<br>

<br>

  Stab me with a dull knife and trample my dead body. But ..... please<br>

please please don't make me sanitize all my data before putting it into<br>

XML<br>

Elements.<br>

<br>

  1) It makes my programming task much more cumbersome because I must<br>

ensure not to feed any of the new verboten and doomed ascii/UTF-8<br>

characters to store as xml text.<br>

<br>

2) No one uses xml 1.1, do they?<br>

<br>

  3) It slows down the parsing (a very small amount) with all the<br>

</blockquote></blockquote>

element<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

text checking.<br>

<br>

  Now that JDOM2 is xml 1.1 compatible, is there any turning back. Can<br>

this be undone?<br>

<br>

  Does everyone understand that their software will bust if data<br>

</blockquote></blockquote>

provided<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

as text is not adhering to the new standard?<br>

<br>

  What about you? How do you deal with it when using the libraries?<br>

<br>

  Wilf<br>

<br>

<br>

<br>

______________________________<u></u>_________________<br>

To control your jdom-interest<br>

<br>

</blockquote></blockquote>

membership:<a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.<u></u>org/mailman/options/jdom-<u></u>interest/youraddr@yourhost.com</a><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

<br>

______________________________<u></u>_________________<br>

To control your jdom-interest membership:<br>

<a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/<u></u>options/jdom-interest/<u></u>youraddr@yourhost.com</a><br>

<br>

</blockquote></blockquote>

______________________________<u></u>_________________<br>

To control your jdom-interest membership:<br>

<a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/<u></u>options/jdom-interest/<u></u>youraddr@yourhost.com</a><br>

<br>

</blockquote>

<br>

</blockquote></div><br>