<div>Let's focus on valid character data for xml. How to do this:</div><div><br></div><div><font face="courier new, monospace">String s = someRandomBytesNowAsString();</font></div><div><br></div><div><font face="courier new, monospace">Element e = new Element("random")</font></div>

<div><font face="courier new, monospace">e.setText(s) or e.addContent(new CDATA(s))</font></div><div><br></div><div>Currently this will fail. .. Which seems wrong because I should be able to send whatever data I want as text  in xml content. <br>

<br></div><div>What use is xml (1.0 or 1.1) if I cannot represent various data? Is the solution to make a custom escaper for my data?</div><div><br></div><div><div><font face="courier new, monospace">e.setText(encodeSpecial(s)) and decodeSpecial(e.getText())</font></div>

<div><br></div><div>Crazy!</div></div><div><br></div><div>Wilf</div><div><br></div><div><br><div class="gmail_quote">On Fri, Sep 7, 2012 at 11:48 AM, Rolf Lear <span dir="ltr"><<a href="mailto:jdom@tuis.net" target="_blank">jdom@tuis.net</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

Hi Wilf.<br>

<br>

You are getting your wires crossed..... In your mail you referenced parsed<br>

and external entities. These have nothing to do with PCDATA (parsed<br>

character data - regular XML text), and CDATA (unparsed character data -<br>

<![CDATA[ ... ]]> )<br>

<br>

Michael was answering your question based on the 'entities', where as you<br>

want the details on the 'PCDATA' and the 'CDATA'.<br>

<br>

So, forget about the 'entity' references, and focus on the valid character<br>

data for XML.<br>

<br>

The only difference between CDATA (character blocks between <![CDATA[  and<br>

]]> ) and PCDATA (element 'text'), is that the XML Parser will look for<br>

'<' and '&' characters in PCDATA, but not in CDATA.<br>

<br>

With the correct escaping, all CDATA content can be expressed as PCDATA<br>

content.<br>

<br>

This does not help you though, because not all Java 'char' characters are<br>

valid Unicode characters, and thus not all chars are valid as either CDATA<br>

or PCDATA.<br>

<br>

In XML 1.0 this distinction was clear.<br>

<br>

In XML 1.1 I am not certain how to interpret the difference between<br>

'Chars' and 'RestrictedChars': <a href="http://www.w3.org/TR/xml11/#charsets" target="_blank">http://www.w3.org/TR/xml11/#charsets</a><br>

<br>

JDOM takes a 1.0 perspective on Characters... which may be a problem, but<br>

it is not going to solve your issues even if it supports 1.1 chars.<br>

<br>

Rolf<br>

<br>

<br>

<br>

<br>

On Fri, 7 Sep 2012 08:45:33 -0700, Canadian Wilf <<a href="mailto:canwilf@gmail.com">canwilf@gmail.com</a>><br>

wrote:<br>

> Then what is the proper mode:<br>

><br>

> Element e = new Element("foo")<br>

><br>

> Should I do this:<br>

><br>

> e.setText(string_of_sanitized_data_with_illegal_characters_escaped);<br>

><br>

> or<br>

><br>

> e.setText(any_text);<br>

><br>

><br>

> Wilf<br>

><br>

><br>

> On Fri, Sep 7, 2012 at 6:05 AM, Michael Kay <<a href="mailto:mike@saxonica.com">mike@saxonica.com</a>> wrote:<br>

><br>

>>  No, that's all wrong. The contents of an unparsed entity are always an<br>

>> external resource, they are never part of a text or attribute node.<br>

>> Parsed<br>

>> entities do become part of the content, but they must always use the<br>

XML<br>

>> character set.<br>

>><br>

>> Michael Kay<br>

>> Saxonica<br>

>><br>

>> On 07/09/2012 13:10, Canadian Wilf wrote:<br>

>><br>

>> According to the xml 1.1 spec:<br>

>><br>

>>  4 Physical Structures ...<br>

>>> [Definition: An *unparsed entity* is a resource whose contents may or<br>

>>> may not be text <<a href="http://www.w3.org/TR/xml11/#dt-text" target="_blank">http://www.w3.org/TR/xml11/#dt-text</a>>, and if text,<br>

may<br>

>>> be other than XML. Each unparsed entity has an associated<br>

>>> notation<<a href="http://www.w3.org/TR/xml11/#dt-notation" target="_blank">http://www.w3.org/TR/xml11/#dt-notation</a>>,<br>

>>> identified by name. Beyond a requirement that an XML processor make<br>

the<br>

>>> identifiers for the entity and notation available to the application,<br>

>>> XML<br>

>>> places no constraints on the contents of unparsed entities.]<br>

>><br>

>><br>

>><br>

>>  AND<br>

>><br>

>>  Entities may be either parsed or unparsed. [Definition: The contents<br>

of<br>

>>> a *parsed entity* are referred to as its replacement<br>

>>> text<<a href="http://www.w3.org/TR/xml11/#dt-repltext" target="_blank">http://www.w3.org/TR/xml11/#dt-repltext</a>>;<br>

>>> this text <<a href="http://www.w3.org/TR/xml11/#dt-text" target="_blank">http://www.w3.org/TR/xml11/#dt-text</a>> is considered an<br>

>>> integral part of the document.]<br>

>><br>

>> [Definition: An *unparsed entity* is a resource whose contents may or<br>

may<br>

>>> not be text <<a href="http://www.w3.org/TR/xml11/#dt-text" target="_blank">http://www.w3.org/TR/xml11/#dt-text</a>>, and if text, may be<br>

>>> other than XML. Each unparsed entity has an associated<br>

>>> notation<<a href="http://www.w3.org/TR/xml11/#dt-notation" target="_blank">http://www.w3.org/TR/xml11/#dt-notation</a>>,<br>

>>> identified by name. Beyond a requirement that an XML processor make<br>

the<br>

>>> identifiers for the entity and notation available to the application,<br>

>>> XML<br>

>>> places no constraints on the contents of unparsed entities.]<br>

>>> Parsed entities are invoked by name using entity references; unparsed<br>

>>> entities by name, given in the value of *ENTITY* or *ENTITIES*<br>

>>>  attributes.<br>

>><br>

>><br>

>><br>

>>  In the current JDOM version, Element method setText(string) and also<br>

>> addContent(CDATA) refuses text that contains illegal characters. It is<br>

>> treating the data provided as 'parsed' when it should by the spec be<br>

>> treating it as free content.<br>

>><br>

>>  I understand:<br>

>><br>

>>   1) The xml 1.1 spec defines a parsed entity as its 'replacement<br>

text'.<br>

>><br>

>>  2) Replacement text' would refer to the actual textual makeup of a<br>

>> serialized Element, not the data an Element holds in a Text content<br>

>> element<br>

>><br>

>><br>

>>  Then, if the above is true, the current implementation is actually<br>

wrong<br>

>> to verify data.<br>

>><br>

>>  I propose that JDOM stop verifying data set as Element text and CDATA<br>

>> and leave it to the xerces (or whatever) to make sure the document is<br>

>> proper 1.1.<br>

>><br>

>>  Am I understanding everything correctly?<br>

>><br>

>>  Thoughts?<br>

>><br>

>>  ---------- Forwarded message ----------<br>

>> From: Canadian Wilf <<a href="mailto:canwilf@gmail.com">canwilf@gmail.com</a>><br>

>> Date: Thu, Sep 6, 2012 at 9:52 PM<br>

>> Subject: XML 1.1 -- Please stab me with a dull knife and trample my<br>

dead<br>

>> body<br>

>> To: <a href="mailto:jdom-interest@jdom.org">jdom-interest@jdom.org</a><br>

>><br>

>><br>

>> Hi All,<br>

>><br>

>>  I just learned that in order to safely use JDOM2, I will need to<br>

>> sanitize my Element .setText(string) so that the parsed data does not<br>

>> contain verboten characters under the XML 1.1 spec.<br>

>><br>

>>  I have an ascii processor and it needs to be able to use xml as a<br>

>> document format. Unfortunately, not all ascii is allowed in an Element<br>

>> text.<br>

>><br>

>>  Stab me with a dull knife and trample my dead body. But ..... please<br>

>> please please don't make me sanitize all my data before putting it into<br>

>> XML<br>

>> Elements.<br>

>><br>

>>  1) It makes my programming task much more cumbersome because I must<br>

>> ensure not to feed any of the new verboten and doomed ascii/UTF-8<br>

>> characters to store as xml text.<br>

>><br>

>> 2) No one uses xml 1.1, do they?<br>

>><br>

>>  3) It slows down the parsing (a very small amount) with all the<br>

element<br>

>> text checking.<br>

>><br>

>>  Now that JDOM2 is xml 1.1 compatible, is there any turning back. Can<br>

>> this be undone?<br>

>><br>

>>  Does everyone understand that their software will bust if data<br>

provided<br>

>> as text is not adhering to the new standard?<br>

>><br>

>>  What about you? How do you deal with it when using the libraries?<br>

>><br>

>>  Wilf<br>

>><br>

>><br>

>><br>

>> _______________________________________________<br>

>> To control your jdom-interest<br>

>><br>

membership:<a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com</a><br>

>><br>

>><br>

>><br>

>> _______________________________________________<br>

>> To control your jdom-interest membership:<br>

>> <a href="http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com" target="_blank">http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com</a><br>

>><br>

</blockquote></div><br></div>