[jdom-interest] setIgnoringElementContentWhitespace inoperant ?

Thu Dec 9 10:27:47 PST 2004

On Thu, 2004-12-09 at 11:59, Bradley S. Huffman wrote:

> Ken Roberts writes:
> 
> > On Thu, 2004-12-09 at 06:38, Elliotte Harold wrote:
> > 
> > > setIgnoringAllWhitespace()  is the wrong name for this functionality. Do 
> > > you really want to throw away all white space? 
> > > Eveninrecordlikedocumentsthiscouldbeveryhardtoread. I think what you 
> > > really want to do is throw away all text nodes that consist of white 
> > > space exclusively, but retain all white space in text nodes that contain 
> > >   any non-whitespace characters. The correct name for this method would 
> > > be setIgnoringBoundaryWhitespace(). The functionality proposed is fine. 
> > > I just want to make sure we get the name right.
> > 
> > 
> > What something like this should do is convert an infinite amount of
> > whitespace in a single instance into a single space.  Not sure about
> > "middle" text, but an equivalent of String.trim() would probably be OK
> > anywhere if you choose this option. Keep in mind that it's an OPTION
> > rather than a change in default behavior.
> 
> You have to be careful when trimming whitespace or something like
> 
>     <p>This is a 
>               <i>   test</i>
>        sentence.   </p>
> 
> could end up as
> 
>     <p>This is a<i>test</i>sentence.</p>
> 
> which may or may not be what is really desired.
> 
> Brad
> 

That's true.  I'm not sure how the parsing works in jdom, but if I were
writing a c or java parser, when you tokenize it the tokens are all
separated correctly even with the short string.

What I was getting at is that if I were to choose a method named as the
one being discussed, my intent would be to minimize whitespace.  In
other words, I would care that there was whitespace between two tokens,
just not how much.

One could convert all sequences of whitespace into a single space, but
then when you parse your above example you would get:

<p>This is a <i> test</i> sentence.</p>

When you took care of the italics, there would still be two spaces
between "a" and "test".  If one were converting to HTML this would not
matter in the least, but if you're parsing a document and expect there
to be only one space in any whitespace, you would not get the correct
result.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.jdom.org/pipermail/jdom-interest/attachments/20041209/00e01931/attachment.htm