[jdom-interest] StAXTextModifier.java (removing indentation white
space with StAXBuilder)
cowtowncoder at yahoo.com
Mon Dec 13 21:39:02 PST 2004
[related to my earlier email regarding new
functionality of StAXBuilder class; ignore if not
interested in the implementation details]
After thinking about what is really needed to
heuristically remove white space used for indentation
purposes, I realized there are really 2 pieces needed:
(a) Context, at least preceding and following events
that surround the (all white space) text segment, and
preferably some information about nesting.
(b) Text segment itself; needs to be all white space,
and probably also contain (or, start with)
Having this information allows for enough granularity;
for example, knowing that a text segment is inside
<pre> tags in (X)HTML allows leaving such white space
untouched. Or, knowing that a non-empty text segment
that starts and ends with white space, is surrounded
by start+end tags, allows trimming even such
indentation, even if the text itself is not all white
The problem in passing such information from streaming
parsers (SAX, StAX) is that only one event at a time
is really accessible. To overcome this problem, it's
important to be able to do (limited) lookahead. Good
thing is, it's quite easy to do, at least with StAX...
and here's the API I created to allow passing the
information with 3 separate (but related) methods:
allowModificationsAfter one is called for all start
and end tag events; it changes modification mode, so
that if call returns false, no modifications will be
done; if true is returned, modifications are allowed.
Thus, modifier object could return false when
encountering <pre> tag, otherwise true (or better,
keep a stack if more elaborate logics are needed).
When modifications are allowed, possiblyModifyText is
called for all text segments (including CDATA). At
this point modifier can check the text event, and
figure out if it wants to modify it; method gets
information about the type of immediately preceding
event (but not following as that's not yet known).
And finally, if above method returns true,
textToIncludeBetween is called (with information about
the event that follows the text segment, pointed to by
the stream at this point), to allow for actual
modification, up to and including removal of the text.
A simple example of an implementation of this abstract
class can be found from
StAXBuilder.IndentationRemover; it's trivially simple
class that only removes all-white space text segments
that start with a linefeed, between any other nodes
(not just around start/end element pairs).
If anyone finds this approach interesting, I'd like to
hear about other interesting use cases... right now I
think this could be useful, but don't really have
specific additional use cases myself. :-)
I guess it (or similar approach) could be used for
simple text filtering, for feeding something like
Lucene text indexer.
-+ Tatu +-
Do you Yahoo!?
The all-new My Yahoo! - Get yours free!
More information about the jdom-interest