[jdom-interest] special characters and JDOM

Joseph Bowbeer jozart at csi.com
Sun Jul 22 00:39:05 PDT 2001


Andrew Freeman writes:

> I am trying to use JDOM to parse an XML file that contains
> the following character:  '-'.  However, I am getting a parsing
> error indicating that that Unicode character is invalid.

What you have there is an "en dash".  The CP1252 encoding (Windows
character) is 150.  The Unicode 2.0 encoding is 8211.

The root of your troubles is that an ASCII "-" (45) is used to represent a
minus sign *and* several different flavors of dashes and hyphens.  In
Unicode each of these is given a unique code.

If you really want one of those narrow dashes in your document, I recommend
"–"

You could try, instead, changing your document's encoding to cp1252.  (Does
this work?)

You should be safe if you "&entify;" anything with a unicode value greater
than 127, though this isn't always the most user-friendly thing to do.

References:

  http://www.pemberley.com/janeinfo/latin1.html
  http://www.robelle.com/library/smugbook/unicode.html


--- original message ---
From: Andrew Freeman aefreeman at earthlink.net
Date: Fri, 20 Jul 2001 20:17:46 -0400

I am trying to use JDOM to parse an XML file that contains the following
character:  '-'.  However, I am getting a parsing error indicating that that
Unicode character is invalid.

When I print it out in Java:

System.out.println("" + (int) '-');

I get 8211.

If I print out its ASCII character in another editor I get 150.

Does the XML file need a specific encoding in order to parse the file?  Do I
need to have the character escaped with – prior to parsing the file? If
I need to escape the character, is there a rule that tells me what I have to
escape and what I don't?  Also, what is special about this  character that
it has such a funky int value when I print it out?

Thanks,
Andy





More information about the jdom-interest mailing list