[jdom-interest] XMLOutputter problems with Unicode

Jason Hunter jhunter at servlets.com
Tue Jul 2 12:06:15 PDT 2002


Your solution is one approach.  However, if you simply leave the
outputter's encoding as UTF-8 (the default) and pass in an output stream
or a writer designed for UTF-8, then characters are encoded correctly
without needing to be escaped.  That should be faster than your
solution.  If you don't see that happening, you probably passed in an
improper writer or changed the encoding.  

-jh-

> Mad Einstein wrote:
> 
> 
> Current XMLOutputter class (Version 8) doesn't support Unicode
> characters with hashcode above 128.
> 
> I was trying to save this character \u8220 to xml using XMLOutputter
> and as the result I had in file one byte (93hex) instead of two bytes,
> and then I couldn't parse this file using SAXBuilder as well as I
> couldn't open this file in Internet Explorer.
> 
> I was reading different algorithms that converts Unicode to XML, HTML
> and I think this one is the best
> 
> ----------------------------------------------------------------------
> http://czyborra.com/utf/#UTF-8
> 
> HTML's Numerical Character References
> 
> A somewhat more standardized encoding option is specified by HTML. RFC
> 2070 allows us to reference just any Unicode character within any HTML
> document of any charset by using the decimal numeric character
> reference 〹 as in:
> 
> putwchar(c)
> {
>   if (c < 0x80 && c != '&' && c != '<') putchar(c);
>   else printf ("&#%d;", c);
> }
> 
> Decimal numbers for Unicode characters are also used in Windows NT's
> Alt-12345 input method but are still of so little mnemonic value that
> a hexadecimal alternative &#x1bc; is being supported by the newer
> standards HTML 4.0 and XML 1.0. Apart from that, hexadecimal numbers
> aren't that easy to memorize either. SGML has long allowed symbolic
> character entities for some character references like &eacute; for é
> and &euro; for the € but the table of supported entities differs
> from browser to browser.
> 
> ----------------------------------------------------------------------
> 
> I wrote this method for the conversion
> 
> This class converts this 3 characters (&,<,>) to SGML Entities as well
> as all characters above 128 using this format &#1234; Now it works
> with any parsers suporting XML 1.0
> 
> /**
>  * Converts Unicode Character to HTML Decimal Entity.
>  * All Characters with hashcode less than 128(decimal) apart from
>  * '>','<' and '&' are the same.. The rest is converted to decimal
> entity &#{char_hashcode};
>  * Supported formats examples:
>  * <br> /u003F  --> &#63;
>  * @param value Unicode Character
>  * @return Converted HTML Character or Entity.
>  */
>   public String convertTEXTtoHTML(char value)
>   {
>      String temp = null;
>      char b[] = new char[1];
>      int bint = new Character(value).hashCode();
> 
> if((bint<128)&&(bint!="&".hashCode())&&(bint!="<".hashCode())&&(bint!=">".hashCode()))
>      {
> //       b[0] = value;
> //       temp = new String(b);
>        temp = null;
>      }
>      else
>       temp = "&#"+ bint +";";
>      return temp;
>   }
> 
> and I changed XMLOutputter.escapeElementEntities(String str) method
> 
>    default :
>        entity = convertTEXTtoHTML(ch);
>        break;
> 
> Maybe there is a different solution for this problem, but It works
> fine.
> 
> Mad Einstein



More information about the jdom-interest mailing list