[jdom-interest] [PATCH] Provide surrogate pair support to jdom

Dave Byrne dave-lists at intelligentendeavors.com
Tue Aug 24 09:58:19 PDT 2004


Below is a patch to provide decoding of surrogate pairs in
Verifier.checkCharacterData. Currently if a surrogate pair is in a document,
each half of the pair will be sent independently to Verifier.isXMLCharacter
which will throw an IllegalDataException.  This patch combines the surrogate
pairs into a single character which passes the tests in
Verifier.isXMLCharacter()

The patch is against CVS from this morning.

Thanks
Dave Byrne


--- Verifier.old	Fri Feb  6 01:28:30 2004
+++ Verifier.java	Tue Aug 24 09:55:39 2004
@@ -137,7 +137,6 @@
      * characters allowed by the XML 1.0 specification. The C0 controls
      * (e.g. null, vertical tab, formfeed, etc.) are specifically excluded
      * except for carriage return, linefeed, and the horizontal tab.
-     * Surrogates are also excluded. 
      * <p>
      * This method is useful for checking element content and attribute
      * values. Note that characters
@@ -155,15 +154,41 @@
             return "A null is not a legal XML value";
         }
 
-        // do check
-        for (int i = 0, len = text.length(); i<len; i++) {
-            if (!isXMLCharacter(text.charAt(i))) {
-                // Likely this character can't be easily displayed
-                // because it's a control so we use it'd hexadecimal 
-                // representation in the reason.
-                return ("0x" + Integer.toHexString(text.charAt(i)) 
-                 + " is not a legal XML character");    
-            }       
+       	
+        for(int i = 0; i < text.length(); i++) {
+        	
+        	int ch = text.charAt(i);
+        	       	
+        	if (ch >= 0xD800 && ch <= 0xDBFF) {
+        		//encountered the first part of a surrogate pair
+        		//make sure that the next char is the low-surrogate
+        		char low;
+        		
+        		try {
+        			low = text.charAt(i + 1);
+        		} catch(IndexOutOfBoundsException ex) {
+        			return "Surrogate Pair Truncated";
+        		}
+        		
+        		if (low < 0xDC00 || low > 0xDFFF) {
+        			//the low surrogate is not present
+					return "Illegal Surrogate Pair";
+        		}
+        		else {
+        			//its a good pair, calculate the true value
of
+        			//the character to then pass to
isXMLCharacter()
+        			ch = 0x10000 + (ch - 0xD800) * 0x400 + (low
- 0xDC00);
+        			i++;
+           		}
+        	}
+        	
+        	if (!isXMLCharacter(ch)) {
+			// Likely this character can't be easily displayed
+			// because it's a control so we use it'd hexadecimal

+			// representation in the reason.
+			return ("0x" + Integer.toHexString(ch) 
+				  + " is not a legal XML character");    
+        	}
         }
 
         // If we got here, everything is OK @@ -715,11 +740,11 @@
      * character is a character according to production 2 of the 
      * XML 1.0 specification.
      *
-     * @param c <code>char</code> to check for XML compliance
+     * @param c <code>int</code> to check for XML compliance
      * @return <code>boolean</code> true if it's a character, 
      *                                false otherwise
      */
-    private static boolean isXMLCharacter(char c) {
+    private static boolean isXMLCharacter(int c) {
     
         if (c == '\n') return true;
         if (c == '\r') return true;




More information about the jdom-interest mailing list