Search
Calendar
May 2017
S M T W T F S
« Sep    
 123456
78910111213
14151617181920
21222324252627
28293031  
Your widget title
Archives

PostHeaderIcon Java, SAX, XML: About special characters encoding

Unlike what most people think, characters such as &eacute are not part of XML specification! And SAX follows rigorously this spec…

A short history: in HTML, non standard ASCII characters, such as French ones (‘é’, ‘ë’, ‘è’, ‘à’, ‘ç’, etc.) were replaced with é, ë etc. From there, most people thought (and I did) this was a standard encoding in XML as in HTML.

But, when parsing XML file containing é characters, SAX parser raises following Exception:

[Fatal Error] toto.xml:3:59: The entity "eacute" was referenced, but not declared. org.xml.sax.SAXParseException: The entity "eacute" was referenced, but not declared. 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

(By the way: there is a warning in Eclipse XML plugin for such XML file)

Indeed, the only special characters owing to XML standard are followings:

  • &
  • <
  • >
  • "
  • '

To encode French, Swedish or Chinese characters, or even a simple equivalent to   (non-breakable space), you have to use their decimal or hexadecimal equivalents. I use the following HashMap for most of common characters in French:

 public static HashMap<Character, String> XML_DECIMAL_ENCODING = new HashMap<Character, String>(); 	static 	{ 		XML_DECIMAL_ENCODING.put('œ', "&#156;"); 		XML_DECIMAL_ENCODING.put('À', "&#192;"); 		XML_DECIMAL_ENCODING.put('Ä', "&#196;"); 		XML_DECIMAL_ENCODING.put('Æ', "&#198;"); 		XML_DECIMAL_ENCODING.put('Ç', "&#199;"); 		XML_DECIMAL_ENCODING.put('È', "&#200;"); 		XML_DECIMAL_ENCODING.put('É', "&#201;"); 		XML_DECIMAL_ENCODING.put('Ë', "&#203;"); 		XML_DECIMAL_ENCODING.put('Ï', "&#207;"); 		XML_DECIMAL_ENCODING.put('Æ', "&#209;"); 		XML_DECIMAL_ENCODING.put('Ö', "&#214;"); 		XML_DECIMAL_ENCODING.put('Ü', "&#220;"); 		XML_DECIMAL_ENCODING.put('à', "&#224;"); 		XML_DECIMAL_ENCODING.put('â', "&#226;"); 		XML_DECIMAL_ENCODING.put('ä', "&#228;"); 		XML_DECIMAL_ENCODING.put('æ', "&#230;");                 XML_DECIMAL_ENCODING.put('ç', "&#231;"); 		XML_DECIMAL_ENCODING.put('è', "&#232;"); 		XML_DECIMAL_ENCODING.put('é', "&#233;"); 		XML_DECIMAL_ENCODING.put('ê', "&#234;"); 		XML_DECIMAL_ENCODING.put('ë', "&#235;");                 XML_DECIMAL_ENCODING.put('î', "&#238;"); 		XML_DECIMAL_ENCODING.put('ï', "&#239;");                 XML_DECIMAL_ENCODING.put('ô', "&#244;"); 		XML_DECIMAL_ENCODING.put('ö', "&#246;");                 XML_DECIMAL_ENCODING.put('ù', "&#249;"); 		XML_DECIMAL_ENCODING.put('ñ', "&#241;"); 		XML_DECIMAL_ENCODING.put('ü', "&#252;"); 		XML_DECIMAL_ENCODING.put('û', "&#251;"); 	} 

Leave a Reply