Java, SAX, XML: About special characters encoding
Unlike what most people think, characters such as é
are not part of XML specification! And SAX follows rigorously this spec…
A short history: in HTML, non standard ASCII characters, such as French ones (‘é’, ‘ë’, ‘è’, ‘à’, ‘ç’, etc.) were replaced with é
, ë
etc. From there, most people thought (and I did) this was a standard encoding in XML as in HTML.
But, when parsing XML file containing é
characters, SAX parser raises following Exception:
[Fatal Error] toto.xml:3:59: The entity "eacute" was referenced, but not declared. org.xml.sax.SAXParseException: The entity "eacute" was referenced, but not declared. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
(By the way: there is a warning in Eclipse XML plugin for such XML file)
Indeed, the only special characters owing to XML standard are followings:
&
<
>
"
'
To encode French, Swedish or Chinese characters, or even a simple equivalent to
(non-breakable space), you have to use their decimal or hexadecimal equivalents. I use the following HashMap for most of common characters in French:
public static HashMap<Character, String> XML_DECIMAL_ENCODING = new HashMap<Character, String>(); static { XML_DECIMAL_ENCODING.put('œ', "œ"); XML_DECIMAL_ENCODING.put('À', "À"); XML_DECIMAL_ENCODING.put('Ä', "Ä"); XML_DECIMAL_ENCODING.put('Æ', "Æ"); XML_DECIMAL_ENCODING.put('Ç', "Ç"); XML_DECIMAL_ENCODING.put('È', "È"); XML_DECIMAL_ENCODING.put('É', "É"); XML_DECIMAL_ENCODING.put('Ë', "Ë"); XML_DECIMAL_ENCODING.put('Ï', "Ï"); XML_DECIMAL_ENCODING.put('Æ', "Ñ"); XML_DECIMAL_ENCODING.put('Ö', "Ö"); XML_DECIMAL_ENCODING.put('Ü', "Ü"); XML_DECIMAL_ENCODING.put('à', "à"); XML_DECIMAL_ENCODING.put('â', "â"); XML_DECIMAL_ENCODING.put('ä', "ä"); XML_DECIMAL_ENCODING.put('æ', "æ"); XML_DECIMAL_ENCODING.put('ç', "ç"); XML_DECIMAL_ENCODING.put('è', "è"); XML_DECIMAL_ENCODING.put('é', "é"); XML_DECIMAL_ENCODING.put('ê', "ê"); XML_DECIMAL_ENCODING.put('ë', "ë"); XML_DECIMAL_ENCODING.put('î', "î"); XML_DECIMAL_ENCODING.put('ï', "ï"); XML_DECIMAL_ENCODING.put('ô', "ô"); XML_DECIMAL_ENCODING.put('ö', "ö"); XML_DECIMAL_ENCODING.put('ù', "ù"); XML_DECIMAL_ENCODING.put('ñ', "ñ"); XML_DECIMAL_ENCODING.put('ü', "ü"); XML_DECIMAL_ENCODING.put('û', "û"); }