XML Developer Tip
(Receive this column in your inbox,
click Edit your Profile to subscribe.)
XML and Internationalization
In past tips, I've touched on issues related to use of alternate character sets and Unicode (strongly related to ISO 10646) in XML. As with other such discussions, nothing beats good examples and illustrations. In that context, I've recently been reading and enjoying XML guru and fellow InformIT Guide editor Nicholas Chase's work on Internationalization: he not only does a good job of talking and walking through the terminology and techniques, but also peppers his work with lots of markup examples that show how to employ what he's describing in your own XML documents.
To that end, readers interested in using alternate character sets will find Chase's discussion of Internationalization both useful and informative. So far, he's got a single piece on this topic available posted on 12/11/2003, entitled "Internationalization." This section of his XML Guide introduces the topics and terms involved in invoking character encodings in XML documents with examples and discussions of ISO-8859-1, UTF-8, and UTF-16. In future postings on this topic, he'll describe and show how to invoke other ISO-8859 character sets, as well as specific ranges of UTF-8 and UTF-16 character codes. For authors working in various foreign languages (especially those with character sets not already included in some ISO-8859-X character set) this should prove pretty darn helpful.
In the meantime, those interested in learning more about what's available in the ISO-8859 character encodings will find the following markup and information of great interest.
<?xml version="1.0" encoding="ISO-8859-1">
This markup is an XML processing instruction (aka PI) that explains to the XML parser what version of XML is in use, along with the specific character encoding to apply. Those wishing to explore other items of the ISO-8859 specification should examine this table:
|ISO-8859-1||Latin-1||ASCII plus most Western European languages, including Albanian, Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, Flemish, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, and Swedish. Omits certain Dutch, French, and German characters.|
|ISO-8859-2||Latin-2||ASCII plus most Central European languages, including Czech, English, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovene, and Serbian.|
|ISO-8859-3||Latin-3||ASCII plus characters required for English, Esperanto, German, Maltese, and Galician.|
|ISO-8859-4||Latin-4||ASCII plus most Baltic languages, including Latvian, Lithuanian, German, Greenlandic, and Lappish; now superseded by ISO-Latin-6.|
|ISO-8859-5||(none)||ASCII plus Cyrillic characters for Slavic languages, including Byelorussian, Bulgarian, Macedonian, Russian, Serbian, and Ukrainian.|
|ISO-8859-6||(none)||ASCII plus Arabic characters.|
|ISO-8859-7||(none)||ASCII plus Greek characters.|
|ISO-8859-8||(none)||ASCII plus Hebrew.|
|ISO-8859-9||Latin-5||Latin-1 except that some Turkish symbols replace Icelandic ones.|
|ISO-8859-10||Latin-6||ASCII plus most Nordic languages, including Latvian, Lithuanian, Inuit, non-Skolt Sami, and Icelandic.|
|ISO-8859-11||(none)||ASCII plus Thai.|
|ISO-8859-12||Latin-7||ASCII plus Celtic.|
|ISO-8859-13||Latin-8||ASCII plus the Baltic Rim characters.|
|ISO-8859-14||Latin-9||ASCII plus Sami (Finnish).|
|ISO-8859-15||Latin-10||Variation on Latin-1 that includes Euro currency sign, plus extra accented Finnish and French characters.|
Some of you may be more familiar with the script names for these character sets, but you should use the ISO-8859-n notation in your XML processing instructions to make sure you invoke the correct character set. You may also find ISO-8859-15 interesting, because it represents the most useful alternate to ISO-8859-1 for European applications, given its support for most relevant languages and the Euro symbol. For more information on any given character set, visit your favorite search engine and search on its ISO standard number. Also be aware that various XML processors and parsers may or may not support ISO-8859 character sets numbered higher than 1; experimentation is urged to check compliance.
About the Author
Ed Tittel is a VP of Content Development & Delivery at CapStar LLC, an e-learning company based in Princeton, NJ. Ed runs a small team of content developers and project managers in Austin, TX, and writes regularly on XML and related vocabularies and applications. E-mail Ed at firstname.lastname@example.org.
For More Information:
- Looking for free research? Browse our comprehensive White Papers section by topic, author or keyword.
- Are you tired of technospeak? The Web Services Advisor column uses plain talk and avoids the hype.
- For insightful opinion and commentary from today's industry leaders, read our Guest Commentary columns.
- Hey Codeheads! Start benefiting from these time-saving XML Developer Tips and .NET Developer Tips.
- Visit our huge Best Web Links for Web Services collection for the freshest editor-selected resources.
- Visit Ask the Experts for answers to your Web services, SOAP, WSDL, XML, .NET, Java and EAI questions.
- Choking on the alphabet soup of industry acronyms? Visit our helpful Glossary for the latest industry lingo.
- Couldn't attend one of our Webcasts? Don't miss out. Visit our archive to watch at your own convenience.
- Discuss this article, voice your opinion or talk with your peers in the SearchWebServices Discussion Forums.