XML Developer Tip
(Receive this column in your inbox,
click Edit your Profile to subscribe.)
Untangling Unicode encoding in XML
My last tip dealt with including the Euro currency symbol in XML documents, using various forms of Unicode character entity references. It caused an unexpected blizzard of e-mail asking for help on managing the details involved in working with the many different forms that Unicode can take. This led me on an expedition to locate good references and tutorials on the subject, which in turn led me to the subject of this week's tip. It's a profound bow of gratitude toward Mike J. Brown's excellent Web resource entitled "The skew.org XML Tutorial". This paper concentrates on matters related not just to XML in general, but also on XML encoding strategies. It also covers the differences between Unicode (which is bandied about—as I've done here—as a way of describing a mammoth collection and codification of character codes, alphabets, and other typographical marks)—and the standard that actually governs XML character encoding—namely, ISO/IEC Standard 10646-1. Brown cuts through these matters by calling this a Universal Character Set or UCS.
The biggest practical difference between the two standards is that the Unicode Standard is available online at www.unicode.org and is well and affordably documented in Addison-Wesley's various versions of the Unicode Consortium's excellent publications, of which the most current version is The Unicode Standard 3.0 (Addison-Wesley, 2000). The ISO/IEC 10646-1 official documentation comes in numerous pieces—as many as six, in fact—and costs hundreds of dollars and up for electronic, CD, or paper copies available only from the Web site at www.iso.org. Brown also recommends Tony Graham's Unicode: A Primer (Wiley, 2000) as another valuable resource on the topic, one that explains the differences between Unicode and ISO 10646 more thoroughly than his tutorial, in fact.
Brown's tutorial does numerous wonderful things to help XML content and tool developers fit their minds around the many minutia of getting Unicode/10646 encoding right in the XML documents and in the tools that deal with such documents, including:
- The best introduction of specific terminology and its specific meanings in the character encoding context (this turns out to be far more important than you might guess).
- Mappings between various important characters sets—include ASCII, the various ISO Latin character sets (denoted ISO/IEC 8859-X, where X runs between 1 and 15 at last check), the WGL4 Windows Glyph List (version 4) that Microsoft defined with Agfa Monotype and implements in most Windows fonts, and the Adobe Glyph List (AGL), itself a superset of WGL4.
- An explanation of how the UCS code space is divided into 17 planes, each of which accommodates up to 65,535 values (a 16-bit encoding space, in other words), and how general character encoding works.
- The process whereby character encodings are converted from abstract representations like Զ to specific numeric codes that some device can recognize and render.
- Documentation of common encoding schemes used for abstract representations, such as UTF-8 and UTF-16, how these work in XML, and how to reference them in XML document descriptions.
By working your way through this excellent collection of materials, you should be much better equipped to understand and use UCS encodings in your XML documents. Having worked around the topic for nearly 5 years now, I nevertheless learned a lot about UCS encodings from this resource myself; hopefully, you will have the same experience.
About the Author
Ed Tittel is a principal at LANWrights, Inc., a network-oriented writing, training, and consulting firm based in Austin, Texas. He is the creator of the Exam Cram series and has worked on over 30 certification-related books on Microsoft, Novell, and Sun related topics. Ed teaches in the Certified Webmaster Program at Austin Community College and consults. He a member of the NetWorld + Interop faculty, where he specializes in Windows 2000 related courses and presentations.
For More Information:
- Looking for free research? Browse our comprehensive White Papers section by topic, author or keyword.
- Are you tired of technospeak? The Web Services Advisor column uses plain talk without the hype.
- For insightful opinion and commentary from today's industry leaders, read our Guest Commentary columns.
- Hey Codeheads! Start benefiting from other time-saving XML Developer Tips and .NET Developer Tips.
- Visit our huge Best Web Links for Web Services collection for the freshest editor-selected resources.
- Choking on the alphabet soup of industry acronyms? Visit our helpful Glossary for the latest lingo.
- Visit Ask the Experts for answers to your Web services, SOAP, WSDL, XML, .NET, Java and EAI questions.
- Discuss this issue, voice your opinion or just talk with your peers in the SearchWebServices Discussion Forums.