XML Developer Tip
(Receive this column in your inbox,
click Edit your Profile to subscribe.)
Working with XML and MS Word
In recent tips, I've made the case that MS Office 2003 is worth a second look because of its enhanced and improved support for XML. In a recent article at XML.com, XML guru John Simpson puts some substance behind those contentions and talks about a tool that does a good job of converting Word document files (.doc) and Rich Text Format (.rtf file) versions into pretty reasonable XML form.
Along the way he makes an excellent point: just as the HTML that Word creates when you use the "Web page (.html)" selection in the "Save As…" menu includes what some markup mavens call garbage, so, likewise, does the "XML (.xml)" option. In fact, Simpson calls the resulting output "XML of a spectacularly hideous form" which is all too true, but also stinging and accurate enough to be hilarious as well. He also points to some output in a recent DevX article by A. Russell Jones entitled "Export Customized XML from Microsoft Word with VB.NET" that illustrates why Simpson is sadly correct in his assessment of the XML that MS Word produces.
In the same story, Simpson picks a conversion tool called upCast from a German software company called infinity-loop GmbH to explore other alternatives in moving between Word .doc files and more reasonable forms of XML. As a Java-based program, upCast is inherently multi-platform in nature, and works with various versions of Windows, Unix/Linux, and Macintosh OSes. The real limitation to its capabilities come from its sourcing requirements: .doc files to be converted must have been created on Windows machines (running Windows 95, 98, NT, or 2000) using MS Word 97 or some newer version of the program. Otherwise, .doc files must be saved as .rtf on the source machines before being turned over to upCast for conversion. Also, Mac and Unix/Linux users can only handle .rtf files on their machines, not native .doc files.
All this said, driving the software is remarkably easy. Working with visual menus, users can import (source file) and export (output handling capabilities). upCast also does a good job of converting Word formatting styles into CSS form, which it saves as xml-stylesheet processing instructions (PI). Namespace handling is equally adept, and the conversion tool does a good job of recognizing and formatting hyperlinks and other active content.
infinity-loop also has an XML-to-word conversion tool to complement upCast—naturally, it's named downCast. Be sure to visit the vendor's Web site and check out these interesting tools. Simpson's complete story is also worth a visit as well!
About the Author
Ed Tittel is a VP of Content Development & Delivery at CapStar LLC, an e-learning company based in Princeton, NJ. Ed runs a small team of content developers and project managers in Austin, TX, and writes regularly on XML and related vocabularies and applications. E-mail Ed at firstname.lastname@example.org.
For More Information:
- Looking for free research? Browse our comprehensive White Papers section by topic, author or keyword.
- Are you tired of technospeak? The Web Services Advisor column uses plain talk and avoids the hype.
- For insightful opinion and commentary from today's industry leaders, read our Guest Commentary columns.
- Hey Codeheads! Start benefiting from these time-saving XML Developer Tips and .NET Developer Tips.
- Visit our huge Best Web Links for Web Services collection for the freshest editor-selected resources.
- Visit Ask the Experts for answers to your Web services, SOAP, WSDL, XML, .NET, Java and EAI questions.
- Choking on the alphabet soup of industry acronyms? Visit our helpful Glossary for the latest industry lingo.
- Couldn't attend one of our Webcasts? Don't miss out. Visit our archive to watch at your own convenience.
- Discuss this article, voice your opinion or talk with your peers in the SearchWebServices Discussion Forums.