In my recent article on BPEL, I suggested that simpler types of business transformation of XML data might be better handled by pipeline-style processing. When I last looked into XML Pipeline processing, the W3C had just started a working group to look into the creation of a standard. The working group contributors had been working on their own implementations of the concepts, both as open-source and commercial applications. Of course, other XML related standards at the W3C, such as XSLT, XPath, XQuery and XInclude were evolving at the same time so it has taken a while but we now (March 9, 2010) have a proposed W3C recommendation. This article is intended to be a quick introduction to the capabilities which have been defined.
XProc is a declarative language using XML for describing sequences of operations on XML documents. The basic computational unit of a pipeline is called a "step." A step might be something as simple as running an XSLT process to create a new document or a more complex operation which might otherwise require custom programming.
A pipeline step takes zero or more inputs and produces zero or more outputs. Data that flows between outputs and inputs is required to be in the form of valid XML documents or sequences of XML documents, but steps can obtain data from any non-pipeline source and write to any non-pipeline output in any format. XProc has provisions for importing non-XML data and translating or encapsulating it to create valid XML.
XProc implementations make heavy use of XSLT and XPath concepts for defining selection of items or collections of items from an XML document. Because the XML world is still in transition between standards, XProc implementations are allowed to use either XPath 1.0 (1999) or XPath 2.0 (2007) expressions. XPath 2.0 is preferred due to better compatibility between the XSLT 2.0 and XPath 2.0 data model. What XProc calls an "XSLTMatchPattern" is used in certain operations which select items in an XML document. This term does not appear to be in common usage outside of XProc, it is used here to denote "Match" pattern expressions in XSLT 1.0 or XSLT 2.0 used to locate nodes. As I understand it, every XSLTMatchPattern is a valid XPath expression but the converse is not true so XProc used both kinds of expression to define steps.
Inputs and Outputs
Inputs can be defined with content "inline" as part of the xpl document, as in the following example from the XML Calabash distribution. The input port named source is all the content in the p:inline element, note that it has start and end elements (arbitrarily named "doc") to form the root element of a valid XML document.
<p:input port="source"> <p:inline><doc>Congratulations! You've run your first pipeline! </doc></p:inline> </p:input>
More typically, step inputs will be a reference to a document or sequence of documents using URIs or references to result "ports" from named earlier steps. XProc includes some provision for creating input XML items from arbitrary data by applying encoding rules and wrapping the result in root elements. A HTTP request can be used to aquire data from a Web service.
User Defined Variables
Extensive provision is made for creating variable in one step for use in later steps. A variable could be as simple as a text string or as complex as collections of items selected by XPath.
Some Typical Operations
Here are a few of the 31 standard steps required in an XProc implementation. The XProc standard also defines 10 optional steps and we may expect individual implementations to come up with even more.
- Validate a document using a schema.
- Add an attribute to elements matching a XSLTMatchPattern.
- Rename attributes or elements selected by a XSLTMatchPattern.
- Apply an XSLT script to a document.
- Delete items in an input document identified by a XSLTMatchPattern, outputting the modified document.
- Select from an input document only items located by a XSLTMatchPattern, creating a new document.
- Delete entire nodes selected by an XPath expression.
- Create a new document containing nodes selected by an XPath expression
- Compare two XML documents for equality as defined by XPath.
- Insert one document into another at a point located by a XSLTMatchPattern.
- Flatten a document hierarchy by replacing matched elements with their child nodes and the reverse by wrapping matching nodes with new elements.
The XProc processing model is necessarily complex, with over one thousand tests in the test suite of documents and required results. The test suite is composed of both required and optional features. Two implementations have achieved very high compliance with the required tests.
- Calabash The XML Calabash implementation is a Java based open-source application maintained by Norman Walsh, one of the editors of the XProc standard. The current version of Calabash, 0.9.19, requires Java 1.5, the Saxon XSLT toolkit, and some additional libraries.
- Calumet The EMC Documentum XProc Engine a commercial product with a free download for development purposes by Jeroen van Rotterdam. The approach taken is to parse the XProc document and compile a Java object representation which is used by a "pipeline runner" to operate on input data. Now in version 1.0.10, it requires Java 1.6 and comes with extensive documentation. Plug-ins allow integration with well known XML tools such as the Apache FOP processor for creation of pdf documents.
You might think of XProc as providing a vastly expanded capability for applying standard tools such as XSLT and XPath to more complex problem sets. Since XProc steps require a DOM (Document Object Model) in memory, XProc has a basic limitation on total document size. Note that since a DOM in any language involves many objects in addition to the memory required for text storage, really large documents may have to be processed in pieces. Due to the complexity of creating an XProc XML Pipeline, I suspect that many potential users will be looking for XProc editor plugins for their favorite IDE.
The XML Processing Group at the W3C handles specifications for XML Pipeline and XInclude.