The Roots of SGML -- A Personal Recollection
By Dr. Charles F. Goldfarb
The wonderful SGML '96 conference and its theme have awakened all sorts of interesting memories and ideas, going back over the nearly thirty years that I've been involved with generalized markup. I'd like to share some of them with you. But as this memoir will be short, idiosyncratic, focused on events in which most important figures in the SGML community did not participate, and taken from memory rather than research, I feel a disclaimer is in order.
PARALEGAL DISCLAIMER AND APOLOGY
All persons, organizations, and activities (hereinafter "Entities") mentioned in this brief reminiscence are described only as I've been able to recall them and such descriptions are not necessarily how those Entities would describe themselves. I may have missed a detail here or there in an Entity description, but the overall result is pretty accurate. The inclusion or omission of an Entity should not be interpreted as an indication of the importance of said Entity to SGML, the author, or the world at large, although the omission of really important Entities, as always, could cause a parsing error.Before Generalized Markup Language (GML)
In 1966 I was an attorney practicing in Boston, MA, two years out of Harvard Law school. I knew nothing about computers, but I knew there had to be a better way to produce documents than dictating them, reviewing a draft, marking up the draft with corrections, reviewing the retyped draft, and then, in frustration, seeing that the typist had introduced more errors while making the corrections.
My hobby was being a "rallymaster", a person who created the route instructions for sports car rallies. These normally read something like:
26. Left at light onto Jones Rd.
27. Right onto Smith St.
Mine looked like:
26. Left at light onto Jones Rd.
27. (Repeat instructions 20 - 26, substituting "left" for "right".)
28. Second right.
I also did things like hand out road maps of Yugoslavia and expect contestants to turn right in response to the instruction "go towards Sarajevo".
Eventually a friend told me that my rally instructions looked like computer programs. I said "Really? What's a computer program?" Shortly thereafter, in November, 1967, I joined IBM to find out what kind of business would pay people for writing rally instructions. (The idea had also crossed my mind that the experience might equip me to attract legal clients from Boston's burgeoning high tech scene.)
My job was to design and install accounting systems for small businesses using not-so-state of the art punched card tabulating machines and occasionally, for the wealthier customers, a small computer. One assignment was different, though, and it eventually changed my career: installing a typesetting system for a local newspaper.
The system was an IBM 1130 computer, a machine the size of a desk with 8KB (sic!) of main memory, a 512KB disk drive, a Teletype CX paper tape reader and BRPE paper tape punch, and a Photon 713 photomechanical typesetter. The assignment was my first experience with managing a machine-readable document database: I learned to roll the punched paper tape carefully so that it could be stored neatly in cylindrical waste paper baskets.
In the meantime, though I didn't know about it, the roots of generalized markup were being planted. Historically, electronic manuscripts contained control codes or macros that caused the document to be formatted in a particular way ("specific coding"). In contrast, generic coding, which began in the late 1960s, uses descriptive tags (for example, "heading", rather than "format-17").
Many credit the start of the generic coding movement to a presentation made by William Tunnicliffe, chairman of the Graphic Communications Association (GCA) Composition Committee, during a meeting at the Canadian Government Printing Office in September 1967: his topic -- the separation of information content of documents from their format.
Bill went on teaching the world about "generic coding" under the auspices of Norm Scharpf and the GCA, then as now (and for all the years in between) unflagging believers, contributors, and promoters of the cause. At the same time, a New York book designer named Stanley Rice was publishing articles about "Standardized Editorial Structures", parameterized style macros based on the structural elements of publications.
Integrated Text Processing and GML
In early 1969 I had had my fill of wiring tab machines and I was ready to resume my accustomed place before the bar. Instead, IBM convinced me to join its Cambridge Scientific Center and figure out how to apply computers to legal practice. That project required integrating a text editing application with an information retrieval system and a page composition program. The documents had to be kept in a repository from which they could be selected by queries. The selected documents could be revised with the text editor and returned to the data base, or rendered for presentation by the composition program.
Standard stuff for SGML systems today, perhaps, but far from the way most people thought about document processing in 1969. So far, in fact, that the applications we needed to integrate were not only not designed to work together, they couldn't even run on the same operating system. Fortunately, we had access to CP-67, an early hypervisor that allowed multiple concurrent operating systems to run on the same computer and share files. The problem was that, even when Ed Mosher, Ted Peterson, and I finally got the programs to talk to one another, we found they each required different procedural markup in the document files.
I remember discussing this first attempt at integration with a senior IBM Industry Marketing Manager named Steve Furth, whom IBM people thought of as the father of document information retrieval. (He'd written a book on the subject in the days when a data base was as likely to use cardboard media as magnetic.) I mentioned that I thought it best to remove the procedural markup. He said something about that being wrong because the markup could have other uses. I said something like "you mean figuring out that some text is a caption because it is centered." He said "something like that" and referred me to Stan Rice's work. The rest, as they say, is history (or pre-history).
Later in 1969, together with Ed Mosher and Ray Lorie, I invented Generalized Markup Language (GML) to solve the data representation problem. GML was not merely an alternative to procedural markup, but the logical representation that motivated all processing. Ed recalls:
We called it Text Description Language at first, because I think that's what we thought it was. We certainly very early intended to use it as a common and general markup to be "translated" into Script [formatting] controls, ATMS & TERMTEXT & MTSC [formatting] codes, STAIRS [information retrieval descriptor] paragraph codes, as well as using an un-filled-in outline of tags as a prompter from which to create a new document.
IBM decided that our work had value beyond the law office application, and the focus of our project shifted to text processing in general. The project was given a name, "Integrated Text Processing" and the first prototype was dubbed "Integrated Textual Information Management Experiment" (InTIME). Our manager, Andy Symonds, gave us permission to report on the work in the Proceedings of the 1970 Annual Meeting of the American Society for Information Science. But we were not allowed to write about TDL/GML because IBM had decided that it had serious product potential. We could only hint at the need for codes "to identify the structure and purpose of the parts of text. ... The composition program would identify the codes as calls to stored formats; the retrieval program would use them for classification."
(There's more about InTIME in the 25th Anniversary Edition of the Journal of the American Society for Information Science, in the form of an annotated version of the original 1970 paper.)
Ed Mosher's technical notebook indicates that by 1971 we had succeeded with tag interpretation and multiple use (which Ed had implemented using Script set-symbols) and moved along into thinking about models and finite state machines. Ed that year developed the first production quality DTD, designed for the manuals for IBM's "Telecommunications Access Method" (TCAM). He was aided by TCAM publications manager, Joe Groppuso, whom I remember being particularly impressed that all the headings of a given head-level were formatted identically. That was a level of consistency they had not been able to achieve by their normal methods.
In 1971 the GCA Annual Meeting was held in Boston and Norm Scharpf, a former IBM Marketing Manager, had inquired as to whether our lab had anything interesting to show off on a site tour. I agreed to demonstrate the InTime prototype (without going into technical details), and to give a paper on "context editing". (That was heady stuff in 1971: you could actually navigate a file by searching for text strings instead of specifying line numbers!)
Norm invited me to a meeting of the "System X" committee, where I met Bill Tunnicliffe for the first time. There were 8 or 10 of us crowded into a hotel room in Boston, with steak dinners perched on our knees, discussing markup codes. I'm not sure about the technical results of the meeting, but I can say one thing for certain, having benefited from Norm's generosity in nurturing SGML and HyTime standards activities over the decades since: He's never fed another committee quite as well.
The GCA continued to work independently of our efforts in Cambridge. System X evolved into the "GenCode(R) concept", which recognized that different generic codes were needed for different kinds of documents, and that smaller documents could be incorporated as elements of larger ones. GCA and I eventually joined forces in 1978, when development of the SGML standard began.
(Bill Tunnicliffe became the first chairman of WG8, the ISO committee that developed and maintains the SGML family of standards. I mention it, although it is outside the period of this memoir, because Bill passed away on September 12 of this year, at the age of 74. We had a chance to honor him for his contributions in person at SGML '92. We won't have that chance again, so I want to thank him here.)
Later in 1971, when product development was imminent, I gave GML its present name so that our initials would always prove where it had originated. One of the ugly truths of technology transfer is that developers tend to be grateful for research work when first received, and virtually oblivious to it by the end of a lengthy development cycle, which in those days could take years and years. (Actually, it still takes that long today; they just bring the software to market much earlier in the development cycle.)
GML finally saw the light of day under its own name in 1973, shortly before the release of its first (relatively primitive) implementation in the "Advanced Text Management System" (ATMS). Here is that first public appearance, from my paper, "Design Considerations for Integrated Text Processing Systems", IBM Cambridge Scientific Center Technical Report G320-2094, May 1973 (but written in 1971):
This analysis of the markup process suggests that it should be possible to design a generalized markup language so that markup would be useful for more than one application or computer system. Such a language would restrict markup within the document to identification of the document's structure and other attributes. This could be done, for example, with mnemonic "tags". The designation of a component as being of a particular type would mean only that it will be processed identically to other components of that type. The actual processing commands, however, would not be included in the text, since these could vary from one application to another, and from one processing system to another.
After the completion of GML, I continued my research on document structures, creating additional concepts, such as short references, link processes, and concurrent document types, that were not part of GML. By far the most important of these was the concept of a validating parser that could read a document type definition and check the accuracy of markup, without going to the expense of actually processing a document. At that point SGML was born -- although it still had a lot of growing up to do.
Copyright 2001, Dr. Charles F. Goldfarb
FOR MORE INFORMATION