Journal Publishing Series "We send Science around the world" Allen Press, Inc. September/October 1995

Publishing with SGML

This newsletter is the latest in a continuing series devoted to the electronic publishing issues facing scientific and scholarly publishers. One of the most important challenges facing publishers in this time of rapid technological change is to find surviving hardware and software standards. We understand that hype in the software and on-line service marketplace make this difficult to do.

There are a number of coding standards now being marketed as the answer to the electronic publishing question. Some of them, particularly the ones that represent commercial, proprietary interests, will raise publishing costs and limit possibilities. Some represent quick fixes, bound by the current limitations of on-line protocols, and will lead to dead ends as technology advances. As we stated in our newsletter about electronic publishing a year ago, Allen Press is committed to SGML, Standard Generalized Markup Language, as "the electronic information format with the greatest future publishing potential" (The Allen Press Newsletter, August 1994, p. 6).

In the last year we have strengthened our commitment to SGML. Our purpose in this newsletter is to explain SGML's benefits as a publishing standard and describe how we have used SGML at Allen Press in the past year to ensure that our clients are prepared to take advantage of the developing on-line protocols for electronic publishing.

If you would like more information on how Allen Press can help you put your publication into SGML format, please feel free to contact us.

—Rand Allen
CEO
Allen Press, Inc.

Publishing with SGML: You Can Take It with You

by Ted Freeman, Electronic Publishing Coordinator

The degree of urgency in choosing the right publishing format for science and scholarship has been raised by the recent explosion of Internet-related technologies, particularly those of the World-Wide Web. It is no longer news that widespread Internet access is driving the future of science and scholarship toward on-line access. Though on-line publishing is still largely in the experimental stages, publishers of science and scholarship wanting to take advantage of the new electronic publishing opportunities should start by making sure that their published information begins in the right format. This means moving away from the old publishing model in which information is processed as if it were bound to the printing process—or even to the concept of the page—and toward a format that promotes new product flexibility, custom publishing, and intelligent electronic delivery.

The Benefits of SGML—Setting a High Standard

Your data may be used a year or two from now in ways you haven't anticipated. The portability of that data—being able to use it in a variety of ways on a variety of platforms without having to worry about what it costs to convert it—should, therefore, be of primary importance to you. Portability requires an accepted standard. Consider the advantages of being able to buy camera film in ISO film speeds.

As a standard, SGML is the product of two decades of scientific study regarding text processing and is endorsed by over three dozen standards organizations including ISO (8879-1986 of the International Organization of Standardization) and the AAP (American Association of Publishers). For this reason SGML has achieved widespread adoption by the military (through the CALS initiative), the semiconductor, aerospace, automotive and many other industries. An increasing number of publishers are also adopting SGML as their format for transmission and interchange. Humanities publishers, for instance, in The Text Encoding Initiative (TEI), have created a comprehensive SGML system for data exchange in literary research.

ASCII and Ye Shall Receive

Part of SGML's strength is that it is a standard based on the ASCII character set (ISO 646-1983), understood by the vast majority of computers in the world, and this is important. It means that text and all of the information SGML adds to the text—its elements, their properties, and their architecture—are tagged with ASCII characters, with nothing hidden. SGML is a universal markup language describing any document to any computer that reads ASCII characters. Unlike, say, the WordPerfect code for a hang indent—to indicate a citation format, perhaps—which is specific to that word processor and is lost when the text is transferred out of WordPerfect and into another format, SGML tags are simply other ASCII letters and characters typed as part of the text. This plain ASCII-character makeup of SGML makes it non-platform, non-application specific: there is no need to "translate" for any machine-dependencies, hardware architectures, operating systems or application-specific formats.

The SGML document does not depend on computers and tools the way proprietary standards like PostScript® do. Adobe decides whether your software is current and when you have to buy new equipment to take advantage of the latest PostScript® incarnation. Computers and software will come and go; the SGML document will be as usable any time in the future as it is today.

Structure vs. Appearance

The difference between SGML and what is usually thought of as "markup" is the difference between structure and appearance or what we generally call "formatting." SGML does not describe mere formatting or a collection of typeset pages as some other coding systems used for electronic publishing do. When used most effectively SGML is rarely concerned with the way something looks. It is neither a page description language (like PostScript®) nor a format-specifying language (like TeX or nroff), but a "generic markup" language used to identify the structure and content of a document, component by component.

The SGML document does not depend on computers and tools the way proprietary standards like PostScript® do.

Consider the following SGML-tagged "citation:"

<citation id=entc-15-01-b08><author><fname> M.L.</fname><surname>Brusseau</surname</ author><date>1993</date><title>Using QSAR to evaluate phenomenological models for sorption of organic compounds by soil. </title><sertitle>Environ. Toxicol.Chem</sertitle><vol>12</vol><fpage>1835</fpage><lpage>1846</lpage></citation>.

The SGML tags (delimited by "<," "</" and ">") indicate content and a nested structure, including a link, the identification ("entc-15-01-b08") of the citation for cross referencing.

Why is primacy of content and structure in SGML good? Because it is more important to know that a portion of text in a document is a citation, uniquely identified and made up of certain nested elements, than to know only that it is one among many hang-indented paragraphs. Structure makes the tagged data or information more useful, particularly in light of the Internet, where hyperdocuments require recognizable, consistent, well-defined, intelligently structured data. SGML's structure is based on interrelated components and makes text into a kind of database that may be selectively searched, sorted, mass updated, and recombined, not to mention formatted in any number of different ways, depending on the application used and the purpose to be served.

The way an SGML document looks will be determined by a formatting application of some kind, preferably one that is SGML-compliant, such as DynaText™. Even non-SGML formatting or paging programs are able to convert SGML into proprietary coding more easily because of the consistent, logical structure of an SGML document. Because of this consistency, the formatting applied to an SGML document also applies automatically to every other SGML document in the same class. SGML-compliant formatters range from Mosaic and Netscape on the low end to systems that include a database server, full-text search engines, and intuitive client software, such as OCLC's Guidon™. The point is that the SGML document serves as the core document containing all the information necessary to process it—whether for printing or electronic delivery.

DTDs: The Benevolent Tyranny

SGML contains a provision for assuring that a document adheres to a defined structure: the DTD or Document Type Definition. The DTD is an ASCII file separate from the text document that defines the elements and the rules—the strictness of which are user-defined—governing any document identifying itself as "doctype" of that DTD. This is the part of SGML that is often overlooked or misunderstood, yet it is the source of SGML's power. SGML documents can be "validated" to ensure that they follow the structure defined in the DTD. This important checking step does not exist in other kinds of markup systems and is what guarantees some level of consistency in SGML documents.

In checking the conformance of a document to the rules or "grammar" described in the DTD, a parser is used to ensure, for instance, that article titles are followed immediately by at least one author's name, or that paragraphs occur only inside of sections, sections occur only inside of chapters, or that every citation be uniquely identified and connected to any reference to that citation, and so forth. Validating SGML editors (specialized word processors used to create an SGML document, such as SoftQuad's Author/Editor™) use the DTD to enforce the rules as the document is being tagged, so that the user is constrained to follow them.

Because DTD writing can be difficult and expensive, experts will tell you not to write your own unless you have to. In the book and periodical field, many have chosen to follow ISO 12083, which includes book, article, and serial DTDs developed under the auspices of the American Association of Publishers. Another reason for using this standard DTD is that it makes document interchange easier. Many SGML editing and formatting applications do now or will include ISO 12083 as a built-in option. Because it is difficult to make a DTD that is everything to all people, we, like many others, use a slightly modified version of the 12083 article DTD. By using 12083, we feel that we are enhancing both the ease of document interchange and the custom formatting of the SGML documents we produce.

Getting to SGMLConverting from Proprietary Formats

At Allen Press, we began by converting our proprietary typeset files into SGML at the end of our production process. We abandoned this effort as error-prone, time-consuming and expensive. Even though we were using sophisticated conversion software in an attempt to automate this conversion, the difference between typesetting codes and SGML tag structure, particularly when that structure is complicated, makes conversion an adventure I don't recommend. Unless the data you are converting to SGML has implicit structure and, more important, that structure is consistent in all the documents in the group to be converted, you're better off by creating SGML manually using an SGML editor, such as Author/Editor™. This is not as difficult as you might think, particularly now that major word-processing applications such as Microsoft Word™ and Novell's WordPerfect™ offer SGML support for importing to and exporting from their proprietary formats.

SGML as Core Document

We are now manually creating SGML documents, with some automation afforded by our SGML editor, at the beginning of the typesetting process. Our typesetting vendor has revamped its batch pagination program to accept SGML-coded material with little or no conversion necessary. Any SGML structures which have nothing to do with format are ignored, while any additional coding required for typesetting is automatically stripped from the files during export. Because of the structure imposed by SGML, we have found that this system has greatly improved the consistency and accuracy of the files we typeset.

SGML and Electronic Publication

SGML is suited to virtually any kind of electronic publishing, from CD-ROM to the Internet. As an open standard it allows the incorporation of other ISO standards, such as DSSSL (Document Style and Semantics Specification Language) for style information and HyTime for the "hypermedia" of audio, video, and computer animation. It is only in an electronic environment, of course, that one can utilize SGML's hyperdocument potential.

Models of electronic publishing on-line vary widely, as do the formats being used. For simple material there are plain ASCII text files with little or no formatting that are viewable and printable from gopher systems and downloadable from ftp sites. For more complex material there are page-based or formatting file formats—PostScript® and/or PDF—that are a combination of bitmapped page images and unfielded (unstructured) ASCII text files. There are TeX files available by ftp for those who can reproduce TeX on their computer screen and/or printer.

There are no doubt many kinds of documents that are served well by either plain text and/or bitmapped page images, just as some document collections may not require more than the simple organizational structure of a gopher system. Scientific research, however, with its equations, tables, figures, and references to external entities, requires an electronic model that in principle is much like the World-Wide Web.

The Little DTD That Could

Hypertext Markup Language (HTML), the SGML-derived tagging system of the Web, is leading the way in establishing SGML as the dominant document structuring system for on-line publishing. It is often noted that HTML is a relatively simple SGML DTD. Part of its overwhelming success on the Web is, in fact, its simplicity, which allows web browsers to contain standard formats applying to all documents tagged in conformance with its DTD. The simplicity of HTML is also its biggest weakness. While it is good for the creation of Home Pages, tables of contents with pointers, and short, simple documents, it is not designed to support the complex documents of science.

There are several HTML-based scientific publishing efforts now on the Web. As impressive as some are (e.g., Journal of Biological Chemistry, http://www-jbc.stanford.edu/jbc/), the constraints they are under are obvious. These include lack of formatting possibilities due to HTML's limited tag set, particularly for tables and equations, as well as lack of support for symbol and special character sets. There are some very elaborate systems set up to convert SGML from a database on the fly to HTML, but the constraints of HTML are still there and much of the benefit of SGML is lost in the process. HTML is not, as we say, robust enough to be anything more than a gateway to the much fuller possibilities of SGML on the Web.

In our view, SGML Web browsers will play a significant part in the future of scholarly and scientific publishing on the Internet.

SGML on the Web

In our view, SGML Web browsers will play a significant part in the future of scholarly and scientific publishing on the Internet. An early entry into the SGML Web browser business is SoftQuad's Panorama™, launched from NCSA's Mosaic™. Both come in free versions and are available via ftp (http:// www.oclc.org:5046/oclc/research/panorama/panorama.html). Panorama™ allows delivery of SGML files straight to the client browser. Because DTD and custom stylesheets accompany the SGML files for viewing or printing (or are already resident in the client browser), the full potential of SGML is possible—special character support, DTD conformance, superior navigation, hypermedia.

Graphics, videos, sounds, programming codes, and database queries can be linked to an SGML document as external files. These referenced files may be retrieved and processed in their native formats without the need for conversion into some other format. Add to the mix ISO standards like HyTime and DSSSL and make the web browsers that read SGML as widely available as Mosaic or Netscape, and you have all the ingredients for affordable electronic publishing of science on-line. Those who have spent a great deal of money on proprietary solutions to on-line publishing will be undone to an extent by the kind of solution represented by Panorama™, only one of several freeware SGML browsers likely to appear in the next year.

SGML and Electronic Publishing Services at Allen Press

If you are a regular reader of the Allen Press newsletters, you know that, in addition to journal production and association management and marketing, we offer several electronic publishing services at Allen Press, from simple disk conversion to Home Page creation and maintenance (see the July/August 1995 Allen Press Newsletter). We have recently added Web access to article titles and abstracts in a searchable HTML format (http://www.allenpress.com). SGML is our latest and, we think, potentially most important electronic publishing service to be offered—for the reasons stated in the preceding paragraphs.

We are now able to return to you or archive your post-print data as a collection of SGML documents based on the DTD of your choice or ours—ISO 12083. We offer advice about creating SGML documents to our clients who want to do their own tagging. We will also convert titles and abstracts from SGML into HTML and post them in searchable form on the Web as a service to your subscribers. Because the issues surrounding delivery of SGML on the Web are now being aggressively addressed by SGML vendors and standards committees, we hope in the not-too-distant future to be able to publish full-text articles, including equations, tables and figures, on the Web in SGML form accessible by SGML-aware Web browsers.

In the meantime, our belief is that, if you take care to develop an SGML-driven system, the rest of the publishing formula will take care of itself. It will not be Internet access or software that will ultimately determine successful publishing in an electronic environment, but good data, properly prepared for electronic distribution. The idea of SGML is to provide choices about how information comes in and how it goes out, choices now and in the future. It is our intent to continue to take the steps that will put us and our clients in a position where we can both benefit from the publishing changes in the future.

For the Really Interested

SGML is a complex set of rules for creating documents, much like a programming language, and defies short and easy definitions. For a more detailed and complete description of the standard, there are several solid and readable introductions: two of the best are TEI's "A Gentle Introduction to SGML" (http:// words.hh.lib.umich.edu/bin/tei-tocs?div=DIV1&id=SG) and SoftQuad's "The SGML Primer" (http://www.sq.com/sgmlinfo/primintr.html). Also helpful is "A Brief History of SGML" by the standard's chief architect, Charles Goldfarb (http://www.sil.org/sgml/sgmlhist0.html). For the really interested, there are two widely read books:

Bryan, Martin. SGML: An Author's Guide to the Standard Generalized Markup Language. Wokingham/Reading/New York: Addison-Wesley, 1988. 380 pages. ISBN: 0-201-17535-5 (pbk); LC CALL NO: QA76.73.S44 B79 1988.

Herwijnen, Eric van. Practical SGML. 2nd edition. Boston/Dordrecht/London: Kluwer Academic Publishers, 1994. xx + 288 pages. ISBN: 0-7923-9434-8.

For more information about SGML, the place to start is the SGML Web Page at http://www.sil.org/sgml.html or contact Ted Freeman, Allen Press, Inc., 1041 New Hampshire Street, P.O. Box 368, Lawrence, KS 66044 U.S.A. Phone: 913-843-1234 or 800-627-0326, Fax: 913-843-1244 or E-mail: tfreeman@allenpress.com.