Hasse Haitto
President
Synex Information AB
Stora Nygatan 20
S-111 27 Stockholm
SWEDEN
+46 8 791 88 81
+46 8 791 88 89
haitto @ synex.se
http://www.synex.se
| Hasse Haitto, President of Synex Information AB, co-founded the company in 1993. Synex Information specializes in SGML browsing software and is spun off after years of research and development of SGML technology invented at the Royal Institute of Technology of Stockholm (RIT). The company's flagship is Synex ViewPort, the market leading SGML/HyTime browser engine. Hasse holds a M.Sc. in Engineering Physics from RIT and has worked with SGML since 1988. |
This paper was presented at SGML Europe'97 in Barcelona and is part of the GCA conference proceedings. Copyright by the author and the Graphic Communications Association.
SGML has celebrated 10 years as a standard, and although the standard is only now being revised, the use of SGML has evolved over time. This paper explores some of the features that has made SGML successful, the importance of adopted conventions, and speculates on future applications as SGML transitions into the next century.
As an international standard, SGML is subject to orderly, voted-upon change. Already a decade in adoption, it is due to be revised. In many ways, the standard was farsighted in its designa fact confirmed by it being applied well beyond its original publishing design intentions, and in becoming the foundation of promising standards such as ISO 10744 HyTime. Even the long delay in completing the companion standard ISO 10179 DSSSL has not significantly slowed SGML's rise to prominence.
However, even if the standard itself has yet to change, the use of SGML has made a number of transitions.
In DTD design, a significant trend has been moving away from structural towards content-oriented, functional design. As an example, one might formerly have tagged an assembly description in a maintenance manual as a bulleted list, but would these days rather tag it as a series of assembly steps, with hyperlinks to required tools, even if its rendition is a bulleted list.
The benefits of content-oriented tagging are increasingly in the re-use of information elements, in the coupling of SGML with databases, and in connection with queries which use the underlying markup as search criteria.
The different ways SGML is used reflects a growing awareness of dealing with information rather than documents.
Early use of SGML reflected a one-size-fits-all approach of generic DTDs. This view was gradually replaced by that of domain-specific DTDs. The DocBook DTD is perhaps one of the most commonly known later efforts; various industriesaerospace, automobile, semi-conductors, etc. have implemented their own DTDs as well.
At the far end of the application spectrum, complex DTDs for IETMs (Interactive Electronic Technical Manuals) may need to include structures for conditional presentation behavior of input data and embed interactive elements such as clickable warning dialogs that are controlled by traversal rules.
DTDs have also become increasingly modular. A simple example is that of common elements that are included when necessary: say, a table model, or elements for mathematical formulae.
A very disciplined modular design has been adopted by the TEI (Text Encoding Initiative) in its set of DTDs (see Guidelines for Electronic Text Encoding and Interchange, edited by C.M. Sperberg-McQueen and Lou Burnard, http://www-tei.uic.edu/orgs/tei/). In these DTDs, you toggle the inclusion of DTD fragments as required, and the content models have provisions for being extended or replaced in a clean, extensible fashion. (These DTDs are highly recommended for study and use!)
Along with content-oriented tagging, the application of reusable information elements is becoming widespreaddifferentiating between tagging for storage and retrieval vs. generating data for some publishing-oriented DTD (say, HTML) from a storage-oriented markup.
Content-oriented storage DTDs are ideal for SGML document databases that support SGML querying capabilities.
As DTDs evolve, one may need to maintain and restore earlier versions of both DTDs and related document instances. There will be a growing need for tools that address DTD evolution, and that optimize queries based on SGML structures.
For IETMs, expect improvements in editing tools, to validate application semantics in addition to the primary SGML and SGML-related functionality.
For implementing re-usable information elements, the SGML subdoc feature may become a key player.
From the onset, portability and system independence were paramount for SGML, and highly touted as a selling point of the SGML approach. Actually, current tools still tend to influence (to a certain degree) how you will use SGML, but at least SGML minimizes the application exposure. The tools are also becoming better.
The concept of entity as a virtual storage system insulates SGML from any particular file system convention, and thereby prevents the standard from being locked into any particular operating system. Although a simple idea, it is also one of SGML's strongest points, and one which has grown in importance over time.
With entities, the SGML standard can simply refer to an abstract entity manager to retrieve and deliver corresponding document contents, without worrying about how this will be done or from where the data is fetched. The entity mechanism is scalable, used for simple things like inserting a foreign character, all the way up to entire documents and referring to non-SGML notation data such as images or video.
External entities are declared using system or public identifiers. The latter form is mapped to the former when resolved. As the name indicates, a system identifier is system-dependent. Initially, a large part of the SGML community's efforts addressed issues of converting legacy data, authoring, validation, DTD design, and of course processing SGML. Not much attention was spent on making entities available across SGML applications. This has changed in latter day SGML use.
As SGML rose to prominence, more and more SGML tools appeared, and with a choice of tools, it became gradually clear that some form of harmonization was required to reach application independence, to isolate and neutralize the use of system identifiers. The SGML Open consortium thus agreed on an entity catalog, which defines a format for SGML systems to share common entities in a well-defined manner.
In its simplest form, the SGML Open catalog is a mapping between public and system identifiers. (It is actually more complex, and is currently being extended even further). Many companies support the catalog format.
Once you start using public identifiers and the catalog scheme, you note that they are an asset over using system identifiers directly. You can reorganize the storage organization of your documents, and only update a single spot: the affected catalog.
With a customizable entity manager, you can further handle the processing of entities freely, to build on this powerful paradigm. A couple examples are:
The dynamic processing of SGML is a recent application, as applications have moved from static, pre-compiled proprietary data generated out of SGML source to working with SGML directly.
Being able to address indirection will become increasingly important, and is required in the design of complex documentation systems that address redundancy and distributed server-based information bases. Avoiding hard-wired system identifiers will become more importantexcept perhaps for on-demand online publishing, where SGML documents may be generated as a transient, temporary piece of information which is read or processed, and then dispensed with.
Around the same time as SGML companies began to adopt the SGML Open entity catalog, HTML made its sweep across the world, and it became an interesting proposal to access SGML on the Web as well. It therefore became necessary to use SGML dynamically.
Two years of experience attest to the fact that SGML is harder to serve efficiently on the Web, compared to HTML. HTML, described in SGML terms, is essentially a fixed DTD with few elements, all of which are tied to a pre-defined layout. This allows HTML browsers to do a number of optimizations because of known pre-conditions. In contrast, an SGML browser needs a DTD, possibly included DTD fragments and entity sets, and support files such as style sheets. To do this kind of processing efficiently, the browser should support both local and remote catalogs, so that only data which cannot be found locally is transmitted across the Web.
However, SGML documents tend to be complex, lengthy, and highly structured. In particular, by the very definition of the standard, the topmost element will encapsulate the entire document, and you have thus to read all of the document before you are done with it. All of these factors have bearing on web publishing: SGML is a bit cumbersome to use as is. Outside of intranets, transmitting SGML data becomes a time-consuming proposal because of current bandwidth restrictions. Two ways of addressing this problem have emerged.
XML (eXtensible Markup Language) currently being designed simplifies SGML extensively, and though designed primarily for web publishing, is general enough to be useful far beyond this use.
Thanks to HTML, the point was realized that DTDs may not always be necessary, and that people will gladly tag their documents as long as it is easy enough to do so. In consequence, XML does not require you to have a DTDwhich means that XML documents need not be valid (but they can be); it is sufficient that they are well-formed.
The XML Editorial Review Board has adopted a minimalist approach to keep the specification light-weight and easily implementable.
In order to transmit SGML more efficiently on the web, SGML Open has defined a technical resolution to permit SGML documents to be served in chunks, with just enough context information about where the corresponding document fragment belongs. It appears likely that this effort will be eclipsed by the emergence of XML.
However, the fragment approach brings up the question of addressing, of describing locations in SGML documents. This is covered in next section.
Online SGML publishing will initially be successful in intranets as they have the bandwidth to support it, but will eventually migrate to the Web as well. XML will pave the way for this evolution; both SGML and XML will co-exist with HTML as HTML addresses different requirements than those which are solved by an SGML-purist approach.
Conventions, through organizations like SGML Open, the TEI, and the W3C complement the standardization process. It is likely that this trend will continue and grow.
SGML was originally designed for the name space of a single document, so one could not mark-up (in a standard-defined manner) links to other documents. This shortcoming will be fixed in the upcoming revision of the standard.
In the meantime, excellent SGML-based approaches have been designed and become implemented in the last few years.
The TEI Guidelines define an SGML-based method for describing links and spans in documents. Although not an international standard, the TEI extended pointer mechanism is an influential and elegant addressing method, which permits structural links (such as addressing children or an enumerated element occurrence). The TEI links are also notationally compact and, as they can be described in a single string, are suitable for parameter passing.
Note also that the TEI extended pointers are being incorporated into the XML specification.
HyTime is the ISO standard for hypertext and multimedia, and is itself an application of SGML.
Currently, subsets of the HyTime hyperlinking features have been most widely implemented, to support addressingused for bookmarks, annotations, and similar user-defined (meta)data coupled to SGML documents; more complex use of HyTime can be found in the field of IETMs.
Note that one can support portions of HyTime selectively to great benefitand as an example of how the use of SGML transitions, consider the content-oriented tagging which, together with HyTime functionality in the Topic Map processing tool, has enabled the automatic creation of the electronic equivalent of printed indexes, glossaries, and thesauri in these proceedings.
As SGML serves as a foundation for HyTime, HyTime in turn is applied in upcoming standards for Topic Navigation Mapping (ISO/IEC CD 13250) and the Standard Music Description Language (ISO/IEC CD 10743).
Joan Smith has written the following about HyTime: This is the application of SGML that is destined to take information processing into the next millenium. This is certainly true, especially as the standard is so multi-faceted and complex that it will be several years before we see large-scale deployment of any comprehensive implementations of the standard. And just as for SGML, HyTime will be put to use its designers did not foresee.
You ain't seen nothing yet! We are only scratching at the surface of these novel uses.
The SGML standard has a number of optional features, several of which are seldom implemented (and rightly so!). However, features such as subdoc and link can be put to good use. The TEI community has also found a need for concur.
In SGML, a sub-document is an SGML document that assumes the current SGML declaration but has its own DTD, so the instance is a self-contained name space). This is therefore a natural unit for information re-use.
The link feature (which has nothing to do with hyperlinks) lets you associate new attributes to a resulting SGML document when processing a source SGML document. This transformation process can be used for a number of things, such as associating style information or support data for the visually impaired. As the link definition is part of the DTD, all corresponding document instances are affected.
The link feature also complements ISO 10179 DSSSL, Document Style Semantics and Specification Language.
The future of SGML has been laid in the alignment of DSSSL and HyTime, which brought about property sets, groves, and a common query language.
DSSSL has a style language, which standardizes the formatting description of SGML documents, and a transformation language, to process instances. Its query language SDQL replaces HyTime's HyQ query language.
Groves are an application-independent abstraction of the result of parsing, and which therefore can be unambiguously understood between applications. Effectively, there is now a model for different tools to share any piece of SGML information.
Property sets define object classes and their properties; the SGML property set (published in the DSSSL standard) is used by DSSSL and HyTime, and will become part of the revised SGML standard. The output of an SGML parser can thus be described in these terms.
It is exhilarating that these state-of-the-art advances are already being matched by tools (the SGML community is greatly indebted to James Clark).
Grove-based tools will radically change the way applications work with SGML as the formalism of groves goes hand in hand with the trend of content-oriented, dynamic, re-usable information elements. In particular, SGML documents can now become truly application independent, and tools can be devised for specific subtasks.
Life is change. SGML and its evolutionary use reflects the new requirements that develop as a continuous processin the transition towards our common future.
In the first decade of SGML, we have witnessed a remarkable change in the world's perception of tagged data. On a global scale, documents are coming online, and it might appear that we are reaching a critical information mass. However, arsenals of tools (search engines, agents, browsers, processors, formatters, etc.) for HTML, XML, DSSSL, HyTime, and SGML are being deployed as well, at a rate that indicates that the age of information will not be a threat, but a promise; a promise of information-centered, content-driven, applications that speak the universal language of SGML.