acm-header
Sign In

Communications of the ACM

Communications of the ACM

Tracing the Roots of Markup Languages


There is an interesting story behind the invention of markup languages. It is said that lawyers were wading through hundreds of previous cases to find specific examples they needed and there was no way to merge new files with those that had already been found and processed. A bright, young lawyer discovered a way to automate this process and save attorneys considerable amount of time and money. Charles Goldfarb, a research team leader at IBM, along with Ed Mosher and Ray Lorie produced a Generalized Markup Language that automated the law office. Goldfarb later went on to create Standard Generalized Markup Languages (SGML) [2]—an international standard for the definition of device independent, system-independent methods of representing texts in electronic form. SGML is based on a descriptive markup system, which provides codes to categorize parts of a document. SGML introduces the notion of a document type, and hence a document type definition (dtd), with the underlying structure and its constituent parts, formally defining the type of a document. Historically, markup refers to a process of marking manuscript copy with type setting directions for type, fonts, sizes, spacing and indentation [1]. In computerese, "markup" refers to the tags that describe the formatting of a document.

Traditionally, the output devices were limited in their capability to display text. The dumb terminals could display text in one particular font, size, and color. Today, devices are capable of displaying the information in various fonts, sizes, colors, and graphics. The technology behind these documents is Markup languages, which control the display behavior of the output devices. These are different then programming languages which process data through calculations and render from the input information produces results that a user can utilize. Markup languages are static, do not process information and can do nothing by itself. However, programs can be written to take advantage of the knowledge encapsulated in the document structure so they can behave in a more intelligent manner. Hyper Text Markup Language (HTML) is one of the simplest and most popular markup language derived from SGML.

HTML. In 1989 Tim Berners-Lee was intending to create a means for scientists to share information from any location, when he developed a hypertext document that could be linked to and read anywhere from Goldfarb's SGML. The method is relatively simple to learn and easy to implement.1 An HTML file is a text file containing small markup tags that instruct the Web browser as to how to display the page. HTML documents are text files made up of HTML elements, which are in turn defined using a predefined set of tags like <HTML>, <Head>, <Body>, <Input>, <Applet>, <Font>, and so on. These tags are interpreted by applications, but HTML is inflexible for sharing documents. It cannot efficiently handle the complex client/server communication of the modern world applications. We cannot define new tags in HTML to customize our needs. The solution to our problem is meta language like eXtensible Markup Language (XML is a subset of SGML). Meta language is nothing but a formal means of describing a language.

XML is a restricted form of SGML (ISO 8879). The HTML document is not designed to be interactive between a server and a client while the developers envisioned a more powerful Web. XML describes a class of data objects called XML documents, which are stored on computers, and partially describes the behavior of programs that process these objects. XML is extensible, with precise and deep structures. It is extensible, which means it can create its own elements. Documents can be customized according to the kind of information that needs processing. Hence, if the discipline is chemistry, a document type might be created marking elements like <ATOMS>, <MOLECULE>, <BONDS>, <FORMULA>, and so on. Table 1 shows some common XML notations.

XML helps computers communicate better [4] by separating presentation from content, and by enabling information to be readily passed between applications. Of course, the added bonus is that humans can easily maintain the code. XML can be viewed in many different ways; one view is that XML is a document format. Others view XML is as a hierarchical storage of data, which is passed across a network from one application or agent to another. Each of these viewpoints is valid and the variety of answers is due to sheer versatility of XML technology. If we view an XML document as a serialized hierarchical structure, we can see the perceived hierarchical structure as a Simple API for XML (SAX) or Document Object Model (DOM) structure. SAX and DOM do provide interfaces to manipulate XML. SAX is an event-based API while DOM is a tree-based API. XML's simple syntax is used to support a wide variety of applications. Figure 1 illustrates this idea in a simplistic way.

Figure 2 shows a real world example of an XML application. Figure 2(a) shows an XML document with student information; the corresponding dtd is depicted in 2(b); and 2(c) briefly outlines how an application can interpret this XML document. The number of XML applications is growing rapidly, and the growth pattern is likely to continue, particularly in areas fields such as health care, the Internal Revenue Service, government, and finance. The use of XML leads to a simple way of data representation and organization and problems of data incompatibility and tedious manual rekeying could become manageable. An important XML application in wireless and handheld devices is Wireless Markup Language (WML). Table 3 summarizes some other current XML applications.

WML is a markup language with its specification developed and maintained by an industrywide consortium called WAP Forum. These specifications define the syntax, variables, and elements used in a valid WML file. A valid WML document must correspond to this dtd or it cannot be processed. If a phone or other communications device is said to be WAP-capable, it means it has a piece of software (aka micro browser) that fully understands how to handle all entities in the WML dtd.

WML was designed for low-bandwidth and small display devices, using a deck of cards as a concept. A single WML document is known as a deck; a single interaction between an agent and a user is known as a card. The beauty of this design is that multiple screens can be downloaded to the client in a single retrieval. Using WML Script, user selections or entries can be handled and routed to already loaded cards, thereby eliminating excessive transaction transmissions with remote servers. Of course, with limited client capabilities comes another tradeoff. Depending on one's client memory restrictions, it may be necessary to divide cards into multiple decks to prevent a single deck from becoming too large. It predefines a set of elements that can be combined together to create a WML document.

WML Script Language. WML Script syntax is based on the ECMA (Standardizing Information and Communications systems) Script programming language. Unlike ECMA Script, however, the WML Script specification also defines byte code and interpreter reference architecture for optimal utilization of current narrowband communications channels and handheld device memory requirements. Some basic syntactical features of the language are as follows:

  • The smallest unit of execution in WML Script is a statement, which must end with a semicolon.
  • WML Script is case-sensitive.
  • Comments can either be single-line (beginning with //) or multi-line (between /* and */).
  • A string literal is defined as sequence of zero or more characters enclosed within quotes.
  • Boolean literal values correspond to true and false.
  • New variables are declared using the var keyword.

WML Script is a weakly typed language. No type checking is done at compile time or run time and no variable types are explicitly declared. The programmer does not need to specify the type of any variable; WML Script automatically converts different types as needed. WML Script is not object-oriented. Therefore, it is not possible to create our own user-defined data types programmatically. Internally, WML supports Boolean, Integer, Floating Point, String, and Invalid Data types.

WML Script has a variety of operators that support value assignment operations, arithmetic operations, logical operations, string operations, comparison operations, and array operations. The operators and expressions supported by WML Script are virtually identical to those of the JavaScript programming language. Java does support a number of control statements for handling branching within programs. These include if-else, for loop, while loop, break, and continue statements.

Back to Top

Functions and Standard Libraries

A sample WML Script function, which takes distance and speed as input, and calculates a time variable looks like this:

  • Function Runtime (distance, speed) {var time = distance / speed; return time;};

The return keyword is used to return the value of time. WML Script libraries provide several standard functions. The library name must be included with the function call. For example, to call the String library length () function, the following syntax could be used:

  • var a = String. Length ("1234567890");

While WML Script does not support the creation of new objects as common object-oriented programming, it does provide six prebuilt libraries to handle common tasks. A brief summary of the libraries is given in Table 2 [3].

Back to Top

Conclusion

Here, we trace the roots of markup languages beginning with HTML as a derivative from SGML. We then looked at XML subset, which is definitely more popular in industry today. WML is one of the important applications of XML in the field of wireless devices.

As the amount of information on the Web will only increase, so too will the difficulty in finding what we need. Markup language is a way to represent and store the information and thus help applications add meaning and value to it.

Separating structure from presentation is the essence of (semantically) meaningful, maintainable, accessible, and evolvable markup documents for the Web. It provides optimal interoperability with full Web integration, is used in a growing number of applications in all areas of human activity, and can be handled by an ever-increasing set of (mostly) free tools. We see increased use of markup languages and its future evolves around making Internet applications easier and faster to use, while keeping it versatile enough to provide the flexibility and reuse.

Back to Top

References

1. Chicago Manual of Style. University of Chicago, IL

2. Floyd, M. A conversation with Charles F. Goldfarb. WEB Techniques 3, 11 (Nov. 1998), 38–41.

3. Goldfarb, C.F. and Prescod, P. The XML Handbook. Prentice-Hall, Upper Saddle River, NJ. 1998.

4. Ladd, E., O'Donnel, J., Morgan, M., and Watt, A.H. Platinum Edition Using XHTML, XML and Java 2. Que Corp., 2001, 296–297.

Back to Top

Authors

Rishi Toshniwal ([email protected]) is a graduate student in the Center for Distributed and Mobile Computing, ECECS, University of Cincinnati, OH.

Dharma P Agrawal([email protected]) is director of the Center for Distributed and Mobile Computing, ECECS, University of Cincinnati, OH.

Back to Top

Footnotes

1www.w3.org/People/Berners-Lee/1996/ppf.html

Back to Top

Figures

F1Figure 1. XML as a base for number of applications.

F2Figure 2. Real world example using XML.

Back to Top

Tables

T1Table 1. Common XML notations.

T2Table 2. Prebuilt libraries in WML.

T3Table 3. Important XML applications.

Back to top


©2004 ACM  0002-0782/04/0500  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2004 ACM, Inc.


 

No entries found