ACM

Communications of the ACM

Home/Magazine Archive/May 2001 (Vol. 44, No. 5)/Drudgery and Deep Thought/Full Text

Drudgery and Deep Thought

By Gregory Crane, Robert F. Chavez, Anne Mahoney, Thomas L. Milbank, Jeffrey A. Rydberg-Cox, David A. Smith, Clifford E. Wulfman
Communications of the ACM, May 2001, Vol. 44 No. 5, Pages 34-40
10.1145/374308.374333
Comments

View as: Print Mobile App ACM Digital Library Full Text (PDF) Share:

The Perseus Digital Library (www.perseus.tufts.edu) has had to confront the problems faced by any general cross-disciplinary digital library for the humanities [11]. Support from the Digital Library Initiative, funded by the National Science Foundation, the National Endowment for the Humanities, and other agencies to support research on the next generation of digital libraries, has allowed Perseus to capitalize on a decade of developing a digital library on Greco-Roman antiquity and explore the problems raised by other domains within the humanities.

Besides studying the design and use of the existing Greco-Roman collection, we collaborate with partners like the Museum of Fine Arts, Boston, and the Max Planck Institute for the History of Science, Berlin, to develop collections for subjects ranging from ancient Egypt to early twentieth-century world history. One of our research areas is how best to design and structure documents to enable them to interact with other objects in a digital library. But there are few conventions for organizing radically new document forms (such as 3D walkthroughs of virtual spaces) so they can be maintained, with minimal support, over time as part of larger digital libraries. Even venerable document classes (such as dictionaries) need to be rethought as we compare models based on semantic networks (such as WordNet, a lexical database for the English language developed and maintained at Princeton University) with traditional dictionary entries. Ultimately, we hope our work will help anyone designing a digital library better understand the needs of humanists. We also hope the collections to which we contribute will help anyone producing content in the humanities better understand how their work might evolve to exploit the possibilities of these new, intensively interlinked environments.

Challenges for the Humanities

All academics writing scholarly texts, whether humanists or scientists, face similar challenges [7]. The technology of publication through which text is given form and disseminated to an audience imposes fundamental constraints on all discourse, especially on subjects that depend on nonverbal representation. The changes resulting from the Internet only intensify a transformation that started long ago. While the printing press radically enhanced our ability to disseminate words, its greatest effect was arguably on our ability to publish nonverbal objects accurately and economically, including lists of numerical data, diagrams, maps, and images [5]. Even before humanists began to use the Internet widely, the ability to distribute hundreds of megabytes of data on cheap plastic disks was changing scholarly practice in fundamental ways.

The first Perseus CD-ROM (published in 1992 after the project was named for the Greek hero who explored the limits of the known world) included far more visual information about art objects than had ever been economically feasible to publish in print; we were able to include hundreds of images illustrating individual objects, allowing art historians to see more detail, and thus pose more refined questions about them. Today, an individual project can construct and publish amazingly rich data sets. Yet, while digital libraries are libraries insofar as they automatically aggregate these data sets, so the whole is more valuable than the sum of its parts, these systems work only with the data sets put into them. A fundamental challenge for scholars in all disciplines is therefore how to structure their publications so they may be integrated into larger units.

Humanists face some challenges that scientists do not. Unlike many scientific publications, whose useful life is frequently less than a decade or so, documents in the humanities often remain useful for centuries. Humanists are therefore especially vulnerable to shifting standards and methods for preservation and publication. Archaeological excavations, for example, destroy the original sites; excavators' field notebooks, pictures, drawings, and other documents provide the only access that later researchers will ever have to the full archaeological context. These documents must be preserved indefinitely. Critical editions, dictionaries, museum catalogs, and similar document classes represent the foundation for much humanistic research; they, too, must remain accessible for many years, possibly for centuries.

The second challenge facing humanists, then, is related to the first. Not only must they structure new publications for digital libraries, they must also find ways to transform old documentsoften quite complex in structureso they, too, may be integrated into the new repositories of human knowledge.

Finally, humanists have a special relationship with human language. Machine translation and cross-language information retrieval are perhaps even more important to humanists than to their colleagues in the sciences. The historical record extends only roughly 4,000 years into the past; we are closer in time to Julius Caesar than Julius Caesar was to the earliest extant cuneiform and hieroglyphic texts. Written sources for most of human history are thus stored in languages very different from those now in wide use. Only limited research in machine translation or cross-language information retrieval focuses on Akkadian, Arabic, classical Chinese, Egyptian hieroglyphics, classical Greek, Latin, Sanskrit, Sumerian, or other historically crucial languages. Modern computational linguistics offers new techniques that, when implemented for these older languages, might revolutionize teaching and research alike; for the first time, if this were possible, humans would be able to work directly with the many languages needed to study, say, the cultural continuum extending from Greece through the Middle East and beyond.

But even if automated translation matched the fidelity of human translators and provided modern-language translations faithfully reflecting the content of an original source document, translation is a limited medium. Literature, in particular, is notoriously intranslatable; serious students of the historical record know that culture and language are inextricably intertwined. We thus need systems that let us work more effectively with documents in the original source languages and digital libraries that automatically link as many complementary tools as possible.

The tools we've developed within Perseus for Greek and Latin allow us to see the benefits of digital grammars, lexica, and other similar linguistic tools designed to work together [8]. The challenge for the next generation of scholars is to develop radically new linguistic reference tools drawing on machine translation, cross-language information retrieval, and similar techniques.

Case Studies for General Problems

In addition to our core work on Greek and Roman antiquity, the Perseus team is pursuing a number of projects and collaborations to explore and address problems confronting any digital library for the humanities, including:

Archimedes. Collecting sources for the history of mechanics from antiquity through the early modern period (with the Max Planck Institute for the History of Science).
Egypt. Publishing the Museum of Fine Arts/Harvard excavations of ancient Egyptian sites near the Pyramids at Giza (with the Museum of Fine Arts).
Publication. Developing standards for new scholarly publications in digital libraries (with the Stoa Publishing Consortium, www.stoa.org, which is based in the classics department of the University of Kentucky, and works on new models for refereed scholarly publishing and collaboration online).
Shakespeare. Adapting the New Variorum Shakespeare Series, the most elaborate series of critical editions in the humanities, to the electronic environment (with the Modern Language Association, a professional society for literary scholars, based in New York).
London. Creating a digital library on the history and topography of London (with the Edwin C. Bolles Collection in the Tufts University Archives, Medford, MA).

Each of these projects complements our existing work while giving us a new vantage point on general problems. Work with Giza, London, and the Stoa involves the problems of integrating texts, geospatial data, and 3D reconstructions of vanished spaces. The history of mechanics allows us to leverage our strengths in Greek and Latin, two crucial source languages, while studying the applicability of our methods to new languages (Arabic and Italian) and new periods (medieval and Renaissance Europe).

For the past four years, we've been testing the generality of methods developed for Greco-Roman texts by developing an electronic edition of the works of Christopher Marlowe, an English dramatist, 15641593. Our partnership with the Modern Language Association has allowed us to extend this work into the highly mature world of Shakespearean studies. Other projects are under development; for example, we have experimented with materials from the U.S. Library of Congress American Memory Project (memory.loc.gov) and plan collections on such topics as the American Civil War and Korean culture.

Space and Time

Although many literary works describe fictional worlds and times, the majority of the resources used by humanities scholars refer to real time and real geographic space. Because time and space are universally relevant, they can be used to classify and visualize the content of even the most heterogeneous collections; subject-specific concept hierarchies cannot always be generalized beyond their most obvious applications. We have therefore developed general toolstimeline, mapping, and chartingfor identifying the places and dates our texts refer to. Developed for visualizing the geographic and temporal references in the Perseus Greco-Roman and London collections, these tools have proved just as useful for the Library of Congress American Memory documents on the European settlement of California and on the recent history of the Upper Midwest.

Every English-language document in the Perseus Digital Library is scanned for references to dates and proper names. Although most dates are easily recognized, place names have to be distinguished from other proper names and different places with the same name disambiguated. Our tools recognize over 95% of all toponyms in unrestricted text, but once recognized, over 67% of the toponyms in most documents are still ambiguous. For example, does "Memphis" refer to a city in Egypt or a city in Tennessee? The answer depends on the context; the mapping tool decides among possible toponym identifications by comparing the distance of each candidate from the centroid of the surrounding sites and their relative importance.

Once dates and places are identified, they can be plotted. A graph like the one in Figure 1 shows the temporal scope of a single text or group of texts; Figure 2 shows the places in the American Memory California collection; and the maps in Figure 3 show the same street in London at a series of points in history.

Managing the Texts

Digital libraries rely on a powerful, structured document format. Perseus documents are encoded in XML using several different document type definitions. Even those in the same DTD may represent the structure of a document very differently; for example, one may call a chapter <div1 type=chapter> and another <chapter>. Our configurable XML management system maps diverse tagging schemes onto abstract structures (such as chapter, page, stanza, or speech), providing an easily extensible citation scheme for documents from many different sources [10].

When a reader requests a page, the Perseus text system finds it within the document, extracts it as an XML fragment, and passes it on to modules that extract citations, disambiguate proper names, and map inflected foreign-language terms onto dictionary entries. At the same time, Perseus generates the timelines and maps discussed earlier. Finally, the display module configures the information the reader actually sees, adding links, from proper names to maps or encyclopedias, choosing transliteration schemes, and adding cross-references from other works (see Figure 4). The overall system gracefully manages many different types of XML document and allows collection editors to configure the appearance of documents for different audiences.

Language Tools (Not Just English)

The most basic tools for learning a new language are a dictionary and a grammar. Perseus includes scholarly dictionaries of Latin and Greek, grammatical reference works, and a morphological analysis tool that determines the meaning of any form in these highly inflected languages. Every Greek or Latin word is hyperlinked to the Perseus "word study tool," providing a short dictionary definition, an explanation of the grammatical form, and frequency statistics.

We have enhanced the online dictionaries and lexica using standard techniques from corpus linguistics. The texts in Greek or Latin are scanned for word pairs that co-occur regularly [9]. When a reader looks up a word in the lexicon, a table showing the five most common collocates of that word appears at the head of each dictionary entry. This table includes links to the dictionary entries for each collocate as well as links to more extensive lists showing the mutual information score for every Greek or Latin word pair appearing five or more times in the digital library (see Figure 5).

Perseus has also calculated mutual information scores for several subcorpora of texts representing different styles and genres, including rhetoric, prose, tragedy, and poetry. If a user looks up a word while reading a text in one of these subcorpora, the table appearing in the dictionary entry shows the five most common collocates for each applicable subcorpus, in addition to the collocates for the complete collection of Perseus Greek or Latin texts. Integration of these scores into the electronic lexicon allows readers to explore texts in a way not possible outside the digital library; they quickly obtain a broad sense of the "company the words keep," along with a rough guide to possible idioms and common phrases.

General Principles

Perseus developers have been able to contribute to our collaborators' projects precisely to the extent that our experiences with Greco-Roman Perseus can be applied to other domains. Our experience with the densely interlinked network of classical texts, encyclopedias, and commentaries helped us recognize what could be done with editions and commentaries on Shakespeare. Our work on the Greek and Latin languages allowed our colleagues in the history of science, working with original Arabic, Greek, Italian, and Latin texts, to see how a digital library of source texts could augment their research. Finally, the Bolles London collection capitalized on our experience with automatic link generation and with the integration of disparate resources [2]. The nineteenth-century materials in the London collection are, however, filled with specific dates and locations in space, while the Greco-Roman materials have much coarser geographic data and almost no dates.

Modern materials thus raise problems of too much data, while the ancient materials force us to confront the problem of data sparsity. Different domains thus pose complementary problems.

Different approaches to digital library design also pose complementary problems. Many established digital libraries are depth-first probes aggressively developing a single dimension of source materials. For example, the Women Writers Project at Brown University (www.wwp.brown.edu) collects texts, and the Art Museum Image Consortium (www.amico.org) collects images. Others aim for breadth; for example, the Thesaurus Linguae Graecae at the University of California, Irvine (www.tlg.uci.edu) has been digitizing all ancient and Byzantine Greek source texts since 1972.

From its earliest plans in 1985, Perseus was designed to balance breadth-first and depth-first approaches, constructing a broad spectrum of data types, including texts, images, and reference materials, each populated to a depth sufficient to make the data useful to serious scholars. The original Perseus team decided to provide a critical mass of coherent materials reflecting as many different categories of data as possible. This balanced approach has allowed us to model the behavior of a mature system and explore the problems of integrating heterogeneous data.

Meanwhile, several general principles have emerged from our experience with Perseus:

Identify the audience. Digital library designers must identify the audience, whether researchers, non-specialists (undergraduates or the general public), or both. Perseus has tried to reach them all. On the one hand, Perseus developers have sought to create some tools and resources that would enhance professional research; while we never set out to produce an exhaustive set of texts, the texts in Perseus can be searched and studied in unique ways that allow researchers to pose new questions. On the other hand, we have tried to create resources that would expand the audience for classical studies, attracting students and specialists from adjacent fields. Professional historians of science not trained as classicists are attracted to tools that help them make better use of texts in Greek and Latin.

Our emphasis on outreach has serious implications for every aspect of the overall design of the Perseus Digital Library; for example, we could not assume that our users all had access to the kind of expensive resources available primarily at well-off research institutions. Nevertheless, our experience suggests that simultaneously keeping specialists and general users in mind has made Perseus more useful to both groups than would have been the case if we had focused on just one or the other.

Provide well-structured documents. Well-structured documents work together to produce a richer, more informative text. Encyclopedias, gazetteers, biographical dictionaries, and even the indices to maps and books are useful sources. These documents often have complex but fairly predictable structures, making it possible for moderate amounts of programming and post-processing to extract core information. Given access to quality knowledge resources, it is possible to generate automatic links between various parts of a digital library and to develop new ways of visualizing its aggregate contents.

The simple links leading from English texts in Perseus to other resources are surprisingly popular. Automatically extracted dates and toponyms, as described earlier, generate maps and timelines to visualize the content of documents and collections [4, 6]. Likewise, well-known techniques for document analysis have allowed us to begin linking documents with one another [1, 12]. Extracting similar information from non-textual objects, such as video, sound files, and still images, is a major, promising area for development.

Develop language tools. Languageespecially the ones not widely spoken and studied todayis the most difficult and probably most important part of any comprehensive cultural digital library. Processing a lexicon and developing a morphological analyzer for an inflected language are expensive, messy, labor-intensive tasks, but the results add value to everything else that subsequently finds its way into a digital library. When we began work on Roman Perseus in 1992, we therefore chose to invest our entire data-entry budget in a lexicon. In addition, although heavily inflected languages like ancient Greek and Latin do not readily lend themselves to some of the linguistic strategies that work well for English, we have been able to adapt work originally done on English to create tools like a search engine that can recognize Latin fecerunt, "they did," as a form of facio, "do," or a spell checker for Greek.

Conclusion

Creating and managing any digital library is laborious and difficult. A mature one draws on many technologies, and the ideal specialist in digital libraries would have expertise in a daunting range of subjects. Moreover, many of the fundamental decisions about the design and development of core digital resources require competence in both technology and content, as well as an instinct for how the discipline might evolve over time, adapting its tactical goals (if not its strategic purpose) in light of possibilities provided by emerging technology.

Ideally, designers of scholarly digital libraries should understand the unspoken and often unarticulated goals of their disciplinean understanding that focus groups and external observation can only partially provide. These ideas may seem obvious to scientists and to librarians, but humanists are only beginning to recognize the need to study system and document design. Few graduate programs prepare humanists to bridge the gap between emerging technology and the needs of their disciplines [3]. Digital library development in the humanities includes both drudgery and deep thought that differ from the drudgery and deep thought for which humanists have traditionally been trained.

Nevertheless, the aggregation of many well-structured documents in intelligent digital library systems raises the prospect of a golden age of the humanities in which both researchers and the general public are able to explore our common cultural heritage in extraordinary new ways.

References

1. Boguraev, B. and Pustejovsky, Eds. Corpus Processing for Lexical Acquisition. MIT Press, Cambridge, MA, 1996.

2. Crane, G. Designing documents to enhance the performance of digital libraries: Time, space, people, and a digital library of London. D-Lib Mag. 6, 7/8 (July/Aug. 2000); see www.dlib.org/dlib/july00/crane/07crane.html.

3. Crane, G. and Rydberg-Cox, J. A. New technology and new roles: The need for "corpus editors." In Proceedings of the Fifth ACM Conference on Digital Libraries (San Antonio, TX, June 46). ACM Press, New York, 2000, 252253.

4. Derthick, M. and Roth, S. Data exploration across temporal contexts. In Proceedings of the 2000 International Conference on Intelligent User Interfaces (New Orleans, Jan. 912). ACM Press, New York, 2000, 6067.

5. Eisenstein, E. The Printing Revolution in Early Modern Europe. Cambridge University Press, Cambridge, UK, 1983.

6. Kumar, V. and Furuta, R. Visualization of relationships. In Proceedings of the 10th ACM Conference on Hypertext and Hypermedia: Returning to Our Diverse Roots (Darmstadt, Germany, Feb. 2125). ACM Press, New York, 1999, 137138.

7. Marchionini, G. Evaluating digital libraries: A longitudinal and multifaceted view. Library Trends in press.

8. Mueller, M. Reading Homer electronically with the TLG, Perseus, and the Chicago Homer. Ariadne 25 (2000); see www.ariadne.ac.uk/issue25/mueller.

9. Rydberg-Cox, J. Word co-occurrence and lexical acquisition in ancient Greek texts. Lit. Ling. Comput. 15, 2 (2000), 121129.

10. Rydberg-Cox, J., Chavez, R., Mahoney, A., Smith, D., and Crane, G. Knowledge management in the Perseus Digital Library. Ariadne 25 (2000); see www.ariadne.ac.uk/issue25/rydberg-cox.

11. Smith, D., Rydberg-Cox, J., and Crane, G. The Perseus project: A digital library for the humanities. Lit. Ling. Comput. 15, 1 (2000), 15-25.

12. Wilks, Y. and Catizone, R. Can we make information extraction more adaptive? In Information Extraction: Towards Scalable, Adaptable Systems, Lecture Notes Artificial Intelligence 1714, Pazienza, M., Ed. Springer, Berlin, 1999, 116.

Authors

Gregory Crane ([email protected]) is a professor of classics and Winnick Family Professor of Technology and Entrepreneurship at Tufts University and director of the Perseus Digital Library, Medford, MA.

Robert F. Chavez ([email protected]) is editor for cartography of the Perseus Digital Library, Medford, MA.

Anne Mahoney ([email protected]) is a programmer for the Perseus Digital Library, Medford, MA, and co-editor of the Stoa Consortium.

Thomas L. Milbank ([email protected]) is art and archaeology editor of the Perseus Digital Library, Medford, MA.

Jeffrey A. Rydberg-Cox ([email protected]) is an assistant professor of English at the University of Missouri, Kansas City.

David A. Smith ([email protected]) is the lead programmer of the Perseus Digital Library, Medford, MA.

Clifford E. Wulfman ([email protected]) is editor for English literature of the Perseus Digital Library, Medford, MA.

Figures

Figure 1. Temporal references in the Perseus Greco-Roman corpus, with concentrations in classical Greece (fifth and fourth centuries B.C.) and the late Roman Republic (first century B.C.). Each red dot represents a reference to a date, and each yellow bar a reference to a date range. The dots and bars are links to the passages in the texts.

Figure 2. Geographic references in American Memory books on California (sites in the U.S.). Note the scattering of sites across the country from migration narratives.

Figure 3. King William Street, London. Historical maps, aligned with a modern geospatial database, permit users to see how the city plan has changed over the past 250 years.

Figure 4. XML documents in Perseus use any of several document type definitions, since the system is concerned primarily with thier underlying structures. We extract proper names, citations of other text, and dates from these documents into a database. When a reader requests a page of text, Perseus uses that database, along with the reader's preferences and the style sheets defined by the collection editor, to construct a rich, hyperlinked display.

Figure 5. The dictionary entry is augmented with frequency information for the headword, words with similar definitions, and words commonly used with the given word.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

No entries found