acm-header
Sign In

Communications of the ACM

Communications of the ACM

Greenstone: Open-Source Dl Software


Greenstone is a comprehensive system for constructing and presenting collections of thousands or millions of documents, including text, images, audio, and video. Greenstone libraries contain many collections, individually organized, though they bear a strong family resemblance. Easily maintained, collections can be augmented and rebuilt automatically.

Greenstone constructs full-text indexes from the document text and from metadata elements such as title and author. Indexes can be searched for particular words, Boolean combinations, or phrases, and results are ranked by relevance or sorted by a metadata element.

Browsing involves hierarchical lists the user can examine interactively. Metadata is the raw material for browsing, and must be provided explicitly or be derivable automatically from the source documents. Different collections offer different searching and browsing facilities. Indexes for both are constructed during a building process, according to information in a collection configuration file.

Greenstone creates all searching and browsing structures automatically from the documents themselves: nothing is done manually. If new documents in the same format become available, they can be merged into the collection automatically. Indeed, for many collections this is done by processes that awake regularly, scout for new material, and rebuild the indexes—all without manual intervention.

Source documents come in a variety of formats, and are converted by plugins into a standard form for indexing. Plugins distributed with Greenstone process HTML, Word, and PDF documents, Usenet and email messages; new ones can be written for different document types. To build browsing structures from metadata, an analogous scheme of classifiers is used. These create browsing indexes of various kinds: scrollable lists, alphabetic selectors, dates, and arbitrary hierarchies.

Collections can contain text, pictures, audio, and video. Nontextual material is currently either linked into the textual documents or accompanied by textual descriptions (such as figure captions) to allow full-text searching and browsing. The architecture, however, permits implementation of plugins and classifiers for nontextual data.

Unicode is used throughout, allowing any language to be processed and displayed in a consistent manner. Collections have been built containing Arabic, Chinese, English, French, Mãori, and Spanish. Multilingual collections embody automatic language recognition, and the interface is available in all these languages, among others.

Collections are accessed over the Internet or published, in precisely the same form, on a self-installing Windows CD-ROM. Compression is used to compact the text and indexes. A CORBA protocol supports distributed collections and graphical query interfaces.

The New Zealand Digital Library (nzdl.org) provides many example collections, including historical documents, humanitarian and development information, technical reports and bibliographies, literary works, and magazines. Other examples appear in Apperley et al. and Witten, Loots et al. in this special section.

Being open source, Greenstone is readily extensible, and benefits from the inclusion of Gnu-licensed modules for full-text retrieval, database management, text extraction from proprietary document formats, and Z39.50 protocol support. Only through international cooperative efforts will digital library software become sufficiently comprehensive to meet the world's needs with the richness and flexibility that users deserve.

Back to Top

Authors

Ian H. Witten ([email protected]), David Bainbridge and Stefan Boddie are members of the digital library research project in the computer science department at Waikato University, NZ.


©2001 ACM  0002-0782/01/0500  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2001 ACM, Inc.


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account
Article Contents: