Most information is now "born digital" and much is disseminated only in digital form. However, little of this is provided in forms that ensure its perpetual intelligibility or that include evidence that it can be trusted for sensitive applications.
Many articles about digital preservation come from the cultural heritage community, which is somewhat unfortunate as the IT community is not involved. The NDIIPP (National Digital Information Infrastructure Preservation Plan) [6] expresses urgency for preserving authentic digital works. However, since the 1995 appearance of Preserving Digital Information [2], little progress has been made toward technology for reliable preservation of substantial collections [7, 11].
Most of the preservation literature draws its examples from scholars' and artists' interests. We anticipate that the needs expressed will expand to those of businesses wanting safeguards against diverse frauds, attorneys arguing cases based on the probative value of digital documents, and our own dependencies on personal medical records.
This article deals exclusively with challenges created by technological obsolescence and the demise of information providers. Preservation know-how was summarized by Thibodeau in 2002 by observing that proven methods for preserving and providing sustained access to electronic records were limited to the simplest forms of digital objects. Even in those areas, proven methods were incapable of being scaled for the expected growth of electronic records. Furthermore, archival science had not responded to the challenge of electronic records sufficiently to provide a sound intellectual foundation for articulating archival policies, strategies, and standards for electronic records [10]. Here, a design that addresses all technical issues reported in the preservation literature is described.1
What might someone a century from now want of information stored today? Figure 1 suggests users' perspectives and helps illuminate preservation reliability questions. In addition to what content management offerings2 and published metadata schema3 already provide, a complete solution would:
Viable solutions will allow repositories and their clients to use deployed content management software without disruption.
Information in physical books, on other paper media, and in other analog forms cannot be copied without error and always contains accidental information that digital representations can avoid. Perfect digital copying is possible, and contributes both to the challenge of preserving digital content and to its solution. Preservation can be viewed as a special case of information interchangespecial because information consumers can no longer obtain information producers' responses about missing information or puzzling aspects.
Pervasive Focus on Repositories. Much preservation literature focuses on so-called "trusted digital repositories." Recent articles [9] amplify prior calls for criteria to be used in audits that might lead to public certification that an institution has correctly executed sound preservation practices. However, to execute partly human procedures faithfully over decades would be difficult and expensive. Repository-centric proposals betray problems that call the direction into question. Fundamentally, they depend on an unexpressed premisethat exposing an archive's procedures can persuade its clients that its content deliveries will be authentic. Such procedures have not yet been described, much less justified as achieving what their proponents apparently assume. In addition, audits of a digital archiveno matter how frequentcannot prove that its contents have not been improperly altered by employees or hackers many years before a sensitive record is accessed. Another problem is that the new code needed for digital preservation is likely to be mostly workstation software, not server software, so the people focusing on repositories will find it difficult to design solutions.
The topical literature is replete with epistemological weaknesses. For instance, many of its references to trust are unmodified (unconstrained). Young children trust unconditionally; anyone else who does so is commonly considered childish. The mature formulation has the pattern, "X trusts Y to accomplish some action Z, or to refrain from some action or behavior W." If the authors of trusted digital repositories articles would adopt this pattern and consider the consequences of each Z and each W, they would materially advance their professed agendas.
As an objective, 'trusted' is misleading. Instead, one should focus on encapsulating information so that it is trustworthy.
In casual conversation, we often say that the copy of a recording is authentic if it closely resembles the original. But consider, for example, an orchestral performance, with sound reflected from walls entering imperfect microphones, signal changes in electronic recording circuits, and so on, until we finally hear a television rendering. Which of many different signal versions is the original?
Difficulties with 'original' and 'authentic' are conceptual. Nobody creates an artifact in an indivisible act. What people consider to be an original or a valuable derivative version is someone's subjective choice, or an objective choice guided by subjective social rules. We can, however, describe any version objectively with provenance metadata that expresses everything important about its creation history.
Conventional definitions, such as "authentic: of undisputed origin; genuine," do not help operationally. For signals, for material artifacts, and even for natural entities, the definition shown in the sidebar here captures what people mean when they say 'authentic'.
Each Tk represents a transformation that is part of a Figure 1 transmission step. To preserve authenticity, the metadata accompanying the input in each transmission step should be extended by including a Tk description. This metadata might identify the author of each Tk choice and other circumstances important to consumers' judgments of authenticity. Each eventual consumer will decide for himself whether the available evidence is sufficient for his particular purposes.
Preserving Dynamic Behavior. A prominent collaborative archivists' project suggests conceptual difficulty with preserving "dynamic objects" (representations of artistic and other performances) digitally [1]. We see no new or difficult technical problem; what differs for different object types is merely the ease of changing them.
A repeat R(t) of an earlier performance P(t) would be called authentic if it were a faithful copy except for a constant time shift from some tstart, that is, if R(t)=P(t-tstart). This seems simple enough and capable of describing any kind of performance. Its meaning is simpler for digital records than for analog recordings because digital records already reflect the sampling errors of recording performances that are continuous in time. The archivists expressing difficulty with dynamic digital objects do not express similar uncertainty about analog recordings of music.
The TDO proposal focuses on methods for making the authenticity of preserved digital objects reliably testable and for assuring that eventual users will be able to render or otherwise use their contents. The objectives suggest solution components that can be nearly independently addressed:
To prepare the TDO that represents a work (see Figure 2), an editor converts each content bit-string into a durably intelligible representation and collects the results, together with standardized metadata, to become the TDO payload. In addition to its payload, each TDO has a protection block into which a human editor loads metadata and records relationships among its parts, and between it and other objects. The final construction step, executed at a human agent's command, is to seal all these pieces within a single bit-string with a message authentication code. In a valid TDO representing some version of an object, the bit-string set that represents the version is XML-packaged with registered schema; these bit-strings and metadata are encoded to be platform-independent and durably intelligible. TDO metadata includes identifiers for the version and for the set of versions of the work and the package includes or links reliably to all metadata needed for interpretation and as evidence. All these contents are packaged as a single bit-string sealed using cryptographic certificates based on public key message authentication and each cryptographic certificate is authenticated by a recursive certificate chain.
In the past, wax seals impressed with signet rings were affixed to documents as evidence of their authenticity. A contemporary digital counterpart is a message authentication code firmly bound to each important document. The structure and use of each TDO, emphasizing the metadata portions suggested by Figure 2, is described in [3]. The design includes the following features:
Content represented with relatively simple and widely known data formats can be saved more or less "as is." For other data formats, [5] teaches how to encode any kind of content bit-string suggested by Figure 2 to be durably intelligible or useful. Its features include:
Content represented with relatively simple and widely known data formats can be saved more or less "as is".
A producer typically tries to encode information so that each consumer can read or otherwise use the content. In an ideal scenario as depicted in Figure 1, perfection would be characterized by the consumer understanding exactly what the producer intended to communicate. However, in addition to the consequences of human imperfections of authors and editors, the 0->1 and 9->10 steps suffer from unavoidable language limitations. (Jargon, expectations, world views, and ontologies are at best imperfectly shared. For example, I cannot tell you what I mean. I cannot know how you interpret what I say.)
Such difficulties originate in the theoretical limits of what machines can do. How we might mitigate them will be discussed in future articles. Philosophical arguments that TDO methodology accomplishes as much as any mechanical method can accomplish toward preserving digital information, and that it attempts no more are presented in [4]. A second work in progress examines what information producers can do to minimize eventual consumers' misinterpretations, given that communication invariably confounds intentional with accidental information.
Premature digital preservation deployment would risk that flaws might not be discovered before large expenditures are made to create archival holdings of uncertain quality. Errors might distort meanings (for texts) or behaviors (for programs). The questions reach into epistemologythe philosophical theory of what can be objectively known and reliably communicated, in contrast to what must forever remain subjective questions of belief or taste. We are therefore reluctant to implement pilot installations until we have considered the applicable philosophy thoroughly and until experts have had the opportunity to criticize TDO design.
What's Missing from the U.S. Digital Preservation Plan? Engineers want questions that can be answered objectively. They expect plans to be clear enough so that every participant and every qualified observer can understand what work is committed and can judge whether progress is being achieved.
We expect a plan to articulate concisely each objective, the resources needed to meet it, commitments to specific actions, a schedule for each delivery, and a prescription for measuring outcomes and quality. If the plan is for a large project, we expect it to be expressed in sections that separate teams can address relatively independently. If the resources currently available are inadequate, we expect the plan to identify each shortfall. Finally, if a team has already worked on the topic, we expect its plan to list its prior achievements.
NDIIPP funding is commensurate with that for all foreign preservation work combined. Unfortunately, the technical portions of [6] contain little more than vague generalities and decade-old ideas. It identifies few technical specifics, no target dates, and few objective success measures. Engineers will find little to work with. Later publications do not repair its weaknesses. This is troubling for an initiative launched six years ago.
Competitive Evaluation. Firm assertions of TDO packaging advantages over alternatives would be premature before we have deployed a complete pilot. Ideally, we would compare our design to alternatives. However, nobody has designed one. Notwithstanding such uncertainties, we believe that, in addition to satisfying our starting objectives, TDO support infrastructure will exhibit the following desirable characteristics:
What will make implementations easy to tailor is that good tools exist for XML. What will make them scalable is that TDO structure is recursive and uses links extensively.
Most preservation literature emphasizes the perspectives of archiving institutions. This article and supporting TDO reports focus on end users' needs because these have precedence over repository needs. Principles for a TDO design have been articulated here to address every technical problem and requirement identified in the literature. The central elements are an encapsulation scheme for digital preservation objects and encoding using extended Turing-complete virtual machines. Correct TDO implementations will allow preservation of any type of digital information and will be as efficient as any competing solution.
What will make implementations easy to tailor is that good tools Exist for XML. What will make them scalable is that TDO structure is recursive and uses links extensively.
Critical examination of this work by readers is encouraged and public discussion is called for because "getting it right" is too important for anything short of complete transparency.
1. Duranti, L. The long-term preservation of the dynamic and interactive records of the arts, sciences and e-government. Documents Numerique 8, 1 (2004), 114.
2. Garrett, J. et al. Preserving Digital Information: Report of the Task Force on Archiving of Digital Information. Commission on Preservation and Access and The Research Libraries Group, 1995.
3. Gladney, H.M. Trustworthy 100-year digital objects: Evidence after every witness is dead. ACM Trans. Info. Sys. 22, 3 (July 2004), 406436.
4. Gladney, H.M. Trustworthy 100-year digital objects: Syntax and semanticstension between facts and values; eprints.erpanet.org/archive/00000051/.
5. Gladney, H.M. and Lorie, R. Trustworthy 100-year digital objects: Durable encoding for when it's too late to ask. ACM Trans. Info. Sys. 23, 3 (July 2005), 299324.
6. Library of Congress. Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure and Preservation Program, 2003; www.digitalpreservation.gov/repor/ndiipp_plan.pdf.
7. Marcum, D.B. Research questions for the digital era library. Library Trends 51, 4 (Spring 2003), 636651.
8. Reich, V. and Rosenthal, D. LOCKSS: A permanent Web publishing and access system. D-Lib Magazine 7, 6 (June 2001).
9. Ross, S. and McHugh, A. Audit and certification of digital repositories; Dale, R. Making certification real: Developing methodology for evaluating repository trustworthiness. Both articles in RLG Digi-News 9, 5 (Oct. 2005).
10. Thibodeau, K. Knowledge and action for digital preservation: Progress in the U.S. Government. In Proceedings of DLM-Forum 2002 (2002), 175179.
11. Waters, D. Good archives make good scholars: Reflections on recent steps toward the archiving of digital information. In Proceedings of the Council on Library and Information Resources pub107, (2002).
1Designs cited here have been published in ACM Transactions on Information Systems.
2Content management is not discussed in this article because archival needs can be satisfied by available software with at most modest and obvious extensions.
3These include general schema proposed for standardization, such as METS sponsored by the Library of Congress, and many topic- or discipline-specific extensions. An October 2005 Web search for material with "metadata schema" in their titles yielded over 300 hits.
Figure 1. Documentary information interchange and repositories (the object numbering is taken from [
©2006 ACM 0001-0782/06/0200 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2006 ACM, Inc.
No entries found