Increasingly, the source materials on which historians of computing work are either born digital or have been digitized. How these digital treasures are preserved, and how continuing access to them may be secured over the long term are therefore matters of concern to us. One of the key determinants of whether this material will remain accessible is the file format in which the source materials are stored. Choosing the "wrong" format will have significant implications for the extent to which files might be supported by systems, automated tools, or workflow associated with the digital content life cycle processes in libraries, archives, and other digital repositories. For this reason, digital preservationists have always taken a keen interest in file formats, and are understandably motivated to maintain up-to-date information on the characteristics of different formats, and the effect these have on their preservability over time.
A considerable amount of digital material is now subject to mandatory deposit to national libraries, meaning even data creators who have no interest in preservation or long-term access need to develop some understanding of the formats preferred, or deemed acceptable by repositories and why.
The principal annual gathering of digital preservationists takes place at the iPRES conference, the most recent of which was held in Melbourne, Australia last year; the next conference will be held this November in Chapel Hill, N.C.a The iPRES conference always provides an opportunity to gauge the zeitgeist of the field, and last year it was difficult not to be struck by the amount of attention being paid to the difficulties posed by file format issues, and the extent to which a viable format registry could be created, which both responds to the needs of the preservation community and involves practitioners meaningfully going forward. Format was prominent both in the papers presented, and in the break-time conversations, some of which were quite animated. The problems caused by format obsolescence and interoperability affect all of us. Whether we are scientists wanting to return to data generated by a specialized scientific instrument that no longer sits in the lab, or animators working on Hollywood blockbusters whose work must be done from scratch many times in a single movie because the studio wants to make use of features in the latest piece of bleeding-edge software. At a more mundane level, files created in version 1.0 of our favorite word processor are seldom fully compatible when we upgrade to later versions. Format obsolescence has the capacity to cut us off from the fruits of research results whether publicly funded, or developed by companies. Files, even when preserved perfectly, are useless if we do not have software that can make sense of them. The pace at which technology drives forward, and the lure of the new, means it is ever more difficult to ensure legacy files are kept fully accessible.
One of the complaints that came up was that "format," despite being a term in common use, is, in fact, not well understood or agreed upon even within the "format" community. Specific problems arising as the result of particular "formats" include:
These, and similar considerations give rise to a certain degree of consternation, and a general lack of confidence concerning the likelihood of succeeding in 'imposing' much by way of a single 'understanding' of format in general, or even a given individual format.
A common approach to identifying files, the format of which is unknown, is to examine the individual bytes (the bitstream) looking for characteristic patterns or 'signatures' associated with known formats. Tools such as DROID (Digital Record and Object Identification), have been developed to make this process easier, and form an important part of the digital forensics toolkit. Perhaps, therefore, we might avoid confusion over "formats" by concentrating our attention on the identifying characteristics of bitstreams. Indeed, there appeared to be no dissent from the view that file formats can be thought of as (usually nested) interpretations (or encoding schemes) of bitstreams. However tempting such linguistic legerdemain may be, there are at least two grounds for thinking this is not the right approach.
Files, even when preserved perfectly, are useless if we do not have software that can make sense of them.
Abstraction is the process of establishing the level of complexity on which people interact with systems. For example, when we are writing code involving numerical operations we normally ignore the way in which numbers are represented in the underlying computer hardware (for example, 16bit or 32-bit), concentrating entirely on "number." On other occasions, these implementation (or lower-level) details are critical and we pay close attention to each "bit."
Changing our level of abstraction down (for example, from "format" to "bitstream") is a certainly recognized way of getting around talking about difficult, disputed, or otherwise problematic concepts, similarly changing a level of abstraction (up) is often very helpful in introducing illuminating concepts or organizational principles that not only are not apparent at a lower level but simply do not apply. The restricted applicability of concepts (they are tied to the level of abstraction—and in some sense define it) is often easier for us to notice in one direction of travel than the other.
For example, when speaking of "color depth" we are talking how finely levels of color can be expressed in a given "format" (encoding scheme). It is not possible to talk meaningfully about color depth at the level of individual bits, or at bitstream level. Color depth is a "higher"-level concept than it is possible to express at the level of bits, it requires an abstraction to a level where encoding schemes exist. This does not preclude us talking at the higher level about bits of course, and indeed, that is how color depth is normally discussed—that is, how many bits a given encoding scheme devotes to representing the color of a particular pixel. "Color," of course, is a higher-level concept again. There are no red bits, or green bitstreams. Talk of "color" does not belong there. It is important to have a clear sense of what concepts apply at each level of description if we are to avoid making "category mistakes."b It is difficult to see how, if we were to restrict ourselves to discussing bitstreams alone we would continue to have at our disposal the set of concepts, such as "color depth," that we require routinely.
A second set of difficulties flow from the disturbingly large problem space that results from understanding formats to be identifiable ways of encoding computer files. For example, a file of bit length L, where each bit may take 1 of V values may be potentially encoded in VL ways. So, a binary file, 8-bits long, is capable of being represented in 256 distinguishable ways and could, therefore give rise to 256 different formats. A 64-bit file can support 18,446,744,073,709,600,000 different representations. Even if we exclude the notion of files that are all format and no "payload" the problem space is not significantly reduced. The smallest "payload" the system I have just described can support is a single "payload" bit, the remainder of the file being taken up with encoding the format. Thus, a binary file 8 bits long (with a 1-bit "payload") can be represented in 128 distinguishable ways and could, therefore give rise to 128 different formats and there could be exactly two different "payloads" representable in this scheme for each of the possible formats. Our 64-bit file (with a 1-bit "payload") can similarly support 9,223,372,036,854,780,000 different representations/formats, any of which may represent one of two payloads. Suffice it to say the numbers involved are very large, so large in fact, that if we insist on trying to wrestle with the whole theoretical problem space, expressed purely in terms of bitstreams, we are not likely to make much progress.
Some other limitations of bitstreams are also apparent. Files having exactly similarly bitmap sequences (syntax) need not have the same format (semantics). Formats are encodings, and different vendors can impose different "meaning" on the same patterns. Syntax is not the same as semantics. It is simply not possible, in the abstract, to distinguish between a given 6-bit pattern followed by 2 bits of "payload" and a 7-bit pattern (coincidentally sharing the same initial 6 bits) followed by a 1-bit "payload." How we interpret (encode) these 8 bits is a matter of choice, which is apt to change over time, place, and circumstance.
It is clear that if, for example, we aim to have a comprehensive mapping of possible formats for files (of any reasonable length), basing our work on bitstreams is not going to succeed. Abandoning talk of formats in favor of bitstream language, for all its intellectual/mathematical attractiveness, will not solve the problems with which we are faced or with which we should be attempting to grapple. I am therefore disinclined to go down the bitstream route.
Some of the problems we need to be addressing include:
There is every reason to be hopeful that substantial progress can be made against each of these problems.
In the biological sciences (which represents a much more complex and multifaceted domain than computer file formats), broadly similar kinds of problems:
These problems have been substantially solved and a movement now exists within the biodiversity informatics community to provide globally unique identifiers in the form of Life Science Identifiers (LSID) for all biological names. Three large nomenclatural databases have already begun this process: Index Fungorum, International Plant Names Index (IPNI), and ZooBank. Other databases, which publish taxonomic rather than nomenclatural data, have also started using LSIDs to identify taxa.
The solution did not, in the biological sciences, involve abandoning talk of genus or species in favor of DNA sequences (for example) but was built on a broadly Linnaean taxonomic approach. This is the sort of approach that might be employed in our more modest and easily tractable domain.
Having decided which attributes of files are of interest (and of course there will be many, and the list will change over time) we can begin to group these into different categories. This is broadly analogous to the Life, Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species taxonomy that is familiar in biology. The biological approach has been perfectly able to withstand heated scientific dispute of the correct classification of individual life-forms, and it is by no means uncommon to see, for example, an insect being classified differently over time, as scientific understanding and technique have changed or developed. It is not necessary for us to get classification schemes right the first time, or for them to be frozen for all time, in order for them to gain widespread acceptance, and to confer significant benefits on the community.
By declining to speak of formats, in favor of speaking of bitstreams, we will not only fail to improve matters but may actually make intractable the format problems we need to tackle. Insofar as we do want to employ multiple levels of abstraction in our discussions of the overall problem space, we must be careful which concepts we deploy where, or we are likely to make category mistakes.
a. See http://www.digitalmeetsculture.net/article/international-conference-on-digital-preservation-ipres-2015/.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2015 ACM, Inc.