Forty years ago this June, an article appeared in these pages that would shape the long-term direction of information technology like few other ideas in computer science. The opening sentence of the article, "A Relational Model of Data for Large Shared Data Banks," summed it up in a way as simple and elegant as the model itself: "Future users of large data banks must be protected from having to know how the data is organized in the machine," wrote Edgar F. Codd, a researcher at IBM.
And protect them it did. Programmers and users at the time dealt mostly with crude homegrown database systems or commercial products like IBM's Information Management System (IMS), which was based on a low-level, hierarchical data model. "These databases were very rigid, and they were hard to understand," recalls Ronald Fagin, a Codd protégé and now a computer scientist at IBM Almaden Research Center. The hierarchical "trees" in IMS were brittle. Adding a single data element, a common occurrence, or even tuning changes, could involve major reprogramming. In addition, the programming language used with IMS was a low-level language akin to an assembler.
But Codd's relational model stored data by rows and columns in simple tables, which were accessed via a high-level data manipulation language (DML). The model raised the level of abstraction so that users specified what they wanted, but not how to get it. And when their needs changed, reprogramming was usually unnecessary. It was similar to the transition 10 years earlier from assembler languages to Fortran and COBOL, which also raised the level of abstraction so that programmers no longer had to know and keep track of details like memory addresses.
"People were stunned to learn that complex, page-long [IMS] queries could be done in a few lines of a relational language," says Raghu Ramakrishnan, chief scientist for audience and cloud computing at Yahoo!
Codd's model came to dominate a multibillion-dollar database market, but it was hardly an overnight success. The model was just too simple to work, some said. And even if it did work, it would never run as efficiently as a finely tuned IMS program, others said. And although Codd's relational concepts were simple and elegant, his mathematically rigorous languages, relational calculus and relational algebra, could be intimidating.
In 1969, an ad hoc consortium called CODASYL proposed a hierarchical database model built on the concepts behind IMS. CODASYL claimed that its approach was more flexible than IMS, but it still required programmers to keep track of far more details than the relational model did. It became the basis for a number of commercial products, including the Integrated Database Management System (IDMS) from the company that would become Cullinet.
Contentious debates raged over the models in the CS community through much of the 1970s, with relational enthusiasts arrayed against CODASYL advocates while IMS users coasted along on waves of legacy software.
As brilliant and elegant as the relational model was, it might have remained confined to computer science curricula if it wasn't for three projects aimed at real-world implementation of the relational database management system (RDBMS). In the mid-1970s, IBM's System R project and the University of California at Berkeley's Ingres project set out to translate the relational concepts into workable, maintainable, and efficient computer code. Support for multiple users, locking, logging, error-recovery, and more were developed.
System R went after the lucrative mainframe market with what would become DB2. In particular, System R produced the Structured Query Language (SQL), which became the de facto standard language for relational databases. Meanwhile Ingres was aimed at UNIX machines and Digital Equipment Corp. (DEC) minicomputers.
Then, in 1979, another watershed paper appeared. "Access Path Selection in a Relational Database Management System," by IBM System R researcher Patricia Selinger and coauthors, described an algorithm by which a relational system, presented with a user query, could pick the best path to a solution from multiple alternatives. It did that by considering the total cost of the various approaches in terms of CPU time, required disk space, and more.
"Selinger's paper was really the piece of work that made relational database systems possible," says David DeWitt, director of Microsoft's Jim Gray Systems Laboratory at the University of Wisconsin-Madison. "It was a complete home run." The paper led to her election to the National Academy of Engineering in 1999, won her a slew of awards (including the SIGMOD Edgar F. Codd Innovations Award in 2002), and remains the seminal work on query optimization in relational systems.
Propelled by Selinger's new ideas, System R, Ingres, and their commercial progeny proved that relational systems could provide excellent performance. IBM's DB2 edged out IMS and IDMS on mainframes, while Ingres and its derivatives had the rapidly growing DEC Vax market to themselves. Soon, the database wars were largely over.
During the 1980s, DeWitt found another way to speed up queries against relational databases. His Gamma Database Machine Project showed it was possible to achieve nearly linear speed-ups by using the multiple CPUs and disks in a cluster of commodity minicomputers. His pioneering ideas about data partitioning and parallel query execution found their way into nearly all commercial parallel database systems.
"If the database community had not switched from CODASYL to relational, the whole parallel database industry would not have been possible," DeWitt says. The declarative, not imperative, programming model of SQL greatly facilitated his work, he says.
The simplicity of the relational model held obvious advantages for users, but it had a more subtle benefit as well, IBM's Fagin says. "For theorists like me, it was much easier to develop theory for it. And we could find ways to make the model perform better, and ways to build functions into the model. The relational model made collaboration between theorists and practitioners much easier."
Indeed, theorists and practitioners worked together to a remarkable degree, and operational techniques and applications flowed from their work. Their collaboration resulted in, for example, the concept of locking, by which simultaneous users were prevented from updating a record simultaneously.
The hegemony of the relational model has not gone without challenge. For example, a rival appeared in the late 1980s in the form of object-oriented databases (OODBs), but they never caught on. There weren't that many applications for which an OODB was the best choice, and it turned out to be easier to add the key features of OODBs to the relational model than to start from scratch with a new paradigm.
More recently, some have suggested that the MapReduce software framework, patented by Google this year, will supplant the relational model for very large distributed data sets. [See "More Debate, Please!" by Moshe Y. Vardi on p. 5 of the January 2010 issue of Communications.] Clearly, each approach has its advantages, and the jury is still out.
As RDBMSs continues to evolve, scientists are exploring new roads of inquiry. Fagin's key research right now is the integration of heterogeneous data. "A special case that is still really hard is schema mapping—converting data from one format to another," he says. "It sounds straightforward, but it's very subtle." DeWitt is interested in how researchers will approach the "unsolved problem" of querying geographically distributed databases, especially when the databases are created by different organizations and are almost but not quite alike. And Ramakrishnan of Yahoo! is investigating how to maintain databases in the cloud, where service vendors could host the databases of many clients. "So 'scale' now becomes not just data volume, it's the number of clients, the variety of applications, the number of locations and so on," he says. "Manageability is one of the key challenges in this space."
Further Reading
Codd, E.F.
A relational model of data for large shared data banks. Comm. of the ACM 13, 6, June 1970.
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.
Access path selection in a relational database management system. Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, 1979.
Ullman, J.D.
Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies, W.H. Freeman & Co., New York, NY, 1990.
Hellerstein, J. M. and Stonebraker, M.
Readings in Database Systems (4th ed.), MIT Press, Cambridge, MA, 2005.
Agrawal, R., et al
The Claremont Report on Database Research, University of California at Berkeley, Sept. 2008. http://db.cs.berkeley.edu/claremont/claremontreport08.pdf
©2010 ACM 0001-0782/10/0500 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.
The following letter was published as a Letter to the Editor in the October 2010 CACM (http://cacm.acm.org/magazines/2010/10/99482).
--CACM Administrator
While we were pleased Communications celebrated E.F. Codd's seminal article "A Relational Model of Data for Large Shared Data Banks" (June 1970) in "Happy Birthday, RDBMS!" by Gary Anthes (May 2010), we were also dismayed by its inaccuracies and misrepresentations, including about more than just pre-RDBMS history.
For example, saying "Codd's relational model stored data in rows and columns..." is completely at odds with Codd's goal that "Future users of large data banks must be protected from having to know how data is organized in the machine." Rows and columns are the canonical representation of Codd's relations, not a constraint on physical data structures. Getting this wrong completely undermines Codd's contribution. Moreover, no viable commercial RDBMS has stored data purely in rows and columns, nor has any vendor completely implemented the logical and physical data independence his theory made possible.
Other inaccuracies and misleading statements abound:
DB2 did not "edge out IMS and IDMS." It took a long time for the transaction rates of any commercial RDBMS to compete with those of IMS, which remains an important commercial DBMS;
Ingres and its derivatives did not have the "DEC VAX market to themselves." Interbase, Oracle, and Rdb/VMS were early players (1980s), and Ingres was initially available on VAX/VMS but like many RDBMS products that preceded the IBM products introduced on Unix;
The "database wars" raged for almost two decades. Relational repeatedly had to prove itself against network, hierarchical, and object-oriented DBMSs, continuing with XML and Hadoop contenders;
Map/Reduce is a non-declarative programmer's distributed query template, and the Hadoop Distributed File System is a storage model. Neither rises to the level of data model or programming language;
Whether it was "easier to add the key features of OODBs to the relational model than start from scratch with a new paradigm" never happened. At best, features were added to SQL and SQL-based products, but these misguided additions did violence to the relational model's way of achieving desired capabilities, namely extensible domain support;
"Querying geographically distributed relational databases" is not unsolved. Implementing the relational model's physical data independence solved it;
Since 1980, numerous RDBMS products have provided partial implementation of physical data independence and been widely used in industry. Perhaps David DeWitt [cited by Anthes and director of Microsoft's Jim Gray Systems Laboratory at the University of Wisconsin-Madison] was referring to the problems of querying heterogeneous, distributed data with inadequate metadata, since he was quoted saying databases "created by different organizations" and "almost but not quite alike"; and
Database scalability has always been about numbers of concurrent users and locations, user variety, and manageability, not just data volumes. One of us (McGoveran) published (late 1980s, 1990s) studies evaluating scalability of commercial products along these lines.
David McGoveran
Boulder Creek, CA
C.J. Date
Healdsburg, CA
---------------------------------------------
AUTHOR'S RESPONSE
E.F. Codd's model let users "see" their data as if it were stored in ordinary tables, rows, and columns. This was easier for them to understand than the pointers and hierarchical trees used in other models. Such simplification was one reason the RDBMS model edged out IMS and IDMS, though IMS is still used in a few narrow (but important) niches. Alas, vendors did not implement Codd's rules in their purest form, as McGoveran and Date point out.
Gary Anthes
Arlington, VA
The following letter was published in the Letters to the Editor in the July 2010 CACM (http://cacm.acm.org/magazines/2010/7/95049).
--CACM Administrator
Gary Anthes offered good reporting but also some serious errors concerning pre-RDBMS history in his news article "Happy Birthday, RDBMS!" (May 2010), saying "In 1969, an ad hoc consortium called CODASYL proposed a hierarchical database model built on the concepts behind IMS. CODASYL claimed that its approach was more flexible than IMS, but it still required programmers to keep track of far more details than the relational model did."
Please compare with the following basic facts as reported in Wikipedia: "In 1965 CODASYL formed a List Processing Task Force. This group was chartered to develop COBOL language extensions for processing collections of records; the name arose because Charles Bachman's IDS system (which was the main technical input to the project) managed relationships between records using chains of pointers. In 1967 the group renamed itself the Data Base Task Group and in October 1969 published its first language specifications for the network database model, which became generally known as the CODASYL Data Model."
The Integrated Data Store (IDS) has been in continuous productive use since 1964, running first on GE 200 computers. In 1966, it began supporting a nationwide, 24/7, order-entry system (OLTP). And in 1969, running on the GE 600, it began supporting a shared-access (OLTP) database, complete with locks, deadlock detection, and automatic recovery and restart.
IBM did not release its IMS/360 (Information Management System) based on the hierarchical data model until September 1969 when future relational databases were still just a gleam in Ted Codd's eye.
B.F. Goodrich received the IDS source code from GE in 1964, renaming it the Integrated Database Management System, or IDMS, when rewritten for the IBM 360 (19691971). IDMS was acquired (1973) and marketed worldwide by Cullinane (later Cullinet). IDMS was acquired (1989) by CA (formerly Computer Associates), which still actively supports it worldwide on more than 4,000 IBM mainframes. British Telecom and the Brazilian government are the best-known IDMS users, rated, in 2005, the second- and third-largest OLTP systems in the world.
For more, please see the refereed papers on IDS, IMS, IDMS, and other DBMS products in IEEE Annals of the History of Computing (Oct.Dec. 2009) special issue on "Mainframe Software: Database Management Systems." A future issue is planned to cover more recent RDBMS history.
Charles W. (Charlie) Bachman
ACM A.M. Turing Award 1973
Lexington, MA
Displaying all 2 comments