Relational database management systems (DBMSs) have been remarkably successful in capturing the DBMS marketplace. To a first approximation they are “the only game in town,” and the major vendors (IBM, Oracle, and Microsoft) enjoy an overwhelming market share. They are selling “one size fits all”; i.e., a single relational engine appropriate for all DBMS needs. Moreover, the code line from all of the major vendors is quite elderly, in all cases dating from the 1980s. Hence, the major vendors sell software that is a quarter century old, and has been extended and morphed to meet today’s needs. In my opinion, these legacy systems are at the end of their useful life. They deserve to be sent to the “home for tired software.”
Here’s why.
If we examine the nontrivial-sized DBMS markets, it turns out that current relational DBMSs can be beaten by approximately a factor of 50 in most any market I can think of. What follows are a few examples.
In the data warehouse market, a column store beats a row store by approximately a factor of 50 on typical business intelligence queries. The reason is because column stores read only the columns of interest to the query and not all of them. In addition, compression is more effective in a column store. Since the legacy systems are all row stores, they are vulnerable to competition from the newer column stores. The interested reader can start with “C-Store: A Column-oriented DBMS” to explore this topic further.
In the online transaction processing (OLTP) market, a lightweight main memory DBMS beats a row store by a factor of 50. Leveraging main memory and the fact that no DBMS application will send a message to a human user in the middle of a transaction, allows an OLTP DBMS to run transactions to completion with no resource contention or locking overhead. The interested reader can start with “The End of an Architectural Era (It’s Time for a Complete Rewrite)” to explore this topic further.
In the science DBMS market, users have never liked relational DBMSs and want a non-relational model and query facility. This was the topic of my last ACM blog, "DBMSs for Science Applications: A Possible Solution."
If you are storing Resource Description Framework (RDF) data, which is popular in the bio community and elsewhere, then “Scalable Semantic Web Data Management Using Vertical Partitioning” points out that column stores are very good at certain RDF workloads. In addition, other ideas, such as “RDF-3X: A Risc-style engine for RDF,” will beat conventional DBMSs in other situations. Lastly, native RDF engines (e.g., Virtuoso, Sesame, and Jena) may well gain traction. The point is that something else will beat conventional row stores in this market.
Text applications have never used relational DBMSs. This was pointed out to me most clearly by Eric Brewer nearly 15 years ago in the early days of Inktomi. He wanted to use a relational DBMS to store the results of Web crawling, but found RDBMS to be two orders of magnitude slower than a home-brew system. All the major Web-search engines use home-brew text software to serve us search results. None use relational DBMSs.
Even in XML, where the current major vendors have spent a great deal of energy extending their engines, it is claimed that specialized engines, such as Mark Logic or Tamino, run circles around the major vendors, according to a private communication by Dave Kellogg.
In summary, one can leverage at least the following ideas to get superior performance:
A non-relational data model. If the user’s data is naturally something other than tables and if simulating his natural data model on top of tables is awkward, then chances are that a native implementation of the natural data model will significantly outperform a conventional RDBMS. This is certainly true in scientific data.
A different implementation of tables. If something other than a row store accelerates the user’s queries, then a direct implementation of the relational model using non-row store technology will run circles around a conventional RDBMS. This is true in the data warehouse marketplace.
A different implementation of transactions. Current row stores give you a “one size fits all” implementation of transactions. This can be radically beaten if a user has lesser requirements or if the system can take advantage of workload specific features. This is true in the OLTP marketplace.
One of these characteristics is true in every market I can think of. Hence, in my opinion, the days of a “one size fits all” monolithic DBMS are at an end. The replacement will be a collection of vertical market specific engines, with much higher performance.
You might ask, “What if I don’t care about performance?” The answer: Run one of the open source relational DBMSs. They are mature, reliable, and, best of all, they are free.
You might also ask, “I am dug in deep with my current vendor(s). What do I do?” The answer: Take some portion of your DBMS budget and allocate it to new solutions. Over time, you will move onto better technology.
References
Michael Stonebraker et al., “C-Store: A Column-oriented DBMS,” Proc 2005 VLDB Conference, Trondheim, Norway, Sept. 2005.
Michael Stonebraker et al., “The End of an Architectural Era (It’s Time for a Complete Rewrite)” Proc 2007 VLDB Conference, Vienna, Austria, Sept. 2007.
Dan Abadi et al., “Scalable Semantic Web Data Management Using Vertical Partitioning,” Proc. 2007 VLDB Conference, Vienna, Austria, Sept. 2007.
Thomas Neumann et al., “RDF-3X: A Risc-style engine for RDF,” Proc VLDB Endowment, 1(1): 647-659 (2008)
Disclosure: Michael Stonebraker is associated with four startups that are either producers or consumers of data base technology. Hence, his opinions should be considered in this light.
This article nicely sums up a lot of the arguments about why the RDBMS is no longer a viable solution for people whose data needs truly must scale. I have not personally looked into RDF much, but I agree that your data storage needs to reflect the nature of the data ... not just a bunch of tables and rows.
I've written an article on it as well that I encourage you to check out, entitled "Social Media Kills the Database", which is about the Swiss-Army RDBMS and its impending end. You can check it out at http://www.roadtofailure.com
Absolutely, we somehow allowed ourselves to go down the path of monopolizing a single technology for data management and largely monopolizing a handful of vendors. At the same time alternatives to the RDBMS were largely discredited throughout the years and never really gained acceptance, even when there clearly was a disconnect between the RDBMS and requirements (and despite some poorly bolted on extensions by the RDBMS vendors in an attempt to retrofit their platforms).
This is clearly changing now. Some difficult but pressing challenges havent been easily solved by the RDBMS in traditional form (massive scalability for example), and this has opened the door to new approaches and ideas. The traditional RDBMS will of course live on but in an ecosystem of alternative data management strategies.
Yes. It is very true that RDBMS are overhyped for years for not-so valid reasons. The current trends also showcase that there are viable alternatives to RDBMS and also can beat them at its own game. Also, the emergence of distributed key-value stores such as Cassandra, Voldemort proves the efficiency and cost effectiveness of their approaches.
Also the recently concluded "NoSQL" conference discussed at length as to how distributed, non relational databases work along with overview of the emerging alternatives in this space.
One of the chief benefits cited for dbms's was improved programmer productivity due to insulating the application programmers from the significant requirements of learning specialized files/systems and their internals. However valuable this benefit actually turned out to be, it would be entirely lost when replacing dbms's with specialized systems.
Michael Stonebraker is fairly entitled to express his opinions, but as a fellow member of ACM, I would like to express some counterpoints based on my 25 years of successful business deployments on 8 or 9 (maybe more) commercial relational database products.
It is perfectly legitimate for Stonebraker to differ with me on the practicalities of relational storage for text, and how we interpret it when it is "claimed" that specialized XML engines outperform RDBMSes.
Similarly, he might claim a column store is different from a tuple store, and I might claim the only difference is data modeling choices.
It is also perfectly fair for him to emphasize performance benefits while downplaying or redefining the notion of transactional integrity that is now in widespread use.
But when Stonebraker misrepresents the relational model by claiming a user's data is "naturally something other than tables" he fundamentally misrepresents relational data analysis. The relational data model is not called "tabular" and the abstraction of a relation is not a simple data-entry form. One needn't read any further than Codd and Date to understand this. Their whole supplier-part-warehouse example shows how you have to deconstruct crude tabular representations in order to get good relational data models. Codd won the Turing Award because the relational abstraction is capable of representing any structured data.
And of all the valid criticisms of a model or a technology, "elderly" and "tired" are worse than useless. Do we believe that technology builds on prior discoveries, or that new technology throws older discoveries away? By such a standard, we would stop teaching Boolean logic, Turing machines, and all the other things that predate us.
Computer science has given us two fantastic tools for analyzing and managing data complexity: the relational model, and language theory. Compiler technology has been bulletproof for decades because of terrific underlying abstractions like the context-free grammar. The longevity of the relational model is due to its similar foundation in a powerful abstraction. I would claim that we will see relational DBMSes for at least as long as we see compilers.
In the interest of full disclosure, I am a proud employee of Oracle Corporation, but I do not speak for Oracle in any official or unofficial capacity.
It is disappointing, conversely, that neither Stonebraker nor ACM has informed our fellow ACM members that he is the CTO of Vertica Systems, a "column-store" database vendor that positions itself as a technology superior to relational technology.
With all respect,
Andrew D. Wolfe, Jr.
Mr. Wolfe appears to be making three main points in his posting:
1) the relational model is the best approach to data modeling
2) column stores are no different than row stores
3) elderly software is not bad.
I would like to briefly respond to each point.
Mr. Wolfe uses examples from business data processing in his posting. It is widely recognized that the relational model is probably the best fit for most business data processing data. In fact, all of the early examples used by Ted Codd, Chris Date, and others (including me) come from this domain (e.g., suppliers, parts, employees, departments, etc.). In the 1970s and 80s this was the only database market of consequence. However, one of the points that I was trying to make is that there are now other sizeable markets with different requirements.
In the science domain, tables are rarely the natural data model, and arrays would be a better choice. Popular science packages (e.g., MATLAB and S+) use arrays, not tables, as their user model. Once one leaves business data processing, the naturalness of the relational model must be questioned.
Column stores are a different implementation of the relational model than the row stores used by the major commercial vendors. Because they make different architectural choices than row stores in the areas of query processing, compression, and storage formats, they have a different performance envelope than row stores. In typical data warehouse workloads, column stores (which were designed specifically for this market) are vastly superior to row stores. See [1, 2, 3] for some detailed remarks in this area. Or just have your favorite Web browser search for column stores versus row stores to access the abundance of literature on this topic.
Third, I am always reminded of the Airline Control Program (ACP), renamed TPF by IBM. Written in IBM assembler a long time ago, it used very small disk blocks, an architectural decision made more than 30 years ago to optimize processing on a then-current (but now obviously long gone) IBM disk drive. Only fairly recently was this architectural decision changed. Hence, the problem with legacy code is that some things are just hard to change and linger in elderly code lines.
Two additional examples come to mind. A major database vendor wanted to change his replication system from active-passive to active-active. However, he didnt do so because it was just too much work. Another DBMS vendor has a shared-disk architecture because implementing a shared nothing architecture was simply too hard.
Besides technical problems, there are also political and business issues to cope with. Any technologist would be well advised to read Clayton Christensens book on this topic [4].
--Michael Stonebraker, Sept. 4, 2009
[1] Mike Stonebraker et. al., "C-Store: A Column-oriented DBMS," Proc. 2005 VLDB Conference, Trondheim, Norway, Sept. 2005.
[2] en.wikipedia.org/wiki/Column-oriented_DBMS.
[3] Dan Abadi et. al., "Column-Stores vs. Row-Stores: How Different Are They Really," Proc. 2008 SIGMOD Conference, Vancouver, Canada, June 2008.
[4] Clayton M. Christensen, "The Innovators Dilemma," Collins Business Essentials, 1997.
Yes. It is very true that RDBMS are overhyped for years for not-so valid reasons. The current trends also showcase that there are viable alternatives to RDBMS and also can beat them at its own game.
Displaying all 7 comments