ACM

Communications of the ACM

Home/Magazine Archive/August 2014 (Vol. 57, No. 8)/Technical Perspective: Getting Consensus For Data.../Full Text

Research highlights

Technical Perspective: Getting Consensus For Data Replication

By Philip A. Bernstein
Communications of the ACM, August 2014, Vol. 57 No. 8, Page 92
10.1145/2632794
Comments

View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share:

Data in a cloud computing system should be highly available. That is, whenever you connect to the system, the data you stored there should be ready to use. The standard mechanism to accomplish this—data replication—involves maintaining multiple copies of each user's data.

By itself, replicating data does not solve the problem, because the copy of data that is available might be stale. To see why, consider a system that maintains three copies of file x: x₁, x₂, and x₃. Suppose x₁ and x₂ are currently available and x₃ is down. If a program updates x, the system writes the update to x₁ and x₂, but not to x₃ because it is down. No problem, since the system still has two correct copies. Next, suppose x₁ and x₂ fail, and then x₃ recovers. Now there is a problem. If a user reads x, the best the system can do is return the value of x₃. But that copy is stale—it does not have the latest update.

An early solution to this problem is called majority consensus.² To process a write operation on x, the system assigns a monotonically increasing timestamp to the write, and writes x's value and timestamp to a majority of the copies of x. To process a read operation on x, it reads a majority of the copies of x and returns one that has the largest timestamp. Since the intersection of any two majorities includes at least one copy, each read returns the result of the previous writes on the same data.

In our example, the read operation only reads one copy, which is not a majority of three. Therefore, it might not return the latest state of x. Using majority consensus, the system would have to wait until a second copy became available, so it could read two copies. Then, it could return the more up-to-date copy of the two that it reads, which must be the freshest.

The notion of majority was extended in Gifford¹ to be a weighted majority, which is called a quorum. Each copy is given a weight. A read must access a read quorum of copies—a set of copies whose weight is at least R. And a write must update a write quorum of copies—a set whose weight is at least W. By requiring that R+W exceeds the total weight N of all copies, we are assured that each read reads a copy that was written by the last write.

Given R, W, and N, the authors of the following paper offer a formula to calculate the probability of reading data that was not written by one of the K most recent writes.

For example, suppose the system stores four copies of x, namely, x₁-x₄. We might assign a weight of 1 to x₁ and x₂, but assign a weight of 2 to x₃ and x₄ if they are stored on more reliable servers. So N=6. If reads are more frequent than writes, we might set R=3 and W=4, so that reads have less work to do than writes. Since R+W>N, each read operation reads the result of the previous write on the same data.

It stands to reason that each write has to update all available copies of the data. But it is annoying that readers have to read multiple copies too, since it adds overhead and delay. It is especially annoying if writes usually execute quickly, since in this case each read is very likely to read the latest copy even if it does not read a quorum. If the probability of reading stale data is sufficiently small, then it might be satisfactory to allow readers to read less than a quorum of copies. In fact, some cloud applications do exactly that.

Until recently, this compromise was done by guesswork, without a bound on the staleness of the data that is read or the expected latency improvement from risking some staleness. The following paper is a breakthrough that removes the guesswork and replaces it by a principled analysis, called probabilistically bounded staleness. Given R, W, and N, the authors offer a formula to calculate the probability of reading data that was not written by one of the K most recent writes. To calculate the probability of reading data that was not written in the last delta time units, they use a Monte Carlo simulation based on time parameters that can be measured in a running system. Moreover, they made the analysis practical by implementing it in open source record managers, so that users can make their own judgment on trading off staleness for latency. What was formerly a guess is now a goal you can engineer for.

References

1. Gifford, D.K. Weighted voting for replicated data. SOSP (1979), 150–162.

2. Thomas, R.H. A majority consensus approach to concurrency control for multiple-copy databases. ACM Trans. Database Syst. 4, 2 (1979), 180–209.

Author

Philip A. Bernstein (http://research.microsoft.com/~philbe) is a Distinguished Scientist at Microsoft Research, Redmond, WA.

Footnotes

To view the accompanying paper, visit doi.acm.org/10.1145/2632792

No entries found

Technical Perspective: Getting Consensus For Data Replication

References

Author

Footnotes

Article Contents: