Leave it to Google to make business data processing—among the stodgiest topics in the rather buttoned-up world of database systems—seem cool. The application here involves producing reports over Google's ads infrastructure: Google executives want to see how many ads each Google property is serving, and how profitable they are, and Google's customers want to see how many users are clicking on their ads, how much they are paying, and so on.
At a small scale, solving this problem is straightforward—new ad click and sales data are appended to a database file as they are sent from the processing system, and computing the answer to a particular query over the data involves reading the contents of ("scanning") the data file to compute a running total of the records in the groups the user is interested in. Making this perform at the scale of Google Ads, where billions of clicks happen per day, is the challenge addressed by the Mesa system described in this following paper.
A natural question is how Mesa compares to existing parallel transactional database systems?
Fundamentally, the key technique is to employ massive parallelism, both when adding new data and when looking up specific records. The techniques used are largely a collection of best practices developed in the distributed systems and database communities over the last decade, with some clever new ideas thrown in. Some of the highlights from this work include:
A natural question is how Mesa compares to existing parallel transactional database systems? Database systems are optimized for high throughput, but lack several features that are a requirement of the Mesa solution. First, Mesa fits neatly into the elegant modular (layered) software architecture stack Google has built: It runs on top of Colossus (their distributed file system), and provides a substrate on which advanced query processing techniques (like their F1 system) can be built. Layering software this way allows different engineering teams to maintain code, and allows different layers to service multiple clients. Many existing data processing systems are much more monolithic, and would be difficult to integrate into the Google software ecosystem. Second, conventional databases were not built to replicate data across multiple datacenters. Traditional systems (typically) use a single-master approach for fault tolerance, replicating to a (read-only) standby that can take over on a master failure. Such a design will not work well if datacenter failures or network partitions are frequent.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2016 ACM, Inc.
No entries found