The database and systems communities have made great progress in developing database systems that allow us to store and query huge amounts of data. My first computer cost about $1,000; it was a Commodore 64 with a 170KB floppy disk drive. Today (September 2009), I can configure a 1.7TB file server for the same price. Companies are responding to this explosion in storage availability by building bigger and bigger data warehouses, where every digital trail we leave is stored for later analysis. Data arrives 24/7, and real-time "on-the-fly" analysiswhere answers are always availableis becoming mandatory. Here is where data stream processing comes to the rescue.
In data stream processing scenarios, data arrives at high speeds and must be analyzed in the order it is received using a limited amount of memory. The area has two main directions: a systems side and an algorithmic side. On the systems side, researchers have developed data stream processing systems that work like database systems turned upside down: Long-running queries are registered with the system and data is streamed through the system. Startups now sell systems that analyze streaming data for solutions in areas such as fraud detection, algorithmic trading, and network monitoring. They often offer at least an order of magnitude performance improvement over traditional database systems.
On the algorithmic side, there has been much research on novel one-pass algorithms. These algorithms have no need for secondary index structures, and they do not require expensive sorting operations. They are onlinean answer of the query over the current prefix of the stream is available at any time. These so-called data stream algorithms achieve these properties by trading exact query answers against approximate answers, but these approximations come with provable quality guarantees.
The following paper by Graham Cormode and Marios Hadjieleftheriou gives an overview of recent progress for the important primitive of finding frequent items in a data stream. Informally, an item is frequent in a prefix of the stream if its relative frequency exceeds a user-defined threshold. Another formulation of the problem just looks for the most frequently occurring items. The authors present an algorithmic framework that encompasses previous work and shows the results of a thorough experimental comparison of the different approaches.
This paper is especially timely since some of these algorithms are already in use (and those that are in use are not necessarily the best, according to the authors). For example, inside Google's analysis infrastructure, in the mapreduce framework, there exist several prepackaged "aggregators" that in one pass collect statistics over a huge dataset. The "quantile" aggregate, which collects a value at each quantile of the data, uses a previously developed algorithm that is covered in the paper, and the "top" aggregate estimates the most popular values in a dataset, again using an algorithm captured by the framework in the paper.
Within AT&T (the home institution of the authors) a variety of streaming algorithms are deployed today for network monitoring, based on real-time analysis of packet header data. For example, quantile aggregates are used to track the distribution of round-trip delays between different points in the network over time. Similarly, heavy-hitter aggregates are used to find sources that send the most traffic to a given destination over time.
Although the paper surveys about 30 years of research, there is still much progress in the area. Moreover, finding frequent items is just one statistic. In practice, much more sophisticated queries such as frequent combinations of items, mining clusters, or other statistical models require data stream algorithms with quality guarantees. For example, recent work from Yahoo! for content optimization shows how to use time series models that are built online to predict the click-through rate of an article based on the stream of user clicks.
I think we have only scratched the surface both for applications and in novel algorithms, and I am looking forward to another 30 years of innovation. I recommend this paper to learn about the types of techniques that have been developed over the years and see how ideas from algorithms, statistics, and databases have come together in this problem.
©2009 ACM 0001-0782/09/1000 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2009 ACM, Inc.
No entries found