Are data synopsessuch as the hash-based sketches discussed by Li and König in the following paperstill needed for querying massive datasets? After all, disk and memory are cheap, and both modern multicore processors and data-processing platforms such as Hadoop are rendering Web scale datasets ever more amenable to parallel and distributed processing techniques. The answer to this question is a firm "yes!" Many important data-processing tasks, such as detecting duplicate Web pages, are not embarrassingly parallel and involve a huge number of operations. Even massively parallelizable tasks face formidable performance challenges: it has long been observed that data volumes are growing at faster rates than computing power, and this trend continues. Moreover, methods that rely solely on parallelism can be expensive. Indeed, under evolving "platform as a service" models for cloud computing, users will pay costs that directly reflect the computing resources they use. In this setting, use of synopsis techniques can lead to significant cost savings, as well as to energy savings and greener computing. The need to process streaming data exacerbates the pressures on data management systems and makes synopsis techniques even more appealing. So synopses such as samples, histograms, wavelets, and sketches will continue to play a key role in information management, data mining, and knowledge discovery. Reducing synopses' time and space requirements thus remains an important challenge.
The paper highlights the fascinating interplay between two synopses techniques: hashing and sampling. Random samples have long served as useful synopses because of their flexibility: a given sample can be used to estimate many different characteristics of the original dataset. This flexibility often comes at the price of limited accuracy, however. In particular, sampling does poorly at finding "needles in haystacks" and so fails at, for example, estimating the number of distinct values in a "skewed" dataset containing a few values that each occur millions of times along with a few hundred values that each occur once. Hash-based sketches, on the other hand, are typically applicable to only one particular task, but perform that task extremely well; a famous example is the Flajolet-Martin sketch for the number of distinct values, based on a single pass through the dataset. In general, hash-based sketches have proven to be extremely well suited for streaming applications, and for this reason have received much attention in recent years.
Although sampling and hashing might seem complementary, they are closely related. Indeed, suppose we apply a given hash function to each of the elements of a set and then select the element having the smallest hashed value. We can reasonably view this process as being equivalent to selecting an element of the set at random. Repeating this process with several different hash functions yields a random sample of set elements. (Strictly speaking, we need to view the hash functions as being randomly and uniformly selected from an appropriate family of hash functions, in order to put hashing into a probabilistic setting comparable to that of sampling.) The crucial property of this approach is that the same hash function can be used to sample from multiple sets, and this coordinated sampling leads to powerful methods for rapidly estimating the similarity between pairs of sets. Indeed, it has long been recognized in the statistics community that inducing correlations between samplesvia the technique of common random numberscan greatly reduce variability when estimating similarities between different populations or systems; hash-based sampling is another manifestation of this idea.
The minwise-hashing (minHash) algorithm originally developed by Andrei Broder and colleagues exploits hash-based sampling to estimate the "Jaccard" similarity between two sets in a simple and elegant way, by computing the fraction of hash functions that result in the same element being selected from both sets, or equivalently, the fraction of hash functions for which the minimum hashed values over each of the two sets coincide. Thus, the sketch for a given dataset is the collection of minimum hashed values for the different hash functions. This idea has been applied to problems ranging from the identification of duplicate Web pages or documents to applications in clustering, caching, online advertising, file-system management, and association-rule mining. The striking result in this paper is that, if the goal is to identify sets of high similarity, it suffices to retain only a few low-order bits of each hashed value in the original minHash sketch. The resulting modification to the original minHash algorithm is minor, but can reduce the space and estimation-time requirements by well over an order of magnitude. The challenge is to find the right estimator for the Jaccard similarity and then to theoretically verify both the correctness and superior properties of the resulting algorithm. An elegant use of probabilistic approximations accomplishes this goal. This work is an ideal blend of theory and application.
©2011 ACM 0001-0782/11/0800 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.
No entries found