Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing. This paper describes b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice.
We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison with other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme.
With the advent of the Internet, many applications are faced with very large and inherently high-dimensional datasets. A common task on these is similarity search, that is, given a high-dimensional data point, the retrieval of data points that are close under a given distance function. In many scenarios, the storage and computational requirements for computing exact distances between all data points are prohibitive, making data representations that allow compact storage and efficient approximate distance computation necessary.
In this paper, we describe b-bit minwise hashing, which leverages properties common to many application scenarios to obtain order-of-magnitude improvements in the storage space and computational overhead required for a given level of accuracy over existing techniques. Moreover, while the theoretical analysis of these gains is technically challenging, the resulting algorithm is simple and easy to implement.
To describe our approach, we first consider the concrete task of Web page duplicate detection, which is of critical importance in the context of Web search and was one of the motivations for the development of the original minwise hashing algorithm by Broder et al.2, 4 Here, the task is to identify pairs of pages that are textually very similar. For this purpose, Web pages are modeled as "a set of shingles," where a shingle corresponds to a string of w contiguous words occurring on the page. Now, given two such sets S1, , the normalized similarity known as resemblance or Jaccard similarity, denoted by R, is
Duplicate detection now becomes the task of detecting pairs of pages for which R exceeds a threshold value. Here, w is a tuning parameter and was set to be w = 5 in several studies.2, 4, 7 Clearly, the total number of possible shingles is huge. Considering 105 unique English words, the total number of possible 5-shingles should be D = (105)5 = O(1025). A prior study7 used D = 264 and even earlier studies2, 4 used D = 240. Due to the size of D and the number of pages crawled as part of Web search, computing the exact similarities for all pairs of pages may require prohibitive storage and computational overhead, leading to approximate techniques based on more compact data structures.
1.1. Minwise hashing
To address this issue, Broder and his colleagues developed minwise hashing in their seminal work.2, 4 Here, we give a brief introduction to this algorithm. Suppose a random permutation Π is performed on Ω, that is,
An elementary probability argument shows that
After k minwise independent permutations, Π1, Π2,..., Πk, one can estimate R without bias, as a binomial probability:
We will frequently use the terms "sample" and "sample size" (i.e., k). For minwise hashing, a sample is a hashed value, min(Πj(Si)), which may require, for example, 64 bits.7
Since the original minwise hashing work,2, 4 there have been considerable theoretical and methodological developments.3, 5, 12, 14, 16, 17, 22
Applications: As a general technique for estimating set similarity, minwise hashing has been applied to a wide range of applications, for example, content matching for online advertising,23 detection of redundancy in enterprise file systems,8 syntactic similarity algorithms for enterprise information management,21 Web spam,25 etc.
Many of the applications of minwise hashing are targeted at detecting duplicates or pairs of somewhat high similarity. By proposing an estimator that is particularly accurate for these scenarios, we can reduce the required storage and computational overhead dramatically. Here, the computational savings are a function of how the min-wise hashes are used. For any technique that does compute the pairwise similarity for (a large subset of) all pairs, the computation is typically bound by the speed at which the samples can be brought into memory (as the computation itself is simple); hence, the space reduction our technique offers directly translates into order-of-magnitude speedup as well.
However, even with the data-size reduction, computing all pairwise similarities is prohibitively expensive in many scenarios. This has lead to a number of approaches that avoid this computation by grouping (subsets of) the samples into buckets and only computing the pairwise similarities for items within the same (set of) buckets. This approach avoids the quadratic number of comparisons, at the cost of some loss in accuracy. Examples of such approaches are the supershingles4 or techniques based on locally sensitive hashing (LSH)1, 5, 13 (also see Chapter 3 of Rajaraman and Ullman24 for an excellent detailed explanation of LSH and see Cohen et al.6 for nice applications of LSH ideas in mining associations).
1.2. b-Bit minwise hashing
In this paper, we establish a unified theoretical framework for b-bit minwise hashing. In our scheme, a sample consists of b bits only, as opposed to, for example, b = 64 bits7 in the original minwise hashing. Intuitively, using fewer bits per sample will increase the estimation variance, compared to (3), at the same sample size k. Thus, we will have to increase k to maintain the same accuracy. Interestingly, our theoretical results will demonstrate that, when resemblance is not too small (which is the case in many applications, e.g., consider R≥0.5, the threshold used in Broder et al.2, 4), we do not have to increase k much. This means our proposed b-bit minwise hashing can be used to improve estimation accuracy and significantly reduce storage requirements at the same time.
For example, when b = 1 and R = 0.5, the estimation variance will increase at most by a factor of 3. In order not to lose accuracy, we have to increase the sample size by a factor of 3. If we originally stored each hashed value using 64 bits, the improvement by using b = 1 will be 64/3 = 21.3.
Algorithm 1 illustrates the procedure of b-bit minwise hashing, based on the theoretical results in Section 2.
1.3. Related work
Locality sensitive hashing (LSH)1, 5, 13 is a set of techniques for performing approximate search in high dimensions. In the context of estimating set intersections, there exist LSH families for estimating the resemblance, the arccosine, and the hamming distance. Our b-bit minwise hashing proposes a new construction of an LSH family (Section 7.4).
Algorithm 1 The b-bit minwise hashing algorithm, applied to estimating pairwise resemblances in a collection of N sets.
Input: Sets , n = 1 to N.
Preprocessing
(1): Generate k random permutations Πj: Ω → Ω, j = 1 to k.
(2): For each set Sn and each permutation Πj, store the lowest b bits of min (Πj(Sn)), denoted by en,i,πj, i = 1 to b.
Estimation: (Use two sets S1 and S2 as an example)
(1): Compute .
(2): Estimate the resemblance by , where C1,b and C2,b are from Theorem 1 in Section 2.
In Charikar5 and Gionis et al.,10 the authors describe hashing schemes that map objects to {0, 1}. The algorithms for the construction, however, are problem specific. Three discovered 1-bit schemes are (i) the simhash5 based on sign random projection,11, 18 (ii) the hamming distance algorithm based on simple random sampling,13 and (iii) the hamming distance algorithm based on a variant of random projection.15
Section 4 will compare our method with two hamming distance algorithms.13, 15 We also wrote a report (http://www.stat.cornell.edu/~li/b-bit-hashing/RP_minwise.pdf), which demonstrated that, unless the similarity is very low, b-bit minwise hashing outperforms sign random projections.
A related approach is conditional random sampling (CRS)16, 17 which uses only a single permutation and instead of a single minimum retains as set of the smallest hashed values. CRS provides more accurate (in some scenarios substantially so) estimators for binary data and naturally extends to real-value data and dynamic streaming data; moreover, the same set of hashed values can be used to estimate a variety of summary statistics including histograms, lp distances (for any p), number of distinct values, χ2 distances, entropies, etc. However, we have not developed a b-bit scheme for CRS, which appears to be a challenging task.
Consider two sets S1, . Apply a random permutation Π on S1 and S2, where Π: Ω → Ω. Define the minimum values under Π to be z1 and z2:
Define e1,i = ith lowest bit of z1, and e2,i = ith lowest bit of z2. Theorem 1 derives the main probability formula. Its proof assumes that D is large, which is virtually always satisfied in practice. This result is a good example of approaching a difficult problem by reasonable approximations.
THEOREM 1. Assume D is large.
The intuition for the difference between (5) and the equivalent equation for minwise hashing (1) is that even when R = 0, the collision probability Pb (i.e., the probability that two minima agree on their last b bits) is not zero, but rather C1,b. Having to account for this type of "false positives" makes the derivation more difficult, resulting in the additional terms in (5). Of course, as expected, if R = 1, then Pb = 1 (because in this case r1 = r2 and C1,b = C2,b).
Note that the only assumption needed in the proof of Theorem 1 is that D is large, which is virtually always satisfied in practice. Interestingly, (5) is remarkably accurate even for very small D. Figure 1 shows that when D = 20 (D = 500), the absolute error caused by using (5) is <0.01 (<0.0004).
2.1. The unbiased estimator
Theorem 1 suggests an unbiased estimator for R:
where e1,i,Πj (e2,i,Πj) denotes the ith lowest bit of z1 (z2), under the permutation Πj. The variance is
For large b, Var () converges to the variance of , the estimator for the original minwise hashing:
In fact, when b = 64, Var () and Var () are numerically indistinguishable for practical purposes.
2.2. The variance-space trade-off
As we decrease b, the space needed for storing each "sample" will be smaller; the estimation variance (11) at the same sample size k, however, will increase. This variance-space trade-off can be precisely quantified by B(b; R, r1, r2):
Lower B(b) is better. The ratio, , measures the improvement of using b = b2 (e.g., b2 = 1) over using b = b1 (e.g., b1 = 64). Some algebra yields the following Lemma.
LEMMA 1. If r1 = r2 and b1 > b2, then
is a monotonically increasing function of R [0, 1].
If R → 1 (which implies r1, r2 → 1), then
If r1 = r2, b2 = 1, b1 = 64 (hence we treat A1,b = 0), then
Suppose the original minwise hashing used b = 64, then the maximum improvement of b-bit minwise hashing would be 64-fold, attained when r1 = r2 = 1 and R = 1. In the least favorable situation, that is, r1, r2 → 0, the improvement will still be when R = 0.5.
Figure 2 plots to directly visualize the relative improvements, which are consistent with what Lemma 1 predicts. The plots show that, when R is very large (which is the case in many practical applications), it is always good to use b = 1. However, when R is small, using larger b may be better. The cut-off point depends on r1, r2, R. For example, when r1 = r2 and both are small, it would be better to use b = 2 than b = 1 if R < 0.4, as shown in Figure 2.
In the following, we evaluate the accuracy of the theoretical derivation and the practical performance of our approach using two sets of experiments. Experiment 1 is a sanity check, to verify: (i) our proposed estimator in (9) is unbiased and (ii) its variance follows the prediction by our formula in (11). Experiment 2 is a duplicate detection task using a Microsoft proprietary collection of 1,000,000 news articles.
3.1. Experiment 1
The data, extracted from Microsoft Web crawls, consists of six pairs of sets. Each set consists of the document IDs, which contain the word at least once. We now use b-bit min-wise hashing to estimate the similarities of these sets (i.e., we estimate the strength of the word associations).
Table 1 summarizes the data and provides the theoretical improvements . The words were selected to include highly frequent pairs (e.g., "OF-AND"), highly rare pairs (e.g., "GAMBIA-KIRIBATI"), highly unbalanced pairs (e.g., "A-TEST"), highly similar pairs (e.g., "KONG-HONG"), as well as pairs that are not quite similar (e.g., "LOW-PAY").
We estimate the resemblance using the original minwise hashing estimator and the b-bit estimator (b = 1, 2, 3).
Figure 3 plots the empirical mean square errors (MSE = variance + bias2) in solid lines and the theoretical variances (11) in dashed lines for all word pairs. All dashed lines are invisible because they overlap with the corresponding solid curves. Thus, this experiment validates that the variance formula (11) is accurate and is indeed unbiased (otherwise, the MSE will differ from the variance).
3.2. Experiment 2
To illustrate the improvements by the use of b-bit minwise hashing on a real-life application, we conducted a duplicate detection experiment using a corpus of 106 news documents. The dataset was crawled as part of the BLEWS project at Microsoft.9 We computed pairwise resemblances for all documents and retrieved document pairs with resemblance R larger than a threshold R0. We estimate the resemblances using with b = 1, 2, 4 bits and the original minwise hashing. Figure 4 presents the precision and recall curves. The recall values (bottom two panels in Figure 4) are all very high and do not differentiate the estimators.
The precision curves for (using 4 bits per sample) and (assuming 64 bits per sample) are almost indistinguishable, suggesting a 16-fold improvement in space using b = 4.
When using b = 1 or 2, the space improvements are normally around 20- to 40-fold, compared to (assuming 64 bits per sample), especially for achieving high precision.
Closely related to the resemblance, the hamming distance H is another important similarity measure. In the context of hamming distance, a set is mapped to a D-dimensional binary vector Yi: Yit = 1, if t Si and 0 otherwise. The hamming distance between Y1 and Y2 is
Thus, one can apply b-bit minwise hashing to estimate H, by converting the estimated resemblance (9) to :
The variance of can be computed from Var () (11) by the "delta method" (i.e., :
We will first compare with an algorithm based on simple random sampling13 and then with another algorithm based on a variant of random projection.15
4.1. Simple random sampling algorithm
To reduce the storage, we can randomly sample k coordinates from the original data Y1 and Y2 in D-dimensions. The samples, denoted by h1 and h2, are k-dimensional bit vectors, from which we can estimate H:
whose variance would be (assuming k ≪ D)
Comparing the two variances, (17) and (19), we find that the variance of using simple random sampling, that is, Var , is substantially larger than the variance of using b-bit minwise hashing, that is, Var (), especially when the data is sparse. We consider in practice one will most likely implement the random sampling algorithm by storing only the original locations (coordinates) of the nonzeros in the samples. If we do so, the total bits on average will be (per set). This motivates us to define the following ratio:
to compare the storage costs. Recall each sample of b-bit minwise hashing requires b bits (i.e., bk bits per set). The following Lemma may help characterize the improvement:
LEMMA 2. If r1, r2 → 0, then Gs,b as defined in (20)
In other words, for small r1, r2, ; if R ≈ 0; and , if R ≈ 1. Figure 5 plots Gs,b = 1, verifying the substantial improvement of b-bit minwise hashing over simple random sampling (often 10- to 30-fold).
4.2. Random projection + modular arithmetic
An interesting 1-bit scheme was developed in Kushilevitz et al.15 using random projection followed by modular arithmetic. A random matrix is generated with entries being i.i.d. samples uij from a binomial distribution: uij = 1 with probability and uij = 0 with probability . Let ν1 = Y1 × U (mod 2) and ν2 = Y2 × U (mod 2). Kushilevitz et al.15 showed that
which allows us to estimate the hamming distance H by
We calculate the variance of to be
which suggests that the performance of this 1-bit scheme might be sensitive to β that must be predetermined for all sets at the processing time (i.e., it cannot be modified in the estimation phrase for a particular pair). Figure 6 provides the "optimal" β (denoted by β*) values (as function of H) by numerically minimizing the variance (24).
It is interesting to compare this random projection-based 1-bit scheme with our b-bit minwise hashing using the following ratio of their variances:
Figure 7 shows that if it is possible to choose the optimal β* for random projection, one can achieve good performance, similar to (or even better than) b-bit minwise hashing.
The problem is that we must choose the same β for all sets. Figure 8 presents a typical example, which uses H*/D = 10−4 to compute the "optimal" β for a wide range of (r1, r2, s) values. The left bottom panel illustrates that when r1 = 10−4 using this particular choice of β results in fairly good performance compared to b-bit minwise hashing. (Recall H/D = r1 + r2 - 2s.) As soon as the true H substantially deviates from the guessed H*, the performance of using random projection degrades dramatically.
There is one more issue. At the optimal β*(H), our calculations show that the probability (22) Eβ* ≈ 0.2746. However, if the chosen β > β*(H), then Eβ may approach 1/2. As is random, it is likely that the observed > 1/2, that is, log (1 − 2) becomes undefined in (23). Thus, it is safer to "overestimate" H when choosing β. When we have a large collection of sets, this basically means the chosen β will be very different from its optimal value for most pairs.
Finally, Figure 9 provides an empirical study as a sanity check that the variance formula (24) is indeed accurate and that, if the guessed H for selecting β deviates from the true H, then the random projection estimator exhibits much larger errors than the b-bit hashing estimator .
Our theoretical and empirical results have confirmed that, when the resemblance R is reasonably high, even a single bit per sample may contain sufficient information for accurately estimating the similarity. This naturally leads to the conjecture that, when R is close to 1, one might further improve the performance by looking at a combination of multiple bits (i.e., "b < 1"). One simple approach is to combine two bits from two permutations using XOR () operations.
Recall e1,1,Π denotes the lowest bit of the hashed value under Π. Theorem 1 has proved that
Consider two permutations Π1 and Π2. We store
Then x1 = x2 either when and , or, when and . Thus,
which is a quadratic equation with a solution:
This estimator is slightly biased at small sample size k. We use to indicate that two bits are combined into one (but each sample is still stored using 1 bit). The asymptotic variance of can be derived to be
Interestingly, as R → 1, does twice as well as :
On the other hand, may not be good when R is not too large. For example, one can numerically show that
Figure 10 plots the empirical MSEs for two-word pairs in Experiment 1, for , , and . For the highly similar pair, "KONG-HONG," exhibits superior performance compared to . For "UNITED-STATES," whose R = 0.591, performs similarly to .
In summary, for applications which care about very high similarities, combining bits can reduce storage even further.
When computing set similarity for large sets of samples, the key operation is determining the number of identical b-bit samples. While samples for values of b that are multiples of 16 bits can easily be compared using a single machine instruction, efficiently computing the overlap between b-bit samples for small b is less straightforward. In the following, we will describe techniques for computing the number of identical b-bit samples when these are packed into arrays , l = 1,2 of w-bit words. To compute the number of identical b-bit samples, we iterate through the arrays; for each offset h, we first compute ν = A1[h] A2[h]. Now, the number of b-bit blocks in u that contain only 0s corresponds to the number of identical b-bit samples.
The case of b = 1 corresponds to the problem of counting the number of 0-bits in a word. We tested a number of different methods and found the fastest approach to be precomputing an array bits[1,..., 216] such that bits[t] corresponds to the number of 0-bits in the binary representation of t and using lookups into this array. This approach extends to b > 1 as well.
To evaluate this approach we timed a tight loop computing the number of identical samples in two arrays of b-bit hashes covering a total of 1.8 billion 32-bit words (using a 64-bit Intel 6600 Processor). Here, the 1-bit hashing requires 1.67× the time that the 32-bit minwise hashing requires (1.73× when comparing to 64-bit minwise hashing). The results were essentially identical for b = 2, 4, 8. Given that, when R > 0.5, we can gain a storage reduction of 21.3-fold, we expect the resulting improvement in computational efficiency to be 21.3/1.67 = 12.8-fold in the above setup.
7.1. Three-way resemblance
Many applications in data mining or data cleaning require not only estimates of two-way, but also of multi-way similarities. The original minwise hashing naturally extends to multi-way resemblance. In Li et al.,19 we extended b-bit minwise hashing to estimate three-way resemblance. We developed a highly accurate, but complicated estimator, as well as a much simplified estimator suitable for sparse data. Interestingly, at least b ≥ 2 bits are needed in order to estimate three-way resemblance. Similar to the two-way case, b-bit minwise hashing can result in an order-of-magnitude reduction in the storage space required for a given estimation accuracy when testing for moderate to high similarity.
7.2. Large-scale machine learning
A different category of applications for b-bit minwise hashing is machine learning on very large datasets. For example, one of our projects20 focuses on linear support vector machines (SVM). We were able to show that the resemblance matrix, the minwise hashing matrix, and the b-bit minwise hashing matrix are all positive definite matrices (kernels), and we integrated b-bit minwise hashing with linear SVM. This allows us to significantly speed up training and testing times with almost no loss in classification accuracy for many practical scenarios. In addition, this provides an elegant solution to the problem of SVM training in scenarios where the training data cannot fit in memory.
Interestingly, the technique we used for linear SVM essentially provides a universal strategy for integrating b-bit minwise hashing with many other learning algorithms, for example, logistic regression.
7.3. Improving estimates by maximum likelihood estimators
While b-bit minwise hashing is particularly effective in applications which mainly concern sets of high similarities (e.g., R > 0.5), there are other important applications in which not just pairs of high similarities matter. For example, many learning algorithms require all pairwise similarities and it is expected that only a small fraction of the pairs are similar. Furthermore, many applications care more about containment (e.g., which fraction of one set is contained in another set) than the resemblance. In a recent technical report (http://www.stat.cornell.edu/~li/b-bit-hashing/AccurateHashing.pdf), we showed that the estimators for minwise hashing and b-bit minwise hashing used in the current practice can be systematically improved and the improvements are most significant for set pairs of low resemblance and high containment.
For minwise hashing, instead of only using Pr(z1 = z2), where z1 and z2 are two hashed values, we can combine it with Pr(z1 < z2) and Pr(z1 > z2) to form a three-cell multinomial estimation problem, whose maximum likelihood estimator (MLE) is the solution to a cubic equation. For b-bit minwise hashing, we formulate a 2b × 2b-cell multinomial problem, whose MLE requires a simple numerical procedure.
7.4. The new LSH family
Applications such as near neighbor search, similarity clustering, and data mining will significantly benefit from b-bit minwise hashing. It is clear that b-bit minwise hashing will significantly improve the efficiency of simple linear algorithms (for near neighbor search) or simple quadratic algorithms (for similarity clustering), when the key bottleneck is main-memory throughput.
Techniques based on LSH1, 5, 13 have been successfully used to achieve sub-linear (for near neighbor search) or sub-quadratic (for similarity clustering) performance. It is interesting that b-bit minwise hashing is a new family of LSH; hence, in this section, we would like to provide more theoretical properties in the context of LSH and approximate near neighbor search.
Consider a set S1. Suppose there exists another set S2 whose resemblance distance (1 R) from S1 is at most d0, that is, 1 - R ≤ d0. The goal of c-approximate d0-near neighbor algorithms is to return sets (with high probability) whose resemblance distances from S1 are at most c×d0 with c > 1.
Recall z1 and z2 denote the minwise hashed values for sets S1 and S2, respectively. The performance of the LSH algorithm depends on the difference (gap) between the following P(1) and P(2) (respectively corresponding to d0 and cd0):
A larger gap between P(1) and P(2) implies a more efficient LSH algorithm. The following "ρ" value (ρM for minwise hashing) characterizes the gap:
A smaller ρ (i.e., larger difference between P(1) and P(2) leads to a more efficient LSH algorithm and is particularly desirable.1, 13 The general LSH theoretical result tells us that the query time for c-approximate d0-near neighbor is dominated by O(Nρ) distance evaluations, where N is the total number of sets in the collection.
Recall Pb, as defined in (5), denotes the collision probability for b-bit minwise hashing. The ρb value for c-approximate d0-near neighbor search can be computed as follows:
Figure 11 suggests that b-bit minwise hashing can potentially achieve very similar ρ values compared to the original minwise hashing, when the applications care mostly about highly similar sets (e.g., d0 = 0.1, the top panels of Figure 11), even using merely b = 1. If the applications concern sets that are not necessarily highly similar (e.g., d0 = 0.5, the bottom panels), using b = 3 or 4 will still have similar ρ values as using the original minwise hashing.
We expect that these theoretical properties regarding the ρ values will potentially be useful in future work. We are currently developing new variants of LSH algorithms for near neighbor search based on b-bit minwise hashing.
Subsequent documents will be made available at www.stat.cornell.edu/~li/b-bit-hashing, which is a repository for maintaining the papers and technical reports related to b-bit minwise hashing.
Minwise hashing is a standard technique for efficiently estimating set similarity in massive datasets. In this paper, we gave an overview of b-bit minwise hashing, which modifies the original scheme by storing the lowest b bits of each hashed value. We proved that, when the similarity is reasonably high (e.g., resemblance ≥ 0.5), using b = 1 bit per hashed value can, even in the worst case, gain a 21.3-fold improvement in storage space (at similar estimation accuracy), compared to storing each hashed value using 64 bits. As many applications are primarily interested in identifying duplicates of reasonably similar sets, these improvements can result in substantial reduction in storage (and consequently computational) overhead in practice.
We also compared our scheme to other approaches that map the hashed objects to single bits, both in theory as well as experimentally.
Our proposed method is simple and requires only minimal modification to the original minwise hashing algorithm. It can be used in the context of a number of different applications, such as duplicate detection, clustering, similarity search, and machine learning, and we expect that it will be adopted in practice.
Acknowledgment
This work is supported by NSF (DMS-0808864), ONR (YIP-N000140910911), and a grant from Microsoft. We thank the Board Members for suggesting a direct comparison with Kushilevitz et al.15
1. Andoni, A., Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51 (2008), 117122.
2. Broder, A.Z. On the resemblance and containment of documents. In The Compression and Complexity of Sequences (Positano, Italy, 1997), 2129.
3. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M. Min-wise independent permutations. J. Comput. Syst. Sci. 60, 3 (2000), 630659.
4. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G. Syntactic clustering of the web. In WWW (Santa Clara, CA, 1997), 11571166.
5. Charikar, M.S. Similarity estimation techniques from rounding algorithms. In STOC (Montreal, Quebec, Canada, 2002), 380388.
6. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13, 1 (2001), 6478.
7. Fetterly, D., Manasse, M., Najork, M., Wiener, J.L. A large-scale study of the evolution of web pages. In WWW (Budapest, Hungary, 2003), 669678.
8. Forman, G., Eshghi, K., Suermondt, J. Efficient detection of large-scale redundancy in enterprise file systems. SIGOPS Oper. Syst. Rev. 43, 1 (2009), 8491.
9. Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C. Blews: Using blogs to provide context for news articles. In AAAI Conference on Weblogs and Social Media (Redmond, WA, 2008).
10. Gionis, A., Gunopulos, D., Koudas, N. Efficient and tunable similar set retrieval. In SIGMOD (Santa Barbara, CA, 2001), 247258.
11. Goemans, M.X., Williamson, D.P. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42, 6 (1995), 11151145.
12. Indyk, P. A small approximately min-wise independent family of hash functions. J. Algorithms 38, 1 (2001), 8490.
13. Indyk, P., Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC (Dallas, TX, 1998), 604613.
14. Itoh, T., Takei, Y., Tarui, J. On the sample size of k-restricted min-wise independent permutations and other k-wise distributions. In STOC (San Diego, CA, 2003), 710718.
15. Kushilevitz, E., Ostrovsky, R., Rabani, Y. Efficient search for approximate nearest neighbor in high dimensional spaces. In STOC (Dallas, TX, 1998), 614623.
16. Li, P., Church, K.W. A sketch algorithm for estimating two-way and multi-way associations. Comput. Linguist. 33, 3 (2007), 305354 (Preliminary results appeared in HLT/EMNLP 2005).
17. Li, P., Church, K.W., Hastie, T.J. One sketch for all: Theory and applications of conditional random sampling. In NIPS (Vancouver, British Columbia, Canada, 2008) (Preliminary results appeared in NIPS 2006).
18. Li, P., Hastie, T.J., Church, K.W. Improving random projections using marginal information. In COLT (Pittsburgh, PA, 2006), 635649.
19. Li, P., König, A.C., Gui, W. b-Bit minwise hashing for estimating three-way similarities. In NIPS (Vancouver, British Columbia, Canada, 2010).
20. Li, P., Moore, J., König, A.C. b-Bit minwise hashing for large-scale linear SVM. Technical report, 2011. http://www.stat.cornell.edu/~li/b-bit-hashing/HashingSVM.pdf
21. Cherkasova, L., Eshghi, K., Morrey III, C.B., Tucek, J., Veitch, A. Applying Syntactic similarity algorithms for enterprise information management. In KDD (Paris, France, 2009), 10871096.
22. Manasse, M., McSherry, F., Talwar, K. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.
23. Pandey, S., Broder, A., Chierichetti, F., Josifovski, V., Kumar, R., Vassilvitskii, S. Nearest-neighbor caching for content-match applications. In WWW (Madrid, Spain, 2009), 441450.
24. Rajaraman, A., Ullman, J. Mining of Massive Datasets. http://i.stanford.edu/ullman/mmds.html
25. Urvoy, T., Chauveau, E., Filoche, P., Lavergne, T. Tracking web spam with html style similarities. ACM Trans. Web 2, 1 (2008), 128.
The previous version of this paper, entitled "b-Bit Minwise Hashing for Estimating Three-way Similarities," was published in Proceedings of the Neural Information Processing Systems: NIPS 2010 (Vancouver, British Columbia, Canada).
Figure 1. The absolute errors (approximateexact) by using (5) are very small even for D = 20 (left panels) or D = 500 (right panels). The exact probability can be numerically computed for small D (from a probability matrix of size D × D). For each D, we selected three f1 values. We always let f2 = 2, 3,..., f1 and a = 0, 1, 2,..., f2.
Figure 2. , the relative storage improvement of using b = 1, 2, 3, 4 bits, compared to using 64 bits. B(b) is defined in (12).
Figure 3. Mean square errors (MSEs). "M" denotes the original minwise hashing. "Theor." denotes the theoretical variances Var () (11) and Var()(3). The dashed curves, however, are invisible because the empirical MSEs overlap the theoretical variances. At the same k, . However, only requires 1/2/3 bits per sample, while may require 64 bits.
Figure 4. The task is to retrieve news article pairs with resemblance R ≥ R0. The recall curves cannot differentiate estimators. The precision curves are more interesting. When R0 = 0.4, to achieve a precision = 0.80, the estimators , and require k = 50, 50, 75, and 145, respectively, indicating , and , respectively, improve (assuming 64 bits per sample) by 16-, 21.4-, and 22-fold. The improvement becomes larger as R0 increases.
Figure 5. Gs,b=1 as defined in (20) for illustrating the improvement of b-bit minwise hashing over simple random sampling. We consider r1 = 10−4, 0.1, 0.5, 0.9, r2 ranging from 0.1 r1 to 0.9 r1 and s from 0 to r2. Note that r1 + r2 - s ≤ 1 has to be satisfied.
Figure 6. Left panel: the β* values at which the smallest variances (24) are attained. Right panel: the corresponding optimal variances.
Figure 7. Grp,b=1,β* as defined in (25) for comparing with . For each combination of r1, r2, s, we computed and used the optimal β* for the variance.
Figure 8. Grp,b=1,β (25) computed by using the fixed β, which is the optimal β when H = H* = 10−4 D.
Figure 9. The exact H for this pair of sets is 0.0316. We use the optimal β at H*/D = 0.1 (left panel) and H*/D = 0.5 (right panel). Compared to , the MSEs of (labeled by "rp") are substantially larger. The theoretical variances (dashed lines) (24) and (17), essentially overlap the empirical MSEs.
Figure 10. MSEs for comparing (27) with and . Due to the bias of , the theoretical variances Var (), that is, (28), deviate from the empirical MSEs when k is small.
Figure 11. ρM and ρb defined in (30) and (31) for measuring the potential performance of LSH algorithms. "M" denotes the original minwise hashing.
©2011 ACM 0001-0782/11/0800 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.
Isn't there an error in the first formula, when the article describes the "resemblance or Jaccard similarity, denoted by R"?
The numerator and denominator of the division are the same number (cardinalty of the union of S1 and S2), the result should be 1.
Regards.
Thanks, Joan
You are correct about the first formula; instead, please refer to equation (1), which has the correct definition of the Jaccard-Overlap.
We just checked and our final submission really did not have that error in the first formula. This error must have occurred later and we did not catch it during the review of the page proofs.
Hopefully the online version will be corrected.
Regards,
Ping and Christian
Displaying all 2 comments