nips nips2013 nips2013-169 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Leonid Boytsov, Bilegsaikhan Naidan
Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. 1
Reference: text
sentIndex sentText sentNum sentScore
1 no Abstract Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. [sent-4, score-0.51]
2 Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. [sent-6, score-0.209]
3 The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. [sent-8, score-0.226]
4 In our work we use the Euclidean distance (L2 ), the KL-divergence ( xi log xi /yi ), and the Itakura-Saito distance ( xi /yi − log xi /yi − 1). [sent-12, score-0.196]
5 Both KL-divergence and the Itakura-Saito distance belong to a class of distances called Bregman divergences. [sent-14, score-0.177]
6 Our interest is in the nearest neighbor (NN) search, i. [sent-15, score-0.224]
7 In the case of the hierarchical space decomposition, a retrieval process is a recursion that employs an “oracle” procedure. [sent-22, score-0.171]
8 The oracle allows one to prune partitions without directly comparing the query against data points in these partitions. [sent-24, score-0.322]
9 To this end, the oracle assesses the query and estimates which partitions may contain an answer and, therefore, should be recursively analyzed. [sent-25, score-0.211]
10 In metric spaces, one can use the classifier based on the triangle inequality. [sent-27, score-0.171]
11 There are numerous data structures that speedup the NN-search by creating hierarchies of partitions at index time, most notably the VP-tree [28, 31] and the KD-tree [4]. [sent-29, score-0.192]
12 , [21]) and, thus, trading index size for retrieval time. [sent-36, score-0.228]
13 The approximate NN-queries are less affected by the curse of the dimensionality, because it is possible to reduce retrieval time at the cost of missing some relevant answers [18, 9, 25]. [sent-37, score-0.214]
14 In metric spaces, it was proposed to compute the intrinsic dimensionality as the half of the squared signal to noise ratio (for the distance distribution) [10]. [sent-41, score-0.319]
15 The LSH employs several hash functions: It is likely that close objects have same hash values and distant objects have different hash values. [sent-45, score-0.273]
16 In the classic LSH index, the probability of finding an element in one hash table is small and, consequently, many hash tables are to be created during indexing. [sent-46, score-0.203]
17 proposed a multi-probe version of the LSH, which can query multiple buckets of the same hash table [22]. [sent-48, score-0.259]
18 For approximate searching it was demonstrated that an early termination strategy could rely on information about distances from typical queries to their respective nearest neighbors [33, 1]. [sent-50, score-0.435]
19 [1] showed that density estimates can be used to approximate a pruning function in metric spaces. [sent-52, score-0.214]
20 Ch´ vez and Navarro [9] proposed to relax triangle-inequality a based lower bounds for distances to potential nearest neighbors. [sent-54, score-0.344]
21 Vermorel [29] applied VP-trees to searching in undisclosed non-metric spaces without trying to learn a pruning function. [sent-66, score-0.181]
22 [1], he proposed to visit partitions in the order defined by density estimates and employed the same early termination method as Zezula et al. [sent-68, score-0.238]
23 Cayton [6] proposed a Bregman ball tree (bbtree), which is an exact search method for Bregman divergences. [sent-70, score-0.163]
24 The bbtree divides data into two clusters (each covered by a Bregman ball) and recursively repeats this procedure for each cluster until the number of data points in a cluster falls below a threshold (a bucket size). [sent-71, score-0.525]
25 At search time, the method relies on properties of Bregman divergences to compute the shortest distances to covering balls. [sent-72, score-0.218]
26 [34] proposed an exact search method based on estimating the maximum distance to a bounding rectangle, but it works with left queries only. [sent-77, score-0.262]
27 The most efficient variant of this method relies on an optimization technique applicable only to certain decomposable Bregman divergences (a decomposable distance is a sum of values computed separately for each coordinate). [sent-78, score-0.156]
28 , Spearman’s) is computed between the permutation of the query and permutations of data points. [sent-86, score-0.348]
29 Candidate points, which have sufficiently small correlation values, are then compared directly with the query (by computing the original distance function). [sent-87, score-0.228]
30 One can sequentially scan the list of permutations and compute the rank correlation between the 2 permutation of the query and the permutation of every data point [8]. [sent-88, score-0.563]
31 We compare the resulting method, which can be applied to both metric and non-metric spaces, with the following state-of-the-art methods: the multi-probe LSH, permutation methods, and the bbtree. [sent-92, score-0.298]
32 1 Proposed Method Classic VP-tree In the VP-tree (also known as a ball tree) the space is partitioned with respect to a (usually randomly) chosen pivot π [28, 31]. [sent-94, score-0.184]
33 Assume that we have computed distances from all points to the pivot π and R is a median of these distances. [sent-95, score-0.301]
34 Points inside the pivotcentered sphere are placed into the left subtree, while points outside the pivot-centered sphere are placed into the right subtree (points on the border may be placed arbitrarily). [sent-97, score-0.193]
35 When the number of data points is below a certain threshold (the bucket size), these data points are stored as a single bucket. [sent-99, score-0.251]
36 Each internal node stores the pivot π and the radius R. [sent-105, score-0.187]
37 In a metric space with the distance d(x, y), we use the triangle inequality to prune the search space. [sent-106, score-0.421]
38 We visit: • only the left subtree if d(π, q) < R − r; • only the right subtree if d(π, q) > R + r; R π Figure 1: Three types of query balls in the VP-tree. [sent-107, score-0.218]
39 The black circle (centered at the pivot π) is the sphere that divides the space. [sent-108, score-0.236]
40 If r < Dπ,R (d(π, q)), we visit only the partition containing the query point. [sent-115, score-0.219]
41 The only argument of this function is a distance between the pivot and the query, i. [sent-118, score-0.253]
42 The function value is equal to the maximum radius of the query ball that fits inside the partition containing the query (see the red and the blue sample balls in Fig. [sent-121, score-0.345]
43 2 Approximating Dπ,R (x) with a Piece-wise Linear Function In Section 2 of the supplemental materials, we describe a straightforward sampling algorithm to learn the decision function Dπ,R (x) for every pivot π. [sent-124, score-0.219]
44 1 q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q qq qq q qq q qqq q qq q q q qq q qq q q q qqqq qq q qq q q qqqqqq qq q q qq qqq q q q qq q qq q q q q 0. [sent-138, score-7.576]
45 75 distance to pivot (a) Colors, L2 q q q q q q q q q 0. [sent-141, score-0.253]
46 1 q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q qq q q qq q q qq q q q qq q q qqq q qq qqq q q qq q q q qq q q q qqq q q qqq qq q 0. [sent-142, score-5.642]
47 50 q q max distance to query q q max distance to query max distance to query q q q q q 0. [sent-144, score-0.684]
48 0 6 distance to pivot (b) RCV-8, KL-divergence q q q q q q q q q qq qq q q qq q q q q q q q q q q q qq q q q qq q q q q q q q q q qqqqqqqqqq qq qq q q qq q qqqq q q qq q q qq qq q qqq q qq qq q q qq 0 2 4 6 distance to pivot (c) RCV-16, gen. [sent-150, score-7.984]
49 The red dashed line denotes a median distance R from data set points to the pivot π. [sent-154, score-0.32]
50 This is similar to stretching of the triangle inequality proposed by Ch´ vez and Navarro [9]. [sent-157, score-0.242]
51 To this end, we index a small subset of the data points and seek to obtain parameters that give the shortest retrieval time at a specified recall threshold. [sent-159, score-0.322]
52 Then, recall values and retrieval times for all αlef t = aρi/m−0. [sent-161, score-0.171]
53 If the obtained recall values are considerably larger than a specified threshold, the procedure repeats the grid search using larger values of a and b. [sent-169, score-0.171]
54 Similarly, if the recall is not sufficient, the values of a and b are decreased and the grid search is repeated. [sent-170, score-0.171]
55 The Property 1, which is true for all metric spaces due to the triangle inequality, holds in the case of the KL-divergence and data points u sampled randomly and uniformly from the simplex {xi |xi ≥ 0, xi = 1}. [sent-179, score-0.3]
56 Additionally, we need to compute logarithms of query vector elements, but this is done only once per query. [sent-203, score-0.198]
57 Improvement in efficiency due to indexing is measured as a reduction in retrieval time compared to a sequential, i. [sent-209, score-0.205]
58 It is equal to the number of NN-points closer to the query than the nearest point returned by the search method. [sent-213, score-0.449]
59 • The permutation prefix index, where permutation profiles are stored in a prefix tree of limited depth [13]. [sent-222, score-0.428]
60 • The bbtree [6], which is designed for Bregman divergences, and, thus, it was not used with L2 . [sent-224, score-0.329]
61 index vp-tree permutation 103 Uniform (L2) RCV-128 (L2) Colors (L2) 103 multi-probe LSH pref. [sent-236, score-0.271]
62 index vp-tree permutation 3 10 multi-probe LSH pref. [sent-237, score-0.271]
63 index vp-tree permutation 102 102 102 101 101 101 10−2 10−1 100 101 102 103 Number of points closer (log. [sent-238, score-0.419]
64 scale) 10−2 10−1 100 101 102 103 104 Number of points closer (log. [sent-240, score-0.148]
65 index vp-tree permutation 102 102 102 101 multi-probe LSH pref. [sent-244, score-0.271]
66 index vp-tree permutation 100 101 101 100 10−2 10−1 100 101 102 103 Number of points closer (log. [sent-245, score-0.419]
67 index bbtree vp-tree permutation RCV-256 (KL-div) 103 2 10 102 pref. [sent-250, score-0.6]
68 index bbtree vp-tree permutation 101 102 101 101 100 100 10−2 10−1 100 101 102 Number of points closer (log. [sent-251, score-0.748]
69 index bbtree vp-tree permutation 2 103 101 102 101 100 pref. [sent-258, score-0.6]
70 index bbtree vp-tree permutation 100 101 102 103 Number of points closer (log. [sent-259, score-0.748]
71 scale) Figure 4: Performance of NN-search for the KL-divergence and Itakura-Saito distance For the bbtree and the VP-tree, vectors in the same bucket were stored in contiguous chunks of memory (allowing for about 1. [sent-262, score-0.544]
72 It is typically more efficient to search elements of a small bucket sequentially, rather than using an index. [sent-264, score-0.224]
73 The same optimization approach was also used for both permutation methods. [sent-266, score-0.187]
74 They included: the minimal number/percentage of candidates in permutation methods, the desired 6 Table 2: Improvement in efficiency and retrieval time (ms) for the bbtree without early termination Data set RCV-16 RCV-32 RCV-128 RCV-256 SIFT sign. [sent-268, score-0.752]
75 Even though a representational dimensionality of the Uniform data set is only 64, it has the highest intrinsic dimensionality among all sets in Table 1 (according to the definition of Ch´ vez et al. [sent-289, score-0.306]
76 For instance, for the VP-tree, a 160x speedup was only possible, when a retrieved object was a 40-th nearest neighbor (on average) instead of the first one. [sent-292, score-0.251]
77 For example, the SIFT signatures have the representational dimensionality of 1111, but the intrinsic dimensionality is only four. [sent-296, score-0.252]
78 The bbtree is never substantially faster than the VP-tree, while being up to an order of magnitude slower for RCV-16 and RCV-256 in the case of Itakura-Saito distance. [sent-301, score-0.385]
79 Yet, for the SIFT signatures data set and the Itakura-Saito distance, permutation methods can be twice as fast. [sent-303, score-0.268]
80 When the VP-tree misses the nearest neighbor, it often returns the second nearest or the third nearest neighbor instead. [sent-305, score-0.484]
81 However, when other examined methods miss the nearest neighbor, they frequently return elements that are far from the true result. [sent-306, score-0.157]
82 For example, the multi-probe LSH may return a true nearest neighbor 50% of the time, and 50% of the time it would return the 100-th nearest neighbor. [sent-307, score-0.354]
83 The second row shows improvements in efficiency for the case of the faster KL-divergence and these improvements are substantially smaller than those reported in the first row, despite the fact that using the faster KL-divergence greatly reduces retrieval times. [sent-313, score-0.198]
84 The reason is that the pruning algorithm of the bbtree is quite expensive. [sent-314, score-0.401]
85 In most cases, this method provided better trade-offs between rank approximation quality and retrieval speed. [sent-317, score-0.172]
86 We also showed that a simple trick of pre-computing logarithms at index time substantially improved performance of existing methods (e. [sent-320, score-0.152]
87 5 Acknowledgements We thank Lawrence Cayton for providing the data sets, the bbtree code, and answering our questions; Anna Belova for checking the proof of Property 1 (supplemental materials) and editing the paper. [sent-325, score-0.329]
88 Region proximity in metric spaces and its use for approximate similarity search. [sent-331, score-0.363]
89 Approximate similarity search in metric spaces using inverted files. [sent-340, score-0.401]
90 Probabilistic proximity search: Fighting the curse of dimensionality a in metric spaces. [sent-386, score-0.276]
91 Indexing inexact proximity search with distance regression in pivot space. [sent-409, score-0.426]
92 Use of permutation prefixes for efficient and scalable approximate similarity search. [sent-414, score-0.312]
93 Properties of embedding methods for similarity searching in metric spaces. [sent-436, score-0.252]
94 Approximate nearest neighbors: towards removing the curse of dimensionality. [sent-449, score-0.169]
95 Classification with nonmetric distances: Image retrieval and class representation. [sent-456, score-0.178]
96 Efficient search for approximate nearest neighbor in high dimensional spaces. [sent-462, score-0.363]
97 NV-Tree: An efficient disk-based o index for approximate search in very large high-dimensional collections. [sent-470, score-0.223]
98 Rank-approximate nearest neighbor search: Retaining meaning and speed in high dimensions. [sent-505, score-0.224]
99 Near neighbor search in metric and nonmetric space, 2005. [sent-518, score-0.347]
100 Data structures and algorithms for nearest neighbor search in general metric spaces. [sent-534, score-0.443]
wordName wordTfidf (topN-words)
[('qq', 0.525), ('bbtree', 0.329), ('lsh', 0.311), ('permutation', 0.187), ('pivot', 0.155), ('cayton', 0.154), ('retrieval', 0.144), ('lef', 0.135), ('vez', 0.135), ('query', 0.13), ('nearest', 0.13), ('amato', 0.12), ('metric', 0.111), ('search', 0.108), ('qqq', 0.098), ('distance', 0.098), ('zezula', 0.097), ('neighbor', 0.094), ('similarity', 0.094), ('bregman', 0.09), ('bucket', 0.089), ('ch', 0.085), ('index', 0.084), ('hash', 0.082), ('partitions', 0.081), ('closer', 0.081), ('signatures', 0.081), ('distances', 0.079), ('sift', 0.076), ('pruning', 0.072), ('logarithms', 0.068), ('points', 0.067), ('proximity', 0.065), ('visit', 0.065), ('spaces', 0.062), ('indexing', 0.061), ('dimensionality', 0.061), ('termination', 0.061), ('pre', 0.061), ('triangle', 0.06), ('ciency', 0.06), ('savino', 0.058), ('sisap', 0.058), ('queries', 0.056), ('pivots', 0.051), ('navarro', 0.051), ('intrinsic', 0.049), ('stretching', 0.047), ('buckets', 0.047), ('usa', 0.047), ('searching', 0.047), ('prune', 0.044), ('subtree', 0.044), ('vldb', 0.042), ('materials', 0.041), ('sphere', 0.041), ('scale', 0.041), ('divides', 0.04), ('curse', 0.039), ('colors', 0.039), ('classic', 0.039), ('boytsov', 0.039), ('figueroa', 0.039), ('icst', 0.039), ('josephson', 0.039), ('pruner', 0.039), ('ny', 0.038), ('supplemental', 0.036), ('grid', 0.036), ('nonmetric', 0.034), ('simd', 0.034), ('radius', 0.032), ('permutations', 0.031), ('approximate', 0.031), ('early', 0.031), ('divergences', 0.031), ('qqqq', 0.03), ('slower', 0.029), ('york', 0.029), ('ball', 0.029), ('stored', 0.028), ('decision', 0.028), ('euclidean', 0.028), ('rank', 0.028), ('improvement', 0.028), ('employs', 0.027), ('moderate', 0.027), ('jacobs', 0.027), ('lv', 0.027), ('applicable', 0.027), ('recall', 0.027), ('faster', 0.027), ('speedup', 0.027), ('elements', 0.027), ('tree', 0.026), ('inverted', 0.026), ('charikar', 0.026), ('belgium', 0.026), ('searched', 0.025), ('partition', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces
Author: Leonid Boytsov, Bilegsaikhan Naidan
Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. 1
2 0.47977567 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
3 0.24138488 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
Author: Anshumali Shrivastava, Ping Li
Abstract: We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1 ∩S2 ∩S3 | , S1 , S2 , S3 ∈ C, where C is a |S1 ∪S2 ∪S3 | size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality. 1 Introduction and Motivation Similarity search (near neighbor search) is one of the fundamental problems in Computer Science. The task is to identify a small set of data points which are “most similar” to a given input query. Similarity search algorithms have been one of the basic building blocks in numerous applications including search, databases, learning, recommendation systems, computer vision, etc. One widely used notion of similarity on sets is the Jaccard similarity or resemblance [5, 10, 18, 20]. Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, the resemblance R2way between S1 and S2 is defined as: R2way = |S1 ∩S2 | . Existing notions of similarity in search problems mainly work with |S1 ∪S2 | pairwise similarity functions. In this paper, we go beyond this notion and look at the problem of k-way similarity search, where the similarity function of interest involves k sets (k ≥ 2). Our work exploits the fact that resemblance can be naturally extended to k-way resemblance similarity [18, 21], defined over k sets {S1 , S2 , ..., Sk } as Rk−way = |S1 ∩S2 ∩...∩Sk | . |S1 ∪S2 ∪...∪Sk | Binary high-dimensional data The current web datasets are typically binary, sparse, and extremely high-dimensional, largely due to the wide adoption of the “Bag of Words” (BoW) representations for documents and images. It is often the case, in BoW representations, that just the presence or absence (0/1) of specific feature words captures sufficient information [7, 16, 20], especially with (e.g.,) 3-grams or higher-order models. And so, the web can be imagined as a giant storehouse of ultra high-dimensional sparse binary vectors. Of course, binary vectors can also be equivalently viewed as sets (containing locations of the nonzero features). We list four practical scenarios where k-way resemblance search would be a natural choice. (i) Google Sets: (http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html) Google Sets is among the earliest google projects, which allows users to generate list of similar words by typing only few related keywords. For example, if the user types “mazda” and “honda” the application will automatically generate related words like “bmw”, “ford”, “toyota”, etc. This application is currently available in google spreadsheet. If we assume the term document binary representation of each word w in the database, then given query w1 and w2 , we show that |w1 ∩w2 ∩w| |w1 ∪w2 ∪w| turns out to be a very good similarity measure for this application (see Section 7.1). 1 (ii) Joint recommendations: Users A and B would like to watch a movie together. The profile of each person can be represented as a sparse vector over a giant universe of attributes. For example, a user profile may be the set of actors, actresses, genres, directors, etc, which she/he likes. On the other hand, we can represent a movie M in the database over the same universe based on attributes associated with the movie. If we have to recommend movie M, jointly to users A and B, then a natural measure to maximize is |A∩B∩M | . The problem of group recommendation [3] is applicable |A∪B∪M | in many more settings such as recommending people to join circles, etc. (iii) Improving retrieval quality: We are interested in finding images of a particular type of object, and we have two or three (possibly noisy) representative images. In such a scenario, a natural expectation is that retrieving images simultaneously similar to all the representative images should be more refined than just retrieving images similar to any one of them. In Section 7.2, we demonstrate that in cases where we have more than one element to search for, we can refine our search quality using k-way resemblance search. In a dynamic feedback environment [4], we can improve subsequent search quality by using k-way similarity search on the pages already clicked by the user. (iv) Beyond pairwise clustering: While machine learning algorithms often utilize the data through pairwise similarities (e.g., inner product or resemblance), there are natural scenarios where the affinity relations are not pairwise, but rather triadic, tetradic or higher [2, 30]. The computational cost, of course, will increase exponentially if we go beyond pairwise similarity. Efficiency is crucial With the data explosion in modern applications, the brute force way of scanning all the data for searching is prohibitively expensive, specially in user-facing applications like search. The need for k-way similarity search can only be fulfilled if it admits efficient algorithms. This paper fulfills this requirement for k-way resemblance and its derived similarities. In particular, we show fast algorithms with provable query time guarantees for approximate k-way resemblance search. Our algorithms and analysis naturally provide a framework to extend classical LSH framework [14, 13] to handle higher-order similarities, which could be of independent theoretical interest. Organization In Section 2, we review approximate near neighbor search and classical Locality Sensitive Hashing (LSH). In Section 3, we formulate the 3-way similarity search problems. Sections 4, 5, and 6 describe provable fast algorithms for several search problems. Section 7 demonstrates the applicability of 3-way resemblance search in real applications. 2 Classical c-NN and Locality Sensitive Hashing (LSH) Initial attempts of finding efficient (sub-linear time) algorithms for exact near neighbor search, based on space partitioning, turned out to be a disappointment with the massive dimensionality of current datasets [11, 28]. Approximate versions of the problem were proposed [14, 13] to break the linear query time bottleneck. One widely adopted formalism is the c-approximate near neighbor (c-NN). Definition 1 (c-Approximate Near Neighbor or c-NN). Consider a set of points, denoted by P, in a D-dimensional space RD , and parameters R0 > 0, δ > 0. The task is to construct a data structure which, given any query point q, if there exist an R0 -near neighbor of q in P, it reports some cR0 -near neighbor of q in P with probability 1 − δ. The usual notion of c-NN is for distance. Since we deal with similarities, we define R0 -near neighbor of point q as a point p with Sim(q, p) ≥ R0 , where Sim is the similarity function of interest. Locality sensitive hashing (LSH) [14, 13] is a popular framework for c-NN problems. LSH is a family of functions, with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than non-similar ones. In formal terms, consider H a family of hash functions mapping RD to some set S Definition 2 (Locality Sensitive Hashing (LSH)). A family H is called (R0 , cR0 , p1 , p2 )-sensitive if for any two points x, y ∈ RD and h chosen uniformly from H satisfies the following: • if Sim(x, y) ≥ R0 then P rH (h(x) = h(y)) ≥ p1 • if Sim(x, y) ≤ cR0 then P rH (h(x) = h(y)) ≤ p2 For approximate nearest neighbor search typically, p1 > p2 and c < 1 is needed. Note, c < 1 as we are defining neighbors in terms of similarity. Basically, LSH trades off query time with extra preprocessing time and space which can be accomplished off-line. 2 Fact 1 Given a family of (R0 , cR0 , p1 , p2 ) -sensitive hash functions, one can construct a data structure for c-NN with O(nρ log1/p2 n) query time and space O(n1+ρ ), where ρ = log 1/p1 . log 1/p2 Minwise Hashing for Pairwise Resemblance One popular choice of LSH family of functions associated with resemblance similarity is, Minwise Hashing family [5, 6, 13]. Minwise Hashing family applies an independent random permutation π : Ω → Ω, on the given set S ⊆ Ω, and looks at the minimum element under π, i.e. min(π(S)). Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, it can be shown by elementary probability argument that P r (min(π(S1 )) = min(π(S2 ))) = |S1 ∩ S2 | = R2way . |S1 ∪ S2 | (1) The recent work on b-bit minwise hashing [20, 23] provides an improvement by storing only the lowest b bits of the hashed values: min(π(S1 )), min(π(S2 )). [26] implemented the idea of building hash tables for near neighbor search, by directly using the bits from b-bit minwise hashing. 3 3-way Similarity Search Formulation Our focus will remain on binary vectors which can also be viewed as sets. We illustrate our method |S1 ∩S2 ∩S3 | using 3-way resemblance similarity function Sim(S1 , S2 , S3 ) = |S1 ∪S2 ∪S3 | . The algorithm and guarantees naturally extend to k-way resemblance. Given a size n collection C ⊆ 2Ω of sets (or binary vectors), we are particularly interested in the following three problems: 1. Given two query sets S1 and S2 , find S3 ∈ C that maximizes Sim(S1 , S2 , S3 ). 2. Given a query set S1 , find two sets S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). 3. Find three sets S1 , S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). The brute force way of enumerating all possibilities leads to the worst case query time of O(n), O(n2 ) and O(n3 ) for problem 1, 2 and 3, respectively. In a hope to break this barrier, just like the case of pairwise near neighbor search, we define the c-approximate (c < 1) versions of the above three problems. As in the case of c-NN, we are given two parameters R0 > 0 and δ > 0. For each of the following three problems, the guarantee is with probability at least 1 − δ: 1. (3-way c-Near Neighbor or 3-way c-NN) Given two query sets S1 and S2 , if there ′ exists S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report some S3 ∈ C so that ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 2. (3-way c-Close Pair or 3-way c-CP) Given a query set S1 , if there exists a pair of ′ ′ set S2 , S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S2 , S3 ∈ C so that ′ ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 3. (3-way c-Best Cluster or 3-way c-BC) If there exist sets S1 , S2 , S3 ∈ C with ′ ′ ′ ′ ′ ′ Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S1 , S2 , S3 ∈ C so that Sim(S1 , S2 , S3 ) ≥ cR0 . 4 Sub-linear Algorithm for 3-way c-NN The basic philosophy behind sub-linear search is bucketing, which allows us to preprocess dataset in a fashion so that we can filter many bad candidates without scanning all of them. LSH-based techniques rely on randomized hash functions to create buckets that probabilistically filter bad candidates. This philosophy is not restricted for binary similarity functions and is much more general. Here, we first focus on 3-way c-NN problem for binary data. Theorem 1 For R3way c-NN one can construct a data structure with O(nρ log1/cR0 n) query time and O(n1+ρ ) space, where ρ = 1 − log 1/c log 1/c+log 1/R0 . The argument for 2-way resemblance can be naturally extended to k-way resemblance. Specifically, given three sets S1 , S2 , S3 ⊆ Ω and an independent random permutation π : Ω → Ω, we have: P r (min(π(S1 )) = min(π(S2 )) = min(π(S3 ))) = R3way . (2) Eq.( 2) shows that minwise hashing, although it operates on sets individually, preserves all 3-way (in fact k-way) similarity structure of the data. The existence of such a hash function is the key requirement behind the existence of efficient approximate search. For the pairwise case, the probability event was a simple hash collision, and the min-hash itself serves as the bucket index. In case 3 of 3-way (and higher) c-NN problem, we have to take care of a more complicated event to create an indexing scheme. In particular, during preprocessing we need to create buckets for each individual S3 , and while querying we need to associate the query sets S1 and S2 to the appropriate bucket. We need extra mechanisms to manipulate these minwise hashes to obtain a bucketing scheme. Proof of Theorem 1: We use two additional functions: f1 : Ω → N for manipulating min(π(S3 )) and f2 : Ω × Ω → N for manipulating both min(π(S1 )) and min(π(S2 )). Let a ∈ N+ such that |Ω| = D < 10a . We define f1 (x) = (10a + 1) × x and f2 (x, y) = 10a x + y. This choice ensures that given query S1 and S2 , for any S3 ∈ C, f1 (min(π(S3 ))) = f2 (min(π(S1 )), min(π(S2 ))) holds if and only if min(π(S1 )) = min(π(S2 )) = min(π(S2 )), and thus we get a bucketing scheme. To complete the proof, we introduce two integer parameters K and L. Define a new hash function by concatenating K events. To be more precise, while preprocessing, for every element S3 ∈ C create buckets g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hK (S3 ))] where hi is chosen uniformly from minwise hashing family. For given query points S1 and S2 , retrieve only points in the bucket g2 (S1 , S2 ) = [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hK (S1 ), hK (S2 ))]. Repeat this process L times independently. For any K S3 ∈ C, with Sim(S1 , S2 , S3 ) ≥ R0 , is retrieved with probability at least 1 − (1 − R0 )L . Using log 1/c log K = ⌈ log n ⌉ and L = ⌈nρ log( 1 )⌉, where ρ = 1 − log 1/c+log 1/R0 , the proof can be obtained 1 δ cR0 using standard concentration arguments used to prove Fact 1, see [14, 13]. It is worth noting that the probability guarantee parameter δ gets absorbed in the constants as log( 1 ). Note, the process is δ stopped as soon as we find some element with R3way ≥ cR0 . Theorem 1 can be easily extended to k-way resemblance with same query time and space guarantees. Note that k-way c-NN is at least as hard as k ∗ -way c-NN for any k ∗ ≤ k, because we can always choose (k −k ∗ +1) identical query sets in k-way c-NN, and it reduces to k ∗ -way c-NN problem. So, any improvements in R3way c-NN implies improvement in the classical min-hash LSH for Jaccard similarity. The proposed analysis is thus tight in this sense. The above observation makes it possible to also perform the traditional pairwise c-NN search using the same hash tables deployed for 3-way c-NN. In the query phase we have an option, if we have two different queries S1 , S2 , then we retrieve from bucket g2 (S1 , S2 ) and that is usual 3-way c-NN search. If we are just interested in pairwise near neighbor search given one query S1 , then we will look into bucket g2 (S1 , S1 ), and we know that the 3-way resemblance between S1 , S1 , S3 boils down to the pairwise resemblance between S1 and S3 . So, the same hash tables can be used for both the purposes. This property generalizes, and hash tables created for k-way c-NN can be used for any k ∗ -way similarity search so long as k ∗ ≤ k. The approximation guarantees still holds. This flexibility makes k-way c-NN bucketing scheme more advantageous over the pairwise scheme. ρ 1 One of the peculiarity of LSH based techniques is that the query complexity exponent ρ < 1 is dependent on the choice R0=0.01 0.8 of the threshold R0 we are interested in and the value of c 0.05 0.1 0.3 0.6 which is the approximation ratio that we will tolerate. Figure 1 0.2 0.4 0.8 log 1/c 0.5 plots ρ = 1− log 1/c+log 1/R0 with respect to c, for selected R0 0.4 0.6 0.9 0.7 values from 0.01 to 0.99. For instance, if we are interested in 0.2 0.95 highly similar pairs, i.e. R0 ≈ 1, then we are looking at near R =0.99 0 O(log n) query complexity for c-NN problem as ρ ≈ 0. On 0 0 0.2 0.4 0.6 0.8 1 the other hand, for very lower threshold R0 , there is no much c log 1/c of hope of time-saving because ρ is close to 1. Figure 1: ρ = 1 − log 1/c+log 1/R0 . 5 Other Efficient k-way Similarities We refer to the k-way similarities for which there exist sub-linear algorithms for c-NN search with query and space complexity exactly as given in Theorem 1 as efficient . We have demonstrated existence of one such example of efficient similarities, which is the k-way resemblance. This leads to a natural question: “Are there more of them?”. [9] analyzed all the transformations on similarities that preserve existence of efficient LSH search. In particular, they showed that if S is a similarity for which there exists an LSH family, then there also exists an LSH family for any similarity which is a probability generating function (PGF) transfor∑∞ mation on S. PGF transformation on S is defined as P GF (S) = i=1 pi S i , where S ∈ [0, 1] and ∑∞ pi ≥ 0 satisfies i=1 pi = 1. Similar theorem can also be shown in the case of 3-way resemblance. 4 Theorem 2 Any PGF transformation on 3-way resemblance R3way is efficient. Recall in the proof of Theorem 1, we created hash assignments f1 (min(π(S3 ))) and f2 (min(π(S1 )), min(π(S2 ))), which lead to a bucketing scheme for the 3-way resemblance search, where the collision event E = {f1 (min(π(S3 )) = f2 (min(π(S1 )), min(π(S2 )))} happens with probability P r(E) = R3way . To prove the above Theorem 2, we will need to create hash events ∑∞ i having probability P GF (R3way ) = i=1 pi (R3way ) . Note that 0 ≤ P GF (R3way ) ≤ 1. We will make use of the following simple lemma. Lemma 1 (R3way )n is efficient for all n ∈ N. n n Proof: Define new hash assignments g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hn (S3 ))] and g2 (S1 , S2 ) = n n [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hn (S1 ), hn (S2 ))]. The collision event g1 (S3 ) = g2 (S1 , S2 ) has n n probability (R3way )n . We now use the pair < g1 , g2 > instead of < f1 , f2 > and obtain same 3way n guarantees, as in Theorem 1, for (R ) as well. i i Proof of Theorem 2: From Lemma 1, let < g1 , g2 > be the hash pair corresponding to (R3way )i i i as used in above lemma. We sample one hash pair from the set {< g1 , g2 >: i ∈ N}, where i i the probability of sampling < g1 , g2 > is proportional to pi . Note that pi ≥ 0, and satisfies ∑∞ is i=1 pi = 1, and so the above sampling ∑ valid. It is not difficult to see that the collision of the ∞ sampled hash pair has probability exactly i=1 pi (R3way )i . Theorem 2 can be naturally extended to k-way similarity for any k ≥ 2. Thus, we now have infinitely many k-way similarity functions admitting efficient sub-linear search. One, that might be interesting, because of its radial basis kernel like nature, is shown in the following corollary. Corollary 1 eR k−way −1 is efficient. Proof: Use the expansion of eR k−way normalized by e to see that eR k−way −1 is a PGF on Rk−way . 6 Fast Algorithms for 3-way c-CP and 3-way c-BC Problems For 3-way c-CP and 3-way c-BC problems, using bucketing scheme with minwise hashing family will save even more computations. Theorem 3 For R3way c-Close Pair Problem (or c-CP) one can construct a data structure with log 1/c O(n2ρ log1/cR0 n) query time and O(n1+2ρ ) space, where ρ = 1 − log 1/c+log 1/R0 . Note that we can switch the role of f1 and f2 in the proof of Theorem 1. We are thus left with a c-NN problem with search space O(n2 ) (all pairs) instead of n. A bit of analysis, similar to Theorem 1, will show that this procedure achieves the required query time O(n2ρ log1/cR0 n), but uses a lot more space, O(n2(1+ρ )), than shown in the above theorem. It turns out that there is a better way of doing c-CP that saves us space. Proof of Theorem 3: We again start with constructing hash tables. For every element Sc ∈ C, we create a hash-table and store Sc in bucket B(Sc ) = [h1 (Sc ); h2 (Sc ); ...; hK (Sc )], where hi is chosen uniformly from minwise independent family of hash functions H. We create L such hash-tables. For a query element Sq we look for all pairs in bucket B(Sq ) = [h1 (Sq ); h2 (Sq ); ...; hK (Sq )] and repeat this for each of the L tables. Note, we do not form pairs of elements retrieved from different tables as they do not satisfy Eq. (2). If there exists a pair S1 , S2 ∈ C with Sim(Sq , S1 , S2 ) ≥ R0 , using K Eq. (2), we can see that we will find that pair in bucket B(Sq ) with probability 1 − (1 − R0 )L . Here, we cannot use traditional choice of K and L, similar to what we did in Theorem 1, as there 2 log are O(n2 ) instead of O(n) possible pairs. We instead use K = ⌈ log 1n ⌉ and L = ⌈n2ρ log( 1 )⌉, δ cR0 log 1/c with ρ = 1 − log 1/c+log 1/R0 . With this choice of K and L, the result follows. Note, the process is stopped as soon as we find pairs S1 and S2 with Sim(Sq , S1 , S2 ) ≥ cR0 . The key argument that saves space from O(n2(1+ρ) ) to O(n1+2ρ ) is that we hash n points individually. Eq. (2) makes it clear that hashing all possible pairs is not needed when every point can be processed individually, and pairs formed within each bucket itself filter out most of the unnecessary combinations. 5 Theorem 4 For R3way c-Best Cluster Problem (or c-BC) there exist an algorithm with running time log 1/c O(n1+2ρ log1/cR0 n), where ρ = 1 − log 1/c+log 1/R0 . The argument similar to one used in proof of Theorem 3 leads to the running time of O(n1+3ρ log1/cR0 n) as we need L = O(n3ρ ), and we have to processes all points at least once. Proof of Theorem 4: Repeat c-CP problem n times for every element in collection C acting as query once. We use the same set of hash tables and hash functions every time. The preprocessing time is O(n1+2ρ log1/cR0 n) evaluations of hash functions and the total querying time is O(n × n2ρ log1/cR0 n), which makes the total running time O(n1+2ρ log1/cR0 n). For k-way c-BC Problem, we can achieve O(n1+(k−1)ρ log1/cR0 n) running time. If we are interested in very high similarity cluster, with R0 ≈ 1, then ρ ≈ 0, and the running time is around O(n log n). This is a huge saving over the brute force O(nk ). In most practical cases, specially in big data regime where we have enormous amount of data, we can expect the k-way similarity of good clusters to be high and finding them should be efficient. We can see that with increasing k, hashing techniques save more computations. 7 Experiments In this section, we demonstrate the usability of 3-way and higher-order similarity search using (i) Google Sets, and (ii) Improving retrieval quality. 7.1 Google Sets: Generating Semantically Similar Words Here, the task is to retrieve words which are “semantically” similar to the given set of query words. We collected 1.2 million random documents from Wikipedia and created a standard term-doc binary vector representation of each term present in the collected documents after removing standard stop words and punctuation marks. More specifically, every word is represented as a 1.2 million dimension binary vector indicating its presence or absence in the corresponding document. The total number of terms (or words) was around 60,000 in this experiment. Since there is no standard benchmark available for this task, we show qualitative evaluations. For querying, we used the following four pairs of semantically related words: (i) “jaguar” and “tiger”; (ii) “artificial” and “intelligence”; (iii) “milky” and “way” ; (iv) “finger” and “lakes”. Given the query words w1 and w2 , we compare the results obtained by the following four methods. • Google Sets: We use Google’s algorithm and report 5 words from Google spreadsheets [1]. This is Google’s algorithm which uses its own data. • 3-way Resemblance (3-way): We use 3-way resemblance |w1 ∩w2 ∩w| to rank every word |w1 ∪w2 ∪w| w and report top 5 words based on this ranking. • Sum Resemblance (SR): Another intuitive method is to use the sum of pairwise resem|w2 ∩w| blance |w1 ∩w| + |w2 ∪w| and report top 5 words based on this ranking. |w1 ∪w| • Pairwise Intersection (PI): We first retrieve top 100 words based on pairwise resemblance for each w1 and w2 independently. We then report the words common in both. If there is no word in common we do not report anything. The results in Table 1 demonstrate that using 3-way resemblance retrieves reasonable candidates for these four queries. An interesting query is “finger” and “lakes”. Finger Lakes is a region in upstate New York. Google could only relate it to New York, while 3-way resemblance could even retrieve the names of cities and lakes in the region. Also, for query “milky” and “way”, we can see some (perhaps) unrelated words like “dance” returned by Google. We do not see such random behavior with 3-way resemblance. Although we are not aware of the algorithm and the dataset used by Google, we can see that 3-way resemblance appears to be a right measure for this application. The above results also illustrate the problem with using the sum of pairwise similarity method. The similarity value with one of the words dominates the sum and hence we see for queries “artificial” and “intelligence” that all the retrieved words are mostly related to the word “intelligence”. Same is the case with query “finger” and “lakes” as well as “jaguar” and “tiger”. Note that “jaguar” is also a car brand. In addition, for all 4 queries, there was no common word in the top 100 words similar to the each query word individually and so PI method never returns anything. 6 Table 1: Top five words retrieved using various methods for different queries. “JAGUAR” AND “ TIGER” G OOGLE 3- WAY SR LION LEOPARD CHEETAH CAT DOG LEOPARD CHEETAH LION PANTHER CAT CAT LEOPARD LITRE BMW CHASIS “MILKY” AND “ WAY” G OOGLE 3- WAY SR DANCE STARS SPACE THE UNIVERSE GALAXY STARS EARTH LIGHT SPACE EVEN ANOTHER STILL BACK TIME PI — — — — — “ARTIFICIAL” AND “INTELLIGENCE” G OOGLE 3- WAY SR PI COMPUTER COMPUTER SECURITY — PROGRAMMING SCIENCE WEAPONS — INTELLIGENT SECRET — SCIENCE ROBOT HUMAN ATTACKS — ROBOTICS TECHNOLOGY HUMAN — PI — — — — — G OOGLE NEW YORK NY PARK CITY “FINGER” AND “LAKES” 3- WAY SR SENECA CAYUGA ERIE ROCHESTER IROQUOIS RIVERS FRESHWATER FISH STREAMS FORESTED PI — — — — — We should note the importance of the denominator term in 3-way resemblance, without which frequent words will be blindly favored. The exciting contribution of this paper is that 3-way resemblance similarity search admits provable sub-linear guarantees, making it an ideal choice. On the other hand, no such provable guarantees are known for SR and other heuristic based search methods. 7.2 Improving Retrieval Quality in Similarity Search We also demonstrate how the retrieval quality of traditional similarity search can be boosted by utilizing more query candidates instead of just one. For the evaluations we choose two public datasets: MNIST and WEBSPAM, which were used in a recent related paper [26] for near neighbor search with binary data using b-bit minwise hashing [20, 23]. The two datasets reflect diversity both in terms of task and scale that is encountered in practice. The MNIST dataset consists of handwritten digit samples. Each sample is an image of 28 × 28 pixel yielding a 784 dimension vector with the associated class label (digit 0 − 9). We binarize the data by settings all non zeros to be 1. We used the standard partition of MNIST, which consists of 10,000 samples in one set and 60,000 in the other. The WEBSPAM dataset, with 16,609,143 features, consists of sparse vector representation of emails labeled as spam or not. We randomly sample 70,000 data points and partitioned them into two independent sets of size 35,000 each. Table 2: Percentage of top candidates with the same labels as that of query retrieved using various similarity criteria. More indicates better retrieval quality (Best marked in bold). T OP Pairwise 3-way NNbor 4-way NNbor 1 94.20 96.90 97.70 MNIST 10 20 92.33 91.10 96.13 95.36 96.89 96.28 50 89.06 93.78 95.10 1 98.45 99.75 99.90 WEBSPAM 10 20 96.94 96.46 98.68 97.80 98.87 98.15 50 95.12 96.11 96.45 For evaluation, we need to generate potential similar search query candidates for k-way search. It makes no sense in trying to search for object simultaneously similar to two very different objects. To generate such query candidates, we took one independent set of the data and partition it according to the class labels. We then run a cheap k-mean clustering on each class, and randomly sample triplets < x1 , x2 , x3 > from each cluster for evaluating 2-way, 3-way and 4-way similarity search. For MNIST dataset, the standard 10,000 test set was partitioned according to the labels into 10 sets, each partition was then clustered into 10 clusters, and we choose 10 triplets randomly from each cluster. In all we had 100 such triplets for each class, and thus 1000 overall query triplets. For WEBSPAM, which consists only of 2 classes, we choose one of the independent set and performed the same procedure. We selected 100 triplets from each cluster. We thus have 1000 triplets from each class making the total number of 2000 query candidates. The above procedures ensure that the elements in each triplets < x1 , x2 , x3 > are not very far from each other and are of the same class label. For each triplet < x1 , x2 , x3 >, we sort all the points x in the other independent set based on the following: • Pairwise: We only use the information in x1 and rank x based on resemblance 7 |x1 ∩x| |x1 ∪x| . • 3-way NN: We rank x based on 3-way resemblance • 4-way NN: We rank x based on 4-way resemblance |x1 ∩x2 ∩x| |x1 ∪x2 ∪x| . |x1 ∩x2 ∩x3 ∩x| |x1 ∪x2 ∪x3 ∪x| . We look at the top 1, 10, 20 and 50 points based on orderings described above. Since, all the query triplets are of the same label, The percentage of top retrieved candidates having same label as that of the query items is a natural metric to evaluate the retrieval quality. This percentage values accumulated over all the triplets are summarized in Table 2. We can see that top candidates retrieved by 3-way resemblance similarity, using 2 query points, are of better quality than vanilla pairwise similarity search. Also 4-way resemblance, with 3 query points, further improves the results compared to 3-way resemblance similarity search. This clearly demonstrates that multi-way resemblance similarity search is more desirable whenever we have more than one representative query in mind. Note that, for MNIST, which contains 10 classes, the boost compared to pairwise retrieval is substantial. The results follow a consistent trend. 8 Future Work While the work presented in this paper is promising for efficient 3-way and k-way similarity search in binary high-dimensional data, there are numerous interesting and practical research problems we can study as future work. In this section, we mention a few such examples. One-permutation hashing. Traditionally, building hash tables for near neighbor search required many (e.g., 1000) independent hashes. This is both time- and energy-consuming, not only for building tables but also for processing un-seen queries which have not been processed. One permutation hashing [22] provides the hope of reducing many permutations to merely one. The version in [22], however, was not applicable to near neighbor search due to the existence of many empty bins (which offer no indexing capability). The most recent work [27] is able to fill the empty bins and works well for pairwise near neighbor search. It will be interesting to extend [27] to k-way search. Non-binary sparse data. This paper focuses on minwise hashing for binary data. Various extensions to real-valued data are possible. For example, our results naturally apply to consistent weighted sampling [25, 15], which is one way to handle non-binary sparse data. The problem, however, is not solved if we are interested in similarities such as (normalized) k-way inner products, although the line of work on Conditional Random Sampling (CRS) [19, 18] may be promising. CRS works on non-binary sparse data by storing a bottom subset of nonzero entries after applying one permutation to (real-valued) sparse data matrix. CRS performs very well for certain applications but it does not work in our context because the bottom (nonzero) subsets are not properly aligned. Building hash tables by directly using bits from minwise hashing. This will be a different approach from the way how the hash tables are constructed in this paper. For example, [26] directly used the bits from b-bit minwise hashing [20, 23] to build hash tables and demonstrated the significant advantages compared to sim-hash [8, 12] and spectral hashing [29]. It would be interesting to see the performance of this approach in k-way similarity search. k-Way sign random projections. It would be very useful to develop theory for k-way sign random projections. For usual (real-valued) random projections, it is known that the volume (which is related to the determinant) is approximately preserved [24, 17]. We speculate that the collision probability of k-way sign random projections might be also a (monotonic) function of the determinant. 9 Conclusions We formulate a new framework for k-way similarity search and obtain fast algorithms in the case of k-way resemblance with provable worst-case approximation guarantees. We show some applications of k-way resemblance search in practice and demonstrate the advantages over traditional search. Our analysis involves the idea of probabilistic hashing and extends the well-known LSH family beyond the pairwise case. We believe the idea of probabilistic hashing still has a long way to go. Acknowledgement The work is supported by NSF-III-1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Ping Li thanks Kenneth Church for introducing Google Sets to him in the summer of 2004 at Microsoft Research. 8 References [1] http://www.howtogeek.com/howto/15799/how-to-use-autofill-on-a-google-docs-spreadsheet-quick-tips/. [2] S. Agarwal, Jongwoo Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005. [3] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. Group recommendation: semantics and efficiency. Proc. VLDB Endow., 2(1):754–765, 2009. [4] Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. Dynamic ranked retrieval. In WSDM, pages 247–256, 2011. [5] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21–29, Positano, Italy, 1997. [6] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998. [7] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classification. IEEE Trans. Neural Networks, 10(5):1055–1064, 1999. [8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. [9] Flavio Chierichetti and Ravi Kumar. LSH-preserving functions and their applications. In SODA, 2012. [10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003. [11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for finding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975. [12] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115–1145, 1995. [13] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):321–350, 2012. [14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998. [15] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, 2010. [16] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494–501, Amsterdam, Netherlands, 2007. [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Technical report, arXiv:1207.6083, 2013. [18] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics (Preliminary results appeared in HLT/EMNLP 2005), 33(3):305–354, 2007. [19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873–880, Vancouver, Canada, 2006. [20] Ping Li and Arnd Christian K¨ nig. b-bit minwise hashing. In Proceedings of the 19th International o Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010. [21] Ping Li, Arnd Christian K¨ nig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simio larities. In NIPS, Vancouver, Canada, 2010. [22] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012. [23] Ping Li, Anshumali Shrivastava, and Arnd Christian K¨ nig. b-bit minwise hashing in practice. In Intero netware, Changsha, China, 2013. [24] Avner Magen and Anastasios Zouzias. Near optimal dimensionality reductions that preserve volumes. In APPROX / RANDOM, pages 523–534, 2008. [25] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010. [26] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML, Bristol, UK, 2012. [27] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014. [28] Roger Weber, Hans-J¨ rg Schek, and Stephen Blott. A quantitative analysis and performance study for o similarity-search methods in high-dimensional spaces. In VLDB, pages 194–205, 1998. [29] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, Vancouver, Canada, 2008. [30] D. Zhou, J. Huang, and B. Sch¨ lkopf. Beyond pairwise classification and clustering using hypergraphs. o In NIPS, Vancouver, Canada, 2006. 9
4 0.14021252 355 nips-2013-Which Space Partitioning Tree to Use for Search?
Author: Parikshit Ram, Alexander Gray
Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by:  γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi   VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9
5 0.12557924 326 nips-2013-The Power of Asymmetry in Binary Hashing
Author: Behnam Neyshabur, Nati Srebro, Ruslan Salakhutdinov, Yury Makarychev, Payman Yadollahpour
Abstract: When approximating binary similarity using the hamming distance between short binary hashes, we show that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e. by approximating the similarity between x and x as the hamming distance between f (x) and g(x ), for two distinct binary codes f, g, rather than as the hamming distance between f (x) and f (x ). 1
6 0.11937173 47 nips-2013-Bayesian Hierarchical Community Discovery
7 0.10099876 161 nips-2013-Learning Stochastic Inverses
8 0.097466275 300 nips-2013-Solving the multi-way matching problem by permutation synchronization
9 0.089961462 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
10 0.076258153 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
11 0.075679407 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries
12 0.072360784 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics
13 0.058685817 269 nips-2013-Regression-tree Tuning in a Streaming Setting
14 0.05573215 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
15 0.054965023 73 nips-2013-Convex Relaxations for Permutation Problems
16 0.054126371 63 nips-2013-Cluster Trees on Manifolds
17 0.054081727 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies
18 0.05322776 309 nips-2013-Statistical Active Learning Algorithms
19 0.053031642 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport
20 0.052722003 294 nips-2013-Similarity Component Analysis
topicId topicWeight
[(0, 0.139), (1, 0.054), (2, -0.026), (3, 0.007), (4, 0.076), (5, 0.05), (6, 0.024), (7, -0.033), (8, -0.09), (9, 0.114), (10, -0.06), (11, -0.013), (12, 0.105), (13, 0.063), (14, 0.013), (15, 0.018), (16, -0.062), (17, 0.063), (18, -0.007), (19, 0.227), (20, 0.166), (21, -0.023), (22, -0.288), (23, 0.307), (24, 0.136), (25, -0.296), (26, -0.177), (27, -0.063), (28, 0.139), (29, 0.085), (30, 0.217), (31, 0.034), (32, -0.113), (33, -0.001), (34, -0.032), (35, -0.007), (36, 0.013), (37, -0.09), (38, -0.003), (39, 0.148), (40, 0.041), (41, 0.059), (42, -0.027), (43, -0.034), (44, -0.014), (45, -0.067), (46, 0.025), (47, -0.045), (48, -0.011), (49, -0.087)]
simIndex simValue paperId paperTitle
same-paper 1 0.94628584 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces
Author: Leonid Boytsov, Bilegsaikhan Naidan
Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. 1
2 0.81050473 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
3 0.59002614 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
Author: Anshumali Shrivastava, Ping Li
Abstract: We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1 ∩S2 ∩S3 | , S1 , S2 , S3 ∈ C, where C is a |S1 ∪S2 ∪S3 | size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality. 1 Introduction and Motivation Similarity search (near neighbor search) is one of the fundamental problems in Computer Science. The task is to identify a small set of data points which are “most similar” to a given input query. Similarity search algorithms have been one of the basic building blocks in numerous applications including search, databases, learning, recommendation systems, computer vision, etc. One widely used notion of similarity on sets is the Jaccard similarity or resemblance [5, 10, 18, 20]. Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, the resemblance R2way between S1 and S2 is defined as: R2way = |S1 ∩S2 | . Existing notions of similarity in search problems mainly work with |S1 ∪S2 | pairwise similarity functions. In this paper, we go beyond this notion and look at the problem of k-way similarity search, where the similarity function of interest involves k sets (k ≥ 2). Our work exploits the fact that resemblance can be naturally extended to k-way resemblance similarity [18, 21], defined over k sets {S1 , S2 , ..., Sk } as Rk−way = |S1 ∩S2 ∩...∩Sk | . |S1 ∪S2 ∪...∪Sk | Binary high-dimensional data The current web datasets are typically binary, sparse, and extremely high-dimensional, largely due to the wide adoption of the “Bag of Words” (BoW) representations for documents and images. It is often the case, in BoW representations, that just the presence or absence (0/1) of specific feature words captures sufficient information [7, 16, 20], especially with (e.g.,) 3-grams or higher-order models. And so, the web can be imagined as a giant storehouse of ultra high-dimensional sparse binary vectors. Of course, binary vectors can also be equivalently viewed as sets (containing locations of the nonzero features). We list four practical scenarios where k-way resemblance search would be a natural choice. (i) Google Sets: (http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html) Google Sets is among the earliest google projects, which allows users to generate list of similar words by typing only few related keywords. For example, if the user types “mazda” and “honda” the application will automatically generate related words like “bmw”, “ford”, “toyota”, etc. This application is currently available in google spreadsheet. If we assume the term document binary representation of each word w in the database, then given query w1 and w2 , we show that |w1 ∩w2 ∩w| |w1 ∪w2 ∪w| turns out to be a very good similarity measure for this application (see Section 7.1). 1 (ii) Joint recommendations: Users A and B would like to watch a movie together. The profile of each person can be represented as a sparse vector over a giant universe of attributes. For example, a user profile may be the set of actors, actresses, genres, directors, etc, which she/he likes. On the other hand, we can represent a movie M in the database over the same universe based on attributes associated with the movie. If we have to recommend movie M, jointly to users A and B, then a natural measure to maximize is |A∩B∩M | . The problem of group recommendation [3] is applicable |A∪B∪M | in many more settings such as recommending people to join circles, etc. (iii) Improving retrieval quality: We are interested in finding images of a particular type of object, and we have two or three (possibly noisy) representative images. In such a scenario, a natural expectation is that retrieving images simultaneously similar to all the representative images should be more refined than just retrieving images similar to any one of them. In Section 7.2, we demonstrate that in cases where we have more than one element to search for, we can refine our search quality using k-way resemblance search. In a dynamic feedback environment [4], we can improve subsequent search quality by using k-way similarity search on the pages already clicked by the user. (iv) Beyond pairwise clustering: While machine learning algorithms often utilize the data through pairwise similarities (e.g., inner product or resemblance), there are natural scenarios where the affinity relations are not pairwise, but rather triadic, tetradic or higher [2, 30]. The computational cost, of course, will increase exponentially if we go beyond pairwise similarity. Efficiency is crucial With the data explosion in modern applications, the brute force way of scanning all the data for searching is prohibitively expensive, specially in user-facing applications like search. The need for k-way similarity search can only be fulfilled if it admits efficient algorithms. This paper fulfills this requirement for k-way resemblance and its derived similarities. In particular, we show fast algorithms with provable query time guarantees for approximate k-way resemblance search. Our algorithms and analysis naturally provide a framework to extend classical LSH framework [14, 13] to handle higher-order similarities, which could be of independent theoretical interest. Organization In Section 2, we review approximate near neighbor search and classical Locality Sensitive Hashing (LSH). In Section 3, we formulate the 3-way similarity search problems. Sections 4, 5, and 6 describe provable fast algorithms for several search problems. Section 7 demonstrates the applicability of 3-way resemblance search in real applications. 2 Classical c-NN and Locality Sensitive Hashing (LSH) Initial attempts of finding efficient (sub-linear time) algorithms for exact near neighbor search, based on space partitioning, turned out to be a disappointment with the massive dimensionality of current datasets [11, 28]. Approximate versions of the problem were proposed [14, 13] to break the linear query time bottleneck. One widely adopted formalism is the c-approximate near neighbor (c-NN). Definition 1 (c-Approximate Near Neighbor or c-NN). Consider a set of points, denoted by P, in a D-dimensional space RD , and parameters R0 > 0, δ > 0. The task is to construct a data structure which, given any query point q, if there exist an R0 -near neighbor of q in P, it reports some cR0 -near neighbor of q in P with probability 1 − δ. The usual notion of c-NN is for distance. Since we deal with similarities, we define R0 -near neighbor of point q as a point p with Sim(q, p) ≥ R0 , where Sim is the similarity function of interest. Locality sensitive hashing (LSH) [14, 13] is a popular framework for c-NN problems. LSH is a family of functions, with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than non-similar ones. In formal terms, consider H a family of hash functions mapping RD to some set S Definition 2 (Locality Sensitive Hashing (LSH)). A family H is called (R0 , cR0 , p1 , p2 )-sensitive if for any two points x, y ∈ RD and h chosen uniformly from H satisfies the following: • if Sim(x, y) ≥ R0 then P rH (h(x) = h(y)) ≥ p1 • if Sim(x, y) ≤ cR0 then P rH (h(x) = h(y)) ≤ p2 For approximate nearest neighbor search typically, p1 > p2 and c < 1 is needed. Note, c < 1 as we are defining neighbors in terms of similarity. Basically, LSH trades off query time with extra preprocessing time and space which can be accomplished off-line. 2 Fact 1 Given a family of (R0 , cR0 , p1 , p2 ) -sensitive hash functions, one can construct a data structure for c-NN with O(nρ log1/p2 n) query time and space O(n1+ρ ), where ρ = log 1/p1 . log 1/p2 Minwise Hashing for Pairwise Resemblance One popular choice of LSH family of functions associated with resemblance similarity is, Minwise Hashing family [5, 6, 13]. Minwise Hashing family applies an independent random permutation π : Ω → Ω, on the given set S ⊆ Ω, and looks at the minimum element under π, i.e. min(π(S)). Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, it can be shown by elementary probability argument that P r (min(π(S1 )) = min(π(S2 ))) = |S1 ∩ S2 | = R2way . |S1 ∪ S2 | (1) The recent work on b-bit minwise hashing [20, 23] provides an improvement by storing only the lowest b bits of the hashed values: min(π(S1 )), min(π(S2 )). [26] implemented the idea of building hash tables for near neighbor search, by directly using the bits from b-bit minwise hashing. 3 3-way Similarity Search Formulation Our focus will remain on binary vectors which can also be viewed as sets. We illustrate our method |S1 ∩S2 ∩S3 | using 3-way resemblance similarity function Sim(S1 , S2 , S3 ) = |S1 ∪S2 ∪S3 | . The algorithm and guarantees naturally extend to k-way resemblance. Given a size n collection C ⊆ 2Ω of sets (or binary vectors), we are particularly interested in the following three problems: 1. Given two query sets S1 and S2 , find S3 ∈ C that maximizes Sim(S1 , S2 , S3 ). 2. Given a query set S1 , find two sets S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). 3. Find three sets S1 , S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). The brute force way of enumerating all possibilities leads to the worst case query time of O(n), O(n2 ) and O(n3 ) for problem 1, 2 and 3, respectively. In a hope to break this barrier, just like the case of pairwise near neighbor search, we define the c-approximate (c < 1) versions of the above three problems. As in the case of c-NN, we are given two parameters R0 > 0 and δ > 0. For each of the following three problems, the guarantee is with probability at least 1 − δ: 1. (3-way c-Near Neighbor or 3-way c-NN) Given two query sets S1 and S2 , if there ′ exists S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report some S3 ∈ C so that ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 2. (3-way c-Close Pair or 3-way c-CP) Given a query set S1 , if there exists a pair of ′ ′ set S2 , S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S2 , S3 ∈ C so that ′ ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 3. (3-way c-Best Cluster or 3-way c-BC) If there exist sets S1 , S2 , S3 ∈ C with ′ ′ ′ ′ ′ ′ Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S1 , S2 , S3 ∈ C so that Sim(S1 , S2 , S3 ) ≥ cR0 . 4 Sub-linear Algorithm for 3-way c-NN The basic philosophy behind sub-linear search is bucketing, which allows us to preprocess dataset in a fashion so that we can filter many bad candidates without scanning all of them. LSH-based techniques rely on randomized hash functions to create buckets that probabilistically filter bad candidates. This philosophy is not restricted for binary similarity functions and is much more general. Here, we first focus on 3-way c-NN problem for binary data. Theorem 1 For R3way c-NN one can construct a data structure with O(nρ log1/cR0 n) query time and O(n1+ρ ) space, where ρ = 1 − log 1/c log 1/c+log 1/R0 . The argument for 2-way resemblance can be naturally extended to k-way resemblance. Specifically, given three sets S1 , S2 , S3 ⊆ Ω and an independent random permutation π : Ω → Ω, we have: P r (min(π(S1 )) = min(π(S2 )) = min(π(S3 ))) = R3way . (2) Eq.( 2) shows that minwise hashing, although it operates on sets individually, preserves all 3-way (in fact k-way) similarity structure of the data. The existence of such a hash function is the key requirement behind the existence of efficient approximate search. For the pairwise case, the probability event was a simple hash collision, and the min-hash itself serves as the bucket index. In case 3 of 3-way (and higher) c-NN problem, we have to take care of a more complicated event to create an indexing scheme. In particular, during preprocessing we need to create buckets for each individual S3 , and while querying we need to associate the query sets S1 and S2 to the appropriate bucket. We need extra mechanisms to manipulate these minwise hashes to obtain a bucketing scheme. Proof of Theorem 1: We use two additional functions: f1 : Ω → N for manipulating min(π(S3 )) and f2 : Ω × Ω → N for manipulating both min(π(S1 )) and min(π(S2 )). Let a ∈ N+ such that |Ω| = D < 10a . We define f1 (x) = (10a + 1) × x and f2 (x, y) = 10a x + y. This choice ensures that given query S1 and S2 , for any S3 ∈ C, f1 (min(π(S3 ))) = f2 (min(π(S1 )), min(π(S2 ))) holds if and only if min(π(S1 )) = min(π(S2 )) = min(π(S2 )), and thus we get a bucketing scheme. To complete the proof, we introduce two integer parameters K and L. Define a new hash function by concatenating K events. To be more precise, while preprocessing, for every element S3 ∈ C create buckets g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hK (S3 ))] where hi is chosen uniformly from minwise hashing family. For given query points S1 and S2 , retrieve only points in the bucket g2 (S1 , S2 ) = [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hK (S1 ), hK (S2 ))]. Repeat this process L times independently. For any K S3 ∈ C, with Sim(S1 , S2 , S3 ) ≥ R0 , is retrieved with probability at least 1 − (1 − R0 )L . Using log 1/c log K = ⌈ log n ⌉ and L = ⌈nρ log( 1 )⌉, where ρ = 1 − log 1/c+log 1/R0 , the proof can be obtained 1 δ cR0 using standard concentration arguments used to prove Fact 1, see [14, 13]. It is worth noting that the probability guarantee parameter δ gets absorbed in the constants as log( 1 ). Note, the process is δ stopped as soon as we find some element with R3way ≥ cR0 . Theorem 1 can be easily extended to k-way resemblance with same query time and space guarantees. Note that k-way c-NN is at least as hard as k ∗ -way c-NN for any k ∗ ≤ k, because we can always choose (k −k ∗ +1) identical query sets in k-way c-NN, and it reduces to k ∗ -way c-NN problem. So, any improvements in R3way c-NN implies improvement in the classical min-hash LSH for Jaccard similarity. The proposed analysis is thus tight in this sense. The above observation makes it possible to also perform the traditional pairwise c-NN search using the same hash tables deployed for 3-way c-NN. In the query phase we have an option, if we have two different queries S1 , S2 , then we retrieve from bucket g2 (S1 , S2 ) and that is usual 3-way c-NN search. If we are just interested in pairwise near neighbor search given one query S1 , then we will look into bucket g2 (S1 , S1 ), and we know that the 3-way resemblance between S1 , S1 , S3 boils down to the pairwise resemblance between S1 and S3 . So, the same hash tables can be used for both the purposes. This property generalizes, and hash tables created for k-way c-NN can be used for any k ∗ -way similarity search so long as k ∗ ≤ k. The approximation guarantees still holds. This flexibility makes k-way c-NN bucketing scheme more advantageous over the pairwise scheme. ρ 1 One of the peculiarity of LSH based techniques is that the query complexity exponent ρ < 1 is dependent on the choice R0=0.01 0.8 of the threshold R0 we are interested in and the value of c 0.05 0.1 0.3 0.6 which is the approximation ratio that we will tolerate. Figure 1 0.2 0.4 0.8 log 1/c 0.5 plots ρ = 1− log 1/c+log 1/R0 with respect to c, for selected R0 0.4 0.6 0.9 0.7 values from 0.01 to 0.99. For instance, if we are interested in 0.2 0.95 highly similar pairs, i.e. R0 ≈ 1, then we are looking at near R =0.99 0 O(log n) query complexity for c-NN problem as ρ ≈ 0. On 0 0 0.2 0.4 0.6 0.8 1 the other hand, for very lower threshold R0 , there is no much c log 1/c of hope of time-saving because ρ is close to 1. Figure 1: ρ = 1 − log 1/c+log 1/R0 . 5 Other Efficient k-way Similarities We refer to the k-way similarities for which there exist sub-linear algorithms for c-NN search with query and space complexity exactly as given in Theorem 1 as efficient . We have demonstrated existence of one such example of efficient similarities, which is the k-way resemblance. This leads to a natural question: “Are there more of them?”. [9] analyzed all the transformations on similarities that preserve existence of efficient LSH search. In particular, they showed that if S is a similarity for which there exists an LSH family, then there also exists an LSH family for any similarity which is a probability generating function (PGF) transfor∑∞ mation on S. PGF transformation on S is defined as P GF (S) = i=1 pi S i , where S ∈ [0, 1] and ∑∞ pi ≥ 0 satisfies i=1 pi = 1. Similar theorem can also be shown in the case of 3-way resemblance. 4 Theorem 2 Any PGF transformation on 3-way resemblance R3way is efficient. Recall in the proof of Theorem 1, we created hash assignments f1 (min(π(S3 ))) and f2 (min(π(S1 )), min(π(S2 ))), which lead to a bucketing scheme for the 3-way resemblance search, where the collision event E = {f1 (min(π(S3 )) = f2 (min(π(S1 )), min(π(S2 )))} happens with probability P r(E) = R3way . To prove the above Theorem 2, we will need to create hash events ∑∞ i having probability P GF (R3way ) = i=1 pi (R3way ) . Note that 0 ≤ P GF (R3way ) ≤ 1. We will make use of the following simple lemma. Lemma 1 (R3way )n is efficient for all n ∈ N. n n Proof: Define new hash assignments g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hn (S3 ))] and g2 (S1 , S2 ) = n n [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hn (S1 ), hn (S2 ))]. The collision event g1 (S3 ) = g2 (S1 , S2 ) has n n probability (R3way )n . We now use the pair < g1 , g2 > instead of < f1 , f2 > and obtain same 3way n guarantees, as in Theorem 1, for (R ) as well. i i Proof of Theorem 2: From Lemma 1, let < g1 , g2 > be the hash pair corresponding to (R3way )i i i as used in above lemma. We sample one hash pair from the set {< g1 , g2 >: i ∈ N}, where i i the probability of sampling < g1 , g2 > is proportional to pi . Note that pi ≥ 0, and satisfies ∑∞ is i=1 pi = 1, and so the above sampling ∑ valid. It is not difficult to see that the collision of the ∞ sampled hash pair has probability exactly i=1 pi (R3way )i . Theorem 2 can be naturally extended to k-way similarity for any k ≥ 2. Thus, we now have infinitely many k-way similarity functions admitting efficient sub-linear search. One, that might be interesting, because of its radial basis kernel like nature, is shown in the following corollary. Corollary 1 eR k−way −1 is efficient. Proof: Use the expansion of eR k−way normalized by e to see that eR k−way −1 is a PGF on Rk−way . 6 Fast Algorithms for 3-way c-CP and 3-way c-BC Problems For 3-way c-CP and 3-way c-BC problems, using bucketing scheme with minwise hashing family will save even more computations. Theorem 3 For R3way c-Close Pair Problem (or c-CP) one can construct a data structure with log 1/c O(n2ρ log1/cR0 n) query time and O(n1+2ρ ) space, where ρ = 1 − log 1/c+log 1/R0 . Note that we can switch the role of f1 and f2 in the proof of Theorem 1. We are thus left with a c-NN problem with search space O(n2 ) (all pairs) instead of n. A bit of analysis, similar to Theorem 1, will show that this procedure achieves the required query time O(n2ρ log1/cR0 n), but uses a lot more space, O(n2(1+ρ )), than shown in the above theorem. It turns out that there is a better way of doing c-CP that saves us space. Proof of Theorem 3: We again start with constructing hash tables. For every element Sc ∈ C, we create a hash-table and store Sc in bucket B(Sc ) = [h1 (Sc ); h2 (Sc ); ...; hK (Sc )], where hi is chosen uniformly from minwise independent family of hash functions H. We create L such hash-tables. For a query element Sq we look for all pairs in bucket B(Sq ) = [h1 (Sq ); h2 (Sq ); ...; hK (Sq )] and repeat this for each of the L tables. Note, we do not form pairs of elements retrieved from different tables as they do not satisfy Eq. (2). If there exists a pair S1 , S2 ∈ C with Sim(Sq , S1 , S2 ) ≥ R0 , using K Eq. (2), we can see that we will find that pair in bucket B(Sq ) with probability 1 − (1 − R0 )L . Here, we cannot use traditional choice of K and L, similar to what we did in Theorem 1, as there 2 log are O(n2 ) instead of O(n) possible pairs. We instead use K = ⌈ log 1n ⌉ and L = ⌈n2ρ log( 1 )⌉, δ cR0 log 1/c with ρ = 1 − log 1/c+log 1/R0 . With this choice of K and L, the result follows. Note, the process is stopped as soon as we find pairs S1 and S2 with Sim(Sq , S1 , S2 ) ≥ cR0 . The key argument that saves space from O(n2(1+ρ) ) to O(n1+2ρ ) is that we hash n points individually. Eq. (2) makes it clear that hashing all possible pairs is not needed when every point can be processed individually, and pairs formed within each bucket itself filter out most of the unnecessary combinations. 5 Theorem 4 For R3way c-Best Cluster Problem (or c-BC) there exist an algorithm with running time log 1/c O(n1+2ρ log1/cR0 n), where ρ = 1 − log 1/c+log 1/R0 . The argument similar to one used in proof of Theorem 3 leads to the running time of O(n1+3ρ log1/cR0 n) as we need L = O(n3ρ ), and we have to processes all points at least once. Proof of Theorem 4: Repeat c-CP problem n times for every element in collection C acting as query once. We use the same set of hash tables and hash functions every time. The preprocessing time is O(n1+2ρ log1/cR0 n) evaluations of hash functions and the total querying time is O(n × n2ρ log1/cR0 n), which makes the total running time O(n1+2ρ log1/cR0 n). For k-way c-BC Problem, we can achieve O(n1+(k−1)ρ log1/cR0 n) running time. If we are interested in very high similarity cluster, with R0 ≈ 1, then ρ ≈ 0, and the running time is around O(n log n). This is a huge saving over the brute force O(nk ). In most practical cases, specially in big data regime where we have enormous amount of data, we can expect the k-way similarity of good clusters to be high and finding them should be efficient. We can see that with increasing k, hashing techniques save more computations. 7 Experiments In this section, we demonstrate the usability of 3-way and higher-order similarity search using (i) Google Sets, and (ii) Improving retrieval quality. 7.1 Google Sets: Generating Semantically Similar Words Here, the task is to retrieve words which are “semantically” similar to the given set of query words. We collected 1.2 million random documents from Wikipedia and created a standard term-doc binary vector representation of each term present in the collected documents after removing standard stop words and punctuation marks. More specifically, every word is represented as a 1.2 million dimension binary vector indicating its presence or absence in the corresponding document. The total number of terms (or words) was around 60,000 in this experiment. Since there is no standard benchmark available for this task, we show qualitative evaluations. For querying, we used the following four pairs of semantically related words: (i) “jaguar” and “tiger”; (ii) “artificial” and “intelligence”; (iii) “milky” and “way” ; (iv) “finger” and “lakes”. Given the query words w1 and w2 , we compare the results obtained by the following four methods. • Google Sets: We use Google’s algorithm and report 5 words from Google spreadsheets [1]. This is Google’s algorithm which uses its own data. • 3-way Resemblance (3-way): We use 3-way resemblance |w1 ∩w2 ∩w| to rank every word |w1 ∪w2 ∪w| w and report top 5 words based on this ranking. • Sum Resemblance (SR): Another intuitive method is to use the sum of pairwise resem|w2 ∩w| blance |w1 ∩w| + |w2 ∪w| and report top 5 words based on this ranking. |w1 ∪w| • Pairwise Intersection (PI): We first retrieve top 100 words based on pairwise resemblance for each w1 and w2 independently. We then report the words common in both. If there is no word in common we do not report anything. The results in Table 1 demonstrate that using 3-way resemblance retrieves reasonable candidates for these four queries. An interesting query is “finger” and “lakes”. Finger Lakes is a region in upstate New York. Google could only relate it to New York, while 3-way resemblance could even retrieve the names of cities and lakes in the region. Also, for query “milky” and “way”, we can see some (perhaps) unrelated words like “dance” returned by Google. We do not see such random behavior with 3-way resemblance. Although we are not aware of the algorithm and the dataset used by Google, we can see that 3-way resemblance appears to be a right measure for this application. The above results also illustrate the problem with using the sum of pairwise similarity method. The similarity value with one of the words dominates the sum and hence we see for queries “artificial” and “intelligence” that all the retrieved words are mostly related to the word “intelligence”. Same is the case with query “finger” and “lakes” as well as “jaguar” and “tiger”. Note that “jaguar” is also a car brand. In addition, for all 4 queries, there was no common word in the top 100 words similar to the each query word individually and so PI method never returns anything. 6 Table 1: Top five words retrieved using various methods for different queries. “JAGUAR” AND “ TIGER” G OOGLE 3- WAY SR LION LEOPARD CHEETAH CAT DOG LEOPARD CHEETAH LION PANTHER CAT CAT LEOPARD LITRE BMW CHASIS “MILKY” AND “ WAY” G OOGLE 3- WAY SR DANCE STARS SPACE THE UNIVERSE GALAXY STARS EARTH LIGHT SPACE EVEN ANOTHER STILL BACK TIME PI — — — — — “ARTIFICIAL” AND “INTELLIGENCE” G OOGLE 3- WAY SR PI COMPUTER COMPUTER SECURITY — PROGRAMMING SCIENCE WEAPONS — INTELLIGENT SECRET — SCIENCE ROBOT HUMAN ATTACKS — ROBOTICS TECHNOLOGY HUMAN — PI — — — — — G OOGLE NEW YORK NY PARK CITY “FINGER” AND “LAKES” 3- WAY SR SENECA CAYUGA ERIE ROCHESTER IROQUOIS RIVERS FRESHWATER FISH STREAMS FORESTED PI — — — — — We should note the importance of the denominator term in 3-way resemblance, without which frequent words will be blindly favored. The exciting contribution of this paper is that 3-way resemblance similarity search admits provable sub-linear guarantees, making it an ideal choice. On the other hand, no such provable guarantees are known for SR and other heuristic based search methods. 7.2 Improving Retrieval Quality in Similarity Search We also demonstrate how the retrieval quality of traditional similarity search can be boosted by utilizing more query candidates instead of just one. For the evaluations we choose two public datasets: MNIST and WEBSPAM, which were used in a recent related paper [26] for near neighbor search with binary data using b-bit minwise hashing [20, 23]. The two datasets reflect diversity both in terms of task and scale that is encountered in practice. The MNIST dataset consists of handwritten digit samples. Each sample is an image of 28 × 28 pixel yielding a 784 dimension vector with the associated class label (digit 0 − 9). We binarize the data by settings all non zeros to be 1. We used the standard partition of MNIST, which consists of 10,000 samples in one set and 60,000 in the other. The WEBSPAM dataset, with 16,609,143 features, consists of sparse vector representation of emails labeled as spam or not. We randomly sample 70,000 data points and partitioned them into two independent sets of size 35,000 each. Table 2: Percentage of top candidates with the same labels as that of query retrieved using various similarity criteria. More indicates better retrieval quality (Best marked in bold). T OP Pairwise 3-way NNbor 4-way NNbor 1 94.20 96.90 97.70 MNIST 10 20 92.33 91.10 96.13 95.36 96.89 96.28 50 89.06 93.78 95.10 1 98.45 99.75 99.90 WEBSPAM 10 20 96.94 96.46 98.68 97.80 98.87 98.15 50 95.12 96.11 96.45 For evaluation, we need to generate potential similar search query candidates for k-way search. It makes no sense in trying to search for object simultaneously similar to two very different objects. To generate such query candidates, we took one independent set of the data and partition it according to the class labels. We then run a cheap k-mean clustering on each class, and randomly sample triplets < x1 , x2 , x3 > from each cluster for evaluating 2-way, 3-way and 4-way similarity search. For MNIST dataset, the standard 10,000 test set was partitioned according to the labels into 10 sets, each partition was then clustered into 10 clusters, and we choose 10 triplets randomly from each cluster. In all we had 100 such triplets for each class, and thus 1000 overall query triplets. For WEBSPAM, which consists only of 2 classes, we choose one of the independent set and performed the same procedure. We selected 100 triplets from each cluster. We thus have 1000 triplets from each class making the total number of 2000 query candidates. The above procedures ensure that the elements in each triplets < x1 , x2 , x3 > are not very far from each other and are of the same class label. For each triplet < x1 , x2 , x3 >, we sort all the points x in the other independent set based on the following: • Pairwise: We only use the information in x1 and rank x based on resemblance 7 |x1 ∩x| |x1 ∪x| . • 3-way NN: We rank x based on 3-way resemblance • 4-way NN: We rank x based on 4-way resemblance |x1 ∩x2 ∩x| |x1 ∪x2 ∪x| . |x1 ∩x2 ∩x3 ∩x| |x1 ∪x2 ∪x3 ∪x| . We look at the top 1, 10, 20 and 50 points based on orderings described above. Since, all the query triplets are of the same label, The percentage of top retrieved candidates having same label as that of the query items is a natural metric to evaluate the retrieval quality. This percentage values accumulated over all the triplets are summarized in Table 2. We can see that top candidates retrieved by 3-way resemblance similarity, using 2 query points, are of better quality than vanilla pairwise similarity search. Also 4-way resemblance, with 3 query points, further improves the results compared to 3-way resemblance similarity search. This clearly demonstrates that multi-way resemblance similarity search is more desirable whenever we have more than one representative query in mind. Note that, for MNIST, which contains 10 classes, the boost compared to pairwise retrieval is substantial. The results follow a consistent trend. 8 Future Work While the work presented in this paper is promising for efficient 3-way and k-way similarity search in binary high-dimensional data, there are numerous interesting and practical research problems we can study as future work. In this section, we mention a few such examples. One-permutation hashing. Traditionally, building hash tables for near neighbor search required many (e.g., 1000) independent hashes. This is both time- and energy-consuming, not only for building tables but also for processing un-seen queries which have not been processed. One permutation hashing [22] provides the hope of reducing many permutations to merely one. The version in [22], however, was not applicable to near neighbor search due to the existence of many empty bins (which offer no indexing capability). The most recent work [27] is able to fill the empty bins and works well for pairwise near neighbor search. It will be interesting to extend [27] to k-way search. Non-binary sparse data. This paper focuses on minwise hashing for binary data. Various extensions to real-valued data are possible. For example, our results naturally apply to consistent weighted sampling [25, 15], which is one way to handle non-binary sparse data. The problem, however, is not solved if we are interested in similarities such as (normalized) k-way inner products, although the line of work on Conditional Random Sampling (CRS) [19, 18] may be promising. CRS works on non-binary sparse data by storing a bottom subset of nonzero entries after applying one permutation to (real-valued) sparse data matrix. CRS performs very well for certain applications but it does not work in our context because the bottom (nonzero) subsets are not properly aligned. Building hash tables by directly using bits from minwise hashing. This will be a different approach from the way how the hash tables are constructed in this paper. For example, [26] directly used the bits from b-bit minwise hashing [20, 23] to build hash tables and demonstrated the significant advantages compared to sim-hash [8, 12] and spectral hashing [29]. It would be interesting to see the performance of this approach in k-way similarity search. k-Way sign random projections. It would be very useful to develop theory for k-way sign random projections. For usual (real-valued) random projections, it is known that the volume (which is related to the determinant) is approximately preserved [24, 17]. We speculate that the collision probability of k-way sign random projections might be also a (monotonic) function of the determinant. 9 Conclusions We formulate a new framework for k-way similarity search and obtain fast algorithms in the case of k-way resemblance with provable worst-case approximation guarantees. We show some applications of k-way resemblance search in practice and demonstrate the advantages over traditional search. Our analysis involves the idea of probabilistic hashing and extends the well-known LSH family beyond the pairwise case. We believe the idea of probabilistic hashing still has a long way to go. Acknowledgement The work is supported by NSF-III-1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Ping Li thanks Kenneth Church for introducing Google Sets to him in the summer of 2004 at Microsoft Research. 8 References [1] http://www.howtogeek.com/howto/15799/how-to-use-autofill-on-a-google-docs-spreadsheet-quick-tips/. [2] S. Agarwal, Jongwoo Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005. [3] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. Group recommendation: semantics and efficiency. Proc. VLDB Endow., 2(1):754–765, 2009. [4] Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. Dynamic ranked retrieval. In WSDM, pages 247–256, 2011. [5] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21–29, Positano, Italy, 1997. [6] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998. [7] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classification. IEEE Trans. Neural Networks, 10(5):1055–1064, 1999. [8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. [9] Flavio Chierichetti and Ravi Kumar. LSH-preserving functions and their applications. In SODA, 2012. [10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003. [11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for finding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975. [12] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115–1145, 1995. [13] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):321–350, 2012. [14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998. [15] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, 2010. [16] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494–501, Amsterdam, Netherlands, 2007. [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Technical report, arXiv:1207.6083, 2013. [18] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics (Preliminary results appeared in HLT/EMNLP 2005), 33(3):305–354, 2007. [19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873–880, Vancouver, Canada, 2006. [20] Ping Li and Arnd Christian K¨ nig. b-bit minwise hashing. In Proceedings of the 19th International o Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010. [21] Ping Li, Arnd Christian K¨ nig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simio larities. In NIPS, Vancouver, Canada, 2010. [22] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012. [23] Ping Li, Anshumali Shrivastava, and Arnd Christian K¨ nig. b-bit minwise hashing in practice. In Intero netware, Changsha, China, 2013. [24] Avner Magen and Anastasios Zouzias. Near optimal dimensionality reductions that preserve volumes. In APPROX / RANDOM, pages 523–534, 2008. [25] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010. [26] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML, Bristol, UK, 2012. [27] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014. [28] Roger Weber, Hans-J¨ rg Schek, and Stephen Blott. A quantitative analysis and performance study for o similarity-search methods in high-dimensional spaces. In VLDB, pages 194–205, 1998. [29] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, Vancouver, Canada, 2008. [30] D. Zhou, J. Huang, and B. Sch¨ lkopf. Beyond pairwise classification and clustering using hypergraphs. o In NIPS, Vancouver, Canada, 2006. 9
4 0.51172721 326 nips-2013-The Power of Asymmetry in Binary Hashing
Author: Behnam Neyshabur, Nati Srebro, Ruslan Salakhutdinov, Yury Makarychev, Payman Yadollahpour
Abstract: When approximating binary similarity using the hamming distance between short binary hashes, we show that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e. by approximating the similarity between x and x as the hamming distance between f (x) and g(x ), for two distinct binary codes f, g, rather than as the hamming distance between f (x) and f (x ). 1
5 0.46780992 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics
Author: Özgür1 Şimşek
Abstract: Several attempts to understand the success of simple decision heuristics have examined heuristics as an approximation to a linear decision rule. This research has identified three environmental structures that aid heuristics: dominance, cumulative dominance, and noncompensatoriness. This paper develops these ideas further and examines their empirical relevance in 51 natural environments. The results show that all three structures are prevalent, making it possible for simple rules to reach, and occasionally exceed, the accuracy of the linear decision rule, using less information and less computation. 1
6 0.44949204 47 nips-2013-Bayesian Hierarchical Community Discovery
7 0.44845563 355 nips-2013-Which Space Partitioning Tree to Use for Search?
8 0.36177707 294 nips-2013-Similarity Component Analysis
9 0.32575282 71 nips-2013-Convergence of Monte Carlo Tree Search in Simultaneous Move Games
10 0.31913489 161 nips-2013-Learning Stochastic Inverses
11 0.31267491 300 nips-2013-Solving the multi-way matching problem by permutation synchronization
12 0.30431178 225 nips-2013-One-shot learning and big data with n=2
13 0.2993567 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel
14 0.27197316 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries
15 0.25378007 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
16 0.25209427 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
17 0.24522127 73 nips-2013-Convex Relaxations for Permutation Problems
18 0.24462682 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport
19 0.22438207 357 nips-2013-k-Prototype Learning for 3D Rigid Structures
20 0.21764421 245 nips-2013-Pass-efficient unsupervised feature selection
topicId topicWeight
[(16, 0.05), (21, 0.011), (33, 0.128), (34, 0.084), (36, 0.01), (41, 0.044), (49, 0.018), (56, 0.113), (70, 0.03), (85, 0.023), (89, 0.036), (93, 0.032), (95, 0.032), (99, 0.318)]
simIndex simValue paperId paperTitle
1 0.81911969 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
2 0.81307471 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation
Author: Hossein Azari Soufiani, William Chen, David C. Parkes, Lirong Xia
Abstract: In this paper we propose a class of efficient Generalized Method-of-Moments (GMM) algorithms for computing parameters of the Plackett-Luce model, where the data consists of full rankings over alternatives. Our technique is based on breaking the full rankings into pairwise comparisons, and then computing parameters that satisfy a set of generalized moment conditions. We identify conditions for the output of GMM to be unique, and identify a general class of consistent and inconsistent breakings. We then show by theory and experiments that our algorithms run significantly faster than the classical Minorize-Maximization (MM) algorithm, while achieving competitive statistical efficiency. 1
same-paper 3 0.75934452 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces
Author: Leonid Boytsov, Bilegsaikhan Naidan
Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. 1
4 0.72165716 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
5 0.7117846 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin
Abstract: This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets. 1
6 0.69391143 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
7 0.58165479 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
8 0.55883265 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
9 0.54746479 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
10 0.54270595 249 nips-2013-Polar Operators for Structured Sparse Estimation
11 0.54032773 318 nips-2013-Structured Learning via Logistic Regression
12 0.53978115 355 nips-2013-Which Space Partitioning Tree to Use for Search?
13 0.53954542 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
14 0.53909868 11 nips-2013-A New Convex Relaxation for Tensor Completion
15 0.538598 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
16 0.53806508 233 nips-2013-Online Robust PCA via Stochastic Optimization
17 0.53742898 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
18 0.53718644 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
19 0.53704548 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
20 0.53676224 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization