nips nips2013 nips2013-293 knowledge-graph by maker-knowledge-mining

293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

Source: pdf

Author: Ping Li, Gennady Samorodnitsk, John Hopcroft

Abstract: The method of stable random projections is useful for efﬁciently approximating the lα distance (0 < α ≤ 2) in high dimension and it is naturally suitable for data streams. In this paper, we propose to use only the signs of the projected data and we analyze the probability of collision (i.e., when the two signs differ). Interestingly, when α = 1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square (χ2 ) similarity. In text and vision applications, the χ2 similarity is a popular measure when the features are generated from histograms (which are a typical example of data streams). Experiments conﬁrm that the proposed method is promising for large-scale learning applications. The full paper is available at arXiv:1308.1009. There are many future research problems. For example, when α → 0, the collision probability is a function of the resemblance (of the binary-quantized data). This provides an effective mechanism for resemblance estimation in data streams. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The method of stable random projections is useful for efﬁciently approximating the lα distance (0 < α ≤ 2) in high dimension and it is naturally suitable for data streams. [sent-8, score-0.432]

2 In this paper, we propose to use only the signs of the projected data and we analyze the probability of collision (i. [sent-9, score-0.767]

3 , Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square (χ2 ) similarity. [sent-14, score-0.613]

4 In text and vision applications, the χ2 similarity is a popular measure when the features are generated from histograms (which are a typical example of data streams). [sent-15, score-0.207]

5 For example, when α → 0, the collision probability is a function of the resemblance (of the binary-quantized data). [sent-20, score-0.659]

6 , α-stable projections with α = 2) has become popular in machine learning (e. [sent-37, score-0.263]

7 More generally, the method of stable random projections [11, 17] provides an efﬁcient algorithm to compute the lα distances (0 < α ≤ 2). [sent-40, score-0.429]

8 In this paper, we propose to use only the signs of the projected data after applying stable projections. [sent-41, score-0.323]

9 The basic idea of stable random projections is to multiply u and v by a random matrix R ∈ RD×k : x = uR ∈ Rk , y = vR ∈ Rk , where entries of R are i. [sent-44, score-0.403]

10 By properties of stable distributions, xj − yj follows a symmetric α-stable distribution with scale dα . [sent-48, score-0.216]

11 In this paper, we propose to store only the signs of projected data and we study the probability of collision: Pα = Pr (sign(xj ) ̸= sign(yj )) (4) Using only the signs (i. [sent-53, score-0.274]

12 For α < 2, the collision probability is an open problem. [sent-57, score-0.613]

13 Furthermore, for α = 1 and nonnegative data, we have the interesting observation that the probability P1 can be well approximated as functions of the χ2 similarity ρχ2 . [sent-60, score-0.166]

14 For example, a negative sign can be coded as “01” and a positive sign as “10” (i. [sent-70, score-0.82]

15 We can code a negative sign by “0” and positive sign by “1” and concatenate k such bits to form a hash table of 2k buckets. [sent-77, score-0.916]

16 3 Data Stream Computations Stable random projections are naturally suitable for data streams. [sent-80, score-0.263]

17 In the standard turnstile model [22], a data stream can be viewed as high-dimensional vector with the entry values changing over time. [sent-82, score-0.141]

18 (t) Here, we denote a stream at time t by ui , i = 1 to D. [sent-83, score-0.253]

19 At time t, a stream element (it , It ) (t) (t−1) arrives and updates the it -th coordinate as uit = uit + It . [sent-84, score-0.132]

20 Clearly, the turnstile data stream model is particularly suitable for describing histograms and it is also a standard model for network trafﬁc summarization and monitoring [31]. [sent-85, score-0.176]

21 Because this stream model is linear, methods based on linear projections (i. [sent-86, score-0.308]

22 , (t) (t−1) rit ,j , j = 1 to k, are (re)generated and the projected data are updated as xj = xj + It × rit j . [sent-92, score-0.189]

23 Thus we can still use, without loss of generality, the sum-to-one assumpi=1 ui tion, even in the streaming environment. [sent-95, score-0.228]

24 This fact was recently exploited by another data stream algorithm named Compressed Counting (CC) [18] for estimating the Shannon entropy of streams. [sent-96, score-0.103]

25 Because the use of the χ2 similarity is popular in (e. [sent-97, score-0.12]

26 ,) 5 ∼ 10 × D dimensions through a nonlinear transformation and then applying normal random projections on the expanded data. [sent-102, score-0.261]

27 To compare these two types of χ2 kernels with “linear” kernel, we also test the same data using LIBLINEAR [6] after normalizing the data to have unit Euclidian norm, i. [sent-110, score-0.104]

28 , ρ2 ) kernel and LIBSVM “precomputed kernel” for two types of χ2 kernels (“χ2 kernel” and “acos-χ2 -kernel”). [sent-118, score-0.13]

29 3 Sign Stable Random Projections and the Collision Probability Bound ∑D ∑D We apply stable random projections on two vectors u, v ∈ RD : x = i=1 ui ri , y = i=1 vi ri , ri ∼ S(α, 1), i. [sent-124, score-0.769]

30 By properties of stable distributions, ( ∑ ) D we know x−y ∼ S α, i=1 |ui − vi |α . [sent-128, score-0.356]

31 Applications including linear learning and near neighbor search will beneﬁt from sign α-stable random projections. [sent-129, score-0.458]

32 normal), the collision probability Pr (sign(x) ̸= sign(y)) is known [5, 9]. [sent-132, score-0.613]

33 , ui ≥ 0, vi ≥ 0, we have  2/α ∑D α/2 α/2 u i vi 1  Pr (sign(x) ̸= sign(y)) ≤ cos−1 ρα , where ρα =  √∑ i=1 ∑ π D D α α i=1 ui i=1 vi (6) For α = 2, this bound is exact [5, 9]. [sent-140, score-0.96]

34 Then we ﬁx the data as our original data (like u and v), apply sign stable random projections, and report the empirical collision probabilities (after 105 repetitions). [sent-148, score-1.253]

35 Figure 2 presents the simulated collision probability Pr (sign(x) ̸= sign(y)) for D = 100 and α ∈ 1 {1. [sent-149, score-0.645]

36 In each panel, the dashed curve is the theoretical upper bound π cos−1 ρα , and the solid curve is the simulated collision probability. [sent-154, score-0.832]

37 Simulated collision probability Pr (sign(x) ̸= sign(y)) for 1 sign stable random projections. [sent-193, score-1.192]

38 In each panel, the dashed curve is the upper bound π cos−1 ρα . [sent-194, score-0.145]

39 Also, the curves of the empirical collision probabilities are not smooth (in terms of ρα ). [sent-199, score-0.669]

40 To verify the theoretical upper bound of the collision probability on sparse data, we also simulate sparse data by randomly making 50% of the generated data as used in Figure 2 be zero. [sent-201, score-0.844]

41 With sparse data, it is even more obvious that the 1 theoretical upper bound π cos−1 ρα is not sharp when α ≤ 1, as shown in Figure 3. [sent-202, score-0.144]

42 Simulated collision probability Pr (sign(x) ̸= sign(y)) for sign stable random projection. [sent-243, score-1.192]

43 The upper bound is not tight especially when α ≤ 1. [sent-244, score-0.103]

44 1 In summary, the collision probability bound: Pr (sign(x) ̸= sign(y)) ≤ π cos−1 ρα is fairly sharp when α is close to 2 (e. [sent-245, score-0.68]

45 4 α = 1 and Chi-Square (χ2 ) Similarity In this section, we focus on nonnegative data (ui ≥ 0, vi ≥ 0) and α = 1. [sent-250, score-0.258]

46 For example, we can view the data (ui , vi ) as empirical probabilities, which are common when data are generated from histograms (as popular in NLP and vision) [4, 10, 13, 2, 28, 27, 26]. [sent-252, score-0.332]

47 Theorem 1 implies (D )2 ∑ 1/2 1/2 1 −1 Pr (sign(x) ̸= sign(y)) ≤ cos ρ1 , where ρ1 = ui vi (7) π i=1 While the bound is not tight, interestingly, the collision probability can be related to the χ2 similarity. [sent-256, score-1.23]

48 2 ∑D Recall the deﬁnitions of the chi-square distance dχ2 = i=1 (uii−vii) and the chi-square similarity u +v ∑D 2ui v ρχ2 = 1 − 1 dχ2 = i=1 ui +vii . [sent-257, score-0.27]

49 2 0 4 Lemma 2 Assume ui ≥ 0, vi ≥ 0, ρχ2 = ∑D i=1 ui = 1, D ∑ 2ui vi u + vi i=1 i ∑D vi = 1. [sent-259, score-1.106]

50 Then (D )2 ∑ 1/2 1/2 ≥ ρ1 = u i vi i=1 (8) i=1 It is known that the χ2 -kernel is PD [10]. [sent-260, score-0.187]

51 The remaining question is how to connect Cauchy random projections with the χ2 similarity. [sent-263, score-0.234]

52 5 Two Approximations of Collision Probability for Sign Cauchy Projections It is a difﬁcult problem to derive the collision probability of sign Cauchy projections if we would like to express the probability only in terms of certain summary statistics (e. [sent-264, score-1.29]

53 Our ﬁrst observation is that the collision probability can be well approximated using the χ2 similarity: ( ) 1 Pr (sign(x) ̸= sign(y)) ≈ Pχ2 (1) = cos−1 ρχ2 (9) π 1 Figure 4 shows this approximation is better than π cos−1 (ρ1 ). [sent-267, score-0.647]

54 Particularly, in sparse data, the ( ) 1 −1 ρχ2 is very accurate (except when ρχ2 is close to 1), while the bound approximation π cos 1 cos−1 (ρ1 ) is not sharp (and the curve is not smooth in ρ1 ). [sent-268, score-0.437]

55 In each panel, the two solid curves are the empirical collision probabilities in terms of ρ1 (labeled by 1 “1”) or ρχ2 (labeled by “χ2 ). [sent-287, score-0.703]

56 It is clear that the proposed approximation π cos−1 ρχ2 in (9) is more 1 tight than the upper bound π cos−1 ρ1 , especially so in sparse data. [sent-288, score-0.18]

57 Our second (and less obvious) approximation is the following integral: ) ( ∫ π/2 ρχ2 1 2 −1 tan t dt Pr (sign(x) ̸= sign(y)) ≈ Pχ2 (2) = − 2 tan 2 π 0 2 − 2ρχ2 (10) Figure 5 illustrates that, for dense data, the second approximation (10) is more accurate than the ﬁrst (9). [sent-289, score-0.405]

58 The second approximation (10) is also accurate for sparse data. [sent-290, score-0.108]

59 In practice, we often do not need the ρχ2 values explicitly because it often sufﬁces if the collision probability is a monotone function of the similarity. [sent-292, score-0.613]

60 1 Binary Data Interestingly, when the data are binary (before normalization), we can compute the collision probability exactly, which allows us to analytically assess the accuracy of the approximations. [sent-294, score-0.666]

61 For convenience, we deﬁne a = |Ia |, b = |Ib |, c = |Ic |, where Ia = {i|ui > 0, vi = 0}, Ib = {i|vi > 0, ui = 0}, Ic = {i|ui > 0, vi > 0}, (11) Assume binary data (before normalization, i. [sent-296, score-0.582]

62 That is, 1 1 1 1 = , ∀i ∈ Ia ∪ Ic , vi = = , ∀i ∈ Ib ∪ Ic (12) ui = |Ia | + |Ic | a+c |Ib | + |Ic | b+c ∑D 2ui v ρχ2 2c c The chi-square similarity ρχ2 becomes ρχ2 = i=1 ui +vii = a+b+2c and hence 2−2ρ 2 = a+b . [sent-299, score-0.636]

63 The solid curves (empirical probabilities expressed in terms of ρχ2 ) are the same solid curves labeled “χ2 ” in Figure 4. [sent-318, score-0.21]

64 The left panel shows that the second approximation (10) is more accurate in dense data. [sent-319, score-0.13]

65 The right panel illustrate that both approximations are accurate in sparse data. [sent-320, score-0.183]

66 When α = 1, the exact collision probability is ( c )} (c ) 2 { 1 |R| tan−1 |R| Pr (sign(x) ̸= sign(y)) = − 2 E tan−1 2 π a b (13) where R is a standard Cauchy random variable. [sent-323, score-0.613]

67 This b 2 observation inspires us to propose the approximation (10): { ( )} ) ( ∫ π/2 1 1 c 1 2 c −1 −1 Pχ2 (2) = − E tan |R| = − 2 tan t dt tan 2 π a+b 2 π 0 a+b To validate this approximation for binary data, we study the difference between (13) and (10), i. [sent-325, score-0.559]

68 Interestingly, the errors of the collision probabilities based on two χ2 approximations are still very small. [sent-354, score-0.685]

69 To report the results, we apply sign Cauchy random projections 107 times to evaluate the approximation errors 1 of (9) and (10). [sent-355, score-0.703]

70 The results, as presented in Figure 7, again conﬁrm that the upper bound π cos−1 ρ1 is not tight and both χ2 approximations, Pχ2 (1) and Pχ2 (2) , are accurate. [sent-356, score-0.103]

71 8 1 Figure 7: Empirical collision probabilities for 3. [sent-376, score-0.616]

72 In the left panel, we plot the empirical collision probabilities against ρ1 (lower, green if color is available) and ρχ2 1 (higher, red). [sent-378, score-0.616]

73 The curves conﬁrm that the bound π cos−1 ρ1 is not tight (and the curve is not smooth). [sent-379, score-0.174]

74 We plot the two χ2 approximations as dashed curves which largely match the empirical probabilities plotted against ρχ2 , conﬁrming that the χ2 approximations are good. [sent-380, score-0.217]

75 For each (high-dimensional) data vector, using k sign Cauchy projections, we encode a negative sign as “01” and a positive as “10” (i. [sent-385, score-0.849]

76 Interestingly, this linear classiﬁer 1 approximates a nonlinear kernel classiﬁer based on acos-χ2 -kernel: K(u, v) = 1− π cos−1 ρχ2 . [sent-391, score-0.111]

77 The solid (black) curves are the accuracies using k sign Cauchy projections and LIBLINEAR. [sent-394, score-0.731]

78 The results conﬁrm that the linear kernel from sign Cauchy projections can approximate the nonlinear acos-χ2 -kernel. [sent-395, score-0.755]

79 Although our method 2 , we can still estimate ρχ2 by assuming the collision probability does not directly approximate ρχ 1 is exactly Pr (sign(x) ̸= sign(y)) = π cos−1 ρχ2 and then we can feed the estimated ρχ2 values into LIBSVM “precomputed kernel” for classiﬁcation. [sent-397, score-0.652]

80 The dashed curves are the classiﬁcation results obtained using χ2 kernel and LIBSVM “precomputed kernel” functionality. [sent-400, score-0.177]

81 We apply k sign Cauchy projections and 1 estimate ρχ2 assuming the collision probability is exactly π cos−1 ρχ2 and then feed the estimated ρχ2 into LIBSVM again using the “precomputed kernel” functionality. [sent-401, score-1.296]

82 7 Conclusion The use of χ2 similarity is widespread in machine learning, especially when features are generated from histograms, as common in natural language processing and computer vision. [sent-402, score-0.14]

83 Computing all pairwise χ2 similarities can be time-consuming and in fact we usually can not materialize an all-pairwise similarity matrix even if there are merely 106 data points. [sent-406, score-0.12]

84 When data are generated in a streaming fashion, computing χ2 similarities without storing the original data will be even more challenging. [sent-410, score-0.156]

85 The method of α-stable random projections (0 < α ≤ 2) [11, 17] is popular for efﬁciently computing the lα distances in massive (streaming) data. [sent-411, score-0.289]

86 We propose sign stable random projections by storing only the signs (i. [sent-412, score-0.926]

87 For example, we can build hash tables using the bits to achieve sublinear time near neighbor search (although this paper does not focus on near neighbor search). [sent-417, score-0.158]

88 A crucial task in analyzing sign stable random projections is to study the probability of collision (i. [sent-419, score-1.426]

89 We derive a theoretical bound of the collision probability which is exact when α = 2. [sent-422, score-0.654]

90 Experiments on real and simulated data conﬁrm that our proposed χ2 approximations are very accurate. [sent-429, score-0.105]

91 We are enthusiastic about the practicality of sign stable projections in learning and search applications. [sent-430, score-0.813]

92 The previous idea of using the signs from normal random projections has been widely adopted in practice, for approximating correlations. [sent-431, score-0.321]

93 Given the widespread use of the χ2 similarity and the simplicity of our method, we expect the proposed method will be adopted by practitioners. [sent-432, score-0.117]

94 (i) The processing cost of conducting stable random projections can be dramatically reduced by very sparse stable random projections [16]. [sent-434, score-0.849]

95 (iii) Another interesting research would be to study the use of sign stable projections for sparse signal recovery (Compressed Sensing) with stable distributions [21]. [sent-439, score-1.025]

96 (iv) When α → 0, the collision probability becomes Pr (sign(x) ̸= sign(y)) = 1 − 1 Resemblance, which provides an elegant 2 2 mechanism for computing resemblance (of the binary-quantized data) in sparse data streams. [sent-440, score-0.731]

97 Stable distributions, pseudorandom generators, embeddings, and data stream computation. [sent-484, score-0.141]

98 A linear approximation to the χ2 kernel with geometric convergence. [sent-496, score-0.118]

99 Very sparse stable random projections for dimension reduction in lα (0 < α ≤ 2) norm. [sent-500, score-0.446]

100 Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections. [sent-503, score-0.169]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('collision', 0.58), ('sign', 0.41), ('projections', 0.234), ('cos', 0.21), ('vi', 0.187), ('ui', 0.179), ('stable', 0.169), ('cauchy', 0.158), ('tan', 0.153), ('ping', 0.138), ('pr', 0.1), ('similarity', 0.091), ('signs', 0.087), ('acc', 0.086), ('kernel', 0.084), ('stream', 0.074), ('precomputed', 0.073), ('ic', 0.071), ('classification', 0.069), ('liblinear', 0.068), ('libsvm', 0.067), ('panel', 0.065), ('dept', 0.061), ('ia', 0.058), ('ib', 0.058), ('gennady', 0.057), ('pems', 0.057), ('samorodnitsky', 0.057), ('curves', 0.053), ('streaming', 0.049), ('streams', 0.049), ('kernels', 0.046), ('resemblance', 0.046), ('compressed', 0.045), ('approximations', 0.044), ('sparse', 0.043), ('pd', 0.043), ('nonnegative', 0.042), ('classi', 0.041), ('vii', 0.041), ('curve', 0.041), ('bound', 0.041), ('dashed', 0.04), ('feed', 0.039), ('mnist', 0.039), ('bits', 0.039), ('tight', 0.039), ('projected', 0.038), ('acos', 0.038), ('pseudorandom', 0.038), ('rit', 0.038), ('turnstile', 0.038), ('sharp', 0.037), ('probabilities', 0.036), ('histograms', 0.035), ('concatenate', 0.034), ('approximation', 0.034), ('solid', 0.034), ('probability', 0.033), ('svm', 0.033), ('validate', 0.032), ('simulated', 0.032), ('li', 0.032), ('ithaca', 0.031), ('piotr', 0.031), ('hilbertian', 0.031), ('generators', 0.031), ('accurate', 0.031), ('fairly', 0.03), ('uit', 0.029), ('functionality', 0.029), ('popular', 0.029), ('data', 0.029), ('stoc', 0.028), ('nonlinear', 0.027), ('vedaldi', 0.027), ('san', 0.027), ('widespread', 0.026), ('francisco', 0.026), ('ca', 0.026), ('neighbor', 0.026), ('storing', 0.026), ('distances', 0.026), ('saving', 0.025), ('cornell', 0.025), ('nlp', 0.025), ('errors', 0.025), ('interestingly', 0.025), ('word', 0.025), ('jun', 0.024), ('analytically', 0.024), ('yj', 0.024), ('storage', 0.024), ('upper', 0.023), ('xj', 0.023), ('hash', 0.023), ('hashing', 0.023), ('generated', 0.023), ('near', 0.022), ('english', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

Author: Ping Li, Gennady Samorodnitsk, John Hopcroft

2 0.16757545 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

Author: Anshumali Shrivastava, Ping Li

Abstract: We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1 ∩S2 ∩S3 | , S1 , S2 , S3 ∈ C, where C is a |S1 ∪S2 ∪S3 | size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality. 1 Introduction and Motivation Similarity search (near neighbor search) is one of the fundamental problems in Computer Science. The task is to identify a small set of data points which are “most similar” to a given input query. Similarity search algorithms have been one of the basic building blocks in numerous applications including search, databases, learning, recommendation systems, computer vision, etc. One widely used notion of similarity on sets is the Jaccard similarity or resemblance [5, 10, 18, 20]. Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, the resemblance R2way between S1 and S2 is deﬁned as: R2way = |S1 ∩S2 | . Existing notions of similarity in search problems mainly work with |S1 ∪S2 | pairwise similarity functions. In this paper, we go beyond this notion and look at the problem of k-way similarity search, where the similarity function of interest involves k sets (k ≥ 2). Our work exploits the fact that resemblance can be naturally extended to k-way resemblance similarity [18, 21], deﬁned over k sets {S1 , S2 , ..., Sk } as Rk−way = |S1 ∩S2 ∩...∩Sk | . |S1 ∪S2 ∪...∪Sk | Binary high-dimensional data The current web datasets are typically binary, sparse, and extremely high-dimensional, largely due to the wide adoption of the “Bag of Words” (BoW) representations for documents and images. It is often the case, in BoW representations, that just the presence or absence (0/1) of speciﬁc feature words captures sufﬁcient information [7, 16, 20], especially with (e.g.,) 3-grams or higher-order models. And so, the web can be imagined as a giant storehouse of ultra high-dimensional sparse binary vectors. Of course, binary vectors can also be equivalently viewed as sets (containing locations of the nonzero features). We list four practical scenarios where k-way resemblance search would be a natural choice. (i) Google Sets: (http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html) Google Sets is among the earliest google projects, which allows users to generate list of similar words by typing only few related keywords. For example, if the user types “mazda” and “honda” the application will automatically generate related words like “bmw”, “ford”, “toyota”, etc. This application is currently available in google spreadsheet. If we assume the term document binary representation of each word w in the database, then given query w1 and w2 , we show that |w1 ∩w2 ∩w| |w1 ∪w2 ∪w| turns out to be a very good similarity measure for this application (see Section 7.1). 1 (ii) Joint recommendations: Users A and B would like to watch a movie together. The proﬁle of each person can be represented as a sparse vector over a giant universe of attributes. For example, a user proﬁle may be the set of actors, actresses, genres, directors, etc, which she/he likes. On the other hand, we can represent a movie M in the database over the same universe based on attributes associated with the movie. If we have to recommend movie M, jointly to users A and B, then a natural measure to maximize is |A∩B∩M | . The problem of group recommendation [3] is applicable |A∪B∪M | in many more settings such as recommending people to join circles, etc. (iii) Improving retrieval quality: We are interested in ﬁnding images of a particular type of object, and we have two or three (possibly noisy) representative images. In such a scenario, a natural expectation is that retrieving images simultaneously similar to all the representative images should be more reﬁned than just retrieving images similar to any one of them. In Section 7.2, we demonstrate that in cases where we have more than one element to search for, we can reﬁne our search quality using k-way resemblance search. In a dynamic feedback environment [4], we can improve subsequent search quality by using k-way similarity search on the pages already clicked by the user. (iv) Beyond pairwise clustering: While machine learning algorithms often utilize the data through pairwise similarities (e.g., inner product or resemblance), there are natural scenarios where the afﬁnity relations are not pairwise, but rather triadic, tetradic or higher [2, 30]. The computational cost, of course, will increase exponentially if we go beyond pairwise similarity. Efﬁciency is crucial With the data explosion in modern applications, the brute force way of scanning all the data for searching is prohibitively expensive, specially in user-facing applications like search. The need for k-way similarity search can only be fulﬁlled if it admits efﬁcient algorithms. This paper fulﬁlls this requirement for k-way resemblance and its derived similarities. In particular, we show fast algorithms with provable query time guarantees for approximate k-way resemblance search. Our algorithms and analysis naturally provide a framework to extend classical LSH framework [14, 13] to handle higher-order similarities, which could be of independent theoretical interest. Organization In Section 2, we review approximate near neighbor search and classical Locality Sensitive Hashing (LSH). In Section 3, we formulate the 3-way similarity search problems. Sections 4, 5, and 6 describe provable fast algorithms for several search problems. Section 7 demonstrates the applicability of 3-way resemblance search in real applications. 2 Classical c-NN and Locality Sensitive Hashing (LSH) Initial attempts of ﬁnding efﬁcient (sub-linear time) algorithms for exact near neighbor search, based on space partitioning, turned out to be a disappointment with the massive dimensionality of current datasets [11, 28]. Approximate versions of the problem were proposed [14, 13] to break the linear query time bottleneck. One widely adopted formalism is the c-approximate near neighbor (c-NN). Deﬁnition 1 (c-Approximate Near Neighbor or c-NN). Consider a set of points, denoted by P, in a D-dimensional space RD , and parameters R0 > 0, δ > 0. The task is to construct a data structure which, given any query point q, if there exist an R0 -near neighbor of q in P, it reports some cR0 -near neighbor of q in P with probability 1 − δ. The usual notion of c-NN is for distance. Since we deal with similarities, we deﬁne R0 -near neighbor of point q as a point p with Sim(q, p) ≥ R0 , where Sim is the similarity function of interest. Locality sensitive hashing (LSH) [14, 13] is a popular framework for c-NN problems. LSH is a family of functions, with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than non-similar ones. In formal terms, consider H a family of hash functions mapping RD to some set S Deﬁnition 2 (Locality Sensitive Hashing (LSH)). A family H is called (R0 , cR0 , p1 , p2 )-sensitive if for any two points x, y ∈ RD and h chosen uniformly from H satisﬁes the following: • if Sim(x, y) ≥ R0 then P rH (h(x) = h(y)) ≥ p1 • if Sim(x, y) ≤ cR0 then P rH (h(x) = h(y)) ≤ p2 For approximate nearest neighbor search typically, p1 > p2 and c < 1 is needed. Note, c < 1 as we are deﬁning neighbors in terms of similarity. Basically, LSH trades off query time with extra preprocessing time and space which can be accomplished off-line. 2 Fact 1 Given a family of (R0 , cR0 , p1 , p2 ) -sensitive hash functions, one can construct a data structure for c-NN with O(nρ log1/p2 n) query time and space O(n1+ρ ), where ρ = log 1/p1 . log 1/p2 Minwise Hashing for Pairwise Resemblance One popular choice of LSH family of functions associated with resemblance similarity is, Minwise Hashing family [5, 6, 13]. Minwise Hashing family applies an independent random permutation π : Ω → Ω, on the given set S ⊆ Ω, and looks at the minimum element under π, i.e. min(π(S)). Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, it can be shown by elementary probability argument that P r (min(π(S1 )) = min(π(S2 ))) = |S1 ∩ S2 | = R2way . |S1 ∪ S2 | (1) The recent work on b-bit minwise hashing [20, 23] provides an improvement by storing only the lowest b bits of the hashed values: min(π(S1 )), min(π(S2 )). [26] implemented the idea of building hash tables for near neighbor search, by directly using the bits from b-bit minwise hashing. 3 3-way Similarity Search Formulation Our focus will remain on binary vectors which can also be viewed as sets. We illustrate our method |S1 ∩S2 ∩S3 | using 3-way resemblance similarity function Sim(S1 , S2 , S3 ) = |S1 ∪S2 ∪S3 | . The algorithm and guarantees naturally extend to k-way resemblance. Given a size n collection C ⊆ 2Ω of sets (or binary vectors), we are particularly interested in the following three problems: 1. Given two query sets S1 and S2 , ﬁnd S3 ∈ C that maximizes Sim(S1 , S2 , S3 ). 2. Given a query set S1 , ﬁnd two sets S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). 3. Find three sets S1 , S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). The brute force way of enumerating all possibilities leads to the worst case query time of O(n), O(n2 ) and O(n3 ) for problem 1, 2 and 3, respectively. In a hope to break this barrier, just like the case of pairwise near neighbor search, we deﬁne the c-approximate (c < 1) versions of the above three problems. As in the case of c-NN, we are given two parameters R0 > 0 and δ > 0. For each of the following three problems, the guarantee is with probability at least 1 − δ: 1. (3-way c-Near Neighbor or 3-way c-NN) Given two query sets S1 and S2 , if there ′ exists S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report some S3 ∈ C so that ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 2. (3-way c-Close Pair or 3-way c-CP) Given a query set S1 , if there exists a pair of ′ ′ set S2 , S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S2 , S3 ∈ C so that ′ ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 3. (3-way c-Best Cluster or 3-way c-BC) If there exist sets S1 , S2 , S3 ∈ C with ′ ′ ′ ′ ′ ′ Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S1 , S2 , S3 ∈ C so that Sim(S1 , S2 , S3 ) ≥ cR0 . 4 Sub-linear Algorithm for 3-way c-NN The basic philosophy behind sub-linear search is bucketing, which allows us to preprocess dataset in a fashion so that we can ﬁlter many bad candidates without scanning all of them. LSH-based techniques rely on randomized hash functions to create buckets that probabilistically ﬁlter bad candidates. This philosophy is not restricted for binary similarity functions and is much more general. Here, we ﬁrst focus on 3-way c-NN problem for binary data. Theorem 1 For R3way c-NN one can construct a data structure with O(nρ log1/cR0 n) query time and O(n1+ρ ) space, where ρ = 1 − log 1/c log 1/c+log 1/R0 . The argument for 2-way resemblance can be naturally extended to k-way resemblance. Speciﬁcally, given three sets S1 , S2 , S3 ⊆ Ω and an independent random permutation π : Ω → Ω, we have: P r (min(π(S1 )) = min(π(S2 )) = min(π(S3 ))) = R3way . (2) Eq.( 2) shows that minwise hashing, although it operates on sets individually, preserves all 3-way (in fact k-way) similarity structure of the data. The existence of such a hash function is the key requirement behind the existence of efﬁcient approximate search. For the pairwise case, the probability event was a simple hash collision, and the min-hash itself serves as the bucket index. In case 3 of 3-way (and higher) c-NN problem, we have to take care of a more complicated event to create an indexing scheme. In particular, during preprocessing we need to create buckets for each individual S3 , and while querying we need to associate the query sets S1 and S2 to the appropriate bucket. We need extra mechanisms to manipulate these minwise hashes to obtain a bucketing scheme. Proof of Theorem 1: We use two additional functions: f1 : Ω → N for manipulating min(π(S3 )) and f2 : Ω × Ω → N for manipulating both min(π(S1 )) and min(π(S2 )). Let a ∈ N+ such that |Ω| = D < 10a . We deﬁne f1 (x) = (10a + 1) × x and f2 (x, y) = 10a x + y. This choice ensures that given query S1 and S2 , for any S3 ∈ C, f1 (min(π(S3 ))) = f2 (min(π(S1 )), min(π(S2 ))) holds if and only if min(π(S1 )) = min(π(S2 )) = min(π(S2 )), and thus we get a bucketing scheme. To complete the proof, we introduce two integer parameters K and L. Deﬁne a new hash function by concatenating K events. To be more precise, while preprocessing, for every element S3 ∈ C create buckets g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hK (S3 ))] where hi is chosen uniformly from minwise hashing family. For given query points S1 and S2 , retrieve only points in the bucket g2 (S1 , S2 ) = [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hK (S1 ), hK (S2 ))]. Repeat this process L times independently. For any K S3 ∈ C, with Sim(S1 , S2 , S3 ) ≥ R0 , is retrieved with probability at least 1 − (1 − R0 )L . Using log 1/c log K = ⌈ log n ⌉ and L = ⌈nρ log( 1 )⌉, where ρ = 1 − log 1/c+log 1/R0 , the proof can be obtained 1 δ cR0 using standard concentration arguments used to prove Fact 1, see [14, 13]. It is worth noting that the probability guarantee parameter δ gets absorbed in the constants as log( 1 ). Note, the process is δ stopped as soon as we ﬁnd some element with R3way ≥ cR0 . Theorem 1 can be easily extended to k-way resemblance with same query time and space guarantees. Note that k-way c-NN is at least as hard as k ∗ -way c-NN for any k ∗ ≤ k, because we can always choose (k −k ∗ +1) identical query sets in k-way c-NN, and it reduces to k ∗ -way c-NN problem. So, any improvements in R3way c-NN implies improvement in the classical min-hash LSH for Jaccard similarity. The proposed analysis is thus tight in this sense. The above observation makes it possible to also perform the traditional pairwise c-NN search using the same hash tables deployed for 3-way c-NN. In the query phase we have an option, if we have two different queries S1 , S2 , then we retrieve from bucket g2 (S1 , S2 ) and that is usual 3-way c-NN search. If we are just interested in pairwise near neighbor search given one query S1 , then we will look into bucket g2 (S1 , S1 ), and we know that the 3-way resemblance between S1 , S1 , S3 boils down to the pairwise resemblance between S1 and S3 . So, the same hash tables can be used for both the purposes. This property generalizes, and hash tables created for k-way c-NN can be used for any k ∗ -way similarity search so long as k ∗ ≤ k. The approximation guarantees still holds. This ﬂexibility makes k-way c-NN bucketing scheme more advantageous over the pairwise scheme. ρ 1 One of the peculiarity of LSH based techniques is that the query complexity exponent ρ < 1 is dependent on the choice R0=0.01 0.8 of the threshold R0 we are interested in and the value of c 0.05 0.1 0.3 0.6 which is the approximation ratio that we will tolerate. Figure 1 0.2 0.4 0.8 log 1/c 0.5 plots ρ = 1− log 1/c+log 1/R0 with respect to c, for selected R0 0.4 0.6 0.9 0.7 values from 0.01 to 0.99. For instance, if we are interested in 0.2 0.95 highly similar pairs, i.e. R0 ≈ 1, then we are looking at near R =0.99 0 O(log n) query complexity for c-NN problem as ρ ≈ 0. On 0 0 0.2 0.4 0.6 0.8 1 the other hand, for very lower threshold R0 , there is no much c log 1/c of hope of time-saving because ρ is close to 1. Figure 1: ρ = 1 − log 1/c+log 1/R0 . 5 Other Efﬁcient k-way Similarities We refer to the k-way similarities for which there exist sub-linear algorithms for c-NN search with query and space complexity exactly as given in Theorem 1 as efﬁcient . We have demonstrated existence of one such example of efﬁcient similarities, which is the k-way resemblance. This leads to a natural question: “Are there more of them?”. [9] analyzed all the transformations on similarities that preserve existence of efﬁcient LSH search. In particular, they showed that if S is a similarity for which there exists an LSH family, then there also exists an LSH family for any similarity which is a probability generating function (PGF) transfor∑∞ mation on S. PGF transformation on S is deﬁned as P GF (S) = i=1 pi S i , where S ∈ [0, 1] and ∑∞ pi ≥ 0 satisﬁes i=1 pi = 1. Similar theorem can also be shown in the case of 3-way resemblance. 4 Theorem 2 Any PGF transformation on 3-way resemblance R3way is efﬁcient. Recall in the proof of Theorem 1, we created hash assignments f1 (min(π(S3 ))) and f2 (min(π(S1 )), min(π(S2 ))), which lead to a bucketing scheme for the 3-way resemblance search, where the collision event E = {f1 (min(π(S3 )) = f2 (min(π(S1 )), min(π(S2 )))} happens with probability P r(E) = R3way . To prove the above Theorem 2, we will need to create hash events ∑∞ i having probability P GF (R3way ) = i=1 pi (R3way ) . Note that 0 ≤ P GF (R3way ) ≤ 1. We will make use of the following simple lemma. Lemma 1 (R3way )n is efﬁcient for all n ∈ N. n n Proof: Deﬁne new hash assignments g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hn (S3 ))] and g2 (S1 , S2 ) = n n [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hn (S1 ), hn (S2 ))]. The collision event g1 (S3 ) = g2 (S1 , S2 ) has n n probability (R3way )n . We now use the pair < g1 , g2 > instead of < f1 , f2 > and obtain same 3way n guarantees, as in Theorem 1, for (R ) as well. i i Proof of Theorem 2: From Lemma 1, let < g1 , g2 > be the hash pair corresponding to (R3way )i i i as used in above lemma. We sample one hash pair from the set {< g1 , g2 >: i ∈ N}, where i i the probability of sampling < g1 , g2 > is proportional to pi . Note that pi ≥ 0, and satisﬁes ∑∞ is i=1 pi = 1, and so the above sampling ∑ valid. It is not difﬁcult to see that the collision of the ∞ sampled hash pair has probability exactly i=1 pi (R3way )i . Theorem 2 can be naturally extended to k-way similarity for any k ≥ 2. Thus, we now have inﬁnitely many k-way similarity functions admitting efﬁcient sub-linear search. One, that might be interesting, because of its radial basis kernel like nature, is shown in the following corollary. Corollary 1 eR k−way −1 is efﬁcient. Proof: Use the expansion of eR k−way normalized by e to see that eR k−way −1 is a PGF on Rk−way . 6 Fast Algorithms for 3-way c-CP and 3-way c-BC Problems For 3-way c-CP and 3-way c-BC problems, using bucketing scheme with minwise hashing family will save even more computations. Theorem 3 For R3way c-Close Pair Problem (or c-CP) one can construct a data structure with log 1/c O(n2ρ log1/cR0 n) query time and O(n1+2ρ ) space, where ρ = 1 − log 1/c+log 1/R0 . Note that we can switch the role of f1 and f2 in the proof of Theorem 1. We are thus left with a c-NN problem with search space O(n2 ) (all pairs) instead of n. A bit of analysis, similar to Theorem 1, will show that this procedure achieves the required query time O(n2ρ log1/cR0 n), but uses a lot more space, O(n2(1+ρ )), than shown in the above theorem. It turns out that there is a better way of doing c-CP that saves us space. Proof of Theorem 3: We again start with constructing hash tables. For every element Sc ∈ C, we create a hash-table and store Sc in bucket B(Sc ) = [h1 (Sc ); h2 (Sc ); ...; hK (Sc )], where hi is chosen uniformly from minwise independent family of hash functions H. We create L such hash-tables. For a query element Sq we look for all pairs in bucket B(Sq ) = [h1 (Sq ); h2 (Sq ); ...; hK (Sq )] and repeat this for each of the L tables. Note, we do not form pairs of elements retrieved from different tables as they do not satisfy Eq. (2). If there exists a pair S1 , S2 ∈ C with Sim(Sq , S1 , S2 ) ≥ R0 , using K Eq. (2), we can see that we will ﬁnd that pair in bucket B(Sq ) with probability 1 − (1 − R0 )L . Here, we cannot use traditional choice of K and L, similar to what we did in Theorem 1, as there 2 log are O(n2 ) instead of O(n) possible pairs. We instead use K = ⌈ log 1n ⌉ and L = ⌈n2ρ log( 1 )⌉, δ cR0 log 1/c with ρ = 1 − log 1/c+log 1/R0 . With this choice of K and L, the result follows. Note, the process is stopped as soon as we ﬁnd pairs S1 and S2 with Sim(Sq , S1 , S2 ) ≥ cR0 . The key argument that saves space from O(n2(1+ρ) ) to O(n1+2ρ ) is that we hash n points individually. Eq. (2) makes it clear that hashing all possible pairs is not needed when every point can be processed individually, and pairs formed within each bucket itself ﬁlter out most of the unnecessary combinations. 5 Theorem 4 For R3way c-Best Cluster Problem (or c-BC) there exist an algorithm with running time log 1/c O(n1+2ρ log1/cR0 n), where ρ = 1 − log 1/c+log 1/R0 . The argument similar to one used in proof of Theorem 3 leads to the running time of O(n1+3ρ log1/cR0 n) as we need L = O(n3ρ ), and we have to processes all points at least once. Proof of Theorem 4: Repeat c-CP problem n times for every element in collection C acting as query once. We use the same set of hash tables and hash functions every time. The preprocessing time is O(n1+2ρ log1/cR0 n) evaluations of hash functions and the total querying time is O(n × n2ρ log1/cR0 n), which makes the total running time O(n1+2ρ log1/cR0 n). For k-way c-BC Problem, we can achieve O(n1+(k−1)ρ log1/cR0 n) running time. If we are interested in very high similarity cluster, with R0 ≈ 1, then ρ ≈ 0, and the running time is around O(n log n). This is a huge saving over the brute force O(nk ). In most practical cases, specially in big data regime where we have enormous amount of data, we can expect the k-way similarity of good clusters to be high and ﬁnding them should be efﬁcient. We can see that with increasing k, hashing techniques save more computations. 7 Experiments In this section, we demonstrate the usability of 3-way and higher-order similarity search using (i) Google Sets, and (ii) Improving retrieval quality. 7.1 Google Sets: Generating Semantically Similar Words Here, the task is to retrieve words which are “semantically” similar to the given set of query words. We collected 1.2 million random documents from Wikipedia and created a standard term-doc binary vector representation of each term present in the collected documents after removing standard stop words and punctuation marks. More speciﬁcally, every word is represented as a 1.2 million dimension binary vector indicating its presence or absence in the corresponding document. The total number of terms (or words) was around 60,000 in this experiment. Since there is no standard benchmark available for this task, we show qualitative evaluations. For querying, we used the following four pairs of semantically related words: (i) “jaguar” and “tiger”; (ii) “artiﬁcial” and “intelligence”; (iii) “milky” and “way” ; (iv) “ﬁnger” and “lakes”. Given the query words w1 and w2 , we compare the results obtained by the following four methods. • Google Sets: We use Google’s algorithm and report 5 words from Google spreadsheets [1]. This is Google’s algorithm which uses its own data. • 3-way Resemblance (3-way): We use 3-way resemblance |w1 ∩w2 ∩w| to rank every word |w1 ∪w2 ∪w| w and report top 5 words based on this ranking. • Sum Resemblance (SR): Another intuitive method is to use the sum of pairwise resem|w2 ∩w| blance |w1 ∩w| + |w2 ∪w| and report top 5 words based on this ranking. |w1 ∪w| • Pairwise Intersection (PI): We ﬁrst retrieve top 100 words based on pairwise resemblance for each w1 and w2 independently. We then report the words common in both. If there is no word in common we do not report anything. The results in Table 1 demonstrate that using 3-way resemblance retrieves reasonable candidates for these four queries. An interesting query is “ﬁnger” and “lakes”. Finger Lakes is a region in upstate New York. Google could only relate it to New York, while 3-way resemblance could even retrieve the names of cities and lakes in the region. Also, for query “milky” and “way”, we can see some (perhaps) unrelated words like “dance” returned by Google. We do not see such random behavior with 3-way resemblance. Although we are not aware of the algorithm and the dataset used by Google, we can see that 3-way resemblance appears to be a right measure for this application. The above results also illustrate the problem with using the sum of pairwise similarity method. The similarity value with one of the words dominates the sum and hence we see for queries “artiﬁcial” and “intelligence” that all the retrieved words are mostly related to the word “intelligence”. Same is the case with query “ﬁnger” and “lakes” as well as “jaguar” and “tiger”. Note that “jaguar” is also a car brand. In addition, for all 4 queries, there was no common word in the top 100 words similar to the each query word individually and so PI method never returns anything. 6 Table 1: Top ﬁve words retrieved using various methods for different queries. “JAGUAR” AND “ TIGER” G OOGLE 3- WAY SR LION LEOPARD CHEETAH CAT DOG LEOPARD CHEETAH LION PANTHER CAT CAT LEOPARD LITRE BMW CHASIS “MILKY” AND “ WAY” G OOGLE 3- WAY SR DANCE STARS SPACE THE UNIVERSE GALAXY STARS EARTH LIGHT SPACE EVEN ANOTHER STILL BACK TIME PI — — — — — “ARTIFICIAL” AND “INTELLIGENCE” G OOGLE 3- WAY SR PI COMPUTER COMPUTER SECURITY — PROGRAMMING SCIENCE WEAPONS — INTELLIGENT SECRET — SCIENCE ROBOT HUMAN ATTACKS — ROBOTICS TECHNOLOGY HUMAN — PI — — — — — G OOGLE NEW YORK NY PARK CITY “FINGER” AND “LAKES” 3- WAY SR SENECA CAYUGA ERIE ROCHESTER IROQUOIS RIVERS FRESHWATER FISH STREAMS FORESTED PI — — — — — We should note the importance of the denominator term in 3-way resemblance, without which frequent words will be blindly favored. The exciting contribution of this paper is that 3-way resemblance similarity search admits provable sub-linear guarantees, making it an ideal choice. On the other hand, no such provable guarantees are known for SR and other heuristic based search methods. 7.2 Improving Retrieval Quality in Similarity Search We also demonstrate how the retrieval quality of traditional similarity search can be boosted by utilizing more query candidates instead of just one. For the evaluations we choose two public datasets: MNIST and WEBSPAM, which were used in a recent related paper [26] for near neighbor search with binary data using b-bit minwise hashing [20, 23]. The two datasets reﬂect diversity both in terms of task and scale that is encountered in practice. The MNIST dataset consists of handwritten digit samples. Each sample is an image of 28 × 28 pixel yielding a 784 dimension vector with the associated class label (digit 0 − 9). We binarize the data by settings all non zeros to be 1. We used the standard partition of MNIST, which consists of 10,000 samples in one set and 60,000 in the other. The WEBSPAM dataset, with 16,609,143 features, consists of sparse vector representation of emails labeled as spam or not. We randomly sample 70,000 data points and partitioned them into two independent sets of size 35,000 each. Table 2: Percentage of top candidates with the same labels as that of query retrieved using various similarity criteria. More indicates better retrieval quality (Best marked in bold). T OP Pairwise 3-way NNbor 4-way NNbor 1 94.20 96.90 97.70 MNIST 10 20 92.33 91.10 96.13 95.36 96.89 96.28 50 89.06 93.78 95.10 1 98.45 99.75 99.90 WEBSPAM 10 20 96.94 96.46 98.68 97.80 98.87 98.15 50 95.12 96.11 96.45 For evaluation, we need to generate potential similar search query candidates for k-way search. It makes no sense in trying to search for object simultaneously similar to two very different objects. To generate such query candidates, we took one independent set of the data and partition it according to the class labels. We then run a cheap k-mean clustering on each class, and randomly sample triplets < x1 , x2 , x3 > from each cluster for evaluating 2-way, 3-way and 4-way similarity search. For MNIST dataset, the standard 10,000 test set was partitioned according to the labels into 10 sets, each partition was then clustered into 10 clusters, and we choose 10 triplets randomly from each cluster. In all we had 100 such triplets for each class, and thus 1000 overall query triplets. For WEBSPAM, which consists only of 2 classes, we choose one of the independent set and performed the same procedure. We selected 100 triplets from each cluster. We thus have 1000 triplets from each class making the total number of 2000 query candidates. The above procedures ensure that the elements in each triplets < x1 , x2 , x3 > are not very far from each other and are of the same class label. For each triplet < x1 , x2 , x3 >, we sort all the points x in the other independent set based on the following: • Pairwise: We only use the information in x1 and rank x based on resemblance 7 |x1 ∩x| |x1 ∪x| . • 3-way NN: We rank x based on 3-way resemblance • 4-way NN: We rank x based on 4-way resemblance |x1 ∩x2 ∩x| |x1 ∪x2 ∪x| . |x1 ∩x2 ∩x3 ∩x| |x1 ∪x2 ∪x3 ∪x| . We look at the top 1, 10, 20 and 50 points based on orderings described above. Since, all the query triplets are of the same label, The percentage of top retrieved candidates having same label as that of the query items is a natural metric to evaluate the retrieval quality. This percentage values accumulated over all the triplets are summarized in Table 2. We can see that top candidates retrieved by 3-way resemblance similarity, using 2 query points, are of better quality than vanilla pairwise similarity search. Also 4-way resemblance, with 3 query points, further improves the results compared to 3-way resemblance similarity search. This clearly demonstrates that multi-way resemblance similarity search is more desirable whenever we have more than one representative query in mind. Note that, for MNIST, which contains 10 classes, the boost compared to pairwise retrieval is substantial. The results follow a consistent trend. 8 Future Work While the work presented in this paper is promising for efﬁcient 3-way and k-way similarity search in binary high-dimensional data, there are numerous interesting and practical research problems we can study as future work. In this section, we mention a few such examples. One-permutation hashing. Traditionally, building hash tables for near neighbor search required many (e.g., 1000) independent hashes. This is both time- and energy-consuming, not only for building tables but also for processing un-seen queries which have not been processed. One permutation hashing [22] provides the hope of reducing many permutations to merely one. The version in [22], however, was not applicable to near neighbor search due to the existence of many empty bins (which offer no indexing capability). The most recent work [27] is able to ﬁll the empty bins and works well for pairwise near neighbor search. It will be interesting to extend [27] to k-way search. Non-binary sparse data. This paper focuses on minwise hashing for binary data. Various extensions to real-valued data are possible. For example, our results naturally apply to consistent weighted sampling [25, 15], which is one way to handle non-binary sparse data. The problem, however, is not solved if we are interested in similarities such as (normalized) k-way inner products, although the line of work on Conditional Random Sampling (CRS) [19, 18] may be promising. CRS works on non-binary sparse data by storing a bottom subset of nonzero entries after applying one permutation to (real-valued) sparse data matrix. CRS performs very well for certain applications but it does not work in our context because the bottom (nonzero) subsets are not properly aligned. Building hash tables by directly using bits from minwise hashing. This will be a different approach from the way how the hash tables are constructed in this paper. For example, [26] directly used the bits from b-bit minwise hashing [20, 23] to build hash tables and demonstrated the signiﬁcant advantages compared to sim-hash [8, 12] and spectral hashing [29]. It would be interesting to see the performance of this approach in k-way similarity search. k-Way sign random projections. It would be very useful to develop theory for k-way sign random projections. For usual (real-valued) random projections, it is known that the volume (which is related to the determinant) is approximately preserved [24, 17]. We speculate that the collision probability of k-way sign random projections might be also a (monotonic) function of the determinant. 9 Conclusions We formulate a new framework for k-way similarity search and obtain fast algorithms in the case of k-way resemblance with provable worst-case approximation guarantees. We show some applications of k-way resemblance search in practice and demonstrate the advantages over traditional search. Our analysis involves the idea of probabilistic hashing and extends the well-known LSH family beyond the pairwise case. We believe the idea of probabilistic hashing still has a long way to go. Acknowledgement The work is supported by NSF-III-1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Ping Li thanks Kenneth Church for introducing Google Sets to him in the summer of 2004 at Microsoft Research. 8 References [1] http://www.howtogeek.com/howto/15799/how-to-use-autoﬁll-on-a-google-docs-spreadsheet-quick-tips/. [2] S. Agarwal, Jongwoo Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005. [3] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. Group recommendation: semantics and efﬁciency. Proc. VLDB Endow., 2(1):754–765, 2009. [4] Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. Dynamic ranked retrieval. In WSDM, pages 247–256, 2011. [5] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21–29, Positano, Italy, 1997. [6] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998. [7] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classiﬁcation. IEEE Trans. Neural Networks, 10(5):1055–1064, 1999. [8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. [9] Flavio Chierichetti and Ravi Kumar. LSH-preserving functions and their applications. In SODA, 2012. [10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003. [11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for ﬁnding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975. [12] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of ACM, 42(6):1115–1145, 1995. [13] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):321–350, 2012. [14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998. [15] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, 2010. [16] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494–501, Amsterdam, Netherlands, 2007. [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Technical report, arXiv:1207.6083, 2013. [18] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics (Preliminary results appeared in HLT/EMNLP 2005), 33(3):305–354, 2007. [19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873–880, Vancouver, Canada, 2006. [20] Ping Li and Arnd Christian K¨ nig. b-bit minwise hashing. In Proceedings of the 19th International o Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010. [21] Ping Li, Arnd Christian K¨ nig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simio larities. In NIPS, Vancouver, Canada, 2010. [22] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012. [23] Ping Li, Anshumali Shrivastava, and Arnd Christian K¨ nig. b-bit minwise hashing in practice. In Intero netware, Changsha, China, 2013. [24] Avner Magen and Anastasios Zouzias. Near optimal dimensionality reductions that preserve volumes. In APPROX / RANDOM, pages 523–534, 2008. [25] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010. [26] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML, Bristol, UK, 2012. [27] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014. [28] Roger Weber, Hans-J¨ rg Schek, and Stephen Blott. A quantitative analysis and performance study for o similarity-search methods in high-dimensional spaces. In VLDB, pages 194–205, 1998. [29] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, Vancouver, Canada, 2008. [30] D. Zhou, J. Huang, and B. Sch¨ lkopf. Beyond pairwise classiﬁcation and clustering using hypergraphs. o In NIPS, Vancouver, Canada, 2006. 9

3 0.12511095 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

Author: Martin Azizyan, Aarti Singh, Larry Wasserman

Abstract: While several papers have investigated computationally and statistically efﬁcient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efﬁcient procedure. Our results provide the ﬁrst step of a theoretical basis for recent methods that combine feature selection and clustering. 1

4 0.11032597 326 nips-2013-The Power of Asymmetry in Binary Hashing

Author: Behnam Neyshabur, Nati Srebro, Ruslan Salakhutdinov, Yury Makarychev, Payman Yadollahpour

Abstract: When approximating binary similarity using the hamming distance between short binary hashes, we show that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e. by approximating the similarity between x and x as the hamming distance between f (x) and g(x ), for two distinct binary codes f, g, rather than as the hamming distance between f (x) and f (x ). 1

5 0.095757149 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

Author: Pablo Sprechmann, Roee Litman, Tal Ben Yakar, Alexander M. Bronstein, Guillermo Sapiro

Abstract: In this paper, we propose a new computationally efﬁcient framework for learning sparse models. We formulate a uniﬁed approach that contains as particular cases models promoting sparse synthesis and analysis type of priors, and mixtures thereof. The supervised training of the proposed model is formulated as a bilevel optimization problem, in which the operators are optimized to achieve the best possible performance on a speciﬁc task, e.g., reconstruction or classiﬁcation. By restricting the operators to be shift invariant, our approach can be thought as a way of learning sparsity-promoting convolutional operators. Leveraging recent ideas on fast trainable regressors designed to approximate exact sparse codes, we propose a way of constructing feed-forward networks capable of approximating the learned models at a fraction of the computational cost of exact solvers. In the shift-invariant case, this leads to a principled way of constructing a form of taskspeciﬁc convolutional networks. We illustrate the proposed models on several experiments in music analysis and image processing applications. 1

6 0.09517128 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning

7 0.085022837 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

8 0.081579968 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

9 0.073076695 173 nips-2013-Least Informative Dimensions

10 0.066778921 171 nips-2013-Learning with Noisy Labels

11 0.064058729 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

12 0.060969759 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

13 0.05908991 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

14 0.05727642 245 nips-2013-Pass-efficient unsupervised feature selection

15 0.057276361 185 nips-2013-Matrix Completion From any Given Set of Observations

16 0.056859616 289 nips-2013-Scalable kernels for graphs with continuous attributes

17 0.05476461 137 nips-2013-High-Dimensional Gaussian Process Bandits

18 0.053409744 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

19 0.05338024 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

20 0.053071599 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, 0.064), (2, 0.005), (3, 0.012), (4, 0.032), (5, 0.027), (6, -0.024), (7, -0.018), (8, -0.081), (9, 0.069), (10, -0.029), (11, -0.021), (12, -0.024), (13, 0.008), (14, 0.045), (15, 0.059), (16, -0.004), (17, 0.067), (18, 0.006), (19, 0.096), (20, -0.039), (21, -0.068), (22, -0.005), (23, 0.135), (24, 0.004), (25, -0.101), (26, -0.033), (27, -0.033), (28, 0.063), (29, -0.099), (30, -0.065), (31, 0.008), (32, 0.051), (33, 0.007), (34, -0.025), (35, -0.054), (36, -0.038), (37, 0.089), (38, -0.021), (39, -0.147), (40, 0.063), (41, 0.099), (42, 0.052), (43, 0.028), (44, 0.105), (45, 0.082), (46, 0.035), (47, 0.111), (48, -0.053), (49, -0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9445501 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

Author: Ping Li, Gennady Samorodnitsk, John Hopcroft

2 0.72606176 326 nips-2013-The Power of Asymmetry in Binary Hashing

Author: Behnam Neyshabur, Nati Srebro, Ruslan Salakhutdinov, Yury Makarychev, Payman Yadollahpour

3 0.71947056 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

Author: Anshumali Shrivastava, Ping Li

4 0.5552054 245 nips-2013-Pass-efficient unsupervised feature selection

Author: Crystal Maung, Haim Schweitzer

Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modiﬁcation of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the ﬁnal selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efﬁciency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1

5 0.49130401 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman

Abstract: We consider the problem of sampling from a probability distribution deﬁned over a high-dimensional discrete set, speciﬁed for instance by a graphical model. We propose a sampling algorithm, called PAWS, based on embedding the set into a higher-dimensional space which is then randomly projected using universal hash functions to a lower-dimensional subspace and explored using combinatorial search methods. Our scheme can leverage fast combinatorial optimization tools as a blackbox and, unlike MCMC methods, samples produced are guaranteed to be within an (arbitrarily small) constant factor of the true probability distribution. We demonstrate that by using state-of-the-art combinatorial search tools, PAWS can efﬁciently sample from Ising grids with strong interactions and from software veriﬁcation instances, while MCMC and variational methods fail in both cases. 1

6 0.46318781 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks

7 0.45790458 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

8 0.450748 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

9 0.42965329 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

10 0.4149605 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces

11 0.40619424 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

12 0.40523216 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

13 0.40402943 355 nips-2013-Which Space Partitioning Tree to Use for Search?

14 0.39343429 327 nips-2013-The Randomized Dependence Coefficient

15 0.39225507 173 nips-2013-Least Informative Dimensions

16 0.39007136 65 nips-2013-Compressive Feature Learning

17 0.38822809 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

18 0.38756752 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

19 0.38235793 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

20 0.38227329 247 nips-2013-Phase Retrieval using Alternating Minimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.013), (16, 0.042), (19, 0.011), (21, 0.017), (33, 0.137), (34, 0.143), (41, 0.025), (49, 0.026), (56, 0.121), (70, 0.029), (74, 0.204), (85, 0.045), (89, 0.023), (93, 0.049), (95, 0.023), (99, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89097029 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering

Author: Shinichi Nakajima, Akiko Takeda, S. Derin Babacan, Masashi Sugiyama, Ichiro Takeuchi

Abstract: When a probabilistic model and its prior are given, Bayesian learning offers inference with automatic parameter tuning. However, Bayesian learning is often obstructed by computational difﬁculty: the rigorous Bayesian learning is intractable in many models, and its variational Bayesian (VB) approximation is prone to suffer from local minima. In this paper, we overcome this difﬁculty for low-rank subspace clustering (LRSC) by providing an exact global solver and its efﬁcient approximation. LRSC extracts a low-dimensional structure of data by embedding samples into the union of low-dimensional subspaces, and its variational Bayesian variant has shown good performance. We ﬁrst prove a key property that the VBLRSC model is highly redundant. Thanks to this property, the optimization problem of VB-LRSC can be separated into small subproblems, each of which has only a small number of unknown variables. Our exact global solver relies on another key property that the stationary condition of each subproblem consists of a set of polynomial equations, which is solvable with the homotopy method. For further computational efﬁciency, we also propose an efﬁcient approximate variant, of which the stationary condition can be written as a polynomial equation with a single variable. Experimental results show the usefulness of our approach. 1

2 0.86729831 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data

Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko

Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1

same-paper 3 0.85001314 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

Author: Ping Li, Gennady Samorodnitsk, John Hopcroft

4 0.84365028 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

Author: Cho-Jui Hsieh, Matyas A. Sustik, Inderjit Dhillon, Pradeep Ravikumar, Russell Poldrack

Abstract: The 1 -regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difﬁcult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm B IG QUIC, which can solve 1 million dimensional 1 regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of speciﬁc components. In spite of these modiﬁcations, we are able to theoretically analyze our procedure and show that B IG QUIC can achieve super-linear or even quadratic convergence rates. 1

5 0.84359306 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Author: Michalis Titsias, Miguel Lazaro-Gredilla

Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1

6 0.83948678 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

7 0.75956666 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

8 0.75568908 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

9 0.75534725 201 nips-2013-Multi-Task Bayesian Optimization

10 0.7547226 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

11 0.75437242 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

12 0.75289822 173 nips-2013-Least Informative Dimensions

13 0.75253403 5 nips-2013-A Deep Architecture for Matching Short Texts

14 0.75235713 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

15 0.75223249 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

16 0.75178862 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

17 0.75036454 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

18 0.74983388 348 nips-2013-Variational Policy Search via Trajectory Optimization

19 0.74928981 318 nips-2013-Structured Learning via Logistic Regression

20 0.74791634 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models