nips nips2013 nips2013-294 knowledge-graph by maker-knowledge-mining

294 nips-2013-Similarity Component Analysis

Source: pdf

Author: Soravit Changpinyo, Kuan Liu, Fei Sha

Abstract: Measuring similarity is crucial to many learning tasks. To this end, metric learning has been the dominant paradigm. However, similarity is a richer and broader notion than what metrics entail. For example, similarity can arise from the process of aggregating the decisions of multiple latent components, where each latent component compares data in its own way by focusing on a different subset of features. In this paper, we propose Similarity Component Analysis (SCA), a probabilistic graphical model that discovers those latent components from data. In SCA, a latent component generates a local similarity value, computed with its own metric, independently of other components. The ﬁnal similarity measure is then obtained by combining the local similarity values with a (noisy-)OR gate. We derive an EM-based algorithm for ﬁtting the model parameters with similarity-annotated data from pairwise comparisons. We validate the SCA model on synthetic datasets where SCA discovers the ground-truth about the latent components. We also apply SCA to a multiway classiﬁcation task and a link prediction task. For both tasks, SCA attains signiﬁcantly better prediction accuracies than competing methods. Moreover, we show how SCA can be instrumental in exploratory analysis of data, where we gain insights about the data by examining patterns hidden in its latent components’ local similarity values. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Measuring similarity is crucial to many learning tasks. [sent-10, score-0.232]

2 However, similarity is a richer and broader notion than what metrics entail. [sent-12, score-0.362]

3 For example, similarity can arise from the process of aggregating the decisions of multiple latent components, where each latent component compares data in its own way by focusing on a different subset of features. [sent-13, score-0.555]

4 In this paper, we propose Similarity Component Analysis (SCA), a probabilistic graphical model that discovers those latent components from data. [sent-14, score-0.231]

5 In SCA, a latent component generates a local similarity value, computed with its own metric, independently of other components. [sent-15, score-0.447]

6 The ﬁnal similarity measure is then obtained by combining the local similarity values with a (noisy-)OR gate. [sent-16, score-0.496]

7 We validate the SCA model on synthetic datasets where SCA discovers the ground-truth about the latent components. [sent-18, score-0.179]

8 We also apply SCA to a multiway classiﬁcation task and a link prediction task. [sent-19, score-0.145]

9 For both tasks, SCA attains signiﬁcantly better prediction accuracies than competing methods. [sent-20, score-0.126]

10 Moreover, we show how SCA can be instrumental in exploratory analysis of data, where we gain insights about the data by examining patterns hidden in its latent components’ local similarity values. [sent-21, score-0.385]

11 1 Introduction Learning how to measure similarity (or dissimilarity) is a fundamental problem in machine learning. [sent-22, score-0.232]

12 If we parameterize the desired dissimilarity measure in the form of a metric function, the resulting learning problem is often referred to as metric learning. [sent-24, score-0.218]

13 Representing objects as points in high-dimensional feature spaces, most metric learning learning algorithms assume that the same set of features contribute indistinguishably to assessing similarity. [sent-33, score-0.128]

14 SCA has K latent components which give rise to local similarity values sk conditioned on a pair of data xm and xn . [sent-44, score-0.661]

15 In contrast, similarity can arise from a complex aggregation of comparing data instances on multiple subsets of features, to which we refer as latent components. [sent-49, score-0.372]

16 For an arbitrary pair of songs, we can rate the similarity between them based on one of the many components or an arbitrary subset of components, while ignoring the rest. [sent-51, score-0.325]

17 Multi-component based similarity exists also in other types of data. [sent-53, score-0.232]

18 In this paper, we propose Similarity Component Analysis (SCA) to model the richer similarity relationships beyond what current metric learning algorithms can offer. [sent-59, score-0.349]

19 The similarity (node s) is modeled as a probabilistic combination of multiple latent components. [sent-62, score-0.391]

20 Each latent component (sk ) assigns a local similarity value to whether or not two objects are similar, inferring from only a subset (but unknown) of features. [sent-63, score-0.447]

21 The (local) similarity values of those latent components are aggregated with a (noisy-) OR model. [sent-64, score-0.428]

22 Two objects are likely to be dissimilar if none of the components voices up. [sent-66, score-0.137]

23 We derive an EM-based algorithm for ﬁtting the model with data annotated with similarity relationships. [sent-67, score-0.232]

24 The algorithm infers the intermediate similarity values of latent components and identiﬁes the parameters for the (noisy-)OR model, as well as each latent component’s conditional distribution, by maximizing the likelihood of the training data. [sent-68, score-0.549]

25 On synthetic data where ground-truth is available, we conﬁrm SCA’s ability in discovering latent components and their corresponding subsets of features. [sent-70, score-0.221]

26 On a multiway classiﬁcation task, we contrast SCA to state-of-the-art metric learning algorithms and demonstrate SCA’s superior performance in classifying data samples. [sent-71, score-0.133]

27 Finally, we use SCA to model the network link structures among research articles published at NIPS proceedings. [sent-72, score-0.167]

28 We also conduct extensive analysis on how learned latent components effectively represent link structures. [sent-74, score-0.274]

29 2 Approach We start by describing in detail Similarity Component Analysis (SCA), a Bayesian network for modeling similarity between two objects. [sent-78, score-0.277]

30 1 Probabilistic model of similarity In what follows, let (u, v, s) denote a pair of D-dimensional data points u, v ∈ RD and their associated value of similarity s ∈ {DISSIMILAR, SIMILAR} or {0, 1} accordingly. [sent-81, score-0.497]

31 In SCA, we assume that p(s|u, v) is a mixture of multiple latent components’s local similarity values. [sent-85, score-0.404]

32 Each latent component evaluates its similarity value independently, using only a subset of the D features. [sent-86, score-0.415]

33 Latent components Formally, let u[k] denote the subset of features from u corresponding to the k-th latent component where [k] ⊂ {1, 2, . [sent-88, score-0.258]

34 The similarity assessment sk of this component alone is determined by the distance between u[k] and v[k] dk = (u − v)T Mk (u − v) (1) where Mk 0 is a D × D positive semideﬁnite matrix, used to measure the distance more ﬂexibly than the standard Euclidean metric. [sent-92, score-0.515]

35 The distance dk is transformed to the probability for the Bernoulli variable sk according to P (sk = 1|u, v) = (1 + e−bk )[1 − σ(dk − bk )] (2) −t −1 where σ(·) is the sigmoid function σ(t) = (1 + e ) and bk is a bias term. [sent-96, score-0.324]

36 Intuitively, when the (biased) distance (dk − bk ) is large, sk is less probable to be 1 and the two data points are regarded less similar. [sent-97, score-0.244]

37 Combining local similarities Assume that there are K latent components. [sent-99, score-0.192]

38 How can we combine all the local similarity assessments? [sent-100, score-0.264]

39 Namely, K P (s = 1|s1 , s2 , · · · , sK ) = 1 − I[sk = 0] (3) k=1 Thus, the two data points are similar (s = 1) if at least one of the aspects deems so, corresponding to sk = 1 for a particular k. [sent-102, score-0.15]

40 The noisy-OR model captures precisely this notion: K I[sk =1] P (s = 1|s1 , s2 , · · · , sK ) = 1 − θk (5) k=1 where the more sk = 1, the less the false-negative rate is after combination. [sent-106, score-0.135]

41 2 Inference and learning Given an annotated training dataset D = {(xm , xn , smn )}, we learn the parameters, which include all the positive semideﬁnite matrices Mk , the biases bk and the false negative rates θk (if noisy-OR is used), by maximizing the likelihood of D. [sent-114, score-0.133]

42 3 Extensions Variants to local similarity models The choice of using logistic-like functions eq. [sent-138, score-0.264]

43 (2) for modeling local similarity of the latent components is orthogonal to how those similarities are combined in eq. [sent-139, score-0.514]

44 Disjoint components We could also explicitly express our desiderata that latent components focus on non-overlapping features. [sent-147, score-0.271]

45 To this end, we penalize the likelihood of the data with the following regularizer to promote disjoint components diag(Mk )T diag(Mk ) R({Mk }) = (9) k,k where diag(·) extracts the diagonal elements of the matrix. [sent-148, score-0.14]

46 As the metrics are constrained to be positive semideﬁnite, the inner product attains its minimum of zero when the diagonal elements, which are nonnegative, are orthogonal to each other. [sent-149, score-0.16]

47 Thus, metrics that have orthogonal diagonal vectors will use non-overlapping subsets of features. [sent-151, score-0.134]

48 3 Experimental results We validate the effectiveness of SCA in modeling similarity relationships on three tasks. [sent-155, score-0.264]

49 2, we apply SCA to a multiway classiﬁcation task to recognize images of handwritten digits where similarity is equated to having the same class label. [sent-159, score-0.269]

50 3, we apply SCA to a link prediction problem for a network of scientiﬁc articles. [sent-162, score-0.138]

51 Our baseline algorithms for modeling similarity are information-theoretic metric learning (ITML) [5] and large margin nearest neighbor (LMNN) [18]. [sent-164, score-0.41]

52 Both methods are discriminative approaches where a metric is optimized to reduce the distances between data points from the same label class (or similar data instances) and increase the distances between data points from different classes (or dissimilar data instances). [sent-165, score-0.201]

53 When possible, we also contrast to multiple metric LMNN (MM - LMNN) [18], a variant to LMNN where multiple metrics are learned from data. [sent-166, score-0.243]

54 Speciﬁcally, our feature dimensionality is D = 30 and the number of latent components is K = 5. [sent-170, score-0.213]

55 For each component k, the corresponding metric Mk is a D × D sparse positive semideﬁnite matrix where only elements in a 6 × 6 matrix block on the diagonal are nonzero. [sent-171, score-0.183]

56 In short, these metrics mimic the setup where each component focuses on its own 1/K-th of total features that are disjoint from each other. [sent-173, score-0.194]

57 We select a random pair and compute their similarity according to eq. [sent-179, score-0.25]

58 We evaluate the results of SCA on two aspects: how well we can recover the ground-truth metrics (and biases) and how well we can use the parameters to predict similarities on the test set. [sent-185, score-0.148]

59 2(a) contrasts the learned metrics to the ground-truth (the ﬁrst row). [sent-187, score-0.125]

60 However, when the number of latent components increases, SCA outperforms other approaches by a large margin. [sent-242, score-0.196]

61 Also note that when the number of latent components exceeds the ground-truth K = 5, SCA reaches a plateau until overﬁtting. [sent-243, score-0.196]

62 In real-world data, “true metrics” may overlap, that is, it is possible that different components of similarity rely on overlapping set of features. [sent-244, score-0.327]

63 To examine SCA’s effectiveness in this scenario, we create another synthetic data where true metrics heavily overlap, illustrated in the ﬁrst row of Fig. [sent-245, score-0.151]

64 For testing, the label y of x is determined by y = arg maxc sc = arg maxc P (s = 1|x, x ) (10) x ∈Bc (x) where sc is the similarity score to the c-th class, computed as the sum of 5 largest similarity values Bc to samples in that class. [sent-256, score-0.51]

65 3 Link prediction We evaluate SCA on the task of link prediction in a “social” network of scientiﬁc articles. [sent-261, score-0.168]

66 In particular, we are interested in not only link prediction accuracies, but also the insights about data that we gain from analyzing the identiﬁed latent components. [sent-263, score-0.229]

67 (2) Topic (ToP) uses the documents’ topic vectors (mixture weights of topics) after ﬁtting the corpus 6 Table 3: Link prediction accuracies and their standard errors (%) on a network of scientiﬁc papers Feature type BoW ToW ToP SCA - DIAG SVM BASELINES ITML LMNN 73. [sent-275, score-0.153]

68 For BoW and ToW represented data, we compare SCA with diagonal metrics (SCA - DIAG, cf. [sent-304, score-0.134]

69 To apply SVM/LOGIT, we treat the link prediction as a binary classiﬁcation problem where the input is the absolute difference in feature values between the two data points. [sent-307, score-0.125]

70 For 50-dimensional ToP represented data, we compare SCA (SCA) and SCA - DIAG to SVM / LOGIT, information-theoretical metric learning (ITML), and large margin nearest neighbor (LMNN). [sent-308, score-0.163]

71 Note that while LMNN was originally designed for nearest-neighbor based classiﬁcation, it can be adapted to use similarity information to learn a global metric to compute the distance between any pair of data points. [sent-309, score-0.369]

72 We learn such a metric and threshold on the distance to render a decision on whether two data points are similar or not (i. [sent-310, score-0.134]

73 On the other end, multiple-metric LMNN, while often having better classiﬁcation performance, cannot be used for similarity and link prediction as it does not provide a principled way of computing distances between two arbitrary data points when there are multiple (local) metrics. [sent-313, score-0.392]

74 For both SCA and SCA - DIAG, we report results when a single component is used as well as when the optimal number of components are used (under columns K∗ ). [sent-317, score-0.137]

75 Both SCA - DIAG and SCA outperform the rest methods by a signiﬁcant margin, especially when the number of latent components is greater than 1 (K∗ ranges from 3 to 13, depending on the methods and the feature types). [sent-318, score-0.213]

76 The only exception is SCA - DIAG with one component (K = 1), which is an overly restrictive model as the diagonal metrics constrain features to be combined additively. [sent-319, score-0.196]

77 Edge component analysis Why does learning latent components in SCA achieve superior link prediction accuracies? [sent-321, score-0.366]

78 The (noisy-)OR model used by SCA is naturally inclined to favoring “positive” opinions — a pair of samples are regarded as being similar as long as there is one latent component strongly believing so. [sent-322, score-0.201]

79 This implies that a latent component can be tuned to a speciﬁc group of samples if those samples rely on common feature characteristics to be similar. [sent-323, score-0.2]

80 The plot displays in relative strength —darker being stronger — how much each latent component believes a pair of articles from the same section should be similar. [sent-326, score-0.239]

81 Concretely, after ﬁtting a 9-component SCA (from documents in ToP features), we consider edges connecting articles in the same section and compute the average local similarity values assigned by each component. [sent-327, score-0.328]

82 We observe two interesting sparse patterns: for each section, there is a dominant latent component that strongly supports the fact that the articles from that section should be similar (e. [sent-328, score-0.241]

83 Moreover, for each latent component, it often strongly “voices up” for one section – the exception is the second component which seems to support both section 3 and 4. [sent-331, score-0.183]

84 Nonetheless, the general picture is that, each section has a signature in terms of how similarity values are distributed across latent components. [sent-332, score-0.377]

85 3(a) depicts averaged signature for each section, the scatterplot displays 2D embeddings computed with the t-SNE algorithm, on each individual edge’s signature — 9-dimensional similarity values inferred with the 9 latent components. [sent-336, score-0.401]

86 Representing network links with local similarity values reveals interesting structures, such as nearly one-to-one correspondence between latent components and sections, as well as clusters. [sent-339, score-0.526]

87 However, representing articles in LDA’s topics does not reveal useful clustering structures such that links can be inferred. [sent-340, score-0.153]

88 In contrast, embedding documents using their topic representations does not reveal clear clustering structures such that network links can be inferred. [sent-343, score-0.164]

89 We observe that while topics themselves do not reveal intrinsic (network) structures, latent components are able to achieve so by applying highly-specialized metrics to measure local similarities and yield characteristic signatures. [sent-346, score-0.434]

90 We also study whether or not the lack of an edge between a pair of dissimilar documents from different sections, can give rise to characteristic signatures from the latent components. [sent-347, score-0.246]

91 4 Related Work Our model learns multiple metrics, one for each latent component. [sent-351, score-0.14]

92 However, the similarity (or associated dissimilarity) from our model is deﬁnitely non-metric due to the complex combination. [sent-352, score-0.232]

93 [12] gives an information-theoretic deﬁnition of (non-metric) similarity as long as there is a probabilistic model for the data. [sent-354, score-0.251]

94 The key difference is that those works model vertices with a mixture of latent components (communities) where we model the interactions between vertices with a mixture of latent components. [sent-358, score-0.317]

95 [2] studies a social network whose edge set is the union of multiple edge sets in hidden similarity spaces. [sent-359, score-0.36]

96 Our work explicitly models the probabilistic process of combining latent components with a (noisy-)OR gate. [sent-360, score-0.215]

97 5 Conclusion We propose Similarity Component Analysis (SCA) for probabilistic modeling of similarity relationship for pairwise data instances. [sent-361, score-0.266]

98 The key ingredient of SCA is to model similarity as a complex combination of multiple latent components, each giving rise to a local similarity value. [sent-362, score-0.636]

99 SCA attains signiﬁcantly better accuracies than existing methods on both classiﬁcation and link prediction tasks. [sent-363, score-0.183]

100 Acknowledgements We thank reviewers for extensive discussion and references on the topics of similarity and learning similarity. [sent-364, score-0.273]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sca', 0.851), ('similarity', 0.232), ('lmnn', 0.173), ('mk', 0.145), ('sk', 0.135), ('latent', 0.121), ('metrics', 0.109), ('metric', 0.096), ('itml', 0.08), ('link', 0.078), ('components', 0.075), ('bk', 0.071), ('component', 0.062), ('centaur', 0.053), ('accuracies', 0.049), ('semide', 0.045), ('topics', 0.041), ('multiplex', 0.04), ('tow', 0.04), ('dissimilar', 0.039), ('similarities', 0.039), ('articles', 0.038), ('diag', 0.037), ('multiway', 0.037), ('jk', 0.036), ('links', 0.036), ('bow', 0.036), ('pk', 0.035), ('man', 0.034), ('southern', 0.032), ('local', 0.032), ('social', 0.031), ('prediction', 0.03), ('xm', 0.03), ('network', 0.03), ('logit', 0.028), ('nearest', 0.027), ('smn', 0.027), ('lk', 0.027), ('papers', 0.026), ('documents', 0.026), ('dissimilarity', 0.026), ('lda', 0.026), ('posteriors', 0.026), ('attains', 0.026), ('synthetic', 0.025), ('diagonal', 0.025), ('dk', 0.024), ('signature', 0.024), ('edge', 0.024), ('maxc', 0.023), ('voices', 0.023), ('disjoint', 0.023), ('distance', 0.023), ('angeles', 0.023), ('texts', 0.022), ('classi', 0.022), ('los', 0.022), ('qk', 0.022), ('songs', 0.022), ('margin', 0.022), ('structures', 0.021), ('mm', 0.021), ('richer', 0.021), ('competing', 0.021), ('dis', 0.02), ('dominant', 0.02), ('overlapping', 0.02), ('probabilistic', 0.019), ('format', 0.019), ('iarpa', 0.019), ('multiple', 0.019), ('signatures', 0.018), ('ca', 0.018), ('neighbor', 0.018), ('svm', 0.018), ('pair', 0.018), ('xn', 0.018), ('distances', 0.018), ('tting', 0.018), ('bc', 0.018), ('topic', 0.018), ('biases', 0.017), ('feature', 0.017), ('illustrated', 0.017), ('validate', 0.017), ('reveal', 0.017), ('baselines', 0.017), ('nonetheless', 0.017), ('fienberg', 0.017), ('horse', 0.017), ('promote', 0.017), ('kulis', 0.017), ('contrasts', 0.016), ('embedding', 0.016), ('discovers', 0.016), ('assessment', 0.016), ('nips', 0.015), ('points', 0.015), ('modeling', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 294 nips-2013-Similarity Component Analysis

Author: Soravit Changpinyo, Kuan Liu, Fei Sha

2 0.08319889 75 nips-2013-Convex Two-Layer Modeling

Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans

Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difﬁcult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1

3 0.07493832 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

Author: Anshumali Shrivastava, Ping Li

Abstract: We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1 ∩S2 ∩S3 | , S1 , S2 , S3 ∈ C, where C is a |S1 ∪S2 ∪S3 | size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality. 1 Introduction and Motivation Similarity search (near neighbor search) is one of the fundamental problems in Computer Science. The task is to identify a small set of data points which are “most similar” to a given input query. Similarity search algorithms have been one of the basic building blocks in numerous applications including search, databases, learning, recommendation systems, computer vision, etc. One widely used notion of similarity on sets is the Jaccard similarity or resemblance [5, 10, 18, 20]. Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, the resemblance R2way between S1 and S2 is deﬁned as: R2way = |S1 ∩S2 | . Existing notions of similarity in search problems mainly work with |S1 ∪S2 | pairwise similarity functions. In this paper, we go beyond this notion and look at the problem of k-way similarity search, where the similarity function of interest involves k sets (k ≥ 2). Our work exploits the fact that resemblance can be naturally extended to k-way resemblance similarity [18, 21], deﬁned over k sets {S1 , S2 , ..., Sk } as Rk−way = |S1 ∩S2 ∩...∩Sk | . |S1 ∪S2 ∪...∪Sk | Binary high-dimensional data The current web datasets are typically binary, sparse, and extremely high-dimensional, largely due to the wide adoption of the “Bag of Words” (BoW) representations for documents and images. It is often the case, in BoW representations, that just the presence or absence (0/1) of speciﬁc feature words captures sufﬁcient information [7, 16, 20], especially with (e.g.,) 3-grams or higher-order models. And so, the web can be imagined as a giant storehouse of ultra high-dimensional sparse binary vectors. Of course, binary vectors can also be equivalently viewed as sets (containing locations of the nonzero features). We list four practical scenarios where k-way resemblance search would be a natural choice. (i) Google Sets: (http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html) Google Sets is among the earliest google projects, which allows users to generate list of similar words by typing only few related keywords. For example, if the user types “mazda” and “honda” the application will automatically generate related words like “bmw”, “ford”, “toyota”, etc. This application is currently available in google spreadsheet. If we assume the term document binary representation of each word w in the database, then given query w1 and w2 , we show that |w1 ∩w2 ∩w| |w1 ∪w2 ∪w| turns out to be a very good similarity measure for this application (see Section 7.1). 1 (ii) Joint recommendations: Users A and B would like to watch a movie together. The proﬁle of each person can be represented as a sparse vector over a giant universe of attributes. For example, a user proﬁle may be the set of actors, actresses, genres, directors, etc, which she/he likes. On the other hand, we can represent a movie M in the database over the same universe based on attributes associated with the movie. If we have to recommend movie M, jointly to users A and B, then a natural measure to maximize is |A∩B∩M | . The problem of group recommendation [3] is applicable |A∪B∪M | in many more settings such as recommending people to join circles, etc. (iii) Improving retrieval quality: We are interested in ﬁnding images of a particular type of object, and we have two or three (possibly noisy) representative images. In such a scenario, a natural expectation is that retrieving images simultaneously similar to all the representative images should be more reﬁned than just retrieving images similar to any one of them. In Section 7.2, we demonstrate that in cases where we have more than one element to search for, we can reﬁne our search quality using k-way resemblance search. In a dynamic feedback environment [4], we can improve subsequent search quality by using k-way similarity search on the pages already clicked by the user. (iv) Beyond pairwise clustering: While machine learning algorithms often utilize the data through pairwise similarities (e.g., inner product or resemblance), there are natural scenarios where the afﬁnity relations are not pairwise, but rather triadic, tetradic or higher [2, 30]. The computational cost, of course, will increase exponentially if we go beyond pairwise similarity. Efﬁciency is crucial With the data explosion in modern applications, the brute force way of scanning all the data for searching is prohibitively expensive, specially in user-facing applications like search. The need for k-way similarity search can only be fulﬁlled if it admits efﬁcient algorithms. This paper fulﬁlls this requirement for k-way resemblance and its derived similarities. In particular, we show fast algorithms with provable query time guarantees for approximate k-way resemblance search. Our algorithms and analysis naturally provide a framework to extend classical LSH framework [14, 13] to handle higher-order similarities, which could be of independent theoretical interest. Organization In Section 2, we review approximate near neighbor search and classical Locality Sensitive Hashing (LSH). In Section 3, we formulate the 3-way similarity search problems. Sections 4, 5, and 6 describe provable fast algorithms for several search problems. Section 7 demonstrates the applicability of 3-way resemblance search in real applications. 2 Classical c-NN and Locality Sensitive Hashing (LSH) Initial attempts of ﬁnding efﬁcient (sub-linear time) algorithms for exact near neighbor search, based on space partitioning, turned out to be a disappointment with the massive dimensionality of current datasets [11, 28]. Approximate versions of the problem were proposed [14, 13] to break the linear query time bottleneck. One widely adopted formalism is the c-approximate near neighbor (c-NN). Deﬁnition 1 (c-Approximate Near Neighbor or c-NN). Consider a set of points, denoted by P, in a D-dimensional space RD , and parameters R0 > 0, δ > 0. The task is to construct a data structure which, given any query point q, if there exist an R0 -near neighbor of q in P, it reports some cR0 -near neighbor of q in P with probability 1 − δ. The usual notion of c-NN is for distance. Since we deal with similarities, we deﬁne R0 -near neighbor of point q as a point p with Sim(q, p) ≥ R0 , where Sim is the similarity function of interest. Locality sensitive hashing (LSH) [14, 13] is a popular framework for c-NN problems. LSH is a family of functions, with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than non-similar ones. In formal terms, consider H a family of hash functions mapping RD to some set S Deﬁnition 2 (Locality Sensitive Hashing (LSH)). A family H is called (R0 , cR0 , p1 , p2 )-sensitive if for any two points x, y ∈ RD and h chosen uniformly from H satisﬁes the following: • if Sim(x, y) ≥ R0 then P rH (h(x) = h(y)) ≥ p1 • if Sim(x, y) ≤ cR0 then P rH (h(x) = h(y)) ≤ p2 For approximate nearest neighbor search typically, p1 > p2 and c < 1 is needed. Note, c < 1 as we are deﬁning neighbors in terms of similarity. Basically, LSH trades off query time with extra preprocessing time and space which can be accomplished off-line. 2 Fact 1 Given a family of (R0 , cR0 , p1 , p2 ) -sensitive hash functions, one can construct a data structure for c-NN with O(nρ log1/p2 n) query time and space O(n1+ρ ), where ρ = log 1/p1 . log 1/p2 Minwise Hashing for Pairwise Resemblance One popular choice of LSH family of functions associated with resemblance similarity is, Minwise Hashing family [5, 6, 13]. Minwise Hashing family applies an independent random permutation π : Ω → Ω, on the given set S ⊆ Ω, and looks at the minimum element under π, i.e. min(π(S)). Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, it can be shown by elementary probability argument that P r (min(π(S1 )) = min(π(S2 ))) = |S1 ∩ S2 | = R2way . |S1 ∪ S2 | (1) The recent work on b-bit minwise hashing [20, 23] provides an improvement by storing only the lowest b bits of the hashed values: min(π(S1 )), min(π(S2 )). [26] implemented the idea of building hash tables for near neighbor search, by directly using the bits from b-bit minwise hashing. 3 3-way Similarity Search Formulation Our focus will remain on binary vectors which can also be viewed as sets. We illustrate our method |S1 ∩S2 ∩S3 | using 3-way resemblance similarity function Sim(S1 , S2 , S3 ) = |S1 ∪S2 ∪S3 | . The algorithm and guarantees naturally extend to k-way resemblance. Given a size n collection C ⊆ 2Ω of sets (or binary vectors), we are particularly interested in the following three problems: 1. Given two query sets S1 and S2 , ﬁnd S3 ∈ C that maximizes Sim(S1 , S2 , S3 ). 2. Given a query set S1 , ﬁnd two sets S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). 3. Find three sets S1 , S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). The brute force way of enumerating all possibilities leads to the worst case query time of O(n), O(n2 ) and O(n3 ) for problem 1, 2 and 3, respectively. In a hope to break this barrier, just like the case of pairwise near neighbor search, we deﬁne the c-approximate (c < 1) versions of the above three problems. As in the case of c-NN, we are given two parameters R0 > 0 and δ > 0. For each of the following three problems, the guarantee is with probability at least 1 − δ: 1. (3-way c-Near Neighbor or 3-way c-NN) Given two query sets S1 and S2 , if there ′ exists S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report some S3 ∈ C so that ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 2. (3-way c-Close Pair or 3-way c-CP) Given a query set S1 , if there exists a pair of ′ ′ set S2 , S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S2 , S3 ∈ C so that ′ ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 3. (3-way c-Best Cluster or 3-way c-BC) If there exist sets S1 , S2 , S3 ∈ C with ′ ′ ′ ′ ′ ′ Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S1 , S2 , S3 ∈ C so that Sim(S1 , S2 , S3 ) ≥ cR0 . 4 Sub-linear Algorithm for 3-way c-NN The basic philosophy behind sub-linear search is bucketing, which allows us to preprocess dataset in a fashion so that we can ﬁlter many bad candidates without scanning all of them. LSH-based techniques rely on randomized hash functions to create buckets that probabilistically ﬁlter bad candidates. This philosophy is not restricted for binary similarity functions and is much more general. Here, we ﬁrst focus on 3-way c-NN problem for binary data. Theorem 1 For R3way c-NN one can construct a data structure with O(nρ log1/cR0 n) query time and O(n1+ρ ) space, where ρ = 1 − log 1/c log 1/c+log 1/R0 . The argument for 2-way resemblance can be naturally extended to k-way resemblance. Speciﬁcally, given three sets S1 , S2 , S3 ⊆ Ω and an independent random permutation π : Ω → Ω, we have: P r (min(π(S1 )) = min(π(S2 )) = min(π(S3 ))) = R3way . (2) Eq.( 2) shows that minwise hashing, although it operates on sets individually, preserves all 3-way (in fact k-way) similarity structure of the data. The existence of such a hash function is the key requirement behind the existence of efﬁcient approximate search. For the pairwise case, the probability event was a simple hash collision, and the min-hash itself serves as the bucket index. In case 3 of 3-way (and higher) c-NN problem, we have to take care of a more complicated event to create an indexing scheme. In particular, during preprocessing we need to create buckets for each individual S3 , and while querying we need to associate the query sets S1 and S2 to the appropriate bucket. We need extra mechanisms to manipulate these minwise hashes to obtain a bucketing scheme. Proof of Theorem 1: We use two additional functions: f1 : Ω → N for manipulating min(π(S3 )) and f2 : Ω × Ω → N for manipulating both min(π(S1 )) and min(π(S2 )). Let a ∈ N+ such that |Ω| = D < 10a . We deﬁne f1 (x) = (10a + 1) × x and f2 (x, y) = 10a x + y. This choice ensures that given query S1 and S2 , for any S3 ∈ C, f1 (min(π(S3 ))) = f2 (min(π(S1 )), min(π(S2 ))) holds if and only if min(π(S1 )) = min(π(S2 )) = min(π(S2 )), and thus we get a bucketing scheme. To complete the proof, we introduce two integer parameters K and L. Deﬁne a new hash function by concatenating K events. To be more precise, while preprocessing, for every element S3 ∈ C create buckets g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hK (S3 ))] where hi is chosen uniformly from minwise hashing family. For given query points S1 and S2 , retrieve only points in the bucket g2 (S1 , S2 ) = [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hK (S1 ), hK (S2 ))]. Repeat this process L times independently. For any K S3 ∈ C, with Sim(S1 , S2 , S3 ) ≥ R0 , is retrieved with probability at least 1 − (1 − R0 )L . Using log 1/c log K = ⌈ log n ⌉ and L = ⌈nρ log( 1 )⌉, where ρ = 1 − log 1/c+log 1/R0 , the proof can be obtained 1 δ cR0 using standard concentration arguments used to prove Fact 1, see [14, 13]. It is worth noting that the probability guarantee parameter δ gets absorbed in the constants as log( 1 ). Note, the process is δ stopped as soon as we ﬁnd some element with R3way ≥ cR0 . Theorem 1 can be easily extended to k-way resemblance with same query time and space guarantees. Note that k-way c-NN is at least as hard as k ∗ -way c-NN for any k ∗ ≤ k, because we can always choose (k −k ∗ +1) identical query sets in k-way c-NN, and it reduces to k ∗ -way c-NN problem. So, any improvements in R3way c-NN implies improvement in the classical min-hash LSH for Jaccard similarity. The proposed analysis is thus tight in this sense. The above observation makes it possible to also perform the traditional pairwise c-NN search using the same hash tables deployed for 3-way c-NN. In the query phase we have an option, if we have two different queries S1 , S2 , then we retrieve from bucket g2 (S1 , S2 ) and that is usual 3-way c-NN search. If we are just interested in pairwise near neighbor search given one query S1 , then we will look into bucket g2 (S1 , S1 ), and we know that the 3-way resemblance between S1 , S1 , S3 boils down to the pairwise resemblance between S1 and S3 . So, the same hash tables can be used for both the purposes. This property generalizes, and hash tables created for k-way c-NN can be used for any k ∗ -way similarity search so long as k ∗ ≤ k. The approximation guarantees still holds. This ﬂexibility makes k-way c-NN bucketing scheme more advantageous over the pairwise scheme. ρ 1 One of the peculiarity of LSH based techniques is that the query complexity exponent ρ < 1 is dependent on the choice R0=0.01 0.8 of the threshold R0 we are interested in and the value of c 0.05 0.1 0.3 0.6 which is the approximation ratio that we will tolerate. Figure 1 0.2 0.4 0.8 log 1/c 0.5 plots ρ = 1− log 1/c+log 1/R0 with respect to c, for selected R0 0.4 0.6 0.9 0.7 values from 0.01 to 0.99. For instance, if we are interested in 0.2 0.95 highly similar pairs, i.e. R0 ≈ 1, then we are looking at near R =0.99 0 O(log n) query complexity for c-NN problem as ρ ≈ 0. On 0 0 0.2 0.4 0.6 0.8 1 the other hand, for very lower threshold R0 , there is no much c log 1/c of hope of time-saving because ρ is close to 1. Figure 1: ρ = 1 − log 1/c+log 1/R0 . 5 Other Efﬁcient k-way Similarities We refer to the k-way similarities for which there exist sub-linear algorithms for c-NN search with query and space complexity exactly as given in Theorem 1 as efﬁcient . We have demonstrated existence of one such example of efﬁcient similarities, which is the k-way resemblance. This leads to a natural question: “Are there more of them?”. [9] analyzed all the transformations on similarities that preserve existence of efﬁcient LSH search. In particular, they showed that if S is a similarity for which there exists an LSH family, then there also exists an LSH family for any similarity which is a probability generating function (PGF) transfor∑∞ mation on S. PGF transformation on S is deﬁned as P GF (S) = i=1 pi S i , where S ∈ [0, 1] and ∑∞ pi ≥ 0 satisﬁes i=1 pi = 1. Similar theorem can also be shown in the case of 3-way resemblance. 4 Theorem 2 Any PGF transformation on 3-way resemblance R3way is efﬁcient. Recall in the proof of Theorem 1, we created hash assignments f1 (min(π(S3 ))) and f2 (min(π(S1 )), min(π(S2 ))), which lead to a bucketing scheme for the 3-way resemblance search, where the collision event E = {f1 (min(π(S3 )) = f2 (min(π(S1 )), min(π(S2 )))} happens with probability P r(E) = R3way . To prove the above Theorem 2, we will need to create hash events ∑∞ i having probability P GF (R3way ) = i=1 pi (R3way ) . Note that 0 ≤ P GF (R3way ) ≤ 1. We will make use of the following simple lemma. Lemma 1 (R3way )n is efﬁcient for all n ∈ N. n n Proof: Deﬁne new hash assignments g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hn (S3 ))] and g2 (S1 , S2 ) = n n [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hn (S1 ), hn (S2 ))]. The collision event g1 (S3 ) = g2 (S1 , S2 ) has n n probability (R3way )n . We now use the pair < g1 , g2 > instead of < f1 , f2 > and obtain same 3way n guarantees, as in Theorem 1, for (R ) as well. i i Proof of Theorem 2: From Lemma 1, let < g1 , g2 > be the hash pair corresponding to (R3way )i i i as used in above lemma. We sample one hash pair from the set {< g1 , g2 >: i ∈ N}, where i i the probability of sampling < g1 , g2 > is proportional to pi . Note that pi ≥ 0, and satisﬁes ∑∞ is i=1 pi = 1, and so the above sampling ∑ valid. It is not difﬁcult to see that the collision of the ∞ sampled hash pair has probability exactly i=1 pi (R3way )i . Theorem 2 can be naturally extended to k-way similarity for any k ≥ 2. Thus, we now have inﬁnitely many k-way similarity functions admitting efﬁcient sub-linear search. One, that might be interesting, because of its radial basis kernel like nature, is shown in the following corollary. Corollary 1 eR k−way −1 is efﬁcient. Proof: Use the expansion of eR k−way normalized by e to see that eR k−way −1 is a PGF on Rk−way . 6 Fast Algorithms for 3-way c-CP and 3-way c-BC Problems For 3-way c-CP and 3-way c-BC problems, using bucketing scheme with minwise hashing family will save even more computations. Theorem 3 For R3way c-Close Pair Problem (or c-CP) one can construct a data structure with log 1/c O(n2ρ log1/cR0 n) query time and O(n1+2ρ ) space, where ρ = 1 − log 1/c+log 1/R0 . Note that we can switch the role of f1 and f2 in the proof of Theorem 1. We are thus left with a c-NN problem with search space O(n2 ) (all pairs) instead of n. A bit of analysis, similar to Theorem 1, will show that this procedure achieves the required query time O(n2ρ log1/cR0 n), but uses a lot more space, O(n2(1+ρ )), than shown in the above theorem. It turns out that there is a better way of doing c-CP that saves us space. Proof of Theorem 3: We again start with constructing hash tables. For every element Sc ∈ C, we create a hash-table and store Sc in bucket B(Sc ) = [h1 (Sc ); h2 (Sc ); ...; hK (Sc )], where hi is chosen uniformly from minwise independent family of hash functions H. We create L such hash-tables. For a query element Sq we look for all pairs in bucket B(Sq ) = [h1 (Sq ); h2 (Sq ); ...; hK (Sq )] and repeat this for each of the L tables. Note, we do not form pairs of elements retrieved from different tables as they do not satisfy Eq. (2). If there exists a pair S1 , S2 ∈ C with Sim(Sq , S1 , S2 ) ≥ R0 , using K Eq. (2), we can see that we will ﬁnd that pair in bucket B(Sq ) with probability 1 − (1 − R0 )L . Here, we cannot use traditional choice of K and L, similar to what we did in Theorem 1, as there 2 log are O(n2 ) instead of O(n) possible pairs. We instead use K = ⌈ log 1n ⌉ and L = ⌈n2ρ log( 1 )⌉, δ cR0 log 1/c with ρ = 1 − log 1/c+log 1/R0 . With this choice of K and L, the result follows. Note, the process is stopped as soon as we ﬁnd pairs S1 and S2 with Sim(Sq , S1 , S2 ) ≥ cR0 . The key argument that saves space from O(n2(1+ρ) ) to O(n1+2ρ ) is that we hash n points individually. Eq. (2) makes it clear that hashing all possible pairs is not needed when every point can be processed individually, and pairs formed within each bucket itself ﬁlter out most of the unnecessary combinations. 5 Theorem 4 For R3way c-Best Cluster Problem (or c-BC) there exist an algorithm with running time log 1/c O(n1+2ρ log1/cR0 n), where ρ = 1 − log 1/c+log 1/R0 . The argument similar to one used in proof of Theorem 3 leads to the running time of O(n1+3ρ log1/cR0 n) as we need L = O(n3ρ ), and we have to processes all points at least once. Proof of Theorem 4: Repeat c-CP problem n times for every element in collection C acting as query once. We use the same set of hash tables and hash functions every time. The preprocessing time is O(n1+2ρ log1/cR0 n) evaluations of hash functions and the total querying time is O(n × n2ρ log1/cR0 n), which makes the total running time O(n1+2ρ log1/cR0 n). For k-way c-BC Problem, we can achieve O(n1+(k−1)ρ log1/cR0 n) running time. If we are interested in very high similarity cluster, with R0 ≈ 1, then ρ ≈ 0, and the running time is around O(n log n). This is a huge saving over the brute force O(nk ). In most practical cases, specially in big data regime where we have enormous amount of data, we can expect the k-way similarity of good clusters to be high and ﬁnding them should be efﬁcient. We can see that with increasing k, hashing techniques save more computations. 7 Experiments In this section, we demonstrate the usability of 3-way and higher-order similarity search using (i) Google Sets, and (ii) Improving retrieval quality. 7.1 Google Sets: Generating Semantically Similar Words Here, the task is to retrieve words which are “semantically” similar to the given set of query words. We collected 1.2 million random documents from Wikipedia and created a standard term-doc binary vector representation of each term present in the collected documents after removing standard stop words and punctuation marks. More speciﬁcally, every word is represented as a 1.2 million dimension binary vector indicating its presence or absence in the corresponding document. The total number of terms (or words) was around 60,000 in this experiment. Since there is no standard benchmark available for this task, we show qualitative evaluations. For querying, we used the following four pairs of semantically related words: (i) “jaguar” and “tiger”; (ii) “artiﬁcial” and “intelligence”; (iii) “milky” and “way” ; (iv) “ﬁnger” and “lakes”. Given the query words w1 and w2 , we compare the results obtained by the following four methods. • Google Sets: We use Google’s algorithm and report 5 words from Google spreadsheets [1]. This is Google’s algorithm which uses its own data. • 3-way Resemblance (3-way): We use 3-way resemblance |w1 ∩w2 ∩w| to rank every word |w1 ∪w2 ∪w| w and report top 5 words based on this ranking. • Sum Resemblance (SR): Another intuitive method is to use the sum of pairwise resem|w2 ∩w| blance |w1 ∩w| + |w2 ∪w| and report top 5 words based on this ranking. |w1 ∪w| • Pairwise Intersection (PI): We ﬁrst retrieve top 100 words based on pairwise resemblance for each w1 and w2 independently. We then report the words common in both. If there is no word in common we do not report anything. The results in Table 1 demonstrate that using 3-way resemblance retrieves reasonable candidates for these four queries. An interesting query is “ﬁnger” and “lakes”. Finger Lakes is a region in upstate New York. Google could only relate it to New York, while 3-way resemblance could even retrieve the names of cities and lakes in the region. Also, for query “milky” and “way”, we can see some (perhaps) unrelated words like “dance” returned by Google. We do not see such random behavior with 3-way resemblance. Although we are not aware of the algorithm and the dataset used by Google, we can see that 3-way resemblance appears to be a right measure for this application. The above results also illustrate the problem with using the sum of pairwise similarity method. The similarity value with one of the words dominates the sum and hence we see for queries “artiﬁcial” and “intelligence” that all the retrieved words are mostly related to the word “intelligence”. Same is the case with query “ﬁnger” and “lakes” as well as “jaguar” and “tiger”. Note that “jaguar” is also a car brand. In addition, for all 4 queries, there was no common word in the top 100 words similar to the each query word individually and so PI method never returns anything. 6 Table 1: Top ﬁve words retrieved using various methods for different queries. “JAGUAR” AND “ TIGER” G OOGLE 3- WAY SR LION LEOPARD CHEETAH CAT DOG LEOPARD CHEETAH LION PANTHER CAT CAT LEOPARD LITRE BMW CHASIS “MILKY” AND “ WAY” G OOGLE 3- WAY SR DANCE STARS SPACE THE UNIVERSE GALAXY STARS EARTH LIGHT SPACE EVEN ANOTHER STILL BACK TIME PI — — — — — “ARTIFICIAL” AND “INTELLIGENCE” G OOGLE 3- WAY SR PI COMPUTER COMPUTER SECURITY — PROGRAMMING SCIENCE WEAPONS — INTELLIGENT SECRET — SCIENCE ROBOT HUMAN ATTACKS — ROBOTICS TECHNOLOGY HUMAN — PI — — — — — G OOGLE NEW YORK NY PARK CITY “FINGER” AND “LAKES” 3- WAY SR SENECA CAYUGA ERIE ROCHESTER IROQUOIS RIVERS FRESHWATER FISH STREAMS FORESTED PI — — — — — We should note the importance of the denominator term in 3-way resemblance, without which frequent words will be blindly favored. The exciting contribution of this paper is that 3-way resemblance similarity search admits provable sub-linear guarantees, making it an ideal choice. On the other hand, no such provable guarantees are known for SR and other heuristic based search methods. 7.2 Improving Retrieval Quality in Similarity Search We also demonstrate how the retrieval quality of traditional similarity search can be boosted by utilizing more query candidates instead of just one. For the evaluations we choose two public datasets: MNIST and WEBSPAM, which were used in a recent related paper [26] for near neighbor search with binary data using b-bit minwise hashing [20, 23]. The two datasets reﬂect diversity both in terms of task and scale that is encountered in practice. The MNIST dataset consists of handwritten digit samples. Each sample is an image of 28 × 28 pixel yielding a 784 dimension vector with the associated class label (digit 0 − 9). We binarize the data by settings all non zeros to be 1. We used the standard partition of MNIST, which consists of 10,000 samples in one set and 60,000 in the other. The WEBSPAM dataset, with 16,609,143 features, consists of sparse vector representation of emails labeled as spam or not. We randomly sample 70,000 data points and partitioned them into two independent sets of size 35,000 each. Table 2: Percentage of top candidates with the same labels as that of query retrieved using various similarity criteria. More indicates better retrieval quality (Best marked in bold). T OP Pairwise 3-way NNbor 4-way NNbor 1 94.20 96.90 97.70 MNIST 10 20 92.33 91.10 96.13 95.36 96.89 96.28 50 89.06 93.78 95.10 1 98.45 99.75 99.90 WEBSPAM 10 20 96.94 96.46 98.68 97.80 98.87 98.15 50 95.12 96.11 96.45 For evaluation, we need to generate potential similar search query candidates for k-way search. It makes no sense in trying to search for object simultaneously similar to two very different objects. To generate such query candidates, we took one independent set of the data and partition it according to the class labels. We then run a cheap k-mean clustering on each class, and randomly sample triplets < x1 , x2 , x3 > from each cluster for evaluating 2-way, 3-way and 4-way similarity search. For MNIST dataset, the standard 10,000 test set was partitioned according to the labels into 10 sets, each partition was then clustered into 10 clusters, and we choose 10 triplets randomly from each cluster. In all we had 100 such triplets for each class, and thus 1000 overall query triplets. For WEBSPAM, which consists only of 2 classes, we choose one of the independent set and performed the same procedure. We selected 100 triplets from each cluster. We thus have 1000 triplets from each class making the total number of 2000 query candidates. The above procedures ensure that the elements in each triplets < x1 , x2 , x3 > are not very far from each other and are of the same class label. For each triplet < x1 , x2 , x3 >, we sort all the points x in the other independent set based on the following: • Pairwise: We only use the information in x1 and rank x based on resemblance 7 |x1 ∩x| |x1 ∪x| . • 3-way NN: We rank x based on 3-way resemblance • 4-way NN: We rank x based on 4-way resemblance |x1 ∩x2 ∩x| |x1 ∪x2 ∪x| . |x1 ∩x2 ∩x3 ∩x| |x1 ∪x2 ∪x3 ∪x| . We look at the top 1, 10, 20 and 50 points based on orderings described above. Since, all the query triplets are of the same label, The percentage of top retrieved candidates having same label as that of the query items is a natural metric to evaluate the retrieval quality. This percentage values accumulated over all the triplets are summarized in Table 2. We can see that top candidates retrieved by 3-way resemblance similarity, using 2 query points, are of better quality than vanilla pairwise similarity search. Also 4-way resemblance, with 3 query points, further improves the results compared to 3-way resemblance similarity search. This clearly demonstrates that multi-way resemblance similarity search is more desirable whenever we have more than one representative query in mind. Note that, for MNIST, which contains 10 classes, the boost compared to pairwise retrieval is substantial. The results follow a consistent trend. 8 Future Work While the work presented in this paper is promising for efﬁcient 3-way and k-way similarity search in binary high-dimensional data, there are numerous interesting and practical research problems we can study as future work. In this section, we mention a few such examples. One-permutation hashing. Traditionally, building hash tables for near neighbor search required many (e.g., 1000) independent hashes. This is both time- and energy-consuming, not only for building tables but also for processing un-seen queries which have not been processed. One permutation hashing [22] provides the hope of reducing many permutations to merely one. The version in [22], however, was not applicable to near neighbor search due to the existence of many empty bins (which offer no indexing capability). The most recent work [27] is able to ﬁll the empty bins and works well for pairwise near neighbor search. It will be interesting to extend [27] to k-way search. Non-binary sparse data. This paper focuses on minwise hashing for binary data. Various extensions to real-valued data are possible. For example, our results naturally apply to consistent weighted sampling [25, 15], which is one way to handle non-binary sparse data. The problem, however, is not solved if we are interested in similarities such as (normalized) k-way inner products, although the line of work on Conditional Random Sampling (CRS) [19, 18] may be promising. CRS works on non-binary sparse data by storing a bottom subset of nonzero entries after applying one permutation to (real-valued) sparse data matrix. CRS performs very well for certain applications but it does not work in our context because the bottom (nonzero) subsets are not properly aligned. Building hash tables by directly using bits from minwise hashing. This will be a different approach from the way how the hash tables are constructed in this paper. For example, [26] directly used the bits from b-bit minwise hashing [20, 23] to build hash tables and demonstrated the signiﬁcant advantages compared to sim-hash [8, 12] and spectral hashing [29]. It would be interesting to see the performance of this approach in k-way similarity search. k-Way sign random projections. It would be very useful to develop theory for k-way sign random projections. For usual (real-valued) random projections, it is known that the volume (which is related to the determinant) is approximately preserved [24, 17]. We speculate that the collision probability of k-way sign random projections might be also a (monotonic) function of the determinant. 9 Conclusions We formulate a new framework for k-way similarity search and obtain fast algorithms in the case of k-way resemblance with provable worst-case approximation guarantees. We show some applications of k-way resemblance search in practice and demonstrate the advantages over traditional search. Our analysis involves the idea of probabilistic hashing and extends the well-known LSH family beyond the pairwise case. We believe the idea of probabilistic hashing still has a long way to go. Acknowledgement The work is supported by NSF-III-1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Ping Li thanks Kenneth Church for introducing Google Sets to him in the summer of 2004 at Microsoft Research. 8 References [1] http://www.howtogeek.com/howto/15799/how-to-use-autoﬁll-on-a-google-docs-spreadsheet-quick-tips/. [2] S. Agarwal, Jongwoo Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005. [3] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. Group recommendation: semantics and efﬁciency. Proc. VLDB Endow., 2(1):754–765, 2009. [4] Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. Dynamic ranked retrieval. In WSDM, pages 247–256, 2011. [5] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21–29, Positano, Italy, 1997. [6] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998. [7] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classiﬁcation. IEEE Trans. Neural Networks, 10(5):1055–1064, 1999. [8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. [9] Flavio Chierichetti and Ravi Kumar. LSH-preserving functions and their applications. In SODA, 2012. [10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003. [11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for ﬁnding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975. [12] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of ACM, 42(6):1115–1145, 1995. [13] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):321–350, 2012. [14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998. [15] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, 2010. [16] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494–501, Amsterdam, Netherlands, 2007. [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Technical report, arXiv:1207.6083, 2013. [18] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics (Preliminary results appeared in HLT/EMNLP 2005), 33(3):305–354, 2007. [19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873–880, Vancouver, Canada, 2006. [20] Ping Li and Arnd Christian K¨ nig. b-bit minwise hashing. In Proceedings of the 19th International o Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010. [21] Ping Li, Arnd Christian K¨ nig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simio larities. In NIPS, Vancouver, Canada, 2010. [22] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012. [23] Ping Li, Anshumali Shrivastava, and Arnd Christian K¨ nig. b-bit minwise hashing in practice. In Intero netware, Changsha, China, 2013. [24] Avner Magen and Anastasios Zouzias. Near optimal dimensionality reductions that preserve volumes. In APPROX / RANDOM, pages 523–534, 2008. [25] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010. [26] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML, Bristol, UK, 2012. [27] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014. [28] Roger Weber, Hans-J¨ rg Schek, and Stephen Blott. A quantitative analysis and performance study for o similarity-search methods in high-dimensional spaces. In VLDB, pages 194–205, 1998. [29] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, Vancouver, Canada, 2008. [30] D. Zhou, J. Huang, and B. Sch¨ lkopf. Beyond pairwise classiﬁcation and clustering using hypergraphs. o In NIPS, Vancouver, Canada, 2006. 9

4 0.069060236 301 nips-2013-Sparse Additive Text Models with Low Rank Background

Author: Lei Shi

Abstract: The sparse additive model for text modeling involves the sum-of-exp computing, whose cost is consuming for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB) and obtains simple yet efﬁcient estimation. Particularly, employing a double majorization bound, we approximate log-likelihood into a quadratic lower-bound without the log-sumexp terms. The constraints of low rank and sparsity are then simply embodied by nuclear norm and ℓ1 -norm regularizers. Interestingly, we ﬁnd that the optimization task of SAM-LRB can be transformed into the same form as in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efﬁciently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to favor unsupervised and multifaceted scenarios. Experiments on three real data demonstrate the effectiveness and efﬁciency of SAM-LRB, compared with a few state-of-the-art models. 1

5 0.062315315 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

Author: Boqing Gong, Kristen Grauman, Fei Sha

Abstract: In visual recognition problems, the common data distribution mismatches between training and testing make domain adaptation essential. However, image data is difﬁcult to manually divide into the discrete domains required by adaptation algorithms, and the standard practice of equating datasets with domains is a weak proxy for all the real conditions that alter the statistics in complex ways (lighting, pose, background, resolution, etc.) We propose an approach to automatically discover latent domains in image or video datasets. Our formulation imposes two key properties on domains: maximum distinctiveness and maximum learnability. By maximum distinctiveness, we require the underlying distributions of the identiﬁed domains to be different from each other to the maximum extent; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. We devise a nonparametric formulation and efﬁcient optimization procedure that can successfully discover domains among both training and test data. We extensively evaluate our approach on object recognition and human activity recognition tasks. 1

6 0.062001191 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks

7 0.054631054 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

8 0.053808842 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

9 0.053200532 148 nips-2013-Latent Maximum Margin Clustering

10 0.052722003 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces

11 0.051469784 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

12 0.051194191 277 nips-2013-Restricting exchangeable nonparametric distributions

13 0.04976343 196 nips-2013-Modeling Overlapping Communities with Node Popularities

14 0.046999402 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

15 0.046571396 158 nips-2013-Learning Multiple Models via Regularized Weighting

16 0.045800418 85 nips-2013-Deep content-based music recommendation

17 0.045753215 326 nips-2013-The Power of Asymmetry in Binary Hashing

18 0.044692606 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

19 0.043473836 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests

20 0.043429356 335 nips-2013-Transfer Learning in a Transductive Setting

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.116), (1, 0.053), (2, -0.036), (3, -0.008), (4, 0.048), (5, 0.004), (6, 0.036), (7, 0.011), (8, 0.018), (9, 0.006), (10, -0.03), (11, 0.021), (12, 0.048), (13, 0.012), (14, 0.017), (15, -0.03), (16, 0.052), (17, -0.007), (18, 0.054), (19, -0.039), (20, 0.005), (21, -0.03), (22, 0.008), (23, 0.076), (24, 0.013), (25, -0.136), (26, -0.033), (27, -0.061), (28, 0.009), (29, 0.008), (30, 0.002), (31, -0.027), (32, 0.06), (33, 0.005), (34, -0.053), (35, -0.004), (36, -0.084), (37, 0.044), (38, 0.046), (39, 0.106), (40, -0.006), (41, -0.005), (42, 0.023), (43, -0.029), (44, 0.059), (45, -0.091), (46, 0.082), (47, 0.056), (48, 0.027), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92446727 294 nips-2013-Similarity Component Analysis

Author: Soravit Changpinyo, Kuan Liu, Fei Sha

2 0.62038356 148 nips-2013-Latent Maximum Margin Clustering

Author: Guang-Tong Zhou, Tian Lan, Arash Vahdat, Greg Mori

Abstract: We present a maximum margin framework that clusters data using latent variables. Using latent representations enables our framework to model unobserved information embedded in the data. We implement our idea by large margin learning, and develop an alternating descent algorithm to effectively solve the resultant non-convex optimization problem. We instantiate our latent maximum margin clustering framework with tag-based video clustering tasks, where each video is represented by a latent tag model describing the presence or absence of video tags. Experimental results obtained on three standard datasets show that the proposed method outperforms non-latent maximum margin clustering as well as conventional clustering approaches. 1

3 0.56331891 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests

Author: Yacine Jernite, Yonatan Halpern, David Sontag

Abstract: We give a polynomial-time algorithm for provably learning the structure and parameters of bipartite noisy-or Bayesian networks of binary variables where the top layer is completely hidden. Unsupervised learning of these models is a form of discrete factor analysis, enabling the discovery of hidden variables and their causal relationships with observed data. We obtain an efﬁcient learning algorithm for a family of Bayesian networks that we call quartet-learnable. For each latent variable, the existence of a singly-coupled quartet allows us to uniquely identify and learn all parameters involving that latent variable. We give a proof of the polynomial sample complexity of our learning algorithm, and experimentally compare it to variational EM. 1

4 0.56273228 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

Author: Kohei Hayashi, Ryohei Fujimaki

Abstract: This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models (LFMs). FAB inference has not been applicable to models, including LFMs, without a speciﬁc condition on the Hessian matrix of a complete loglikelihood, which is required to derive a “factorized information criterion” (FIC). Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. FAB/LFMs have several desirable properties (e.g., automatic hidden states selection and parameter identiﬁability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efﬁciency. 1

5 0.54079127 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan

Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classiﬁer with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectiﬁed linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efﬁcient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classiﬁcation accuracy, particularly in the case of a very small number of labeled training videos. 1

6 0.53576487 75 nips-2013-Convex Two-Layer Modeling

7 0.53425813 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

8 0.52217704 301 nips-2013-Sparse Additive Text Models with Low Rank Background

9 0.52079988 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

10 0.5205639 326 nips-2013-The Power of Asymmetry in Binary Hashing

11 0.51781261 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

12 0.49626043 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

13 0.49520794 85 nips-2013-Deep content-based music recommendation

14 0.49094796 277 nips-2013-Restricting exchangeable nonparametric distributions

15 0.48009065 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

16 0.4758355 158 nips-2013-Learning Multiple Models via Regularized Weighting

17 0.46019682 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces

18 0.45912141 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks

19 0.45690745 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

20 0.44867116 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.02), (10, 0.023), (13, 0.176), (16, 0.029), (33, 0.165), (34, 0.12), (41, 0.042), (49, 0.07), (56, 0.087), (70, 0.03), (85, 0.047), (89, 0.024), (93, 0.051), (95, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85233802 294 nips-2013-Similarity Component Analysis

Author: Soravit Changpinyo, Kuan Liu, Fei Sha

2 0.85117847 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh

Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-deﬁnite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efﬁciently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1

3 0.82405227 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini

Abstract: We study in this paper ﬂat and hierarchical classiﬁcation strategies in the context of large-scale taxonomies. To this end, we ﬁrst propose a multiclass, hierarchical data dependent bound on the generalization error of classiﬁers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of ﬂat and hierarchical classiﬁers. We then introduce another type of bound targeting the approximation error of a family of classiﬁers, and derive from it features used in a meta-classiﬁer to decide which nodes to prune (or ﬂatten) in a large-scale taxonomy. We ﬁnally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1

4 0.81965983 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

Author: Rishabh K. Iyer, Stefanie Jegelka, Jeff A. Bilmes

Abstract: We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [28]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the “curvature” of the submodular function, and provide lower and upper bounds that reﬁne and improve previous results [2, 6, 8, 27]. Our proof techniques are fairly generic. We either use a black-box transformation of the function (for approximation and learning), or a transformation of algorithms to use an appropriate surrogate function (for minimization). Curiously, curvature has been known to inﬂuence approximations for submodular maximization [3, 29], but its effect on minimization, approximation and learning has hitherto been open. We complete this picture, and also support our theoretical claims by empirical results. 1

5 0.79416084 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

Author: David Carlson, Vinayak Rao, Joshua T. Vogelstein, Lawrence Carin

Abstract: With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparametric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-theart. Via exploratory data analysis—using data with partial ground truth as well as two novel data sets—we ﬁnd several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) detecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using multiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain. 1

6 0.7901122 173 nips-2013-Least Informative Dimensions

7 0.78991431 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis

8 0.78959042 301 nips-2013-Sparse Additive Text Models with Low Rank Background

9 0.7895261 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

10 0.78912032 64 nips-2013-Compete to Compute

11 0.78833491 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

12 0.78596717 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

13 0.78584665 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

14 0.78538322 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

15 0.78523976 201 nips-2013-Multi-Task Bayesian Optimization

16 0.78474188 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

17 0.78276539 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

18 0.78219396 5 nips-2013-A Deep Architecture for Matching Short Texts

19 0.78115988 183 nips-2013-Mapping paradigm ontologies to and from the brain

20 0.78067422 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation