nips nips2013 nips2013-344 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jason Lee, Ran Gilad-Bachrach, Rich Caruana
Abstract: In the mixture models problem it is assumed that there are K distributions θ1 , . . . , θK and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same K underlying distributions, but with different mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Using multiple samples to learn mixture models Jason Lee∗ Stanford University Stanford, USA jdl17@stanford. [sent-1, score-0.231]
2 com Abstract In the mixture models problem it is assumed that there are K distributions θ1 , . [sent-4, score-0.209]
3 , θK and one gets to observe a sample from a mixture of these distributions with unknown coefficients. [sent-7, score-0.24]
4 In this work we make the assumption that we have access to several samples drawn from the same K underlying distributions, but with different mixing weights. [sent-9, score-0.166]
5 Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. [sent-11, score-0.114]
6 We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. [sent-12, score-0.202]
7 1 Introduction The mixture model has been studied extensively from several directions. [sent-14, score-0.156]
8 A line of studies on clustering theory, starting from [5] has proposed to address this problem by finding a projection to a low dimensional space and solving the problem in this space. [sent-16, score-0.265]
9 The goal of this projection is to reduce the dimension while preserving the distances, as much as possible, between the means of the underlying distributions. [sent-17, score-0.208]
10 On the other end of the spectrum, Topic modeling (TM), [9, 3], assumes multiple samples (documents) that are mixtures, with different weights of the underlying distributions (topics) over words. [sent-19, score-0.191]
11 In TM, each document is a mixture of the topics, but with different mixture weights. [sent-24, score-0.312]
12 However, we assume that points (words) are presented as feature vectors and the hidden distributions may be continuous. [sent-35, score-0.116]
13 1 Definitions and Notations We assume a mixture model in which there are K mixture components θ1 , . [sent-38, score-0.358]
14 We assume that there are M mixture models (samples), each drawn with different mixture weights Φ1 , . [sent-43, score-0.336]
15 In the second part of this work we will assume that the mixture components have disjoint supports. [sent-65, score-0.256]
16 , SM , we will provide an algorithm that finds the supports of the underlying distributions, and thus clusters the samples. [sent-70, score-0.198]
17 In the classical clustering setting (MM), one would take a sample of patients and try to divide them based on some similarity criteria into groups. [sent-76, score-0.133]
18 However, if patients in two hospitals acquired the same sub-type of a disease, parts of their medical records will be similar. [sent-80, score-0.117]
19 These patches may have different distributions based on the object in that part of the image. [sent-83, score-0.124]
20 Therefore, patches from images taken at different locations will have different representation of the underlying distributions. [sent-84, score-0.11]
21 Moreover, patches from the center of the frame are more likely to contain parts of faces than patches from the perimeter of the picture. [sent-85, score-0.168]
22 At the same time, patches from the bottom of the picture are more likely to be of grass than patches from the top of the picture. [sent-86, score-0.142]
23 In the first part of this work we discuss the problem of identifying the mixture component from multiple samples when the means of the different components differ and variances are bounded. [sent-87, score-0.316]
24 We focus on the problem of finding a low dimensional embedding of the data that preserves the distances between the means since the problem of finding the mixtures in a low dimensional space has already been address (see, for example [10]). [sent-88, score-0.368]
25 The Disjoint Overlapping method of moments [6, 8, 1] allows the reclusters clusters covery of the model but requires exponenHigh DSC, MSP MSP tial running time and sample sizes. [sent-96, score-0.126]
26 In the DSC dimension first stage, the data is projected to a low dimensional space and in the second stage Table 1: Summary of the scenarios the MSP the association of points to clusters is recov(Multi Sample Projection) algorithm and the ered. [sent-98, score-0.301]
27 Most of the results with this approach DSC (Double Sample Clustering) algorithm assume that the mixture components are are designed to address. [sent-99, score-0.202]
28 He used random projections to project the points√ a space of a lower dimension. [sent-102, score-0.106]
29 This to work assumes that separation is at least Ω(σmax n). [sent-103, score-0.108]
30 Arora and Kannan [10] presented algorithms for finding the mixture components which are, in most cases, polynomial in n and K. [sent-105, score-0.255]
31 Vempala and Wang [11] used PCA to reduce the required separation to Ω σmax K 1/4 log /4 n/φmin . [sent-106, score-0.108]
32 Kanan, Salmasian and Vempala [7] used similar spectral methods but 2/3 were able to improve the results to require separation of only cσmax K /φ2 . [sent-108, score-0.108]
33 They require separation of Ω σmax K log(Kσmax log n/φmin ) , however they assume that the Gaussians are axis aligned and that the distance between the centers of the Gaussians is spread across Ω (Kσmax log n/φmin ) coordinates. [sent-110, score-0.19]
34 We present a method to project the problem into a space of dimension d∗ which is the dimension of the affine space spanned by the means of the distributions. [sent-111, score-0.285]
35 We can find this projection and maintain the distances between the means to within a factor of 1 − . [sent-112, score-0.158]
36 For example, combining with [5] yields an algorithm that requires a separation of only √ √ Ω σmax d∗ ≤ Ω σmax K . [sent-115, score-0.108]
37 However, using [11] will result in separation requirement 1 of Ω σmax K log (Kσmax log d∗ /φmin ) . [sent-116, score-0.108]
38 There is also an improvement in terms of the value of σmax since we need only to control the variance in the affine space spanned by the means of the Gaussians and do not need to restrict the variance in orthogonal directions, as long as it is finite. [sent-117, score-0.221]
39 Later we also show that we can work in a more generic setting where the distributions are not restricted to be Gaussians as long as the supports of the distributions are disjoint. [sent-118, score-0.17]
40 For example, even if the required separation is σmax K 1/2 then if we look at the Voronoi tessellation around the centers of the −1 Gaussians, each cell will contain at least 1 − (2π) K 3/4 exp (−K/2) of the mass of the Gaussian. [sent-120, score-0.192]
41 2 Projection for overlapping components In this section we present a method to use multiple samples to project high dimensional mixtures to a low dimensional space while keeping the means of the mixture components 3 Algorithm 1 Multi Sample Projection (MSP) Inputs: Samples S1 , . [sent-122, score-0.677]
42 Let µi be the mean of the i’th component θi and let Ej be the mean of the j’th mixture Dj . [sent-145, score-0.156]
43 , µK and hence in the affine space spanned by them; this is demonstrated in Figure 1. [sent-149, score-0.096]
44 Under mild assumptions, if we have sufficiently many mixtures, their means will span the affine space spanned by µ1 , . [sent-150, score-0.168]
45 The reason for selecting this sub-space is that by projecting on this space we maintain the distance between the means while reducing the dimension to at most K − 1. [sent-155, score-0.107]
46 We will assume that X = Rn , the first two 2 moments of θj are finite, and σmax denotes maximal variance of any of the components in any direction. [sent-158, score-0.142]
47 The separation of the mixture components is minj=j µj − µj . [sent-159, score-0.31]
48 Finally, we will denote by d∗ the dimension of the affine space spanned by the µj ’s. [sent-160, score-0.137]
49 Let µi be the projection of µi on the space spanned by v1 , . [sent-170, score-0.185]
50 Let αj be such that µi = j αj vj and let A = maxi αj 2 nσ 1 then with probability of at least 1 − max 2 j Nj Pr max | µi − µi i,i − µi − µi | > ¯ ¯ ≤ 2 4nσmax A2 2 j 1 . [sent-174, score-0.125]
51 Nj The MSP analysis theorem shows that with large enough samples, the projection will maintain the separation between the centers of the distributions. [sent-175, score-0.255]
52 If the mixing coefficients are very different in the different samples then A will be small. [sent-178, score-0.127]
53 Nevertheless, the size of the sample needed is polynomial in Figure 1: The mean of the mixture compothe parameters of the problem. [sent-180, score-0.211]
54 a good projection will be found, even with 4 large variances, high dimensions and close centroids. [sent-182, score-0.119]
55 Once a projection to a low dimensional space has been found, it is possible to find the clusters using approaches presented in section 1. [sent-184, score-0.314]
56 , Em will be almost co-planar in the sense that there will ∗ be an affine space of dimension d that is very close to all these points and we can project onto this space. [sent-198, score-0.143]
57 3 Disjoint supports and the Double Sample Clustering (DSC) algorithm In this section we discuss the case where the underlying distributions have disjoint supports. [sent-199, score-0.21]
58 However, as in the mixture of Gaussians case some sort of separation between the distributions is needed, this is the role of the disjoint supports. [sent-202, score-0.371]
59 We will show that given two samples from mixtures with different mixture coefficients, it is possible to find the supports of the underlying distributions (clusters) by building a tree of classifiers such that each leaf represents a cluster. [sent-203, score-0.654]
60 First we take the two samples, from the two distributions, and reweigh the examples such that the two samples will have the same cumulative weight. [sent-205, score-0.119]
61 It also splits each of the samples into two sets. [sent-208, score-0.105]
62 To understand why this algorithm works it is easier to look first at the case where the mixture distributions are known. [sent-211, score-0.209]
63 Therefore, any inner node in the tree splits the region without breaking clusters. [sent-214, score-0.125]
64 This process proceeds until all the points associated with a leaf are from the same cluster in which case, no classifier can distinguish between the classes. [sent-215, score-0.122]
65 S1 S2 However, this estimate is almost surely going to be 1 if the underlying distributions are absolutely continuous. [sent-219, score-0.092]
66 We claim that asymptotically, as the sizes of the samples increase, one can increase the complexity of the class until the clusters can be separated. [sent-221, score-0.17]
67 5 Algorithm 2 Double Sample Clustering (DSC) Inputs: • Samples S1 , S2 • A binary learning algorithm L that given samples S1 , S2 with weights w1 , w2 finds a classifier h and an estimator e of the error of h. [sent-227, score-0.099]
68 Furthermore, A contains all the clusters for which φ1 > φ2 and does not contain all the i i clusters for which φ1 < φ2 . [sent-240, score-0.216]
69 Hence, if we i i ¯ split the space to A and A we have few clusters in each side. [sent-241, score-0.122]
70 By applying the same trick recursively in each side we keep on bisecting the space according to cluster boundaries until subspaces that contain only a single cluster remain. [sent-242, score-0.133]
71 i i Conditioned on these two regions, the mixture D2 (A∗ ) = i max φ1 − φ2 , 0 . [sent-258, score-0.2]
72 We conclude from Lemma 1 that if D1 and D2 were explicitly known and one could have found a classifier that best separates between the distributions, that classifier would not break clusters as long as the mixing coefficients 6 are not identical. [sent-263, score-0.191]
73 In order for this to hold when the separation is applied recursively in the DSC algorithm it suffices to have that for every I ⊆ [1, . [sent-264, score-0.108]
74 , K] if |I| > 1 and i ∈ I then φ1 i 1 i ∈I φi = φ2 i i ∈I φ2 i to guarantee that at any stage of the algorithm clusters will not be split by the classifier (but may be sections of measure zero). [sent-267, score-0.125]
75 We show that with samples large enough, clusters are only minimally broken. [sent-271, score-0.17]
76 For this to hold we require that the learning algorithm L separates the clusters according to this definition: Definition 1. [sent-272, score-0.139]
77 Before we introduce the main statement, we define what it means for a tree to cluster the mixture components: Definition 2. [sent-283, score-0.33]
78 A clustering tree is a tree in which in each internal node is a classifier and the points that end in a certain leaf are considered a cluster. [sent-284, score-0.347]
79 A clustering tree -clusters the mixture coefficient θ1 , . [sent-285, score-0.326]
80 , K there exists a leaf in the tree such that the cluster L ⊆ X associated with this leaf is such that θi (L) ≥ 1 − and θi (L) < for every i = i. [sent-291, score-0.231]
81 To be able to find a clustering tree, the two mixtures have to be different. [sent-292, score-0.199]
82 For every ∗ , δ ∗ > 0 there exists N = N ( ∗ , δ ∗ , g, b, K) such that given two random samples of sizes N < n1 , n2 from the two mixtures, with probability of at least 1 − δ ∗ the DSC algorithm will return a clustering tree which ∗ -clusters θ1 , . [sent-306, score-0.245]
83 4 Empirical evidence We conducted several experiments with synthetic data to compare different methods when clustering in high dimensional spaces. [sent-310, score-0.124]
84 two samples with 80 examples each from the two mixing coefficients. [sent-317, score-0.127]
85 The DSC and MSP algorithm received these two samples as inputs while the reference algorithms, which are not designed to use multiple samples, received the combined set of 160 points as input. [sent-318, score-0.109]
86 Second, following [11] we used PCA to project on the maximal variance subspace. [sent-326, score-0.137]
87 In all projection algorithm we first projected on a one dimensional space and then applied K-means to find the clusters. [sent-328, score-0.165]
88 5 Conclusions The mixture problem examined here is closely related to the problem of clustering. [sent-345, score-0.156]
89 Most clustering data can be viewed as points generated from multiple underlying distributions or generating functions, and clustering can be seen as the process of recovering the structure of or assignments to these distributions. [sent-346, score-0.276]
90 We presented two algorithms for the mixture problem that can be viewed as clustering algorithms. [sent-347, score-0.26]
91 The MSP algorithm uses multiple samples to find a low dimensional space to project the data to. [sent-348, score-0.217]
92 The DSC algorithm builds a clustering tree assuming that the clusters are disjoint. [sent-349, score-0.265]
93 The key message in this work is that when multiple samples are available, often it is best not to pool the data into one large sample, but that the structure in the different samples can be leveraged to improve clustering power. [sent-351, score-0.225]
94 [4] Kamalika Chaudhuri and Satish Rao, Learning mixtures of product distributions using correlations and independence, Proc. [sent-356, score-0.177]
95 [5] Sanjoy Dasgupta, Learning mixtures of gaussians, Foundations of Computer Science, 1999. [sent-358, score-0.124]
96 [6] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant, Efficiently learning mixtures of two gaussians, Proceedings of the 42nd ACM symposium on Theory of computing, ACM, 2010, pp. [sent-361, score-0.17]
97 [7] Ravindran Kannan, Hadi Salmasian, and Santosh Vempala, The spectral method for general mixture models, Learning Theory, Springer, 2005, pp. [sent-363, score-0.156]
98 [8] Ankur Moitra and Gregory Valiant, Settling the polynomial learnability of mixtures of gaussians, Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, IEEE, 2010, pp. [sent-365, score-0.148]
99 [10] Arora Sanjeev and Ravi Kannan, Learning mixtures of arbitrary gaussians, Proceedings of the thirty-third annual ACM symposium on Theory of computing, ACM, 2001, pp. [sent-369, score-0.194]
100 [11] Santosh Vempala and Grant Wang, A spectral algorithm for learning mixtures of distributions, Foundations of Computer Science, 2002. [sent-371, score-0.124]
wordName wordTfidf (topN-words)
[('dsc', 0.444), ('erent', 0.419), ('msp', 0.411), ('di', 0.292), ('coe', 0.176), ('mixture', 0.156), ('gaussians', 0.141), ('ej', 0.124), ('mixtures', 0.124), ('separation', 0.108), ('tm', 0.108), ('clusters', 0.095), ('tree', 0.095), ('mm', 0.094), ('projection', 0.089), ('er', 0.086), ('clustering', 0.075), ('samples', 0.075), ('vempala', 0.072), ('patches', 0.071), ('spanned', 0.069), ('hospitals', 0.067), ('supports', 0.064), ('orange', 0.059), ('centers', 0.058), ('nj', 0.056), ('sj', 0.055), ('disjoint', 0.054), ('distributions', 0.053), ('maximal', 0.053), ('mixing', 0.052), ('ci', 0.05), ('dj', 0.05), ('dimensional', 0.049), ('classi', 0.049), ('leaf', 0.048), ('su', 0.048), ('santosh', 0.046), ('components', 0.046), ('symposium', 0.046), ('kannan', 0.045), ('reweigh', 0.044), ('max', 0.044), ('separates', 0.044), ('multi', 0.043), ('variance', 0.043), ('project', 0.041), ('dimension', 0.041), ('cluster', 0.04), ('vm', 0.039), ('supa', 0.039), ('valiant', 0.039), ('salmasian', 0.039), ('erences', 0.039), ('underlying', 0.039), ('means', 0.039), ('sm', 0.038), ('projections', 0.038), ('cients', 0.038), ('vj', 0.037), ('disease', 0.037), ('dm', 0.035), ('double', 0.035), ('cj', 0.035), ('points', 0.034), ('redmond', 0.034), ('span', 0.033), ('ankur', 0.032), ('milder', 0.031), ('moitra', 0.031), ('sample', 0.031), ('stage', 0.03), ('dimensions', 0.03), ('distances', 0.03), ('splits', 0.03), ('word', 0.029), ('presented', 0.029), ('pr', 0.028), ('nds', 0.028), ('space', 0.027), ('supremum', 0.027), ('skewed', 0.027), ('patients', 0.027), ('contain', 0.026), ('topics', 0.026), ('gregory', 0.026), ('separated', 0.025), ('microsoft', 0.025), ('gap', 0.025), ('subtree', 0.025), ('low', 0.025), ('acm', 0.025), ('foundations', 0.024), ('art', 0.024), ('weights', 0.024), ('annual', 0.024), ('polynomial', 0.024), ('focs', 0.024), ('aligned', 0.024), ('records', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 344 nips-2013-Using multiple samples to learn mixture models
Author: Jason Lee, Ran Gilad-Bachrach, Rich Caruana
Abstract: In the mixture models problem it is assumed that there are K distributions θ1 , . . . , θK and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same K underlying distributions, but with different mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data. 1
2 0.16994061 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
Author: Martin Azizyan, Aarti Singh, Larry Wasserman
Abstract: While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering. 1
3 0.10415907 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
4 0.090396717 63 nips-2013-Cluster Trees on Manifolds
Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman
Abstract: unkown-abstract
5 0.082460642 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
6 0.077756599 347 nips-2013-Variational Planning for Graph-based MDPs
7 0.076971039 241 nips-2013-Optimizing Instructional Policies
8 0.074011512 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
9 0.072571814 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
10 0.071528621 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
11 0.071364053 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
12 0.070432521 158 nips-2013-Learning Multiple Models via Regularized Weighting
13 0.069785543 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
14 0.069493189 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
15 0.064316832 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
16 0.06426432 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
17 0.063885599 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
18 0.062610306 355 nips-2013-Which Space Partitioning Tree to Use for Search?
19 0.062211107 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
20 0.062063348 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
topicId topicWeight
[(0, 0.161), (1, 0.047), (2, -0.034), (3, 0.009), (4, 0.055), (5, 0.052), (6, 0.056), (7, 0.006), (8, -0.028), (9, 0.031), (10, -0.011), (11, 0.047), (12, 0.039), (13, 0.079), (14, 0.055), (15, -0.004), (16, 0.097), (17, -0.019), (18, 0.09), (19, 0.028), (20, 0.052), (21, 0.095), (22, 0.006), (23, -0.015), (24, -0.043), (25, 0.001), (26, -0.086), (27, -0.017), (28, -0.117), (29, -0.095), (30, -0.011), (31, -0.025), (32, 0.124), (33, 0.05), (34, 0.045), (35, 0.067), (36, 0.18), (37, 0.06), (38, -0.007), (39, 0.083), (40, -0.064), (41, 0.043), (42, -0.057), (43, 0.032), (44, -0.005), (45, 0.096), (46, -0.104), (47, -0.052), (48, -0.056), (49, -0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.93183142 344 nips-2013-Using multiple samples to learn mixture models
Author: Jason Lee, Ran Gilad-Bachrach, Rich Caruana
Abstract: In the mixture models problem it is assumed that there are K distributions θ1 , . . . , θK and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same K underlying distributions, but with different mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data. 1
2 0.69936252 63 nips-2013-Cluster Trees on Manifolds
Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman
Abstract: unkown-abstract
3 0.66500765 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
Author: Martin Azizyan, Aarti Singh, Larry Wasserman
Abstract: While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering. 1
4 0.60037857 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
Author: Jeffrey W. Miller, Matthew T. Harrison
Abstract: For data assumed to come from a finite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of clusters — that is, the posterior on the number of components represented in the observed data. However, it turns out that this posterior is not consistent — it does not concentrate at the true number of components. In this note, we give an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a “mixture” with one standard normal component. Further, we show that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster converges (in probability) to 0. 1
5 0.54145914 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
6 0.50178492 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
7 0.4931995 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
8 0.48963967 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
9 0.47807965 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
10 0.47017574 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
11 0.46336609 252 nips-2013-Predictive PAC Learning and Process Decompositions
12 0.44554797 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
13 0.44360986 269 nips-2013-Regression-tree Tuning in a Streaming Setting
14 0.43804824 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
15 0.43346938 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
16 0.433173 158 nips-2013-Learning Multiple Models via Regularized Weighting
17 0.43277186 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
18 0.43165156 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
19 0.42461482 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
20 0.42058769 355 nips-2013-Which Space Partitioning Tree to Use for Search?
topicId topicWeight
[(2, 0.017), (16, 0.04), (21, 0.164), (33, 0.16), (34, 0.109), (41, 0.033), (49, 0.071), (56, 0.117), (70, 0.023), (79, 0.037), (85, 0.052), (89, 0.039), (93, 0.042), (95, 0.017)]
simIndex simValue paperId paperTitle
1 0.90176803 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
Author: Anshumali Shrivastava, Ping Li
Abstract: We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1 ∩S2 ∩S3 | , S1 , S2 , S3 ∈ C, where C is a |S1 ∪S2 ∪S3 | size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality. 1 Introduction and Motivation Similarity search (near neighbor search) is one of the fundamental problems in Computer Science. The task is to identify a small set of data points which are “most similar” to a given input query. Similarity search algorithms have been one of the basic building blocks in numerous applications including search, databases, learning, recommendation systems, computer vision, etc. One widely used notion of similarity on sets is the Jaccard similarity or resemblance [5, 10, 18, 20]. Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, the resemblance R2way between S1 and S2 is defined as: R2way = |S1 ∩S2 | . Existing notions of similarity in search problems mainly work with |S1 ∪S2 | pairwise similarity functions. In this paper, we go beyond this notion and look at the problem of k-way similarity search, where the similarity function of interest involves k sets (k ≥ 2). Our work exploits the fact that resemblance can be naturally extended to k-way resemblance similarity [18, 21], defined over k sets {S1 , S2 , ..., Sk } as Rk−way = |S1 ∩S2 ∩...∩Sk | . |S1 ∪S2 ∪...∪Sk | Binary high-dimensional data The current web datasets are typically binary, sparse, and extremely high-dimensional, largely due to the wide adoption of the “Bag of Words” (BoW) representations for documents and images. It is often the case, in BoW representations, that just the presence or absence (0/1) of specific feature words captures sufficient information [7, 16, 20], especially with (e.g.,) 3-grams or higher-order models. And so, the web can be imagined as a giant storehouse of ultra high-dimensional sparse binary vectors. Of course, binary vectors can also be equivalently viewed as sets (containing locations of the nonzero features). We list four practical scenarios where k-way resemblance search would be a natural choice. (i) Google Sets: (http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html) Google Sets is among the earliest google projects, which allows users to generate list of similar words by typing only few related keywords. For example, if the user types “mazda” and “honda” the application will automatically generate related words like “bmw”, “ford”, “toyota”, etc. This application is currently available in google spreadsheet. If we assume the term document binary representation of each word w in the database, then given query w1 and w2 , we show that |w1 ∩w2 ∩w| |w1 ∪w2 ∪w| turns out to be a very good similarity measure for this application (see Section 7.1). 1 (ii) Joint recommendations: Users A and B would like to watch a movie together. The profile of each person can be represented as a sparse vector over a giant universe of attributes. For example, a user profile may be the set of actors, actresses, genres, directors, etc, which she/he likes. On the other hand, we can represent a movie M in the database over the same universe based on attributes associated with the movie. If we have to recommend movie M, jointly to users A and B, then a natural measure to maximize is |A∩B∩M | . The problem of group recommendation [3] is applicable |A∪B∪M | in many more settings such as recommending people to join circles, etc. (iii) Improving retrieval quality: We are interested in finding images of a particular type of object, and we have two or three (possibly noisy) representative images. In such a scenario, a natural expectation is that retrieving images simultaneously similar to all the representative images should be more refined than just retrieving images similar to any one of them. In Section 7.2, we demonstrate that in cases where we have more than one element to search for, we can refine our search quality using k-way resemblance search. In a dynamic feedback environment [4], we can improve subsequent search quality by using k-way similarity search on the pages already clicked by the user. (iv) Beyond pairwise clustering: While machine learning algorithms often utilize the data through pairwise similarities (e.g., inner product or resemblance), there are natural scenarios where the affinity relations are not pairwise, but rather triadic, tetradic or higher [2, 30]. The computational cost, of course, will increase exponentially if we go beyond pairwise similarity. Efficiency is crucial With the data explosion in modern applications, the brute force way of scanning all the data for searching is prohibitively expensive, specially in user-facing applications like search. The need for k-way similarity search can only be fulfilled if it admits efficient algorithms. This paper fulfills this requirement for k-way resemblance and its derived similarities. In particular, we show fast algorithms with provable query time guarantees for approximate k-way resemblance search. Our algorithms and analysis naturally provide a framework to extend classical LSH framework [14, 13] to handle higher-order similarities, which could be of independent theoretical interest. Organization In Section 2, we review approximate near neighbor search and classical Locality Sensitive Hashing (LSH). In Section 3, we formulate the 3-way similarity search problems. Sections 4, 5, and 6 describe provable fast algorithms for several search problems. Section 7 demonstrates the applicability of 3-way resemblance search in real applications. 2 Classical c-NN and Locality Sensitive Hashing (LSH) Initial attempts of finding efficient (sub-linear time) algorithms for exact near neighbor search, based on space partitioning, turned out to be a disappointment with the massive dimensionality of current datasets [11, 28]. Approximate versions of the problem were proposed [14, 13] to break the linear query time bottleneck. One widely adopted formalism is the c-approximate near neighbor (c-NN). Definition 1 (c-Approximate Near Neighbor or c-NN). Consider a set of points, denoted by P, in a D-dimensional space RD , and parameters R0 > 0, δ > 0. The task is to construct a data structure which, given any query point q, if there exist an R0 -near neighbor of q in P, it reports some cR0 -near neighbor of q in P with probability 1 − δ. The usual notion of c-NN is for distance. Since we deal with similarities, we define R0 -near neighbor of point q as a point p with Sim(q, p) ≥ R0 , where Sim is the similarity function of interest. Locality sensitive hashing (LSH) [14, 13] is a popular framework for c-NN problems. LSH is a family of functions, with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than non-similar ones. In formal terms, consider H a family of hash functions mapping RD to some set S Definition 2 (Locality Sensitive Hashing (LSH)). A family H is called (R0 , cR0 , p1 , p2 )-sensitive if for any two points x, y ∈ RD and h chosen uniformly from H satisfies the following: • if Sim(x, y) ≥ R0 then P rH (h(x) = h(y)) ≥ p1 • if Sim(x, y) ≤ cR0 then P rH (h(x) = h(y)) ≤ p2 For approximate nearest neighbor search typically, p1 > p2 and c < 1 is needed. Note, c < 1 as we are defining neighbors in terms of similarity. Basically, LSH trades off query time with extra preprocessing time and space which can be accomplished off-line. 2 Fact 1 Given a family of (R0 , cR0 , p1 , p2 ) -sensitive hash functions, one can construct a data structure for c-NN with O(nρ log1/p2 n) query time and space O(n1+ρ ), where ρ = log 1/p1 . log 1/p2 Minwise Hashing for Pairwise Resemblance One popular choice of LSH family of functions associated with resemblance similarity is, Minwise Hashing family [5, 6, 13]. Minwise Hashing family applies an independent random permutation π : Ω → Ω, on the given set S ⊆ Ω, and looks at the minimum element under π, i.e. min(π(S)). Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, it can be shown by elementary probability argument that P r (min(π(S1 )) = min(π(S2 ))) = |S1 ∩ S2 | = R2way . |S1 ∪ S2 | (1) The recent work on b-bit minwise hashing [20, 23] provides an improvement by storing only the lowest b bits of the hashed values: min(π(S1 )), min(π(S2 )). [26] implemented the idea of building hash tables for near neighbor search, by directly using the bits from b-bit minwise hashing. 3 3-way Similarity Search Formulation Our focus will remain on binary vectors which can also be viewed as sets. We illustrate our method |S1 ∩S2 ∩S3 | using 3-way resemblance similarity function Sim(S1 , S2 , S3 ) = |S1 ∪S2 ∪S3 | . The algorithm and guarantees naturally extend to k-way resemblance. Given a size n collection C ⊆ 2Ω of sets (or binary vectors), we are particularly interested in the following three problems: 1. Given two query sets S1 and S2 , find S3 ∈ C that maximizes Sim(S1 , S2 , S3 ). 2. Given a query set S1 , find two sets S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). 3. Find three sets S1 , S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). The brute force way of enumerating all possibilities leads to the worst case query time of O(n), O(n2 ) and O(n3 ) for problem 1, 2 and 3, respectively. In a hope to break this barrier, just like the case of pairwise near neighbor search, we define the c-approximate (c < 1) versions of the above three problems. As in the case of c-NN, we are given two parameters R0 > 0 and δ > 0. For each of the following three problems, the guarantee is with probability at least 1 − δ: 1. (3-way c-Near Neighbor or 3-way c-NN) Given two query sets S1 and S2 , if there ′ exists S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report some S3 ∈ C so that ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 2. (3-way c-Close Pair or 3-way c-CP) Given a query set S1 , if there exists a pair of ′ ′ set S2 , S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S2 , S3 ∈ C so that ′ ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 3. (3-way c-Best Cluster or 3-way c-BC) If there exist sets S1 , S2 , S3 ∈ C with ′ ′ ′ ′ ′ ′ Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S1 , S2 , S3 ∈ C so that Sim(S1 , S2 , S3 ) ≥ cR0 . 4 Sub-linear Algorithm for 3-way c-NN The basic philosophy behind sub-linear search is bucketing, which allows us to preprocess dataset in a fashion so that we can filter many bad candidates without scanning all of them. LSH-based techniques rely on randomized hash functions to create buckets that probabilistically filter bad candidates. This philosophy is not restricted for binary similarity functions and is much more general. Here, we first focus on 3-way c-NN problem for binary data. Theorem 1 For R3way c-NN one can construct a data structure with O(nρ log1/cR0 n) query time and O(n1+ρ ) space, where ρ = 1 − log 1/c log 1/c+log 1/R0 . The argument for 2-way resemblance can be naturally extended to k-way resemblance. Specifically, given three sets S1 , S2 , S3 ⊆ Ω and an independent random permutation π : Ω → Ω, we have: P r (min(π(S1 )) = min(π(S2 )) = min(π(S3 ))) = R3way . (2) Eq.( 2) shows that minwise hashing, although it operates on sets individually, preserves all 3-way (in fact k-way) similarity structure of the data. The existence of such a hash function is the key requirement behind the existence of efficient approximate search. For the pairwise case, the probability event was a simple hash collision, and the min-hash itself serves as the bucket index. In case 3 of 3-way (and higher) c-NN problem, we have to take care of a more complicated event to create an indexing scheme. In particular, during preprocessing we need to create buckets for each individual S3 , and while querying we need to associate the query sets S1 and S2 to the appropriate bucket. We need extra mechanisms to manipulate these minwise hashes to obtain a bucketing scheme. Proof of Theorem 1: We use two additional functions: f1 : Ω → N for manipulating min(π(S3 )) and f2 : Ω × Ω → N for manipulating both min(π(S1 )) and min(π(S2 )). Let a ∈ N+ such that |Ω| = D < 10a . We define f1 (x) = (10a + 1) × x and f2 (x, y) = 10a x + y. This choice ensures that given query S1 and S2 , for any S3 ∈ C, f1 (min(π(S3 ))) = f2 (min(π(S1 )), min(π(S2 ))) holds if and only if min(π(S1 )) = min(π(S2 )) = min(π(S2 )), and thus we get a bucketing scheme. To complete the proof, we introduce two integer parameters K and L. Define a new hash function by concatenating K events. To be more precise, while preprocessing, for every element S3 ∈ C create buckets g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hK (S3 ))] where hi is chosen uniformly from minwise hashing family. For given query points S1 and S2 , retrieve only points in the bucket g2 (S1 , S2 ) = [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hK (S1 ), hK (S2 ))]. Repeat this process L times independently. For any K S3 ∈ C, with Sim(S1 , S2 , S3 ) ≥ R0 , is retrieved with probability at least 1 − (1 − R0 )L . Using log 1/c log K = ⌈ log n ⌉ and L = ⌈nρ log( 1 )⌉, where ρ = 1 − log 1/c+log 1/R0 , the proof can be obtained 1 δ cR0 using standard concentration arguments used to prove Fact 1, see [14, 13]. It is worth noting that the probability guarantee parameter δ gets absorbed in the constants as log( 1 ). Note, the process is δ stopped as soon as we find some element with R3way ≥ cR0 . Theorem 1 can be easily extended to k-way resemblance with same query time and space guarantees. Note that k-way c-NN is at least as hard as k ∗ -way c-NN for any k ∗ ≤ k, because we can always choose (k −k ∗ +1) identical query sets in k-way c-NN, and it reduces to k ∗ -way c-NN problem. So, any improvements in R3way c-NN implies improvement in the classical min-hash LSH for Jaccard similarity. The proposed analysis is thus tight in this sense. The above observation makes it possible to also perform the traditional pairwise c-NN search using the same hash tables deployed for 3-way c-NN. In the query phase we have an option, if we have two different queries S1 , S2 , then we retrieve from bucket g2 (S1 , S2 ) and that is usual 3-way c-NN search. If we are just interested in pairwise near neighbor search given one query S1 , then we will look into bucket g2 (S1 , S1 ), and we know that the 3-way resemblance between S1 , S1 , S3 boils down to the pairwise resemblance between S1 and S3 . So, the same hash tables can be used for both the purposes. This property generalizes, and hash tables created for k-way c-NN can be used for any k ∗ -way similarity search so long as k ∗ ≤ k. The approximation guarantees still holds. This flexibility makes k-way c-NN bucketing scheme more advantageous over the pairwise scheme. ρ 1 One of the peculiarity of LSH based techniques is that the query complexity exponent ρ < 1 is dependent on the choice R0=0.01 0.8 of the threshold R0 we are interested in and the value of c 0.05 0.1 0.3 0.6 which is the approximation ratio that we will tolerate. Figure 1 0.2 0.4 0.8 log 1/c 0.5 plots ρ = 1− log 1/c+log 1/R0 with respect to c, for selected R0 0.4 0.6 0.9 0.7 values from 0.01 to 0.99. For instance, if we are interested in 0.2 0.95 highly similar pairs, i.e. R0 ≈ 1, then we are looking at near R =0.99 0 O(log n) query complexity for c-NN problem as ρ ≈ 0. On 0 0 0.2 0.4 0.6 0.8 1 the other hand, for very lower threshold R0 , there is no much c log 1/c of hope of time-saving because ρ is close to 1. Figure 1: ρ = 1 − log 1/c+log 1/R0 . 5 Other Efficient k-way Similarities We refer to the k-way similarities for which there exist sub-linear algorithms for c-NN search with query and space complexity exactly as given in Theorem 1 as efficient . We have demonstrated existence of one such example of efficient similarities, which is the k-way resemblance. This leads to a natural question: “Are there more of them?”. [9] analyzed all the transformations on similarities that preserve existence of efficient LSH search. In particular, they showed that if S is a similarity for which there exists an LSH family, then there also exists an LSH family for any similarity which is a probability generating function (PGF) transfor∑∞ mation on S. PGF transformation on S is defined as P GF (S) = i=1 pi S i , where S ∈ [0, 1] and ∑∞ pi ≥ 0 satisfies i=1 pi = 1. Similar theorem can also be shown in the case of 3-way resemblance. 4 Theorem 2 Any PGF transformation on 3-way resemblance R3way is efficient. Recall in the proof of Theorem 1, we created hash assignments f1 (min(π(S3 ))) and f2 (min(π(S1 )), min(π(S2 ))), which lead to a bucketing scheme for the 3-way resemblance search, where the collision event E = {f1 (min(π(S3 )) = f2 (min(π(S1 )), min(π(S2 )))} happens with probability P r(E) = R3way . To prove the above Theorem 2, we will need to create hash events ∑∞ i having probability P GF (R3way ) = i=1 pi (R3way ) . Note that 0 ≤ P GF (R3way ) ≤ 1. We will make use of the following simple lemma. Lemma 1 (R3way )n is efficient for all n ∈ N. n n Proof: Define new hash assignments g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hn (S3 ))] and g2 (S1 , S2 ) = n n [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hn (S1 ), hn (S2 ))]. The collision event g1 (S3 ) = g2 (S1 , S2 ) has n n probability (R3way )n . We now use the pair < g1 , g2 > instead of < f1 , f2 > and obtain same 3way n guarantees, as in Theorem 1, for (R ) as well. i i Proof of Theorem 2: From Lemma 1, let < g1 , g2 > be the hash pair corresponding to (R3way )i i i as used in above lemma. We sample one hash pair from the set {< g1 , g2 >: i ∈ N}, where i i the probability of sampling < g1 , g2 > is proportional to pi . Note that pi ≥ 0, and satisfies ∑∞ is i=1 pi = 1, and so the above sampling ∑ valid. It is not difficult to see that the collision of the ∞ sampled hash pair has probability exactly i=1 pi (R3way )i . Theorem 2 can be naturally extended to k-way similarity for any k ≥ 2. Thus, we now have infinitely many k-way similarity functions admitting efficient sub-linear search. One, that might be interesting, because of its radial basis kernel like nature, is shown in the following corollary. Corollary 1 eR k−way −1 is efficient. Proof: Use the expansion of eR k−way normalized by e to see that eR k−way −1 is a PGF on Rk−way . 6 Fast Algorithms for 3-way c-CP and 3-way c-BC Problems For 3-way c-CP and 3-way c-BC problems, using bucketing scheme with minwise hashing family will save even more computations. Theorem 3 For R3way c-Close Pair Problem (or c-CP) one can construct a data structure with log 1/c O(n2ρ log1/cR0 n) query time and O(n1+2ρ ) space, where ρ = 1 − log 1/c+log 1/R0 . Note that we can switch the role of f1 and f2 in the proof of Theorem 1. We are thus left with a c-NN problem with search space O(n2 ) (all pairs) instead of n. A bit of analysis, similar to Theorem 1, will show that this procedure achieves the required query time O(n2ρ log1/cR0 n), but uses a lot more space, O(n2(1+ρ )), than shown in the above theorem. It turns out that there is a better way of doing c-CP that saves us space. Proof of Theorem 3: We again start with constructing hash tables. For every element Sc ∈ C, we create a hash-table and store Sc in bucket B(Sc ) = [h1 (Sc ); h2 (Sc ); ...; hK (Sc )], where hi is chosen uniformly from minwise independent family of hash functions H. We create L such hash-tables. For a query element Sq we look for all pairs in bucket B(Sq ) = [h1 (Sq ); h2 (Sq ); ...; hK (Sq )] and repeat this for each of the L tables. Note, we do not form pairs of elements retrieved from different tables as they do not satisfy Eq. (2). If there exists a pair S1 , S2 ∈ C with Sim(Sq , S1 , S2 ) ≥ R0 , using K Eq. (2), we can see that we will find that pair in bucket B(Sq ) with probability 1 − (1 − R0 )L . Here, we cannot use traditional choice of K and L, similar to what we did in Theorem 1, as there 2 log are O(n2 ) instead of O(n) possible pairs. We instead use K = ⌈ log 1n ⌉ and L = ⌈n2ρ log( 1 )⌉, δ cR0 log 1/c with ρ = 1 − log 1/c+log 1/R0 . With this choice of K and L, the result follows. Note, the process is stopped as soon as we find pairs S1 and S2 with Sim(Sq , S1 , S2 ) ≥ cR0 . The key argument that saves space from O(n2(1+ρ) ) to O(n1+2ρ ) is that we hash n points individually. Eq. (2) makes it clear that hashing all possible pairs is not needed when every point can be processed individually, and pairs formed within each bucket itself filter out most of the unnecessary combinations. 5 Theorem 4 For R3way c-Best Cluster Problem (or c-BC) there exist an algorithm with running time log 1/c O(n1+2ρ log1/cR0 n), where ρ = 1 − log 1/c+log 1/R0 . The argument similar to one used in proof of Theorem 3 leads to the running time of O(n1+3ρ log1/cR0 n) as we need L = O(n3ρ ), and we have to processes all points at least once. Proof of Theorem 4: Repeat c-CP problem n times for every element in collection C acting as query once. We use the same set of hash tables and hash functions every time. The preprocessing time is O(n1+2ρ log1/cR0 n) evaluations of hash functions and the total querying time is O(n × n2ρ log1/cR0 n), which makes the total running time O(n1+2ρ log1/cR0 n). For k-way c-BC Problem, we can achieve O(n1+(k−1)ρ log1/cR0 n) running time. If we are interested in very high similarity cluster, with R0 ≈ 1, then ρ ≈ 0, and the running time is around O(n log n). This is a huge saving over the brute force O(nk ). In most practical cases, specially in big data regime where we have enormous amount of data, we can expect the k-way similarity of good clusters to be high and finding them should be efficient. We can see that with increasing k, hashing techniques save more computations. 7 Experiments In this section, we demonstrate the usability of 3-way and higher-order similarity search using (i) Google Sets, and (ii) Improving retrieval quality. 7.1 Google Sets: Generating Semantically Similar Words Here, the task is to retrieve words which are “semantically” similar to the given set of query words. We collected 1.2 million random documents from Wikipedia and created a standard term-doc binary vector representation of each term present in the collected documents after removing standard stop words and punctuation marks. More specifically, every word is represented as a 1.2 million dimension binary vector indicating its presence or absence in the corresponding document. The total number of terms (or words) was around 60,000 in this experiment. Since there is no standard benchmark available for this task, we show qualitative evaluations. For querying, we used the following four pairs of semantically related words: (i) “jaguar” and “tiger”; (ii) “artificial” and “intelligence”; (iii) “milky” and “way” ; (iv) “finger” and “lakes”. Given the query words w1 and w2 , we compare the results obtained by the following four methods. • Google Sets: We use Google’s algorithm and report 5 words from Google spreadsheets [1]. This is Google’s algorithm which uses its own data. • 3-way Resemblance (3-way): We use 3-way resemblance |w1 ∩w2 ∩w| to rank every word |w1 ∪w2 ∪w| w and report top 5 words based on this ranking. • Sum Resemblance (SR): Another intuitive method is to use the sum of pairwise resem|w2 ∩w| blance |w1 ∩w| + |w2 ∪w| and report top 5 words based on this ranking. |w1 ∪w| • Pairwise Intersection (PI): We first retrieve top 100 words based on pairwise resemblance for each w1 and w2 independently. We then report the words common in both. If there is no word in common we do not report anything. The results in Table 1 demonstrate that using 3-way resemblance retrieves reasonable candidates for these four queries. An interesting query is “finger” and “lakes”. Finger Lakes is a region in upstate New York. Google could only relate it to New York, while 3-way resemblance could even retrieve the names of cities and lakes in the region. Also, for query “milky” and “way”, we can see some (perhaps) unrelated words like “dance” returned by Google. We do not see such random behavior with 3-way resemblance. Although we are not aware of the algorithm and the dataset used by Google, we can see that 3-way resemblance appears to be a right measure for this application. The above results also illustrate the problem with using the sum of pairwise similarity method. The similarity value with one of the words dominates the sum and hence we see for queries “artificial” and “intelligence” that all the retrieved words are mostly related to the word “intelligence”. Same is the case with query “finger” and “lakes” as well as “jaguar” and “tiger”. Note that “jaguar” is also a car brand. In addition, for all 4 queries, there was no common word in the top 100 words similar to the each query word individually and so PI method never returns anything. 6 Table 1: Top five words retrieved using various methods for different queries. “JAGUAR” AND “ TIGER” G OOGLE 3- WAY SR LION LEOPARD CHEETAH CAT DOG LEOPARD CHEETAH LION PANTHER CAT CAT LEOPARD LITRE BMW CHASIS “MILKY” AND “ WAY” G OOGLE 3- WAY SR DANCE STARS SPACE THE UNIVERSE GALAXY STARS EARTH LIGHT SPACE EVEN ANOTHER STILL BACK TIME PI — — — — — “ARTIFICIAL” AND “INTELLIGENCE” G OOGLE 3- WAY SR PI COMPUTER COMPUTER SECURITY — PROGRAMMING SCIENCE WEAPONS — INTELLIGENT SECRET — SCIENCE ROBOT HUMAN ATTACKS — ROBOTICS TECHNOLOGY HUMAN — PI — — — — — G OOGLE NEW YORK NY PARK CITY “FINGER” AND “LAKES” 3- WAY SR SENECA CAYUGA ERIE ROCHESTER IROQUOIS RIVERS FRESHWATER FISH STREAMS FORESTED PI — — — — — We should note the importance of the denominator term in 3-way resemblance, without which frequent words will be blindly favored. The exciting contribution of this paper is that 3-way resemblance similarity search admits provable sub-linear guarantees, making it an ideal choice. On the other hand, no such provable guarantees are known for SR and other heuristic based search methods. 7.2 Improving Retrieval Quality in Similarity Search We also demonstrate how the retrieval quality of traditional similarity search can be boosted by utilizing more query candidates instead of just one. For the evaluations we choose two public datasets: MNIST and WEBSPAM, which were used in a recent related paper [26] for near neighbor search with binary data using b-bit minwise hashing [20, 23]. The two datasets reflect diversity both in terms of task and scale that is encountered in practice. The MNIST dataset consists of handwritten digit samples. Each sample is an image of 28 × 28 pixel yielding a 784 dimension vector with the associated class label (digit 0 − 9). We binarize the data by settings all non zeros to be 1. We used the standard partition of MNIST, which consists of 10,000 samples in one set and 60,000 in the other. The WEBSPAM dataset, with 16,609,143 features, consists of sparse vector representation of emails labeled as spam or not. We randomly sample 70,000 data points and partitioned them into two independent sets of size 35,000 each. Table 2: Percentage of top candidates with the same labels as that of query retrieved using various similarity criteria. More indicates better retrieval quality (Best marked in bold). T OP Pairwise 3-way NNbor 4-way NNbor 1 94.20 96.90 97.70 MNIST 10 20 92.33 91.10 96.13 95.36 96.89 96.28 50 89.06 93.78 95.10 1 98.45 99.75 99.90 WEBSPAM 10 20 96.94 96.46 98.68 97.80 98.87 98.15 50 95.12 96.11 96.45 For evaluation, we need to generate potential similar search query candidates for k-way search. It makes no sense in trying to search for object simultaneously similar to two very different objects. To generate such query candidates, we took one independent set of the data and partition it according to the class labels. We then run a cheap k-mean clustering on each class, and randomly sample triplets < x1 , x2 , x3 > from each cluster for evaluating 2-way, 3-way and 4-way similarity search. For MNIST dataset, the standard 10,000 test set was partitioned according to the labels into 10 sets, each partition was then clustered into 10 clusters, and we choose 10 triplets randomly from each cluster. In all we had 100 such triplets for each class, and thus 1000 overall query triplets. For WEBSPAM, which consists only of 2 classes, we choose one of the independent set and performed the same procedure. We selected 100 triplets from each cluster. We thus have 1000 triplets from each class making the total number of 2000 query candidates. The above procedures ensure that the elements in each triplets < x1 , x2 , x3 > are not very far from each other and are of the same class label. For each triplet < x1 , x2 , x3 >, we sort all the points x in the other independent set based on the following: • Pairwise: We only use the information in x1 and rank x based on resemblance 7 |x1 ∩x| |x1 ∪x| . • 3-way NN: We rank x based on 3-way resemblance • 4-way NN: We rank x based on 4-way resemblance |x1 ∩x2 ∩x| |x1 ∪x2 ∪x| . |x1 ∩x2 ∩x3 ∩x| |x1 ∪x2 ∪x3 ∪x| . We look at the top 1, 10, 20 and 50 points based on orderings described above. Since, all the query triplets are of the same label, The percentage of top retrieved candidates having same label as that of the query items is a natural metric to evaluate the retrieval quality. This percentage values accumulated over all the triplets are summarized in Table 2. We can see that top candidates retrieved by 3-way resemblance similarity, using 2 query points, are of better quality than vanilla pairwise similarity search. Also 4-way resemblance, with 3 query points, further improves the results compared to 3-way resemblance similarity search. This clearly demonstrates that multi-way resemblance similarity search is more desirable whenever we have more than one representative query in mind. Note that, for MNIST, which contains 10 classes, the boost compared to pairwise retrieval is substantial. The results follow a consistent trend. 8 Future Work While the work presented in this paper is promising for efficient 3-way and k-way similarity search in binary high-dimensional data, there are numerous interesting and practical research problems we can study as future work. In this section, we mention a few such examples. One-permutation hashing. Traditionally, building hash tables for near neighbor search required many (e.g., 1000) independent hashes. This is both time- and energy-consuming, not only for building tables but also for processing un-seen queries which have not been processed. One permutation hashing [22] provides the hope of reducing many permutations to merely one. The version in [22], however, was not applicable to near neighbor search due to the existence of many empty bins (which offer no indexing capability). The most recent work [27] is able to fill the empty bins and works well for pairwise near neighbor search. It will be interesting to extend [27] to k-way search. Non-binary sparse data. This paper focuses on minwise hashing for binary data. Various extensions to real-valued data are possible. For example, our results naturally apply to consistent weighted sampling [25, 15], which is one way to handle non-binary sparse data. The problem, however, is not solved if we are interested in similarities such as (normalized) k-way inner products, although the line of work on Conditional Random Sampling (CRS) [19, 18] may be promising. CRS works on non-binary sparse data by storing a bottom subset of nonzero entries after applying one permutation to (real-valued) sparse data matrix. CRS performs very well for certain applications but it does not work in our context because the bottom (nonzero) subsets are not properly aligned. Building hash tables by directly using bits from minwise hashing. This will be a different approach from the way how the hash tables are constructed in this paper. For example, [26] directly used the bits from b-bit minwise hashing [20, 23] to build hash tables and demonstrated the significant advantages compared to sim-hash [8, 12] and spectral hashing [29]. It would be interesting to see the performance of this approach in k-way similarity search. k-Way sign random projections. It would be very useful to develop theory for k-way sign random projections. For usual (real-valued) random projections, it is known that the volume (which is related to the determinant) is approximately preserved [24, 17]. We speculate that the collision probability of k-way sign random projections might be also a (monotonic) function of the determinant. 9 Conclusions We formulate a new framework for k-way similarity search and obtain fast algorithms in the case of k-way resemblance with provable worst-case approximation guarantees. We show some applications of k-way resemblance search in practice and demonstrate the advantages over traditional search. Our analysis involves the idea of probabilistic hashing and extends the well-known LSH family beyond the pairwise case. We believe the idea of probabilistic hashing still has a long way to go. Acknowledgement The work is supported by NSF-III-1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Ping Li thanks Kenneth Church for introducing Google Sets to him in the summer of 2004 at Microsoft Research. 8 References [1] http://www.howtogeek.com/howto/15799/how-to-use-autofill-on-a-google-docs-spreadsheet-quick-tips/. [2] S. Agarwal, Jongwoo Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005. [3] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. Group recommendation: semantics and efficiency. Proc. VLDB Endow., 2(1):754–765, 2009. [4] Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. Dynamic ranked retrieval. In WSDM, pages 247–256, 2011. [5] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21–29, Positano, Italy, 1997. [6] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998. [7] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classification. IEEE Trans. Neural Networks, 10(5):1055–1064, 1999. [8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. [9] Flavio Chierichetti and Ravi Kumar. LSH-preserving functions and their applications. In SODA, 2012. [10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003. [11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for finding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975. [12] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115–1145, 1995. [13] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):321–350, 2012. [14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998. [15] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, 2010. [16] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494–501, Amsterdam, Netherlands, 2007. [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Technical report, arXiv:1207.6083, 2013. [18] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics (Preliminary results appeared in HLT/EMNLP 2005), 33(3):305–354, 2007. [19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873–880, Vancouver, Canada, 2006. [20] Ping Li and Arnd Christian K¨ nig. b-bit minwise hashing. In Proceedings of the 19th International o Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010. [21] Ping Li, Arnd Christian K¨ nig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simio larities. In NIPS, Vancouver, Canada, 2010. [22] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012. [23] Ping Li, Anshumali Shrivastava, and Arnd Christian K¨ nig. b-bit minwise hashing in practice. In Intero netware, Changsha, China, 2013. [24] Avner Magen and Anastasios Zouzias. Near optimal dimensionality reductions that preserve volumes. In APPROX / RANDOM, pages 523–534, 2008. [25] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010. [26] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML, Bristol, UK, 2012. [27] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014. [28] Roger Weber, Hans-J¨ rg Schek, and Stephen Blott. A quantitative analysis and performance study for o similarity-search methods in high-dimensional spaces. In VLDB, pages 194–205, 1998. [29] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, Vancouver, Canada, 2008. [30] D. Zhou, J. Huang, and B. Sch¨ lkopf. Beyond pairwise classification and clustering using hypergraphs. o In NIPS, Vancouver, Canada, 2006. 9
2 0.88715971 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet
Abstract: In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of maximin policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed minimax regret as a suitable alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds.
same-paper 3 0.85704851 344 nips-2013-Using multiple samples to learn mixture models
Author: Jason Lee, Ran Gilad-Bachrach, Rich Caruana
Abstract: In the mixture models problem it is assumed that there are K distributions θ1 , . . . , θK and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same K underlying distributions, but with different mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data. 1
4 0.81918448 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
Author: David Carlson, Vinayak Rao, Joshua T. Vogelstein, Lawrence Carin
Abstract: With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparametric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-theart. Via exploratory data analysis—using data with partial ground truth as well as two novel data sets—we find several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) detecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using multiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain. 1
5 0.81598753 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
Author: Nikhil Rao, Christopher Cox, Rob Nowak, Timothy T. Rogers
Abstract: Multitask learning can be effective when features useful in one task are also useful for other tasks, and the group lasso is a standard method for selecting a common subset of features. In this paper, we are interested in a less restrictive form of multitask learning, wherein (1) the available features can be organized into subsets according to a notion of similarity and (2) features useful in one task are similar, but not necessarily identical, to the features best suited for other tasks. The main contribution of this paper is a new procedure called Sparse Overlapping Sets (SOS) lasso, a convex optimization that automatically selects similar features for related learning tasks. Error bounds are derived for SOSlasso and its consistency is established for squared error loss. In particular, SOSlasso is motivated by multisubject fMRI studies in which functional activity is classified using brain voxels as features. Experiments with real and synthetic data demonstrate the advantages of SOSlasso compared to the lasso and group lasso. 1
6 0.81369996 326 nips-2013-The Power of Asymmetry in Binary Hashing
7 0.81143576 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
9 0.80991548 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
10 0.8092494 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
11 0.80639464 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
12 0.80631858 294 nips-2013-Similarity Component Analysis
13 0.80524111 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
14 0.80513978 173 nips-2013-Least Informative Dimensions
15 0.80461216 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
16 0.80352932 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
17 0.80198079 301 nips-2013-Sparse Additive Text Models with Low Rank Background
18 0.80156672 350 nips-2013-Wavelets on Graphs via Deep Learning
19 0.80111986 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
20 0.80017501 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints