nips nips2012 nips2012-277 knowledge-graph by maker-knowledge-mining

277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Source: pdf

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we ﬁrst develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the ﬁrst method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. [sent-8, score-0.45]

2 In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. [sent-11, score-0.209]

3 While the ﬁrst method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. [sent-12, score-0.179]

4 Both methods are extended to handle sparse outliers for robustness and can handle missing values. [sent-13, score-0.319]

5 Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. [sent-14, score-0.496]

6 A classical approach is principal component analysis (PCA), which implicitly models data to live in a single low-dimensional subspace within the high-dimensional ambient space. [sent-16, score-0.466]

7 This modeling leads to the more challenging problem of subspace clustering, which attempts to simultaneously cluster data points into multiple subspaces and ﬁnd the basis of the corresponding subspace for each cluster. [sent-18, score-0.933]

8 Mathematically, subspace clustering can be deﬁned as follows: Let Y be the M × N data matrix consisting of N vectors {yi ∈ RM }N , which are assumed be drawn from a union of K linear (or i=1 afﬁne) subspaces Sk of unknown dimensions dk = dim(Sk ) with 0 < dk < M . [sent-19, score-0.886]

9 The subspace clustering problem is to ﬁnd the number of subspaces K, their dimensions {dk }K , the subspace k=1 bases, and the clustering of vectors yi into these subspaces. [sent-20, score-1.37]

10 Subspace clustering is widely investigated problem due to its application in a large number of ﬁelds, including computer vision [6, 12, 23], machine learning [11, 22] and system identiﬁcation [31] (see [22, 28] for comprehensive reviews). [sent-21, score-0.17]

11 Some of the common approaches include algebraicgeometric approaches such as generalized PCA (GPCA) [19, 29], spectral clustering [18], and mixture models [9, 26]. [sent-22, score-0.17]

12 The general approach in these methods is to ﬁrst ﬁnd a sparse/low-rank representation X of the data and then apply a spectral clustering method on X. [sent-24, score-0.17]

13 • Sparse Subspace Clustering (SSC) [7, 25]: This approach is based on representing data points yi as sparse linear combinations of other data points. [sent-27, score-0.182]

14 In these formulations, D is a clean dictionary and data Y is assumed to be the noisy version of D possibly with outliers. [sent-31, score-0.185]

15 When β → ∞, Y = D, and thus the data itself is used as the dictionary [7,15,25]. [sent-32, score-0.124]

16 If the subspaces are disjoint or independent1 , the solution X in both formulations is shown to be such that Xik = 0 only if data points yi and yk belong to the same subspace [7, 14, 15, 25]. [sent-33, score-0.714]

17 That is, the sparsest/lowest rank solution is obtained when each point yi is represented as a linear combination of points in its own subspace. [sent-34, score-0.179]

18 The estimated X is used to deﬁne an afﬁnity matrix [18] such as |X| + |XT | and a spectral clustering algorithm, such as normalized cuts [24], is applied on this afﬁnity to cluster the data vectors. [sent-35, score-0.289]

19 The subspace bases can then be obtained in a straightforward manner using this clustering. [sent-36, score-0.365]

20 We derive and analyze its global solution, and show that it is related to closed-form solution of the rankminimization formulation (2) in [8]. [sent-41, score-0.136]

21 The second one is based on a matrix-factorization formulation, and exploits the recent results on Bayesian matrix factorization [20] to achieve fast estimation that is less prone to errors due to alternating optimization. [sent-43, score-0.243]

22 Compared to deterministic methods, proposed Bayesian methods have the advantages of automatically estimating the dimensionality and the algorithmic parameters. [sent-45, score-0.091]

23 This is crucial in unsupervised clustering as the parameters can have a drastic effect on the solution, especially in the presence of heavy noise and outliers. [sent-46, score-0.206]

24 We also assume that each subspace is sufﬁciently sampled, that is, for each Si of dimension di , there exist at least di data vectors yi in Y that span Si . [sent-49, score-0.774]

25 We formulate the latent variable model as yi = di + nY , di = DAbi + nD , i = 1, . [sent-52, score-0.457]

26 , N K k=1 1 The subspaces Sk are called independent if dim( SK ) = The subspaces are disjoint if they only intersect at the origin. [sent-55, score-0.418]

28 (7) N We model the components as independent such that p(Y|D) = i=1 p(yi |di ), p(D|A, B) = N N i=1 p(di |D, A, bi ), and p(B) = i=1 p(bi ). [sent-62, score-0.097]

29 This model has the generative interpretation where latent vectors bi are drawn from an isotropic Gaussian distribution, shaped by A to obtain Abi , which then chooses a sample of points from the dictionary D to generate the ith dictionary element di . [sent-63, score-0.646]

30 In this sense, matrix DA has a role similar to principal subspace matrix in probabilistic principal component analysis (PPCA) [26]. [sent-64, score-0.653]

31 However, notice that in contrast to this and related approaches such as mixture of PPCAs [9, 26], the principal subspaces are deﬁned using the data itself in (6). [sent-65, score-0.316]

32 In (5), the observations yi are modeled as corrupted versions of dictionary elements di with iid Gaussian noise. [sent-66, score-0.464]

33 1 An Expectation-Maximization (EM) Algorithm In (5) - (7), latent variables bi can be regarded as missing data and D, A as parameters, and an EM algorithm can be devised for their joint estimation. [sent-70, score-0.14]

34 The complete log-likelihood is given by N LC = log p(yi , bi ) (8) i=1 with p(yi , bi ) = p(yi |di ) p(di |D, A, bi ) p(bi ). [sent-71, score-0.291]

35 (11) In summary, the maximum likelihood solution is obtained by an alternating iterative procedure where ﬁrst the statistics of B are calculated using (9), followed by the M-step updates for D, A, σd , and σy in (10) and (11), respectively. [sent-88, score-0.111]

36 This is a reasonable assumption if each subspace is sufﬁciently sampled and the dictionary element di belongs to one of them (i. [sent-92, score-0.607]

37 Here, D = UΛVT is the singular value decomposition (SVD) of D, and Vq contains its q right singular vectors that √ correspond to singular values that are larger than or equal to N σd . [sent-97, score-0.562]

38 Hence, the solution (12) is T related to the rank-q shape interaction matrix (SIM) Vq Vq [6], while in addition it involves scaling of the singular vectors via thresholded singular values of D. [sent-98, score-0.47]

39 Hence, the optimal D is obtained by applying the thresholding operation (13) on the singular values of Y, where the shrinkage amount is small for large singular values so that they are preserved, whereas small singular values are shrank by down-scaling. [sent-100, score-0.569]

40 As shown in [8], the nuclear norm formulation (2) leads to a similar closed-form solution, but it requires the solution of a quartic equation. [sent-102, score-0.09]

41 2 Finally, at the stationary points, the noise variance σd is found as 2 σd = 1 N −q N λ2 , q (14) q =q+1 that is, the average of the squared discarded singular values of D when computing DA B . [sent-103, score-0.176]

42 2 2 In summary, if σy and σd are given, the optimal D and A B are found by taking the SVD of Y and applying shrinkage/thresholding operations on the singular values of Y. [sent-105, score-0.176]

43 , σy = 0), an alternative method is to choose q, the total number of independent dimensions to be retained in 2 2 DA B , calculate σd from (14), and ﬁnally use (12) to obtain A B . [sent-109, score-0.083]

44 These issues can be overcome by employing a Bayesian estimation to automatically determine the effective dimensionality of D and AB. [sent-114, score-0.124]

45 For a M × N matrix X, the matrix-variate normal distribution is given by [10] N (X|M, Σ, Ω) = (2π) NM 2 N M 1 T |Σ|− 2 |Ω|− 2 exp − tr Σ−1 (X − M) Ω−1 (X − M) 2 (15) where M is the mean, and Σ, Ω are M × M row and N × N column covariances, respectively. [sent-119, score-0.129]

46 4 For inference, we employ the variational Bayesian (VB) method [4] which leads to a fast algorithm. [sent-126, score-0.095]

47 The variational free energy is given by the following functional 2 2 2 2 F = log q(D, A, B, CA , CB , σd , σy ) − log p(Y, D, A, B, CA , CB , σd , σy ) . [sent-128, score-0.129]

48 The convergence can be monitored during iterations using the variational free energy F. [sent-135, score-0.129]

49 2 σd = Similarly to the matrix factorization approaches [2, 3, 13], automatic dimensionality selection is invoked via hyperparameters cA,i and cB,i , which enforce sparsity in the columns and rows of A and B, respectively. [sent-138, score-0.229]

50 Speciﬁcally, when a particular set of variances cA,i , cB,i assume very small values, the posteriors of the ith column of A and ith row of B will be concentrated around zero, such that the effective number of principal directions in AB will be reduced. [sent-139, score-0.171]

51 Thus, one can apply a matrix factorization method to D, and relate this factorization to DAB to 1 ﬁnd AB. [sent-145, score-0.184]

52 It has been shown in [20] that when variational Bayesian inference is applied to this model, the global solution is found analytically and given by DL DR = UΛF VT , (24) 3 The optimal distribution q(A) does not have a matrix-variate normal form. [sent-147, score-0.217]

53 5 where U, V contain the singular vectors of D, and ΛF is a diagonal matrix, obtained by applying a speciﬁc shrinkage method to the singular values of D [20]. [sent-149, score-0.386]

54 The number of retained singular values are therefore automatically determined. [sent-150, score-0.273]

55 Then, setting DL DR equal to DAB, we obtain the solution T AB = Vf Λ−1 ΛF Vf , where the subscript f denotes the retained singular value and vectors. [sent-151, score-0.268]

56 f The only modiﬁcation to the method in the previous section is to replace the estimation of A and T B in (18)-(21) with the global solution Vf Λ−1 ΛF Vf . [sent-152, score-0.123]

57 Although the probability model is slightly different than the one described in the previous section, we anticipate that its global solution to be related to the factorization-based solution. [sent-154, score-0.09]

58 5 Robustness to Outliers Depending on the application, the outliers might be in various forms. [sent-155, score-0.204]

59 For instance in motion tracking applications, an entire data point might become an outlier if the tracker fails at that instance. [sent-156, score-0.211]

60 The only required change in the model is in the conditional distribution of the observations as 2 p(Y|D) = N (Y|D + E, σy ) , (25) where E is the sparse outlier matrix for which we introduce the prior p(E) = N (E|0, CC , CR ) = N (vec(E)|0, CC ⊗ CR ) . [sent-159, score-0.227]

61 In the ﬁrst case, the VB E,i E E estimation rule becomes q(ei ) = N ( ei , I, Σei ) with ei = Σei 1 (yi − di ) Σei = diag 2 σy with the hyperparameter update cR = ei E,i models can be derived in a similar manner. [sent-166, score-0.661]

62 T 1 1 + R 2 σy cE,i −1 , (27) ei +tr (Σei ). [sent-167, score-0.133]

63 The estimation rules for other outlier In the presence of outlier data points, there is an inherent unidentiﬁability between AB and E which can prevent the detection of outliers and hence reduce the performance of subspace clustering. [sent-168, score-0.879]

64 Speciﬁcally, an outlier yi can be included in the sparse component as ei = yi or included in the dictionary D with its own subspace, which leads to (AB)ii ≈ 1. [sent-169, score-0.739]

65 During iterations, data points yi with (AB)ii larger than a threshold (e. [sent-171, score-0.135]

66 As this might initially increase the variational energy F, we monitor its progress over a few iterations and reject this “birth” of the sparse component if F does not decrease below its original state. [sent-175, score-0.209]

67 This method is observed to be very effective in identifying outliers and alleviating the effect of the initialization. [sent-176, score-0.204]

68 Finally, missing values in Y can also be handled by modifying the distribution of the observations 2 in (5) to p(yi |di ) = k∈Zi N yik | dik , σy , where Zi is the set containing the indices of the observed entries in vector yi . [sent-177, score-0.138]

69 We also include comparisons with deterministic subspace clustering and mixture of PPCA (MPPCA) methods. [sent-184, score-0.496]

70 In all experiments, the estimated AB matrix is used to ﬁnd the afﬁnity matrix and the normalized cuts algorithm [24] is applied to ﬁnd the clustering and hence the subspaces. [sent-185, score-0.292]

71 16) VBLR VBLR−Fac MPPCA Clustering accuracy (%) 100 90 80 70 60 50 40 (a) (b) 30 0 Figure 1: Clustering 1D subspaces (points in the same cluster are in the same color) (a) MPPCA [3] result, (b) the result of the EM algorithm (global solution). [sent-188, score-0.246]

72 20 40 60 80 Percentage of Outliers (%) 100 Figure 2: Accuracy of clustering 5 independent subspaces of dimension 5 for different percentage of outliers. [sent-190, score-0.379]

73 1, where each contains 800 points slightly corrupted by iid Gaussian noise of variance 0. [sent-193, score-0.118]

74 Each line can be considered as a separate 1D subspace, and the subspaces are disjoint but not independent. [sent-195, score-0.209]

75 On the other hand, the EM method accurately clusters the lines into different subspaces (Fig. [sent-199, score-0.209]

76 Both Bayesian methods VBLR and VBLR-Fac gave similar results and accurately estimated the subspace dimensions, while the VB-variant of MPPCA [9] gave results similar to Fig. [sent-201, score-0.428]

77 Next, similarly to the setup in [15], we construct 5 independent subspaces {Si } ⊂ R50 of dimension 5 with bases Ui generated as follows: We ﬁrst generate a random 50 × 5 orthogonal matrix U1 , and then rotate it with random orthonormal matrices Ri to obtain Ui = Ri U1 , i = 2, 3, 4. [sent-203, score-0.288]

78 Dictionary D is obtained by sampling 25 points from each subspace using Di = Ui Vi where Vi are 5 × 25 matrices with elements drawn from N (0, 1). [sent-204, score-0.361]

79 Finally, Y is obtained by corrupting D with outliers sampled from N (0, 1) and normalized to lie on the unit sphere. [sent-205, score-0.204]

80 Table 1: Clustering errors (%) on the Hopkins155 motion database Method Mean Max Std GPCA [19] 30. [sent-212, score-0.137]

81 The Hopkins155 motion database [27] is frequently used to test subspace clustering methods. [sent-231, score-0.602]

82 Each motion corresponds to a subspace and each sequence is regarded as a separate clustering task. [sent-233, score-0.567]

83 While most existing methods use a pre-processing stage that generally involves dimensionality reduction using PCA, we do not employ pre-processing and apply our Bayesian methods directly (the EM method cannot handle outliers and thus is not included in the experiments). [sent-234, score-0.311]

84 The mean and maximum clustering errors and the standard deviation in the whole set are shown in Table 1. [sent-235, score-0.201]

85 Example recovered clean dictionary and sparse outlier components are shown in Fig. [sent-250, score-0.372]

86 Original images are shown in the left column (denoted by Y), the clean dictionary elements obtained by VBLR and VBLR-Fac are shown in columns denoted by DAB, and columns denoted by E show corruption captured by the sparse element. [sent-258, score-0.344]

87 7 Conclusion In this work we presented a probabilistic treatment of low dimensional subspace clustering. [sent-259, score-0.326]

88 Using a latent variable formulation, we developed an expectation-maximization method and derived its global solution. [sent-260, score-0.089]

89 We further proposed two effective Bayesian methods both based on the automatic relevance determination principle and variational Bayesian approximation for inference. [sent-261, score-0.136]

90 While the ﬁrst one, VBLR, relies completely on alternating optimization, the second one, VBLR-Fac, makes use of the global solution of VB matrix factorization to eliminate one alternating step and leads to faster convergence. [sent-262, score-0.336]

91 Both methods have been extended to handle sparse large corruptions in the data for robustness. [sent-263, score-0.135]

92 These methods are advantageous over deterministic methods as they are able to automatically determine the total number of principal dimensions and all required algorithmic parameters. [sent-264, score-0.191]

93 A closed form solution to robust subspace estimation and clustering. [sent-326, score-0.403]

94 Acquiring linear subspaces for face recognition under variable lighting. [sent-351, score-0.209]

95 Exact subspace segmentation and outlier detection by low-rank representation. [sent-384, score-0.545]

96 Latent low-rank representation for subspace segmentation and feature extraction. [sent-389, score-0.405]

97 Estimation of subspace arrangements with applications in modeling and segmenting mixed data,. [sent-400, score-0.326]

98 Clustering high-dimensional data: a survey on subspace slustering, pattern-based clustering, and correlation clustering. [sent-416, score-0.326]

99 Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories. [sent-424, score-0.198]

100 A benchmark for the comparison of 3-d motion segmentation algorithms. [sent-454, score-0.15]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('subspace', 0.326), ('vblr', 0.297), ('lrr', 0.216), ('subspaces', 0.209), ('outliers', 0.204), ('singular', 0.176), ('clustering', 0.17), ('dab', 0.162), ('ab', 0.162), ('di', 0.157), ('cb', 0.157), ('cr', 0.152), ('outlier', 0.14), ('bbt', 0.135), ('mppca', 0.135), ('ei', 0.133), ('dictionary', 0.124), ('principal', 0.107), ('cc', 0.105), ('em', 0.104), ('bayesian', 0.104), ('yi', 0.1), ('bi', 0.097), ('dl', 0.096), ('ssc', 0.095), ('variational', 0.095), ('da', 0.095), ('vq', 0.094), ('dr', 0.089), ('vf', 0.088), ('corrupted', 0.083), ('vb', 0.08), ('segmentation', 0.079), ('ca', 0.072), ('diag', 0.072), ('factorization', 0.072), ('gpca', 0.071), ('motion', 0.071), ('alternating', 0.067), ('ppca', 0.062), ('vidal', 0.062), ('clean', 0.061), ('tr', 0.057), ('abi', 0.054), ('corruptions', 0.054), ('dabi', 0.054), ('lsa', 0.054), ('dim', 0.051), ('gave', 0.051), ('sk', 0.049), ('automatically', 0.049), ('retained', 0.048), ('mn', 0.048), ('tron', 0.048), ('unidenti', 0.048), ('babacan', 0.048), ('sparse', 0.047), ('formulation', 0.046), ('global', 0.046), ('solution', 0.044), ('corruption', 0.044), ('nakajima', 0.044), ('urbana', 0.044), ('latent', 0.043), ('dimensionality', 0.042), ('cuts', 0.042), ('thresholding', 0.041), ('automatic', 0.041), ('matrix', 0.04), ('birth', 0.039), ('dt', 0.039), ('bases', 0.039), ('priors', 0.038), ('handled', 0.038), ('yale', 0.038), ('cluster', 0.037), ('corr', 0.037), ('pca', 0.037), ('af', 0.037), ('presence', 0.036), ('aug', 0.036), ('dk', 0.036), ('database', 0.035), ('cl', 0.035), ('dimensions', 0.035), ('points', 0.035), ('columns', 0.034), ('energy', 0.034), ('bb', 0.034), ('illinois', 0.034), ('handle', 0.034), ('vectors', 0.034), ('component', 0.033), ('estimation', 0.033), ('il', 0.032), ('normal', 0.032), ('ith', 0.032), ('develop', 0.031), ('errors', 0.031), ('included', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

2 0.13492443 16 nips-2012-A Polynomial-time Form of Robust Regression

Author: Ozlem Aslan, Dale Schuurmans, Yao-liang Yu

Abstract: Despite the variety of robust regression methods that have been developed, current regression formulations are either NP-hard, or allow unbounded response to even a single leverage point. We present a general formulation for robust regression—Variational M-estimation—that uniﬁes a number of robust regression methods while allowing a tractable approximation strategy. We develop an estimator that requires only polynomial-time, while achieving certain robustness and consistency guarantees. An experimental evaluation demonstrates the effectiveness of the new estimation approach compared to standard methods. 1

3 0.13420397 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

Author: Thanh Ngo, Yousef Saad

Abstract: This paper describes gradient methods based on a scaled metric on the Grassmann manifold for low-rank matrix completion. The proposed methods signiﬁcantly improve canonical gradient methods, especially on ill-conditioned matrices, while maintaining established global convegence and exact recovery guarantees. A connection between a form of subspace iteration for matrix completion and the scaled gradient descent procedure is also established. The proposed conjugate gradient method based on the scaled gradient outperforms several existing algorithms for matrix completion and is competitive with recently proposed methods. 1

4 0.12648317 254 nips-2012-On the Sample Complexity of Robust PCA

Author: Matthew Coudron, Gilad Lerman

Abstract: We estimate the rate of convergence and sample complexity of a recent robust estimator for a generalized version of the inverse covariance matrix. This estimator is used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our model assumes a sub-Gaussian underlying distribution and an i.i.d. sample from it. Our main result shows with high probability that the norm of the difference between the generalized inverse covariance of the underlying distribution and its estimator from an i.i.d. sample of size N is of order O(N −0.5+ ) for arbitrarily small > 0 (affecting the probabilistic estimate); this rate of convergence is close to the one of direct covariance estimation, i.e., O(N −0.5 ). Our precise probabilistic estimate implies for some natural settings that the sample complexity of the generalized inverse covariance estimation when using the Frobenius norm is O(D2+δ ) for arbitrarily small δ > 0 (whereas the sample complexity of direct covariance estimation with Frobenius norm is O(D2 )). These results provide similar rates of convergence and sample complexity for the corresponding robust subspace recovery algorithm. To the best of our knowledge, this is the only work analyzing the sample complexity of any robust PCA algorithm. 1

5 0.11790158 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

Author: Shinichi Nakajima, Ryota Tomioka, Masashi Sugiyama, S. D. Babacan

Abstract: The variational Bayesian (VB) approach is one of the best tractable approximations to the Bayesian estimation, and it was demonstrated to perform well in many applications. However, its good performance was not fully understood theoretically. For example, VB sometimes produces a sparse solution, which is regarded as a practical advantage of VB, but such sparsity is hardly observed in the rigorous Bayesian estimation. In this paper, we focus on probabilistic PCA and give more theoretical insight into the empirical success of VB. More speciﬁcally, for the situation where the noise variance is unknown, we derive a sufﬁcient condition for perfect recovery of the true PCA dimensionality in the large-scale limit when the size of an observed matrix goes to inﬁnity. In our analysis, we obtain bounds for a noise variance estimator and simple closed-form solutions for other parameters, which themselves are actually very useful for better implementation of VB-PCA. 1

6 0.11692661 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

7 0.11152562 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

8 0.10827419 69 nips-2012-Clustering Sparse Graphs

9 0.10629049 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

10 0.10538452 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

11 0.10065141 86 nips-2012-Convex Multi-view Subspace Learning

12 0.099516079 237 nips-2012-Near-optimal Differentially Private Principal Components

13 0.097229078 279 nips-2012-Projection Retrieval for Classification

14 0.096670099 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

15 0.09416686 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

16 0.093149319 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

17 0.091427952 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

18 0.088889778 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

19 0.086265482 126 nips-2012-FastEx: Hash Clustering with Exponential Families

20 0.085638821 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.243), (1, 0.09), (2, 0.007), (3, -0.063), (4, -0.028), (5, 0.008), (6, -0.011), (7, -0.013), (8, 0.042), (9, -0.059), (10, 0.053), (11, -0.098), (12, -0.074), (13, -0.041), (14, 0.094), (15, 0.098), (16, 0.037), (17, -0.002), (18, 0.033), (19, -0.083), (20, -0.166), (21, 0.049), (22, -0.09), (23, 0.107), (24, -0.058), (25, 0.023), (26, 0.034), (27, -0.068), (28, -0.02), (29, 0.142), (30, -0.052), (31, 0.093), (32, 0.024), (33, 0.03), (34, -0.1), (35, 0.007), (36, -0.028), (37, 0.008), (38, 0.096), (39, 0.09), (40, 0.054), (41, -0.083), (42, 0.009), (43, -0.079), (44, 0.046), (45, 0.032), (46, -0.018), (47, -0.127), (48, -0.088), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95857555 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

2 0.75343901 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate signiﬁcant improvement over the existing relative-error algorithms. 1

3 0.70334816 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke

Abstract: Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for instance when modelling the spiking activity of populations of neurons. Here, we show how spectral learning methods (usually called subspace identiﬁcation in this context) for linear systems with linear-Gaussian observations can be extended to estimate the parameters of a generalised-linear dynamical system model despite a non-linear and non-Gaussian observation process. We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. We show that the extended subspace identiﬁcation algorithm is consistent and accurately recovers the correct parameters on large simulated data sets with a single calculation, avoiding the costly iterative computation of approximate expectation-maximisation (EM). Even on smaller data sets, it provides an effective initialisation for EM, avoiding local optima and speeding convergence. These beneﬁts are shown to extend to real neural data.

4 0.69717932 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

Author: Shinichi Nakajima, Ryota Tomioka, Masashi Sugiyama, S. D. Babacan

5 0.67056072 125 nips-2012-Factoring nonnegative matrices with linear programs

Author: Ben Recht, Christopher Re, Joel Tropp, Victor Bittorf

Abstract: This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identiﬁes a matrix C that satisﬁes X ≈ CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to ﬁnd a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efﬁcient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes. 1

6 0.65803325 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

7 0.63988537 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

8 0.63832986 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

9 0.62654799 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

10 0.61954445 86 nips-2012-Convex Multi-view Subspace Learning

11 0.61237407 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery

12 0.58798891 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

13 0.57613629 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

14 0.56467044 254 nips-2012-On the Sample Complexity of Robust PCA

15 0.56265581 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

16 0.55143225 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

17 0.54603237 26 nips-2012-A nonparametric variable clustering model

18 0.53030509 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

19 0.52559561 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

20 0.52454776 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.056), (21, 0.028), (38, 0.168), (39, 0.019), (42, 0.032), (54, 0.02), (55, 0.021), (61, 0.191), (74, 0.054), (76, 0.147), (80, 0.141), (92, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9359315 2 nips-2012-3D Social Saliency from Head-mounted Cameras

Author: Hyun S. Park, Eakta Jain, Yaser Sheikh

Abstract: A gaze concurrence is a point in 3D where the gaze directions of two or more people intersect. It is a strong indicator of social saliency because the attention of the participating group is focused on that point. In scenes occupied by large groups of people, multiple concurrences may occur and transition over time. In this paper, we present a method to construct a 3D social saliency ďŹ eld and locate multiple gaze concurrences that occur in a social scene from videos taken by head-mounted cameras. We model the gaze as a cone-shaped distribution emanating from the center of the eyes, capturing the variation of eye-in-head motion. We calibrate the parameters of this distribution by exploiting the ďŹ xed relationship between the primary gaze ray and the head-mounted camera pose. The resulting gaze model enables us to build a social saliency ďŹ eld in 3D. We estimate the number and 3D locations of the gaze concurrences via provably convergent modeseeking in the social saliency ďŹ eld. Our algorithm is applied to reconstruct multiple gaze concurrences in several real world scenes and evaluated quantitatively against motion-captured ground truth. 1

2 0.93133748 190 nips-2012-Learning optimal spike-based representations

Author: Ralph Bourdoukan, David Barrett, Sophie Deneve, Christian K. Machens

Abstract: How can neural networks learn to represent information optimally? We answer this question by deriving spiking dynamics and learning dynamics directly from a measure of network performance. We ﬁnd that a network of integrate-and-ﬁre neurons undergoing Hebbian plasticity can learn an optimal spike-based representation for a linear decoder. The learning rule acts to minimise the membrane potential magnitude, which can be interpreted as a representation error after learning. In this way, learning reduces the representation error and drives the network into a robust, balanced regime. The network becomes balanced because small representation errors correspond to small membrane potentials, which in turn results from a balance of excitation and inhibition. The representation is robust because neurons become self-correcting, only spiking if the representation error exceeds a threshold. Altogether, these results suggest that several observed features of cortical dynamics, such as excitatory-inhibitory balance, integrate-and-ﬁre dynamics and Hebbian plasticity, are signatures of a robust, optimal spike-based code. A central question in neuroscience is to understand how populations of neurons represent information and how they learn to do so. Usually, learning and information representation are treated as two different functions. From the outset, this separation seems like a good idea, as it reduces the problem into two smaller, more manageable chunks. Our approach, however, is to study these together. This allows us to treat learning and information representation as two sides of a single mechanism, operating at two different timescales. Experimental work has given us several clues about the regime in which real networks operate in the brain. Some of the most prominent observations are: (a) high trial-to-trial variability—a neuron responds differently to repeated, identical inputs [1, 2]; (b) asynchronous ﬁring at the network level—spike trains of different neurons are at most very weakly correlated [3, 4, 5]; (c) tight balance of excitation and inhibition—every excitatory input is met by an inhibitory input of equal or greater size [6, 7, 8] and (4) spike-timing-dependent plasticity (STDP)—the strength of synapses change as a function of presynaptic and postsynaptic spike times [9]. Previously, it has been shown that observations (a)–(c) can be understood as signatures of an optimal, spike-based code [10, 11]. The essential idea is to derive spiking dynamics from the assumption that neurons only ﬁre if their spike improves information representation. Information in a network may ∗ Authors contributed equally 1 originate from several possible sources: external sensory input, external neural network input, or alternatively, it may originate within the network itself as a memory, or as a computation. Whatever the source, this initial assumption leads directly to the conclusion that a network of integrate-and-ﬁre neurons can optimally represent a signal while exhibiting properties (a)–(c). A major problem with this framework is that network connectivity must be completely speciﬁed a priori, and requires the tuning of N 2 parameters, where N is the number of neurons in the network. Although this is feasible mathematically, it is unclear how a real network could tune itself into this optimal regime. In this work, we solve this problem using a simple synaptic learning rule. The key insight is that the plasticity rule can be derived from the same basic principle as the spiking rule in the earlier work—namely, that any change should improve information representation. Surprisingly, this can be achieved with a local, Hebbian learning rule, where synaptic plasticity is proportional to the product of presynaptic ﬁring rates with post-synaptic membrane potentials. Spiking and synaptic plasticity then work hand in hand towards the same goal: the spiking of a neuron decreases the representation error on a fast time scale, thereby giving rise to the actual population representation; synaptic plasticity decreases the representation error on a slower time scale, thereby improving or maintaining the population representation. For a large set of initial connectivities and spiking dynamics, neural networks are driven into a balanced regime, where excitation and inhibition cancel each other and where spike trains are asynchronous and irregular. Furthermore, the learning rule that we derive reproduces the main features of STDP (property (d) above). In this way, a network can learn to represent information optimally, with synaptic, neural and network dynamics consistent with those observed experimentally. 1 Derivation of the learning rule for a single neuron We begin by deriving a learning rule for a single neuron with an autapse (a self-connection) (Fig. 1A). Our approach is to derive synaptic dynamics for the autapse and spiking dynamics for the neuron such that the neuron learns to optimally represent a time-varying input signal. We will derive a learning rule for networks of neurons later, after we have developed the fundamental concepts for the single neuron case. Our ﬁrst step is to derive optimal spiking dynamics for the neuron, so that we have a target for our learning rule. We do this by making two simple assumptions [11]. First, we assume that the neuron can provide an estimate or read-out x(t) of a time-dependent signal x(t) by ﬁltering its spike train ˆ o(t) as follows: ˙ x(t) = −ˆ(t) + Γo(t), ˆ x (1) where Γ is a ﬁxed read-out weight, which we will refer to as the neuron’s “output kernel” and the spike train can be written as o(t) = i δ(t − ti ), where {ti } are the spike times. Next, we assume that the neuron only produces a spike if that spike improves the read-out, where we measure the read-out performance through a simple squared-error loss function: 2 L(t) = x(t) − x(t) . ˆ (2) With these two assumptions, we can now derive optimal spiking dynamics. First, we observe that if the neuron produces an additional spike at time t, the read-out increases by Γ, and the loss function becomes L(t|spike) = (x(t) − (x(t) + Γ))2 . This allows us to restate our spiking rule as follows: ˆ the neuron should only produce a spike if L(t|no spike) > L(t|spike), or (x(t) − x(t))2 > (x(t) − ˆ (x(t) + Γ))2 . Now, squaring both sides of this inequality, deﬁning V (t) ≡ Γ(x(t) − x(t)) and ˆ ˆ deﬁning T ≡ Γ2 /2 we ﬁnd that the neuron should only spike if: V (t) > T. (3) We interpret V (t) to be the membrane potential of the neuron, and we interpret T as the spike threshold. This interpretation allows us to understand the membrane potential functionally: the voltage is proportional to a prediction error—the difference between the read-out x(t) and the actual ˆ signal x(t). A spike is an error reduction mechanism—the neuron only spikes if the error exceeds the spike threshold. This is a greedy minimisation, in that the neuron ﬁres a spike whenever that action decreases L(t) without considering the future impact of that spike. Importantly, the neuron does not require direct access to the loss function L(t). 2 To determine the membrane potential dynamics, we take the derivative of the voltage, which gives ˙ ˙ us V = Γ(x − x). (Here, and in the following, we will drop the time index for notational brevity.) ˙ ˆ ˙ Now, using Eqn. (1) we obtain V = Γx − Γ(−x + Γo) = −Γ(x − x) + Γ(x + x) − Γ2 o, so that: ˙ ˆ ˆ ˙ ˙ V = −V + Γc − Γ2 o, (4) where c = x + x is the neural input. This corresponds exactly to the dynamics of a leaky integrate˙ and-ﬁre neuron with an inhibitory autapse1 of strength Γ2 , and a feedforward connection strength Γ. The dynamics and connectivity guarantee that a neuron spikes at just the right times to optimise the loss function (Fig. 1B). In addition, it is especially robust to noise of different forms, because of its error-correcting nature. If x is constant in time, the voltage will rise up to the threshold T at which point a spike is ﬁred, adding a delta function to the spike train o at time t, thereby producing a read-out x that is closer to x and causing an instantaneous drop in the voltage through the autapse, ˆ by an amount Γ2 = 2T , effectively resetting the voltage to V = −T . We now have a target for learning—we know the connection strength that a neuron must have at the end of learning if it is to represent information optimally, for a linear read-out. We can use this target to derive synaptic dynamics that can learn an optimal representation from experience. Speciﬁcally, we consider an integrate-and-ﬁre neuron with some arbitrary autapse strength ω. The dynamics of this neuron are given by ˙ V = −V + Γc − ωo. (5) This neuron will not produce the correct spike train for representing x through a linear read-out (Eqn. (1)) unless ω = Γ2 . Our goal is to derive a dynamical equation for the synapse ω so that the spike train becomes optimal. We do this by quantifying the loss that we are incurring by using the suboptimal strength, and then deriving a learning rule that minimises this loss with respect to ω. The loss function underlying the spiking dynamics determined by Eqn. (5) can be found by reversing the previous membrane potential analysis. First, we integrate the differential equation for V , assuming that ω changes on time scales much slower than the membrane potential. We obtain the following (formal) solution: V = Γx − ω¯, o (6) ˙ where o is determined by o = −¯ + o. The solution to this latter equation is o = h ∗ o, a convolution ¯ ¯ o ¯ of the spike train with the exponential kernel h(τ ) = θ(τ ) exp(−τ ). As such, it is analogous to the instantaneous ﬁring rate of the neuron. Now, using Eqn. (6), and rewriting the read-out as x = Γ¯, we obtain the loss incurred by the ˆ o sub-optimal neuron, L = (x − x)2 = ˆ 1 V 2 + 2(ω − Γ2 )¯ + (ω − Γ2 )2 o2 . o ¯ Γ2 (7) We observe that the last two terms of Eqn. (7) will vanish whenever ω = Γ2 , i.e., when the optimal reset has been found. We can therefore simplify the problem by deﬁning an alternative loss function, 1 2 V , (8) 2 which has the same minimum as the original loss (V = 0 or x = x, compare Eqn. (2)), but yields a ˆ simpler learning algorithm. We can now calculate how changes to ω affect LV : LV = ∂LV ∂V ∂o ¯ =V = −V o − V ω ¯ . (9) ∂ω ∂ω ∂ω We can ignore the last term in this equation (as we will show below). Finally, using simple gradient descent, we obtain a simple Hebbian-like synaptic plasticity rule: τω = − ˙ ∂LV = V o, ¯ ∂ω (10) where τ is the learning time constant. 1 This contribution of the autapse can also be interpreted as the reset of an integrate-and-ﬁre neuron. Later, when we generalise to networks of neurons, we shall employ this interpretation. 3 This synaptic learning rule is capable of learning the synaptic weight ω that minimises the difference between x and x (Fig. 1B). During learning, the synaptic weight changes in proportion to the postˆ synaptic voltage V and the pre-synaptic ﬁring rate o (Fig. 1C). As such, this is a Hebbian learning ¯ rule. Of course, in this single neuron case, the pre-synaptic neuron and post-synaptic neuron are the same neuron. The synaptic weight gradually approaches its optimal value Γ2 . However, it never completely stabilises, because learning never stops as long as neurons are spiking. Instead, the synapse oscillates closely about the optimal value (Fig. 1D). This is also a “greedy” learning rule, similar to the spiking rule, in that it seeks to minimise the error at each instant in time, without regard for the future impact of those changes. To demonstrate that the second term in Eqn. (5) can be neglected we note that the equations for V , o, and ω deﬁne a system ¯ of coupled differential equations that can be solved analytically by integrating between spikes. This results in a simple recurrence relation for changes in ω from the ith to the (i + 1)th spike, ωi+1 = ωi + ωi (ωi − 2T ) . τ (T − Γc − ωi ) (11) This iterative equation has a single stable ﬁxed point at ω = 2T = Γ2 , proving that the neuron’s autaptic weight or reset will approach the optimal solution. 2 Learning in a homogeneous network We now generalise our learning rule derivation to a network of N identical, homogeneously connected neurons. This generalisation is reasonably straightforward because many characteristics of the single neuron case are shared by a network of identical neurons. We will return to the more general case of heterogeneously connected neurons in the next section. We begin by deriving optimal spiking dynamics, as in the single neuron case. This provides a target for learning, which we can then use to derive synaptic dynamics. As before, we want our network to produce spikes that optimally represent a variable x for a linear read-out. We assume that the read-out x is provided by summing and ﬁltering the spike trains of all the neurons in the network: ˆ ˙ x = −ˆ + Γo, ˆ x (12) 2 where the row vector Γ = (Γ, . . . , Γ) contains the read-out weights of the neurons and the column vector o = (o1 , . . . , oN ) their spike trains. Here, we have used identical read-out weights for each neuron, because this indirectly leads to homogeneous connectivity, as we will demonstrate. Next, we assume that a neuron only spikes if that spike reduces a loss-function. This spiking rule is similar to the single neuron spiking rule except that this time there is some ambiguity about which neuron should spike to represent a signal. Indeed, there are many different spike patterns that provide exactly the same estimate x. For example, one neuron could ﬁre regularly at a high rate (exactly like ˆ our previous single neuron example) while all others are silent. To avoid this ﬁring rate ambiguity, we use a modiﬁed loss function, that selects amongst all equivalent solutions, those with the smallest neural ﬁring rates. We do this by adding a ‘metabolic cost’ term to our loss function, so that high ﬁring rates are penalised: ¯ L = (x − x)2 + µ o 2 , ˆ (13) where µ is a small positive constant that controls the cost-accuracy trade-off, akin to a regularisation parameter. Each neuron in the optimal network will seek to reduce this loss function by ﬁring a spike. Speciﬁcally, the ith neuron will spike whenever L(no spike in i) > L(spike in i). This leads to the following spiking rule for the ith neuron: Vi > Ti (14) where Vi ≡ Γ(x − x) − µoi and Ti ≡ Γ2 /2 + µ/2. We can naturally interpret Vi as the membrane ˆ potential of the ith neuron and Ti as the spiking threshold of that neuron. As before, we can now derive membrane potential dynamics: ˙ V = −V + ΓT c − (ΓT Γ + µI)o, 2 (15) The read-out weights must scale as Γ ∼ 1/N so that ﬁring rates are not unrealistically small in large networks. We can see this by calculating the average ﬁring rate N oi /N ≈ x/(ΓN ) ∼ O(N/N ) ∼ O(1). i=1 ¯ 4 where I is the identity matrix and ΓT Γ + µI is the network connectivity. We can interpret the selfconnection terms {Γ2 +µ} as voltage resets that decrease the voltage of any neuron that spikes. This optimal network is equivalent to a network of identical integrate-and-ﬁre neurons with homogeneous inhibitory connectivity. The network has some interesting dynamical properties. The voltages of all the neurons are largely synchronous, all increasing to the spiking threshold at about the same time3 (Fig. 1F). Nonetheless, neural spiking is asynchronous. The ﬁrst neuron to spike will reset itself by Γ2 + µ, and it will inhibit all the other neurons in the network by Γ2 . This mechanism prevents neurons from spik- x 3 The ﬁrst neuron to spike will be random if there is some membrane potential noise. V (A) (B) x x ˆ x 10 1 0.1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 1 D 0.5 V V 0 ˆ x V ˆ x (C) 1 0 1 2 0 0.625 25 25.625 (D) start of learning 1 V 50 200.625 400 400.625 1 2.4 O 1.78 ω 1.77 25 neuron$ 0 1 2 !me$ 3 4 25 1 5 V 400.625 !me$ (F) 25 1 2.35 1.05 1.049 400 25.625 !me$ (E) neuron$ 100.625 200 end of learning 1.4 1.35 ω 100 !me$ 1 V 1 O 50.625 0 1 2 !me$ 3 4 5 V !me$ !me$ Figure 1: Learning in a single neuron and a homogeneous network. (A) A single neuron represents an input signal x by producing an output x. (B) During learning, the single neuron output x (solid red ˆ ˆ line, top panel) converges towards the input x (blue). Similarly, for a homogeneous network the output x (dashed red line, top panel) converges towards x. Connectivity also converges towards optimal ˆ connectivity in both the single neuron case (solid black line, middle panel) and the homogeneous net2 2 work case (dashed black line, middle panel), as quantiﬁed by D = maxi,j ( Ωij − Ωopt / Ωopt ) ij ij at each point in time. Consequently, the membrane potential reset (bottom panel) converges towards the optimal reset (green line, bottom panel). Spikes are indicated by blue vertical marks, and are produced when the membrane potential reaches threshold (bottom panel). Here, we have rescaled time, as indicated, for clarity. (C) Our learning rule dictates that the autapse ω in our single neuron (bottom panel) changes in proportion to the membrane potential (top panel) and the ﬁring rate (middle panel). (D) At the end of learning, the reset ω ﬂuctuates weakly about the optimal value. (E) For a homogeneous network, neurons spike regularly at the start of learning, as shown in this raster plot. Membrane potentials of different neurons are weakly correlated. (F) At the end of learning, spiking is very irregular and membrane potentials become more synchronous. 5 ing synchronously. The population as a whole acts similarly to the single neuron in our previous example. Each neuron ﬁres regularly, even if a different neuron ﬁres in every integration cycle. The design of this optimal network requires the tuning of N (N − 1) synaptic parameters. How can an arbitrary network of integrate-and-ﬁre neurons learn this optimum? As before, we address this question by using the optimal network as a target for learning. We start with an arbitrarily connected network of integrate-and-ﬁre neurons: ˙ V = −V + ΓT c − Ωo, (16) where Ω is a matrix of connectivity weights, which includes the resets of the individual neurons. Assuming that learning occurs on a slow time scale, we can rewrite this equation as V = ΓT x − Ω¯ . o (17) Now, repeating the arguments from the single neuron derivation, we modify the loss function to obtain an online learning rule. Speciﬁcally, we set LV = V 2 /2, and calculate the gradient: ∂LV = ∂Ωij Vk k ∂Vk =− ∂Ωij Vk δki oj − ¯ k Vk Ωkl kl ∂ ol ¯ . ∂Ωij (18) We can simplify this equation considerably by observing that the contribution of the second summation is largely averaged out under a wide variety of realistic conditions4 . Therefore, it can be neglected, and we obtain the following local learning rule: ∂LV ˙ = V i oj . ¯ τ Ωij = − ∂Ωij (19) This is a Hebbian plasticity rule, whereby connectivity changes in proportion to the presynaptic ﬁring rate oj and post-synaptic membrane potential Vi . We assume that the neural thresholds are set ¯ to a constant T and that the neural resets are set to their optimal values −T . In the previous section we demonstrated that these resets can be obtained by a Hebbian plasticity rule (Eqn. (10)). This learning rule minimises the difference between the read-out and the signal, by approaching the optimal recurrent connection strengths for the network (Fig. 1B). As in the single neuron case, learning does not stop, so the connection strengths ﬂuctuate close to their optimal value. During learning, network activity becomes progressively more asynchronous as it progresses towards optimal connectivity (Fig. 1E, F). 3 Learning in the general case Now that we have developed the fundamental concepts underlying our learning rule, we can derive a learning rule for the more general case of a network of N arbitrarily connected leaky integrateand-ﬁre neurons. Our goal is to understand how such networks can learn to optimally represent a ˙ J-dimensional signal x = (x1 , . . . , xJ ), using the read-out equation x = −x + Γo. We consider a network with the following membrane potential dynamics: ˙ V = −V + ΓT c − Ωo, (20) where c is a J-dimensional input. We assume that this input is related to the signal according to ˙ c = x + x. This assumption can be relaxed by treating the input as the control for an arbitrary linear dynamical system, in which case the signal represented by the network is the output of such a computation [11]. However, this further generalisation is beyond the scope of this work. As before, we need to identify the optimal recurrent connectivity so that we have a target for learning. Most generally, the optimal recurrent connectivity is Ωopt ≡ ΓT Γ + µI. The output kernels of the individual neurons, Γi , are given by the rows of Γ, and their spiking thresholds by Ti ≡ Γi 2 /2 + 4 From the deﬁnition of the membrane potential we can see that Vk ∼ O(1/N ) because Γ ∼ 1/N . Therefore, the size of the ﬁrst term in Eqn. (18) is k Vk δki oj = Vi oj ∼ O(1/N ). Therefore, the second term can ¯ ¯ be ignored if kl Vk Ωkl ∂ ol /∂Ωij ¯ O(1/N ). This happens if Ωkl O(1/N 2 ) as at the start of learning. It also happens towards the end of learning if the terms {Ωkl ∂ ol /∂Ωij } are weakly correlated with zero mean, ¯ or if the membrane potentials {Vi } are weakly correlated with zero mean. 6 µ/2. With these connections and thresholds, we ﬁnd that a network of integrate-and-ﬁre neurons ˆ ¯ will produce spike trains in such a way that the loss function L = x − x 2 + µ o 2 is minimised, ˆ where the read-out is given by x = Γ¯ . We can show this by prescribing a greedy5 spike rule: o a spike is ﬁred by neuron i whenever L(no spike in i) > L(spike in i) [11]. The resulting spike generation rule is Vi > Ti , (21) ˆ where Vi ≡ ΓT (x − x) − µ¯i is interpreted as the membrane potential. o i 5 Despite being greedy, this spiking rule can generate ﬁring rates that are practically identical to the optimal solutions: we checked this numerically in a large ensemble of networks with randomly chosen kernels. (A) x1 … x … 1 1 (B) xJJ x 10 L 10 T T 10 4 6 8 1 Viii V D ˆˆ ˆˆ x11 xJJ x x F 0.5 0 0.4 … … 0.2 0 0 2000 4000 !me (C) x V V 1 x 10 x 3 ˆ x 8 0 x 10 1 2 3 !me 4 5 4 0 1 4 0 1 8 V (F) Ρ(Δt) E-‐I input 0.4 ˆ x 0 3 0 1 x 10 1.3 0.95 x 10 ˆ x 4 V (E) 1 x 0 end of learning 50 neuron neuron 50 !me 2 0 ˆ x 0 0.5 ISI Δt 1 2 !me 4 5 4 1.5 1.32 3 2 0.1 Ρ(Δt) x E-‐I input (D) start of learning 0 2 !me 0 0 0.5 ISI Δt 1 Figure 2: Learning in a heterogeneous network. (A) A network of neurons represents an input ˆ signal x by producing an output x. (B) During learning, the loss L decreases (top panel). The difference between the connection strengths and the optimal strengths also decreases (middle panel), as 2 2 quantiﬁed by the mean difference (solid line), given by D = Ω − Ωopt / Ωopt and the maxi2 2 mum difference (dashed line), given by maxi,j ( Ωij − Ωopt / Ωopt ). The mean population ﬁring ij ij rate (solid line, bottom panel) also converges towards the optimal ﬁring rate (dashed line, bottom panel). (C, E) Before learning, a raster plot of population spiking shows that neurons produce bursts ˆ of spikes (upper panel). The network output x (red line, middle panel) fails to represent x (blue line, middle panel). The excitatory input (red, bottom left panel) and inhibitory input (green, bottom left panel) to a randomly selected neuron is not tightly balanced. Furthermore, a histogram of interspike intervals shows that spiking activity is not Poisson, as indicated by the red line that represents a best-ﬁt exponential distribution. (D, F) At the end of learning, spiking activity is irregular and ˆ Poisson-like, excitatory and inhibitory input is tightly balanced and x matches x. 7 How can we learn this optimal connection matrix? As before, we can derive a learning rule by minimising the cost function LV = V 2 /2. This leads to a Hebbian learning rule with the same form as before: ˙ τ Ωij = Vi oj . ¯ (22) Again, we assume that the neural resets are given by −Ti . Furthermore, in order for this learning rule to work, we must assume that the network input explores all possible directions in the J-dimensional input space (since the kernels Γi can point in any of these directions). The learning performance does not critically depend on how the input variable space is sampled as long as the exploration is extensive. In our simulations, we randomly sample the input c from a Gaussian white noise distribution at every time step for the entire duration of the learning. We ﬁnd that this learning rule decreases the loss function L, thereby approaching optimal network connectivity and producing optimal ﬁring rates for our linear decoder (Fig. 2B). In this example, we have chosen connectivity that is initially much too weak at the start of learning. Consequently, the initial network behaviour is similar to a collection of unconnected single neurons that ignore each other. Spike trains are not Poisson-like, ﬁring rates are excessively large, excitatory and inhibitory ˆ input is unbalanced and the decoded variable x is highly unreliable (Fig. 2C, E). As a result of learning, the network becomes tightly balanced and the spike trains become asynchronous, irregular and Poisson-like with much lower rates (Fig. 2D, F). However, despite this apparent variability, the population representation is extremely precise, only limited by the the metabolic cost and the discrete nature of a spike. This learnt representation is far more precise than a rate code with independent Poisson spike trains [11]. In particular, shufﬂing the spike trains in response to identical inputs drastically degrades this precision. 4 Conclusions and Discussion In population coding, large trial-to-trial spike train variability is usually interpreted as noise [2]. We show here that a deterministic network of leaky integrate-and-ﬁre neurons with a simple Hebbian plasticity rule can self-organise into a regime where information is represented far more precisely than in noisy rate codes, while appearing to have noisy Poisson-like spiking dynamics. Our learning rule (Eqn. (22)) has the basic properties of STDP. Speciﬁcally, a presynaptic spike occurring immediately before a post-synaptic spike will potentiate a synapse, because membrane potentials are positive immediately before a postsynaptic spike. Furthermore, a presynaptic spike occurring immediately after a post-synaptic spike will depress a synapse, because membrane potentials are always negative immediately after a postsynaptic spike. This is similar in spirit to the STDP rule proposed in [12], but different to classical STDP, which depends on post-synaptic spike times [9]. This learning rule can also be understood as a mechanism for generating a tight balance between excitatory and inhibitory input. We can see this by observing that membrane potentials after learning can be interpreted as representation errors (projected onto the read-out kernels). Therefore, learning acts to minimise the magnitude of membrane potentials. Excitatory and inhibitory input must be balanced if membrane potentials are small, so we can equate balance with optimal information representation. Previous work has shown that the balanced regime produces (quasi-)chaotic network dynamics, thereby accounting for much observed cortical spike train variability [13, 14, 4]. Moreover, the STDP rule has been known to produce a balanced regime [16, 17]. Additionally, recent theoretical studies have suggested that the balanced regime plays an integral role in network computation [15, 13]. In this work, we have connected these mechanisms and functions, to conclude that learning this balance is equivalent to the development of an optimal spike-based population code, and that this learning can be achieved using a simple Hebbian learning rule. Acknowledgements We are grateful for generous funding from the Emmy-Noether grant of the Deutsche Forschungsgemeinschaft (CKM) and the Chaire d’excellence of the Agence National de la Recherche (CKM, DB), as well as a James Mcdonnell Foundation Award (SD) and EU grants BACS FP6-IST-027140, BIND MECT-CT-20095-024831, and ERC FP7-PREDSPIKE (SD). 8 References [1] Tolhurst D, Movshon J, and Dean A (1982) The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res 23: 775–785. [2] Shadlen MN, Newsome WT (1998) The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci 18(10): 3870–3896. [3] Zohary E, Newsome WT (1994) Correlated neuronal discharge rate and its implication for psychophysical performance. Nature 370: 140–143. [4] Renart A, de la Rocha J, Bartho P, Hollender L, Parga N, Reyes A, & Harris, KD (2010) The asynchronous state in cortical circuits. Science 327, 587–590. [5] Ecker AS, Berens P, Keliris GA, Bethge M, Logothetis NK, Tolias AS (2010) Decorrelated neuronal ﬁring in cortical microcircuits. Science 327: 584–587. [6] Okun M, Lampl I (2008) Instantaneous correlation of excitation and inhibition during ongoing and sensory-evoked activities. Nat Neurosci 11, 535–537. [7] Shu Y, Hasenstaub A, McCormick DA (2003) Turning on and off recurrent balanced cortical activity. Nature 423, 288–293. [8] Gentet LJ, Avermann M, Matyas F, Staiger JF, Petersen CCH (2010) Membrane potential dynamics of GABAergic neurons in the barrel cortex of behaving mice. Neuron 65: 422–435. [9] Caporale N, Dan Y (2008) Spike-timing-dependent plasticity: a Hebbian learning rule. Annu Rev Neurosci 31: 25–46. [10] Boerlin M, Deneve S (2011) Spike-based population coding and working memory. PLoS Comput Biol 7, e1001080. [11] Boerlin M, Machens CK, Deneve S (2012) Predictive coding of dynamic variables in balanced spiking networks. under review. [12] Clopath C, B¨ sing L, Vasilaki E, Gerstner W (2010) Connectivity reﬂects coding: a model of u voltage-based STDP with homeostasis. Nat Neurosci 13(3): 344–352. [13] van Vreeswijk C, Sompolinsky H (1998) Chaotic balanced state in a model of cortical circuits. Neural Comput 10(6): 1321–1371. [14] Brunel N (2000) Dynamics of sparsely connected networks of excitatory and inhibitory neurons. J Comput Neurosci 8, 183–208. [15] Vogels TP, Rajan K, Abbott LF (2005) Neural network dynamics. Annu Rev Neurosci 28: 357–376. [16] Vogels TP, Sprekeler H, Zenke F, Clopath C, Gerstner W. (2011) Inhibitory plasticity balances excitation and inhibition in sensory pathways and memory networks. Science 334(6062):1569– 73. [17] Song S, Miller KD, Abbott LF (2000) Competitive Hebbian learning through spike-timingdependent synaptic plasticity. Nat Neurosci 3(9): 919–926. 9

3 0.91253674 22 nips-2012-A latent factor model for highly multi-relational data

Author: Rodolphe Jenatton, Nicolas L. Roux, Antoine Bordes, Guillaume R. Obozinski

Abstract: Many data such as social networks, movie preferences or knowledge bases are multi-relational, in that they describe multiple relations between entities. While there is a large body of work focused on modeling these data, modeling these multiple types of relations jointly remains challenging. Further, existing approaches tend to breakdown when the number of these types grows. In this paper, we propose a method for modeling large multi-relational datasets, with possibly thousands of relations. Our model is based on a bilinear structure, which captures various orders of interaction of the data, and also shares sparse latent factors across different relations. We illustrate the performance of our approach on standard tensor-factorization datasets where we attain, or outperform, state-of-the-art results. Finally, a NLP application demonstrates our scalability and the ability of our model to learn efﬁcient and semantically meaningful verb representations. 1

same-paper 4 0.86697948 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

5 0.85738134 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

Author: Guillermo Canas, Lorenzo Rosasco

Abstract: We study the problem of estimating, in the sense of optimal transport metrics, a measure which is assumed supported on a manifold embedded in a Hilbert space. By establishing a precise connection between optimal transport metrics, optimal quantization, and learning theory, we derive new probabilistic bounds for the performance of a classic algorithm in unsupervised learning (k-means), when used to produce a probability measure derived from the data. In the course of the analysis, we arrive at new lower bounds, as well as probabilistic upper bounds on the convergence rate of empirical to population measures, which, unlike existing bounds, are applicable to a wide class of measures. 1 Introduction and Motivation In this paper we study the problem of learning from random samples a probability distribution supported on a manifold, when the learning error is measured using transportation metrics. The problem of learning a probability distribution is classic in statistics, and is typically analyzed for distributions in X = Rd that have a density with respect to the Lebesgue measure, with total variation, and L2 among the common distances used to measure closeness of two densities (see for instance [10, 32] and references therein.) The setting in which the data distribution is supported on a low dimensional manifold embedded in a high dimensional space has only been considered more recently. In particular, kernel density estimators on manifolds have been described in [36], and their pointwise consistency, as well as convergence rates, have been studied in [25, 23, 18]. A discussion on several topics related to statistics on a Riemannian manifold can be found in [26]. Interestingly, the problem of approximating measures with respect to transportation distances has deep connections with the ﬁelds of optimal quantization [14, 16], optimal transport [35] and, as we point out in this work, with unsupervised learning (see Sec. 4.) In fact, as described in the sequel, some of the most widely-used algorithms for unsupervised learning, such as k-means (but also others such as PCA and k-ﬂats), can be shown to be performing exactly the task of estimating the data-generating measure in the sense of the 2-Wasserstein distance. This close relation between learning theory, and optimal transport and quantization seems novel and of interest in its own right. Indeed, in this work, techniques from the above three ﬁelds are used to derive the new probabilistic bounds described below. Our technical contribution can be summarized as follows: (a) we prove uniform lower bounds for the distance between a measure and estimates based on discrete sets (such as the empirical measure or measures derived from algorithms such as kmeans); (b) we provide new probabilistic bounds for the rate of convergence of empirical to population measures which, unlike existing probabilistic bounds, hold for a very large class of measures; 1 (c) we provide probabilistic bounds for the rate of convergence of measures derived from k-means to the data measure. The structure of the paper is described at the end of Section 2, where we discuss the exact formulation of the problem as well as related previous works. 2 Setup and Previous work Consider the problem of learning a probability measure ρ supported on a space M, from an i.i.d. sample Xn = (x1 , . . . , xn ) ∼ ρn of size n. We assume M to be a compact, smooth d-dimensional manifold of bounded curvature, with C 1 metric and volume measure λM , embedded in the unit ball of a separable Hilbert space X with inner product ·, · , induced norm · , and distance d (for d instance M = B2 (1) the unit ball in X = Rd .) Following [35, p. 94], let Pp (M) denote the Wasserstein space of order 1 ≤ p < ∞: Pp (M) := x p dρ(x) < ∞ ρ ∈ P (M) : M of probability measures P (M) supported on M, with ﬁnite p-th moment. The p-Wasserstein distance 1/p Wp (ρ, µ) = inf [E X − Y p ] : Law(X) = ρ, Law(Y ) = µ (1) X,Y where the random variables X and Y are distributed according to ρ and µ respectively, is the optimal expected cost of transporting points generated from ρ to those generated from µ, and is guaranteed to be ﬁnite in Pp (M) [35, p. 95]. The space Pp (M) with the Wp metric is itself a complete separable metric space [35]. We consider here the problem of learning probability measures ρ ∈ P2 (M), where the performance is measured by the distance W2 . There are many possible choices of distances between probability measures [13]. Among them, Wp metrizes weak convergence (see [35] theorem 6.9), that is, in Pp (M), a sequence (µi )i∈N of measures converges weakly to µ iff Wp (µi , µ) → 0 and their p-th order moments converge to that of µ. There are other distances, such as the L´ vy-Prokhorov, or the weak-* distance, that also metrize e weak convergence. However, as pointed out by Villani in his excellent monograph [35, p. 98], 1. “Wasserstein distances are rather strong, [...]a deﬁnite advantage over the weak-* distance”. 2. “It is not so difﬁcult to combine information on convergence in Wasserstein distance with some smoothness bound, in order to get convergence in stronger distances.” Wasserstein distances have been used to study the mixing and convergence of Markov chains [22], as well as concentration of measure phenomena [20]. To this list we would add the important fact that existing and widely-used algorithms for unsupervised learning can be easily extended (see Sec. 4) to compute a measure ρ that minimizes the distance W2 (ˆn , ρ ) to the empirical measure ρ n ρn := ˆ 1 δx , n i=1 i a fact that will allow us to prove, in Sec. 5, bounds on the convergence of a measure induced by k-means to the population measure ρ. The most useful versions of Wasserstein distance are p = 1, 2, with p = 1 being the weaker of the two (by H¨ lder’s inequality, p ≤ q ⇒ Wp ≤ Wq .) In particular, “results in W2 distance are usually o stronger, and more difﬁcult to establish than results in W1 distance” [35, p. 95]. A discussion of p = ∞ would take us out of topic, since its behavior is markedly different. 2.1 Closeness of Empirical and Population Measures By the strong law of large numbers, the empirical measure converges almost surely to the population measure: ρn → ρ in the sense of the weak topology [34]. Since weak convergence and convergence ˆ in Wp plus convergence of p-th moments are equivalent in Pp (M), this means that, in the Wp sense, the empirical measure ρn converges to ρ, as n → ∞. A fundamental question is therefore how fast ˆ the rate of convergence of ρn → ρ is. ˆ 2 2.1.1 Convergence in expectation The rate of convergence of ρn → ρ in expectation has been widely studied in the past, resultˆ ing in upper bounds of order EW2 (ρ, ρn ) = O(n−1/(d+2) ) [19, 8], and lower bounds of order ˆ EW2 (ρ, ρn ) = Ω(n−1/d ) [29] (both assuming that the absolutely continuous part of ρ is ρA = 0, ˆ with possibly better rates otherwise). More recently, an upper bound of order EWp (ρ, ρn ) = O(n−1/d ) has been proposed [2] by proving ˆ a bound for the Optimal Bipartite Matching (OBM) problem [1], and relating this problem to the expected distance EWp (ρ, ρn ). In particular, given two independent samples Xn , Yn , the OBM ˆ problem is that of ﬁnding a permutation σ that minimizes the matching cost n−1 xi −yσ(i) p [24, p ˆ ˆ ˆ 30]. It is not hard to show that the optimal matching cost is Wp (ˆXn , ρYn ) , where ρXn , ρYn are ρ the empirical measures associated to Xn , Yn . By Jensen’s inequality, the triangle inequality, and (a + b)p ≤ 2p−1 (ap + bp ), it holds EWp (ρ, ρn )p ≤ EWp (ˆXn , ρYn )p ≤ 2p−1 EWp (ρ, ρn )p , ˆ ρ ˆ ˆ and therefore a bound of order O(n−p/d ) for the OBM problem [2] implies a bound EWp (ρ, ρn ) = ˆ O(n−1/d ). The matching lower bound is only known for a special case: ρA constant over a bounded set of non-null measure [2] (e.g. ρA uniform.) Similar results, with matching lower bounds are found for W1 in [11]. 2.1.2 Convergence in probability Results for convergence in probability, one of the main results of this work, appear to be considerably harder to obtain. One fruitful avenue of analysis has been the use of so-called transportation, or Talagrand inequalities Tp , which can be used to prove concentration inequalities on Wp [20]. In particular, we say that ρ satisﬁes a Tp (C) inequality with C > 0 iff Wp (ρ, µ)2 ≤ CH(µ|ρ), ∀µ ∈ Pp (M), where H(·|·) is the relative entropy [20]. As shown in [6, 5], it is possible to obtain probabilistic upper bounds on Wp (ρ, ρn ), with p = 1, 2, if ρ is known to satisfy a Tp inequality ˆ of the same order, thereby reducing the problem of bounding Wp (ρ, ρn ) to that of obtaining a Tp ˆ inequality. Note that, by Jensen’s inequality, and as expected from the behavior of Wp , the inequality T2 is stronger than T1 [20]. While it has been shown that ρ satisﬁes a T1 inequality iff it has a ﬁnite square-exponential moment 2 (E[eα x ] ﬁnite for some α > 0) [4, 7], no such general conditions have been found for T2 . As an example, consider that, if M is compact with diameter D then, by theorem 6.15 of [35], and the celebrated Csisz´ r-Kullback-Pinsker inequality [27], for all ρ, µ ∈ Pp (M), it is a Wp (ρ, µ)2p ≤ (2D)2p ρ − µ where · does not. TV 2 TV ≤ 22p−1 D2p H(µ|ρ), is the total variation norm. Clearly, this implies a Tp=1 inequality, but for p ≥ 2 it The T2 inequality has been shown by Talagrand to be satisﬁed by the Gaussian distribution [31], and then slightly more generally by strictly log-concave measures (see [20, p. 123], and [3].) However, as noted in [6], “contrary to the T1 case, there is no hope to obtain T2 inequalities from just integrability or decay estimates.” Structure of this paper. In this work we obtain bounds in probability (learning rates) for the problem of learning a probability measure in the sense of W2 . We begin by establishing (lower) bounds for the convergence of empirical to population measures, which serve to set up the problem and introduce the connection between quantization and measure learning (sec. 3.) We then describe how existing unsupervised learning algorithms that compute a set (k-means, k-ﬂats, PCA,. . . ) can be easily extended to produce a measure (sec. 4.) Due to its simplicity and widespread use, we focus here on k-means. Since the two measure estimates that we consider are the empirical measure, and the measure induced by k-means, we next set out to prove upper bounds on their convergence to the data-generating measure (sec. 5.) We arrive at these bounds by means of intermediate measures, which are related to the problem of optimal quantization. The bounds apply in a very broad setting (unlike existing bounds based on transportation inequalities, they are not restricted to log-concave measures [20, 3].) 3 3 Learning probability measures, optimal transport and quantization We address the problem of learning a probability measure ρ when the only observations we have at our disposal are n i.i.d. samples Xn = (x1 , . . . , xn ). We begin by establishing some notation and useful intermediate results. Given a closed set S ⊆ X , let {Vq : q ∈ S} be a Borel Voronoi partition of X composed of sets Vq closest to each q ∈ S, that is, such that each Vq ⊆ {x ∈ X : x − q = minr∈S x − r } is measurable (see for instance [15].) Consider the projection function πS : X → S mapping each x ∈ Vq to q. By virtue of {Vq }q∈S being a Borel Voronoi partition, the map πS is measurable [15], and it is d (x, πS (x)) = minq∈S x − q for all x ∈ X . For any ρ ∈ Pp (M), let πS ρ be the pushforward, or image measure of ρ under the mapping πS , −1 which is deﬁned to be (πS ρ)(A) := ρ(πS (A)) for all Borel measurable sets A. From its deﬁnition, it is clear that πS ρ is supported on S. We now establish a connection between the expected distance to a set S, and the distance between ρ and the set’s induced pushforward measure. Notice that, for discrete sets S, the expected Lp distance to S is exactly the expected quantization error Ep,ρ (S) := Ex∼ρ d(x, S)p = Ex∼ρ x − πS (x) p incurred when encoding points x drawn from ρ by their closest point πS (x) in S [14]. This close connection between optimal quantization and Wasserstein distance has been pointed out in the past in the statistics [28], optimal quantization [14, p. 33], and approximation theory [16] literatures. The following two lemmas are key tools in the reminder of the paper. The ﬁrst highlights the close link between quantization and optimal transport. Lemma 3.1. For closed S ⊆ X , ρ ∈ Pp (M), 1 ≤ p < ∞, it holds Ex∼ρ d(x, S)p = Wp (ρ, πS ρ)p . Note that the key element in the above lemma is that the two measures in the expression Wp (ρ, πS ρ) must match. When there is a mismatch, the distance can only increase. That is, Wp (ρ, πS µ) ≥ Wp (ρ, πS ρ) for all µ ∈ Pp (M). In fact, the following lemma shows that, among all the measures with support in S, πS ρ is closest to ρ. Lemma 3.2. For closed S ⊆ X , and all µ ∈ Pp (M) with supp(µ) ⊆ S, 1 ≤ p < ∞, it holds Wp (ρ, µ) ≥ Wp (ρ, πS ρ). When combined, lemmas 3.1 and 3.2 indicate that the behavior of the measure learning problem is limited by the performance of the optimal quantization problem. For instance, Wp (ρ, ρn ) can only ˆ be, in the best-case, as low as the optimal quantization cost with codebook of size n. The following section makes this claim precise. 3.1 Lower bounds Consider the situation depicted in ﬁg. 1, in which a sample X4 = {x1 , x2 , x3 , x4 } is drawn from a distribution ρ which we assume here to be absolutely continuous on its support. As shown, the projection map πX4 sends points x to their closest point in X4 . The resulting Voronoi decomposition of supp(ρ) is drawn in shades of blue. By lemma 5.2 of [9], the pairwise intersections of Voronoi regions have null ambient measure, and since ρ is absolutely continuous, the pushforward measure 4 can be written in this case as πX4 ρ = j=1 ρ(Vxj )δxj , where Vxj is the Voronoi region of xj . Note that, even for ﬁnite sets S, this particular decomposition is not always possible if the {Vq }q∈S form a Borel Voronoi tiling, instead of a Borel Voronoi partition. If, for instance, ρ has an atom falling on two Voronoi regions in a tiling, then both regions would count the atom as theirs, and double-counting would imply q ρ(Vq ) > 1. The technicalities required to correctly deﬁne a Borel Voronoi partition are such that, in general, it is simpler to write πS ρ, even though (if S is discrete) this measure can clearly be written as a sum of deltas with appropriate masses. By lemma 3.1, the distance Wp (ρ, πX4 ρ)p is the (expected) quantization cost of ρ when using X4 as codebook. Clearly, this cost can never be lower than the optimal quantization cost of size 4. This reasoning leads to the following lower bound between empirical and population measures. 4 Theorem 3.3. For ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and 1 ≤ p < ∞, it holds Wp (ρ, ρn ) = Ω(n−1/d ) uniformly over ρn , where the constants depend on d and ρA only. ˆ ˆ Proof: Let Vn,p (ρ) := inf S⊂M,|S|=n Ex∼ρ d(x, S)p be the optimal quantization cost of ρ of order p with n centers. Since ρA = 0, and since ρ has a ﬁnite (p + δ)-th order moment, for some δ > 0 (since it is supported on the unit ball), then it is Vn,p (ρ) = Θ(n−p/d ), with constants depending on d and ρA (see [14, p. 78] and [16].) Since supp(ˆn ) = Xn , it follows that ρ Wp (ρ, ρn )p ˆ ≥ lemma 3.2 Wp (ρ, πXn ρ)p = lemma 3.1 Ex∼ρ d(x, Xn )p ≥ Vn,p (ρ) = Θ(n−p/d ) Note that the bound of theorem 3.3 holds for ρn derived from any sample Xn , and is therefore ˆ stronger than the existing lower bounds on the convergence rates of EWp (ρ, ρn ) → 0. In particular, ˆ it trivially induces the known lower bound Ω(n−1/d ) on the rate of convergence in expectation. 4 Unsupervised learning algorithms for learning a probability measure As described in [21], several of the most widely used unsupervised learning algorithms can be ˆ interpreted to take as input a sample Xn and output a set Sk , where k is typically a free parameter of the algorithm, such as the number of means in k-means1 , the dimension of afﬁne spaces in PCA, n ˆ etc. Performance is measured by the empirical quantity n−1 i=1 d(xi , Sk )2 , which is minimized among all sets in some class (e.g. sets of size k, afﬁne spaces of dimension k,. . . ) This formulation is general enough to encompass k-means and PCA, but also k-ﬂats, non-negative matrix factorization, and sparse coding (see [21] and references therein.) Using the discussion of Sec. 3, we can establish a clear connection between unsupervised learning and the problem of learning probability measures with respect to W2 . Consider as a running example the k-means problem, though the argument is general. Given an input Xn , the k-means problem is ˆ ˆ to ﬁnd a set |Sk | = k minimizing its average distance from points in Xn . By associating to Sk the pushforward measure πSk ρn , we ﬁnd that ˆ ˆ 1 n n ˆ ˆ d(xi , Sk )2 = Ex∼ρn d(x, Sk )2 ˆ i=1 = lemma 3.1 W2 (ˆn , πSk ρn )2 . ρ ˆ ˆ (2) Since k-means minimizes equation 2, it also ﬁnds the measure that is closest to ρn , among those ˆ with support of size k. This connection between k-means and W2 measure approximation was, to the best of the authors’ knowledge, ﬁrst suggested by Pollard [28] though, as mentioned earlier, the argument carries over to many other unsupervised learning algorithms. Unsupervised measure learning algorithms. We brieﬂy clarify the steps involved in using an existing unsupervised learning algorithm for probability measure learning. Let Uk be a parametrized algorithm (e.g. k-means) that takes a sample Xn and outputs a set Uk (Xn ). The measure learning algorithm Ak : Mn → Pp (M) corresponding to Uk is deﬁned as follows: ˆ 1. Ak takes a sample Xn and outputs the measure πSk ρn , supported on Sk = Uk (Xn ); ˆ ˆ 2. since ρn is discrete, then so must πSk ρn be, and thus Ak (Xn ) = ˆ ˆ ˆ 1 n n ˆ i=1 δπSk (xi ) ; 3. in practice, we can simply store an n-vector πSk (x1 ), . . . , πSk (xn ) , from which Ak (Xn ) ˆ ˆ can be reconstructed by placing atoms of mass 1/n at each point. In the case that Uk is the k-means algorithm, only k points and k masses need to be stored. Note that any algorithm A that attempts to output a measure A (Xn ) close to ρn can be cast in the ˆ above framework. Indeed, if S is the support of A (Xn ) then, by lemma 3.2, πS ρn is the measure ˆ closest to ρn with support in S . This effectively reduces the problem of learning a measure to that of ˆ 1 In a slight abuse of notation, we refer to the k-means algorithm here as an ideal algorithm that solves the k-means problem, even though in practice an approximation algorithm may be used. 5 ﬁnding a set, and is akin to how the fact that every optimal quantizer is a nearest-neighbor quantizer (see [15], [12, p. 350], and [14, p. 37–38]) reduces the problem of ﬁnding an optimal quantizer to that of ﬁnding an optimal quantizing set. Clearly, the minimum of equation 2 over sets of size k (the output of k-means) is monotonically ˆ ˆ non-increasing with k. In particular, since Sn = Xn and πSn ρn = ρn , it is Ex∼ρn d(x, Sn )2 = ˆ ˆ ˆ ˆ 2 W2 (ˆn , πSn ρn ) = 0. That is, we can always make the learned measure arbitrarily close to ρn ρ ˆ ˆ ˆ by increasing k. However, as pointed out in Sec. 2, the problem of measure learning is concerned with minimizing the 2-Wasserstein distance W2 (ρ, πSk ρn ) to the data-generating measure. The ˆ ˆ actual performance of k-means is thus not necessarily guaranteed to behave in the same way as the empirical one, and the question of characterizing its behavior as a function of k and n naturally arises. ˆ Finally, we note that, while it is Ex∼ρn d(x, Sk )2 = W2 (ˆn , πSk ρn )2 (the empirical performances ρ ˆ ˆ ˆ are the same in the optimal quantization, and measure learning problem formulations), the actual performances satisfy ˆ Ex∼ρ d(x, Sk )2 = W2 (ρ, π ˆ ρ)2 ≤ W2 (ρ, π ˆ ρn )2 , 1 ≤ k ≤ n. ˆ lemma 3.1 Sk lemma 3.2 Sk Consequently, with the identiﬁcation between sets S and measures πS ρn , the measure learning ˆ problem is, in general, harder than the set-approximation problem (for example, if M = Rd and ρ is absolutely continuous over a set of non-null volume, it is not hard to show that the inequality is ˆ almost surely strict: Ex∼ρ d(x, Sk )2 < W2 (ρ, πSk ρn )2 for 1 < k < n.) ˆ ˆ In the remainder, we characterize the performance of k-means on the measure learning problem, for varying k, n. Although other unsupervised learning algorithms could have been chosen as basis for our analysis, k-means is one of the oldest and most widely used, and the one for which the deep connection between optimal quantization and measure approximation is most clearly manifested. Note that, by setting k = n, our analysis includes the problem of characterizing the behavior of the distance W2 (ρ, ρn ) between empirical and population measures which, as indicated in Sec. 2.1, ˆ is a fundamental question in statistics (i.e. the speed of convergence of empirical to population measures.) 5 Learning rates In order to analyze the performance of k-means as a measure learning algorithm, and the convergence of empirical to population measures, we propose the decomposition shown in ﬁg. 2. The diagram includes all the measures considered in the paper, and shows the two decompositions used to prove upper bounds. The upper arrow (green), illustrates the decomposition used to bound the distance W2 (ρ, ρn ). This decomposition uses the measures πSk ρ and πSk ρn as intermediates to arrive ˆ ˆ at ρn , where Sk is a k-point optimal quantizer of ρ, that is, a set Sk minimizing Ex∼ρ d(x, S)2 over ˆ all sets of size |S| = k. The lower arrow (blue) corresponds to the decomposition of W2 (ρ, πSk ρn ) ˆ ˆ (the performance of k-means), whereas the labelled black arrows correspond to individual terms in the bounds. We begin with the (slightly) simpler of the two results. 5.1 Convergence rates for the empirical to population measures Let Sk be the optimal k-point quantizer of ρ of order two [14, p. 31]. By the triangle inequality and the identity (a + b + c)2 ≤ 3(a2 + b2 + c2 ), it follows that W2 (ρ, ρn )2 ≤ 3 W2 (ρ, πSk ρ)2 + W2 (πSk ρ, πSk ρn )2 + W2 (πSk ρn , ρn )2 . ˆ ˆ ˆ ˆ (3) This is the decomposition depicted in the upper arrow of ﬁg. 2. By lemma 3.1, the ﬁrst term in the sum of equation 3 is the optimal k-point quantization error of ρ over a d-manifold M which, using recent techniques from [16] (see also [17, p. 491]), is shown in the proof of theorem 5.1 (part a) to be of order Θ(k −2/d ). The remaining terms, b) and c), are slightly more technical and are bounded in the proof of theorem 5.1. Since equation 3 holds for all 1 ≤ k ≤ n, the best bound on W2 (ρ, ρn ) can be obtained by optimizˆ ing the right-hand side over all possible values of k, resulting in the following probabilistic bound for the rate of convergence of the empirical to population measures. 6 x2 x W2 (ρ, ρn ) ˆ supp ρ x1 π{x1 ,x2 ,x3 ,x4 } ρ a) x3 πSk ρ b) πSk ρn ˆ c) d) ρn ˆ πSk ρn ˆ ˆ W2 (ρ, πSk ρn ) ˆ ˆ x4 Figure 1: A sample {x1 , x2 , x3 , x4 } is drawn from a distribution ρ with support in supp ρ. The projection map π{x1 ,x2 ,x3 ,x4 } sends points x to their closest one in the sample. The induced Voronoi tiling is shown in shades of blue. Figure 2: The measures considered in this paper are linked by arrows for which upper bounds for their distance are derived. Bounds for the quantities of interest W2 (ρ, ρn )2 , and W2 (ρ, πSk ρn )2 , ˆ ˆ ˆ are decomposed by following the top and bottom colored arrows. Theorem 5.1. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, sufﬁciently large n, and τ > 0, it holds W2 (ρ, ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ where m(ρA ) := 5.2 M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Learning rates of k-means The key element in the proof of theorem 5.1 is that the distance between population and empirical measures can be bounded by choosing an intermediate optimal quantizing measure of an appropriate size k. In the analysis, the best bounds are obtained for k smaller than n. If the output of k-means is close to an optimal quantizer (for instance if sufﬁcient data is available), then we would similarly expect that the best bounds for k-means correspond to a choice of k < n. The decomposition of the bottom (blue) arrow in ﬁgure 2 leads to the following bound in probability. Theorem 5.2. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and τ > 0, then for all sufﬁciently large n, and letting k = C · m(ρA ) · nd/(2d+4) , it holds W2 (ρ, πSk ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ ˆ where m(ρA ) := M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Note that the upper bounds in theorem 5.1 and 5.2 are exactly the same. Although this may appear ˆ surprising, it stems from the following fact. Since S = Sk is a minimizer of W2 (πS ρn , ρn )2 , the ˆ ˆ bound d) of ﬁgure 2 satisﬁes: W2 (πSk ρn , ρn )2 ≤ W2 (πSk ρn , ρn )2 ˆ ˆ ˆ ˆ ˆ and therefore (by the deﬁnition of c), the term d) is of the same order as c). It follows then that adding term d) to the bound only affects the constants, but otherwise leaves it unchanged. Since d) is the term that takes the output measure of k-means to the empirical measure, this implies that the rate of convergence of k-means (for suitably chosen k) cannot be worse than that of ρn → ρ. ˆ Conversely, bounds for ρn → ρ are obtained from best rates of convergence of optimal quantizers, ˆ whose convergence to ρ cannot be slower than that of k-means (since the quantizers that k-means produces are suboptimal.) 7 Since the bounds obtained for the convergence of ρn → ρ are the same as those for k-means with ˆ k of order k = Θ(nd/(2d+4) ), this suggests that estimates of ρ that are as accurate as those derived from an n point-mass measure ρn can be derived from k point-mass measures with k ˆ n. Finally, we note that the introduced bounds are currently limited by the statistical bound sup |W2 (πS ρn , ρn )2 − W2 (πS ρ, ρ)2 | ˆ ˆ |S|=k = sup |Ex∼ρn d(x, S)2 − Ex∼ρ d(x, S)2 | ˆ lemma 3.1 |S|=k (4) (see for instance [21]), for which non-matching lower bounds are known. This means that, if better upper bounds can be obtained for equation 4, then both bounds in theorems 5.1 and 5.2 would automatically improve (would become closer to the lower bound.) References [1] M. Ajtai, J. Komls, and G. Tusndy. On optimal matchings. Combinatorica, 4:259–264, 1984. [2] Franck Barthe and Charles Bordenave. Combinatorial optimization over two random point sets. Technical Report arXiv:1103.2734, Mar 2011. [3] Gordon Blower. The Gaussian isoperimetric inequality and transportation. Positivity, 7:203–224, 2003. [4] S. G. Bobkov and F. G¨ tze. Exponential integrability and transportation cost related to logarithmic o Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28, April 1999. [5] Emmanuel Boissard. Simple bounds for the convergence of empirical and occupation measures in 1wasserstein distance. Electron. J. Probab., 16(83):2296–2333, 2011. [6] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137(3):541–593, 2007. [7] F. Bolley and C. Villani. Weighted Csisz´ r-Kullback-Pinsker inequalities and applications to transportaa tion inequalities. Annales de la Faculte des Sciences de Toulouse, 14(3):331–352, 2005. [8] Claire Caillerie, Fr´ d´ ric Chazal, J´ rˆ me Dedecker, and Bertrand Michel. Deconvolution for the Wassere e eo stein metric and geometric inference. Rapport de recherche RR-7678, INRIA, July 2011. [9] Kenneth L. Clarkson. Building triangulations using -nets. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, STOC ’06, pages 326–335, New York, NY, USA, 2006. ACM. [10] Luc Devroye and G´ bor Lugosi. Combinatorial methods in density estimation. Springer Series in Statisa tics. Springer-Verlag, New York, 2001. [11] V. Dobri and J. Yukich. Asymptotics for transportation cost in high dimensions. Journal of Theoretical Probability, 8:97–118, 1995. [12] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, 1992. [13] Alison L. Gibbs and Francis E. Su. On choosing and bounding probability metrics. International Statistical Review, 70:419–435, 2002. [14] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2000. [15] Siegfried Graf, Harald Luschgy, and Gilles Page`. Distortion mismatch in the quantization of probability s measures. Esaim: Probability and Statistics, 12:127–153, 2008. [16] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002. [17] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer, 2007. [18] Guillermo Henry and Daniela Rodriguez. Kernel density estimation on riemannian manifolds: Asymptotic results. J. Math. Imaging Vis., 34(3):235–239, July 2009. [19] Joseph Horowitz and Rajeeva L. Karandikar. Mean rates of convergence of empirical measures in the Wasserstein metric. J. Comput. Appl. Math., 55(3):261–273, November 1994. [20] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, 2001. [21] A. Maurer and M. Pontil. K–dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 –5846, nov. 2010. [22] Yann Ollivier. Ricci curvature of markov chains on metric spaces. J. Funct. Anal., 256(3):810–864, 2009. 8 [23] Arkadas Ozakin and Alexander Gray. Submanifold density estimation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1375–1382. 2009. [24] C. Papadimitriou. The probabilistic analysis of matching heuristics. In Proc. of the 15th Allerton Conf. on Communication, Control and Computing, pages 368–378, 1978. [25] Bruno Pelletier. Kernel density estimation on Riemannian manifolds. Statist. Probab. Lett., 73(3):297– 304, 2005. [26] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vis., 25(1):127–154, July 2006. [27] M. S. Pinsker. Information and information stability of random variables and processes. San Francisco: Holden-Day, 1964. [28] David Pollard. Quantization and the method of k-means. IEEE Transactions on Information Theory, 28(2):199–204, 1982. [29] S.T. Rachev. Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1991. [30] J.M. Steele. Probability Theory and Combinatorial Optimization. Cbms-Nsf Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, 1997. [31] M. Talagrand. Transportation cost for Gaussian and other product measures. Geometric And Functional Analysis, 6:587–600, 1996. [32] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats. [33] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, 1996. [34] V. S. Varadarajan. On the convergence of sample probability distributions. Sankhy¯ : The Indian Journal a of Statistics, 19(1/2):23–26, Feb. 1958. [35] C. Villani. Optimal Transport: Old and New. Grundlehren der Mathematischen Wissenschaften. Springer, 2009. [36] P. Vincent and Y. Bengio. Manifold Parzen Windows. In Advances in Neural Information Processing Systems 22, pages 849–856. 2003. 9

6 0.81869787 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

7 0.81804419 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

8 0.81597471 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

9 0.81517565 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

10 0.81481671 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

11 0.81343544 200 nips-2012-Local Supervised Learning through Space Partitioning

12 0.81292832 65 nips-2012-Cardinality Restricted Boltzmann Machines

13 0.81215262 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

14 0.8119151 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

15 0.81181538 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

16 0.81041366 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

17 0.81003356 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

18 0.80962729 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

19 0.80839825 292 nips-2012-Regularized Off-Policy TD-Learning

20 0.80632758 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes