nips nips2012 nips2012-179 knowledge-graph by maker-knowledge-mining

179 nips-2012-Learning Manifolds with K-Means and K-Flats

Source: pdf

Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco

Abstract: We study the problem of estimating a manifold from random samples. In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-ﬂats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-ﬂats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-ﬂats, both the results and the mathematical tools are new. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu ,† Abstract We study the problem of estimating a manifold from random samples. [sent-7, score-0.212]

2 In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-ﬂats, and analyze their performance. [sent-8, score-0.162]

3 First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-ﬂats), for which no known results were previously available. [sent-10, score-0.731]

4 One such assumption is that the data distribution lies on, or is close to, a low-dimensional set embedded in a high dimensional space, for instance a low dimensional manifold. [sent-15, score-0.177]

5 Starting from [23, 34, 5], this set of ideas, broadly referred to as manifold learning, has been applied to a variety of problems from supervised [35] and semisupervised learning [6], to clustering [37] and dimensionality reduction [5], to name a few. [sent-17, score-0.309]

6 Interestingly, the problem of learning the manifold itself has received less attention: given samples from a d-manifold M embedded in some ambient space X , the problem is to learn a set that approximates M in a suitable sense. [sent-18, score-0.346]

7 This problem has been considered in computational geometry, but in a setting in which typically the manifold is a hyper-surface in a low-dimensional space (e. [sent-19, score-0.239]

8 The problem of learning a manifold is also related to that of estimating the support of a distribution, (see [13, 14] for recent surveys. [sent-22, score-0.212]

9 ) In this context, some of the distances considered to measure approximation quality are the Hausforff distance, and the so-called excess mass distance. [sent-23, score-0.102]

10 The reconstruction framework that we consider is related to the work of [1, 32], as well as to the framework proposed in [30], in which a manifold is approximated by a set, with performance measured by an expected distance to this set. [sent-24, score-0.552]

11 This setting is similar to the problem of dictionary learning (see for instance [29], and extensive references therein), in which a dictionary is found by minimizing a similar reconstruction error, perhaps with additional constraints on an associated encoding of the data. [sent-25, score-0.445]

12 Crucially, while the dictionary is learned on the empirical data, the quantity of interest is the expected reconstruction error, which is the focus of this work. [sent-26, score-0.468]

13 We analyze this problem by focusing on two important, and widely-used algorithms, namely kmeans and k-ﬂats. [sent-27, score-0.056]

14 The k-means algorithm can be seen to deﬁne a piecewise constant approximation of M. [sent-28, score-0.152]

15 Indeed, it induces a Voronoi decomposition on M, in which each Voronoi region is effectively approximated by a ﬁxed mean. [sent-29, score-0.112]

16 Given this, a natural extension is to consider higher order approxima1 tions, such as those induced by discrete collections of k d-dimensional afﬁne spaces (k-ﬂats), with possibly better resulting performance. [sent-30, score-0.105]

17 Since M is a d-manifold, the k-ﬂats approximation naturally resembles the way in which a manifold is locally approximated by its tangent bundle. [sent-31, score-0.48]

18 We note that the k-means algorithm has been widely studied, and thus much of our analysis in this case involves the combination of known facts to obtain novel results. [sent-33, score-0.045]

19 We begin our analysis by discussing the reconstruction properties of k-means in section 3. [sent-37, score-0.257]

20 The approximation (learning error) is measured by the expected reconstruction error Eρ (Sn ) := M dρ(x) d2 (x, Sn ), X (1) where the distance to a set S ⊆ X is d2 (x, S) = inf x ∈S d2 (x, x ), with dX (x, x ) = x − x . [sent-45, score-0.416]

21 X X This is the same reconstruction measure that has been the recent focus of [30, 4, 32]. [sent-46, score-0.257]

22 ) In other words, the above error measure does not introduce an explicit penalty on the “size” of Sn : enlarging any given Sn can never increase the learning error. [sent-48, score-0.05]

23 Finally, note that the risk of Equation 1 is non-negative and, if the hypothesis space is sufﬁciently rich, the risk of an unsupervised algorithm may converge to zero under suitable conditions. [sent-52, score-0.119]

24 Although typically discussed in the Euclidean space case, their deﬁnition can be easily extended to a Hilbert space setting. [sent-55, score-0.054]

25 The study of manifolds embedded in a Hilbert space is of special interest when considering non-linear (kernel) versions of the algorithms [15]. [sent-56, score-0.211]

26 Naturally, the more classical setting of an absolutely continuous distribution over d-dimensional Euclidean space is simply a particular case, in which X = Rd , and M is a domain with positive Lebesgue measure. [sent-58, score-0.08]

27 Given a training set Xn and a choice of k, k-means is deﬁned by the minimization over S ∈ Sk of the empirical reconstruction error En (S) := 1 n n d2 (xi , S). [sent-61, score-0.339]

28 X (2) i=1 where, for any ﬁxed set S, En (S) is an unbiased empirical estimate of Eρ (S), so that k-means can be seen to be performing a kind of empirical risk minimization [10, 7, 30, 8, 31]. [sent-62, score-0.11]

29 A minimizer of Equation 2 on Sk is a discrete set of k means Sn,k = {m1 , . [sent-63, score-0.096]

30 , mk }, which induces a Dirichlet-Voronoi tiling of X : a collection of k regions, each closest to a common mean [3] (in our notation, the subscript n denotes the dependence of Sn,k on the sample, while k refers to its size. [sent-66, score-0.06]

31 These two facts imply that it is possible to compute a local minimum of the empirical risk by using a greedy coordinate-descent relaxation, namely Lloyd’s algorithm [27]. [sent-68, score-0.123]

32 Let H = Fk be the class of collections of k ﬂats (afﬁne spaces) of dimension d. [sent-72, score-0.068]

33 For any value of k, k-ﬂats, analogously to k-means, aims at ﬁnding the set Fk ∈ Fk that minimizes the empirical reconstruction (2) over Fk . [sent-73, score-0.321]

34 By an argument similar to the one used for k-means, a global minimizer must be attainable, and a Lloyd-type relaxation converges to a local minimum. [sent-74, score-0.059]

35 Note that, in this case, given a Voronoi partition of M into regions closest to each d-ﬂat, new optimizing ﬂats for that partition can be computed by a d-truncated PCA solution on the samples falling in each region. [sent-75, score-0.065]

36 2 Learning a Manifold with K-means and K-ﬂats In practice, k-means is often interpreted to be a clustering algorithm, with clusters deﬁned by the Voronoi diagram of the set of means Sn,k . [sent-77, score-0.066]

37 ) For instance, this point of view is considered in [11] where kmeans is studied from an information theoretic persepective. [sent-79, score-0.115]

38 K-means can also be interpreted to be performing vector quantization, where the goal is to minimize the encoding error associated to a nearest-neighbor quantizer [17]. [sent-80, score-0.096]

39 Interestingly, in the limit of increasing sample size, this problem coincides, in a precise sense [33], with the problem of optimal quantization of probability distributions (see for instance the excellent monograph of [18]. [sent-81, score-0.12]

40 ) When the data-generating distribution is supported on a manifold M, k-means can be seen to be approximating points on the manifold by a discrete set of means. [sent-82, score-0.455]

41 Analogously to the Euclidean setting, this induces a Voronoi decomposition of M, in which each Voronoi region is effectively approximated by a ﬁxed mean (in this sense k-means produces a piecewise constant approximation of M. [sent-83, score-0.264]

42 ) As in the Euclidean setting, the limit of this problem with increasing sample size is precisely the problem of optimal quantization of distributions on manifolds, which is the subject of signiﬁcant recent work in the ﬁeld of optimal quantization [20, 21]. [sent-84, score-0.24]

43 In this paper, we take the above view of k-means as deﬁning a (piecewise constant) approximation of the manifold M supporting the data distribution. [sent-85, score-0.283]

44 In particular, we are interested in the behavior of the expected reconstruction error Eρ (Sn,k ), for varying k and n. [sent-86, score-0.398]

45 This perspective has an interesting relation with dictionary learning, in which one is interested in ﬁnding a dictionary, and an associated representation, that allows to approximately reconstruct a ﬁnite set of data-points/signals. [sent-87, score-0.094]

46 In this interpretation, the set of means can be seen as a dictionary of size k that produces a maximally sparse representation (the k-means encoding), see for example [29] and references therein. [sent-88, score-0.129]

47 Crucially, while the dictionary is learned on the available empirical data, the quantity of interest is the expected reconstruction error, and the question of characterizing the performance with respect to this latter quantity naturally arises. [sent-89, score-0.543]

48 Since k-means produces a piecewise constant approximation of the data, a natural idea is to consider higher orders of approximation, such as approximation by discrete collections of k d-dimensional afﬁne spaces (k-ﬂats), with possibly better performance. [sent-90, score-0.328]

49 Since M is a d-manifold, the approximation induced by k-ﬂats may more naturally resemble the way in which a manifold is locally approximated by its tangent bundle. [sent-91, score-0.48]

50 3 Reconstruction Properties of k-Means Since we are interested in the behavior of the expected reconstruction (1) of k-means and k-ﬂats for varying k and n, before analyzing this behavior, we consider what is currently known about this problem, based on previous work. [sent-95, score-0.348]

51 While k-ﬂats is a relatively new algorithm whose behavior is not yet well understood, several properties of k-means are currently known. [sent-96, score-0.053]

52 05 n=100 n=200 n=500 n=1000 n=100 n=200 n=500 n=1000 Test set reconstruction error: k−means, MNIST 1 Expected reconstruction error: k−means, d=20 x 10 0. [sent-99, score-0.514]

53 9 1 Figure 1: We consider the behavior of k-means for data sets obtained by sampling uniformly a 19 dimensional sphere embedded in R20 (left). [sent-125, score-0.162]

54 The reconstruction performance on a (large) hold-out set is reported as a function of k. [sent-127, score-0.257]

55 The results for four different training set cardinalities are reported: for small number of points, the reconstruction error decreases sharply for small k and then increases, while it is simply decreasing for larger data sets. [sent-128, score-0.307]

56 For example [22] report an average intrinsic dimension d for each digit to be between 10 and 13. [sent-133, score-0.077]

57 Recall that k-means ﬁnd an discrete set Sn,k of size k that best approximates the samples in the sense of (2). [sent-134, score-0.063]

58 Clearly, as k increases, the empirical reconstruction error En (Sn,k ) cannot increase, and typically decreases. [sent-135, score-0.339]

59 However, we are ultimately interested in the expected reconstruction error, and therefore would like to understand the behavior of Eρ (Sn,k ) with varying k, n. [sent-136, score-0.348]

60 In the context of optimal quantization, the behavior of the expected reconstruction error Eρ has been considered for an approximating set Sk obtained by minimizing the expected reconstruction error itself over the hypothesis space H = Sk . [sent-137, score-0.77]

61 In machine learning, the properties of k-means have been studied, for ﬁxed k, by considering the excess reconstruction error Eρ (Sn,k ) − Eρ (Sk ). [sent-141, score-0.338]

62 In particular, this quantity has been studied for X = Rd , and shown to be, with high probability, of order kd/n, up-to logarithmic factors [31]. [sent-142, score-0.112]

63 The more general setting where X is a metric space has been studied in [7]. [sent-144, score-0.059]

64 ) The above inequality suggests a somewhat surprising effect: the expected reconstruction properties of k-means may be described by a trade-off between a statistical error (of order kd n ) and a geometric approximation error (of order k −2/d . [sent-146, score-0.559]

65 For instance, in the k-means problem, it is intuitive that, as more means are inserted, the expected distance from a random sample to the means should 4 (a) Eρ (Sk=1 ) (b) Eρ (Sk=2 ) 1. [sent-148, score-0.108]

66 5, while for b) k = 2, it is decrease, and one might expect a similar behavior for the expected reconstruction error. [sent-152, score-0.348]

67 Indeed, it is not clear how close 1− 4 the distortion redundancy Eρ (Sn,k ) − Eρ (Sk ) is to its known lower bound of order d k n d (in expectation) [4]. [sent-156, score-0.085]

68 Indeed, as pointed out in [4], “The exact dependence of the minimax distortion redundancy on k and d is still a challenging open problem”. [sent-158, score-0.085]

69 Finally, we note that, whenever a trade-off can be shown to hold, it may be used to justify a heuristic for choosing k empirically as the value that minimizes the reconstruction error in a hold-out set. [sent-159, score-0.307]

70 Because d n, with high probability, the samples are nearly orthogonal: < x1 , x2 >X 0, while a third sample x drawn uniformly on S100 will also very likely be nearly orthogonal to both x1 , x2 [25]. [sent-164, score-0.04]

71 Our work extends previous results in two different directions: (a) We provide an analysis of k-means for the case in which the data-generating distribution is supported on a manifold embedded in a Hilbert space. [sent-171, score-0.287]

72 In particular, in this setting: 1) we derive new results on the approximation error, and 2) new sample complexity results (learning rates) arising from the choice of k by optimizing the resulting bound. [sent-172, score-0.071]

73 We analyze the case in which a solution is obtained from an approximation algorithm, such as k-means++ [2], to include this computational error in the bounds. [sent-173, score-0.121]

74 5 (b) We generalize the above results from k-means to k-ﬂats, deriving learning rates obtained from new bounds on both the statistical and the approximation errors. [sent-174, score-0.138]

75 We note that the k-means algorithm has been widely studied in the past, and much of our analysis in this case involves the combination of known facts to obtain novel results. [sent-176, score-0.077]

76 The probability measure ρ is absolutely continuous with respect to µI , with density p. [sent-180, score-0.053]

77 The bound of Equation (6) holds only when the absolutely continuous part of ρ over M is non-vanishing. [sent-187, score-0.053]

78 Note that according to the above theorems, choosing k requires knowledge of properties of the distribution ρ underlying the data, such as the intrinsic dimension of the support. [sent-197, score-0.077]

79 3-5, it is easy to prove that choosing k to minimize the reconstruction error on a hold-out set, allows to achieve the same learning rates (up to a logarithmic factor), adaptively in the sense that knowledge of properties of ρ are not needed. [sent-199, score-0.37]

80 Assume the manifold M to have metric of class C 3 , and ﬁnite second fundamental form II [16]. [sent-202, score-0.247]

81 We begin by providing a result for k-ﬂats on hypersurfaces (codimension one), and next extend it to manifolds in more general spaces. [sent-204, score-0.109]

82 In the more general case of a d-manifold M (with metric in C 3 ) embedded in a separable Hilbert space X , we cannot make any assumption on the codimension of M (the dimension of the orthogonal complement to the tangent space at each point. [sent-208, score-0.446]

83 ) Crucially, in this case, ⊥ we may no longer assume the dimension of the orthogonal complement (Tx M) to be ﬁnite. [sent-211, score-0.111]

84 3 Discussion In all the results, the ﬁnal performance does not depend on the dimensionality of the embedding space (which in fact can be inﬁnite), but only on the intrinsic dimension of the space on which the data-generating distribution is deﬁned. [sent-217, score-0.169]

85 The key to these results is an approximation construction in which the Voronoi regions on the manifold (points closest to a given mean or ﬂat) are guaranteed to have vanishing diameter in the limit of k going to inﬁnity. [sent-218, score-0.41]

86 Under our construction, a hypersurface is approximated efﬁciently by tracking the variation of its tangent spaces by using the second fundamental form. [sent-219, score-0.206]

87 Where this form vanishes, the Voronoi regions of an approximation will not be ensured to have vanishing diameter with k going to inﬁnity, unless certain care is taken in the analysis. [sent-220, score-0.197]

88 Note that these types of quantities have been linked to provably tight approximations in certain cases, such as for convex manifolds [19, 12], in contrast with worst-case methods that place a constraint on a maximum curvature, or minimum injectivity radius (for instance [1, 32]. [sent-222, score-0.148]

89 ) Intuitively, it is easy to see that a constraint on an average quantity may be arbitrarily less restrictive than one on its maximum. [sent-223, score-0.047]

90 of very high curvature) may cause the bounds of the latter to substantially degrade, while the results presented here would not be adversely affected so long as the region is small. [sent-226, score-0.072]

91 Additionally, care has been taken throughout to analyze the behavior of the constants. [sent-227, score-0.08]

92 Multiscale geometric methods for data sets ii: Geometric multi-resolution analysis. [sent-231, score-0.061]

93 In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007. [sent-235, score-0.043]

94 Voronoi diagrams: A survey of a fundamental geometric data structure. [sent-238, score-0.096]

95 Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. [sent-253, score-0.061]

96 A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering. [sent-260, score-0.102]

97 Empirical risk approximation: An induction principle for unsupervised learning. [sent-277, score-0.046]

98 In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, STOC ’06, pages 326–335, New York, NY, USA, 2006. [sent-288, score-0.043]

99 Asymptotic estimates for best and stepwise approximation of convex bodies i. [sent-327, score-0.071]

100 In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, SGP ’04, pages 11–21, New York, NY, USA, 2004. [sent-354, score-0.043]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ats', 0.617), ('reconstruction', 0.257), ('sk', 0.247), ('voronoi', 0.225), ('manifold', 0.212), ('iix', 0.14), ('en', 0.137), ('quantization', 0.12), ('sn', 0.112), ('manifolds', 0.109), ('hilbert', 0.102), ('dictionary', 0.094), ('tangent', 0.088), ('codimension', 0.084), ('curvature', 0.083), ('kn', 0.081), ('piecewise', 0.081), ('ln', 0.079), ('embedded', 0.075), ('approximation', 0.071), ('geometric', 0.061), ('tx', 0.058), ('kmeans', 0.056), ('cuevas', 0.056), ('sublinearly', 0.056), ('fn', 0.054), ('fk', 0.054), ('absolutely', 0.053), ('behavior', 0.053), ('error', 0.05), ('quantity', 0.047), ('crucially', 0.047), ('risk', 0.046), ('quantizer', 0.046), ('intrinsic', 0.045), ('approximated', 0.045), ('facts', 0.045), ('redundancy', 0.044), ('symposium', 0.043), ('maurer', 0.043), ('joachim', 0.043), ('usa', 0.042), ('rd', 0.042), ('mnist', 0.041), ('hein', 0.041), ('guillermo', 0.041), ('distortion', 0.041), ('orthogonal', 0.04), ('complement', 0.039), ('lloyd', 0.039), ('tight', 0.039), ('spaces', 0.038), ('constants', 0.038), ('expected', 0.038), ('dimensionality', 0.038), ('regions', 0.037), ('bounds', 0.037), ('locally', 0.036), ('collections', 0.036), ('fig', 0.035), ('region', 0.035), ('fundamental', 0.035), ('means', 0.035), ('dimensional', 0.034), ('assumption', 0.034), ('geometry', 0.033), ('logarithmic', 0.033), ('riemannian', 0.033), ('ii', 0.033), ('lebesgue', 0.033), ('matthias', 0.033), ('studied', 0.032), ('xn', 0.032), ('empirical', 0.032), ('approximates', 0.032), ('vanishing', 0.032), ('kd', 0.032), ('analogously', 0.032), ('belkin', 0.032), ('dimension', 0.032), ('induces', 0.032), ('york', 0.032), ('euclidean', 0.031), ('excess', 0.031), ('discrete', 0.031), ('clustering', 0.031), ('minimizer', 0.03), ('rates', 0.03), ('diameter', 0.03), ('af', 0.029), ('global', 0.029), ('mathematical', 0.028), ('naturally', 0.028), ('broadly', 0.028), ('closest', 0.028), ('remark', 0.028), ('ny', 0.027), ('theoretic', 0.027), ('care', 0.027), ('space', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 179 nips-2012-Learning Manifolds with K-Means and K-Flats

Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco

2 0.30726495 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

Author: Guillermo Canas, Lorenzo Rosasco

Abstract: We study the problem of estimating, in the sense of optimal transport metrics, a measure which is assumed supported on a manifold embedded in a Hilbert space. By establishing a precise connection between optimal transport metrics, optimal quantization, and learning theory, we derive new probabilistic bounds for the performance of a classic algorithm in unsupervised learning (k-means), when used to produce a probability measure derived from the data. In the course of the analysis, we arrive at new lower bounds, as well as probabilistic upper bounds on the convergence rate of empirical to population measures, which, unlike existing bounds, are applicable to a wide class of measures. 1 Introduction and Motivation In this paper we study the problem of learning from random samples a probability distribution supported on a manifold, when the learning error is measured using transportation metrics. The problem of learning a probability distribution is classic in statistics, and is typically analyzed for distributions in X = Rd that have a density with respect to the Lebesgue measure, with total variation, and L2 among the common distances used to measure closeness of two densities (see for instance [10, 32] and references therein.) The setting in which the data distribution is supported on a low dimensional manifold embedded in a high dimensional space has only been considered more recently. In particular, kernel density estimators on manifolds have been described in [36], and their pointwise consistency, as well as convergence rates, have been studied in [25, 23, 18]. A discussion on several topics related to statistics on a Riemannian manifold can be found in [26]. Interestingly, the problem of approximating measures with respect to transportation distances has deep connections with the ﬁelds of optimal quantization [14, 16], optimal transport [35] and, as we point out in this work, with unsupervised learning (see Sec. 4.) In fact, as described in the sequel, some of the most widely-used algorithms for unsupervised learning, such as k-means (but also others such as PCA and k-ﬂats), can be shown to be performing exactly the task of estimating the data-generating measure in the sense of the 2-Wasserstein distance. This close relation between learning theory, and optimal transport and quantization seems novel and of interest in its own right. Indeed, in this work, techniques from the above three ﬁelds are used to derive the new probabilistic bounds described below. Our technical contribution can be summarized as follows: (a) we prove uniform lower bounds for the distance between a measure and estimates based on discrete sets (such as the empirical measure or measures derived from algorithms such as kmeans); (b) we provide new probabilistic bounds for the rate of convergence of empirical to population measures which, unlike existing probabilistic bounds, hold for a very large class of measures; 1 (c) we provide probabilistic bounds for the rate of convergence of measures derived from k-means to the data measure. The structure of the paper is described at the end of Section 2, where we discuss the exact formulation of the problem as well as related previous works. 2 Setup and Previous work Consider the problem of learning a probability measure ρ supported on a space M, from an i.i.d. sample Xn = (x1 , . . . , xn ) ∼ ρn of size n. We assume M to be a compact, smooth d-dimensional manifold of bounded curvature, with C 1 metric and volume measure λM , embedded in the unit ball of a separable Hilbert space X with inner product ·, · , induced norm · , and distance d (for d instance M = B2 (1) the unit ball in X = Rd .) Following [35, p. 94], let Pp (M) denote the Wasserstein space of order 1 ≤ p < ∞: Pp (M) := x p dρ(x) < ∞ ρ ∈ P (M) : M of probability measures P (M) supported on M, with ﬁnite p-th moment. The p-Wasserstein distance 1/p Wp (ρ, µ) = inf [E X − Y p ] : Law(X) = ρ, Law(Y ) = µ (1) X,Y where the random variables X and Y are distributed according to ρ and µ respectively, is the optimal expected cost of transporting points generated from ρ to those generated from µ, and is guaranteed to be ﬁnite in Pp (M) [35, p. 95]. The space Pp (M) with the Wp metric is itself a complete separable metric space [35]. We consider here the problem of learning probability measures ρ ∈ P2 (M), where the performance is measured by the distance W2 . There are many possible choices of distances between probability measures [13]. Among them, Wp metrizes weak convergence (see [35] theorem 6.9), that is, in Pp (M), a sequence (µi )i∈N of measures converges weakly to µ iff Wp (µi , µ) → 0 and their p-th order moments converge to that of µ. There are other distances, such as the L´ vy-Prokhorov, or the weak-* distance, that also metrize e weak convergence. However, as pointed out by Villani in his excellent monograph [35, p. 98], 1. “Wasserstein distances are rather strong, [...]a deﬁnite advantage over the weak-* distance”. 2. “It is not so difﬁcult to combine information on convergence in Wasserstein distance with some smoothness bound, in order to get convergence in stronger distances.” Wasserstein distances have been used to study the mixing and convergence of Markov chains [22], as well as concentration of measure phenomena [20]. To this list we would add the important fact that existing and widely-used algorithms for unsupervised learning can be easily extended (see Sec. 4) to compute a measure ρ that minimizes the distance W2 (ˆn , ρ ) to the empirical measure ρ n ρn := ˆ 1 δx , n i=1 i a fact that will allow us to prove, in Sec. 5, bounds on the convergence of a measure induced by k-means to the population measure ρ. The most useful versions of Wasserstein distance are p = 1, 2, with p = 1 being the weaker of the two (by H¨ lder’s inequality, p ≤ q ⇒ Wp ≤ Wq .) In particular, “results in W2 distance are usually o stronger, and more difﬁcult to establish than results in W1 distance” [35, p. 95]. A discussion of p = ∞ would take us out of topic, since its behavior is markedly different. 2.1 Closeness of Empirical and Population Measures By the strong law of large numbers, the empirical measure converges almost surely to the population measure: ρn → ρ in the sense of the weak topology [34]. Since weak convergence and convergence ˆ in Wp plus convergence of p-th moments are equivalent in Pp (M), this means that, in the Wp sense, the empirical measure ρn converges to ρ, as n → ∞. A fundamental question is therefore how fast ˆ the rate of convergence of ρn → ρ is. ˆ 2 2.1.1 Convergence in expectation The rate of convergence of ρn → ρ in expectation has been widely studied in the past, resultˆ ing in upper bounds of order EW2 (ρ, ρn ) = O(n−1/(d+2) ) [19, 8], and lower bounds of order ˆ EW2 (ρ, ρn ) = Ω(n−1/d ) [29] (both assuming that the absolutely continuous part of ρ is ρA = 0, ˆ with possibly better rates otherwise). More recently, an upper bound of order EWp (ρ, ρn ) = O(n−1/d ) has been proposed [2] by proving ˆ a bound for the Optimal Bipartite Matching (OBM) problem [1], and relating this problem to the expected distance EWp (ρ, ρn ). In particular, given two independent samples Xn , Yn , the OBM ˆ problem is that of ﬁnding a permutation σ that minimizes the matching cost n−1 xi −yσ(i) p [24, p ˆ ˆ ˆ 30]. It is not hard to show that the optimal matching cost is Wp (ˆXn , ρYn ) , where ρXn , ρYn are ρ the empirical measures associated to Xn , Yn . By Jensen’s inequality, the triangle inequality, and (a + b)p ≤ 2p−1 (ap + bp ), it holds EWp (ρ, ρn )p ≤ EWp (ˆXn , ρYn )p ≤ 2p−1 EWp (ρ, ρn )p , ˆ ρ ˆ ˆ and therefore a bound of order O(n−p/d ) for the OBM problem [2] implies a bound EWp (ρ, ρn ) = ˆ O(n−1/d ). The matching lower bound is only known for a special case: ρA constant over a bounded set of non-null measure [2] (e.g. ρA uniform.) Similar results, with matching lower bounds are found for W1 in [11]. 2.1.2 Convergence in probability Results for convergence in probability, one of the main results of this work, appear to be considerably harder to obtain. One fruitful avenue of analysis has been the use of so-called transportation, or Talagrand inequalities Tp , which can be used to prove concentration inequalities on Wp [20]. In particular, we say that ρ satisﬁes a Tp (C) inequality with C > 0 iff Wp (ρ, µ)2 ≤ CH(µ|ρ), ∀µ ∈ Pp (M), where H(·|·) is the relative entropy [20]. As shown in [6, 5], it is possible to obtain probabilistic upper bounds on Wp (ρ, ρn ), with p = 1, 2, if ρ is known to satisfy a Tp inequality ˆ of the same order, thereby reducing the problem of bounding Wp (ρ, ρn ) to that of obtaining a Tp ˆ inequality. Note that, by Jensen’s inequality, and as expected from the behavior of Wp , the inequality T2 is stronger than T1 [20]. While it has been shown that ρ satisﬁes a T1 inequality iff it has a ﬁnite square-exponential moment 2 (E[eα x ] ﬁnite for some α > 0) [4, 7], no such general conditions have been found for T2 . As an example, consider that, if M is compact with diameter D then, by theorem 6.15 of [35], and the celebrated Csisz´ r-Kullback-Pinsker inequality [27], for all ρ, µ ∈ Pp (M), it is a Wp (ρ, µ)2p ≤ (2D)2p ρ − µ where · does not. TV 2 TV ≤ 22p−1 D2p H(µ|ρ), is the total variation norm. Clearly, this implies a Tp=1 inequality, but for p ≥ 2 it The T2 inequality has been shown by Talagrand to be satisﬁed by the Gaussian distribution [31], and then slightly more generally by strictly log-concave measures (see [20, p. 123], and [3].) However, as noted in [6], “contrary to the T1 case, there is no hope to obtain T2 inequalities from just integrability or decay estimates.” Structure of this paper. In this work we obtain bounds in probability (learning rates) for the problem of learning a probability measure in the sense of W2 . We begin by establishing (lower) bounds for the convergence of empirical to population measures, which serve to set up the problem and introduce the connection between quantization and measure learning (sec. 3.) We then describe how existing unsupervised learning algorithms that compute a set (k-means, k-ﬂats, PCA,. . . ) can be easily extended to produce a measure (sec. 4.) Due to its simplicity and widespread use, we focus here on k-means. Since the two measure estimates that we consider are the empirical measure, and the measure induced by k-means, we next set out to prove upper bounds on their convergence to the data-generating measure (sec. 5.) We arrive at these bounds by means of intermediate measures, which are related to the problem of optimal quantization. The bounds apply in a very broad setting (unlike existing bounds based on transportation inequalities, they are not restricted to log-concave measures [20, 3].) 3 3 Learning probability measures, optimal transport and quantization We address the problem of learning a probability measure ρ when the only observations we have at our disposal are n i.i.d. samples Xn = (x1 , . . . , xn ). We begin by establishing some notation and useful intermediate results. Given a closed set S ⊆ X , let {Vq : q ∈ S} be a Borel Voronoi partition of X composed of sets Vq closest to each q ∈ S, that is, such that each Vq ⊆ {x ∈ X : x − q = minr∈S x − r } is measurable (see for instance [15].) Consider the projection function πS : X → S mapping each x ∈ Vq to q. By virtue of {Vq }q∈S being a Borel Voronoi partition, the map πS is measurable [15], and it is d (x, πS (x)) = minq∈S x − q for all x ∈ X . For any ρ ∈ Pp (M), let πS ρ be the pushforward, or image measure of ρ under the mapping πS , −1 which is deﬁned to be (πS ρ)(A) := ρ(πS (A)) for all Borel measurable sets A. From its deﬁnition, it is clear that πS ρ is supported on S. We now establish a connection between the expected distance to a set S, and the distance between ρ and the set’s induced pushforward measure. Notice that, for discrete sets S, the expected Lp distance to S is exactly the expected quantization error Ep,ρ (S) := Ex∼ρ d(x, S)p = Ex∼ρ x − πS (x) p incurred when encoding points x drawn from ρ by their closest point πS (x) in S [14]. This close connection between optimal quantization and Wasserstein distance has been pointed out in the past in the statistics [28], optimal quantization [14, p. 33], and approximation theory [16] literatures. The following two lemmas are key tools in the reminder of the paper. The ﬁrst highlights the close link between quantization and optimal transport. Lemma 3.1. For closed S ⊆ X , ρ ∈ Pp (M), 1 ≤ p < ∞, it holds Ex∼ρ d(x, S)p = Wp (ρ, πS ρ)p . Note that the key element in the above lemma is that the two measures in the expression Wp (ρ, πS ρ) must match. When there is a mismatch, the distance can only increase. That is, Wp (ρ, πS µ) ≥ Wp (ρ, πS ρ) for all µ ∈ Pp (M). In fact, the following lemma shows that, among all the measures with support in S, πS ρ is closest to ρ. Lemma 3.2. For closed S ⊆ X , and all µ ∈ Pp (M) with supp(µ) ⊆ S, 1 ≤ p < ∞, it holds Wp (ρ, µ) ≥ Wp (ρ, πS ρ). When combined, lemmas 3.1 and 3.2 indicate that the behavior of the measure learning problem is limited by the performance of the optimal quantization problem. For instance, Wp (ρ, ρn ) can only ˆ be, in the best-case, as low as the optimal quantization cost with codebook of size n. The following section makes this claim precise. 3.1 Lower bounds Consider the situation depicted in ﬁg. 1, in which a sample X4 = {x1 , x2 , x3 , x4 } is drawn from a distribution ρ which we assume here to be absolutely continuous on its support. As shown, the projection map πX4 sends points x to their closest point in X4 . The resulting Voronoi decomposition of supp(ρ) is drawn in shades of blue. By lemma 5.2 of [9], the pairwise intersections of Voronoi regions have null ambient measure, and since ρ is absolutely continuous, the pushforward measure 4 can be written in this case as πX4 ρ = j=1 ρ(Vxj )δxj , where Vxj is the Voronoi region of xj . Note that, even for ﬁnite sets S, this particular decomposition is not always possible if the {Vq }q∈S form a Borel Voronoi tiling, instead of a Borel Voronoi partition. If, for instance, ρ has an atom falling on two Voronoi regions in a tiling, then both regions would count the atom as theirs, and double-counting would imply q ρ(Vq ) > 1. The technicalities required to correctly deﬁne a Borel Voronoi partition are such that, in general, it is simpler to write πS ρ, even though (if S is discrete) this measure can clearly be written as a sum of deltas with appropriate masses. By lemma 3.1, the distance Wp (ρ, πX4 ρ)p is the (expected) quantization cost of ρ when using X4 as codebook. Clearly, this cost can never be lower than the optimal quantization cost of size 4. This reasoning leads to the following lower bound between empirical and population measures. 4 Theorem 3.3. For ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and 1 ≤ p < ∞, it holds Wp (ρ, ρn ) = Ω(n−1/d ) uniformly over ρn , where the constants depend on d and ρA only. ˆ ˆ Proof: Let Vn,p (ρ) := inf S⊂M,|S|=n Ex∼ρ d(x, S)p be the optimal quantization cost of ρ of order p with n centers. Since ρA = 0, and since ρ has a ﬁnite (p + δ)-th order moment, for some δ > 0 (since it is supported on the unit ball), then it is Vn,p (ρ) = Θ(n−p/d ), with constants depending on d and ρA (see [14, p. 78] and [16].) Since supp(ˆn ) = Xn , it follows that ρ Wp (ρ, ρn )p ˆ ≥ lemma 3.2 Wp (ρ, πXn ρ)p = lemma 3.1 Ex∼ρ d(x, Xn )p ≥ Vn,p (ρ) = Θ(n−p/d ) Note that the bound of theorem 3.3 holds for ρn derived from any sample Xn , and is therefore ˆ stronger than the existing lower bounds on the convergence rates of EWp (ρ, ρn ) → 0. In particular, ˆ it trivially induces the known lower bound Ω(n−1/d ) on the rate of convergence in expectation. 4 Unsupervised learning algorithms for learning a probability measure As described in [21], several of the most widely used unsupervised learning algorithms can be ˆ interpreted to take as input a sample Xn and output a set Sk , where k is typically a free parameter of the algorithm, such as the number of means in k-means1 , the dimension of afﬁne spaces in PCA, n ˆ etc. Performance is measured by the empirical quantity n−1 i=1 d(xi , Sk )2 , which is minimized among all sets in some class (e.g. sets of size k, afﬁne spaces of dimension k,. . . ) This formulation is general enough to encompass k-means and PCA, but also k-ﬂats, non-negative matrix factorization, and sparse coding (see [21] and references therein.) Using the discussion of Sec. 3, we can establish a clear connection between unsupervised learning and the problem of learning probability measures with respect to W2 . Consider as a running example the k-means problem, though the argument is general. Given an input Xn , the k-means problem is ˆ ˆ to ﬁnd a set |Sk | = k minimizing its average distance from points in Xn . By associating to Sk the pushforward measure πSk ρn , we ﬁnd that ˆ ˆ 1 n n ˆ ˆ d(xi , Sk )2 = Ex∼ρn d(x, Sk )2 ˆ i=1 = lemma 3.1 W2 (ˆn , πSk ρn )2 . ρ ˆ ˆ (2) Since k-means minimizes equation 2, it also ﬁnds the measure that is closest to ρn , among those ˆ with support of size k. This connection between k-means and W2 measure approximation was, to the best of the authors’ knowledge, ﬁrst suggested by Pollard [28] though, as mentioned earlier, the argument carries over to many other unsupervised learning algorithms. Unsupervised measure learning algorithms. We brieﬂy clarify the steps involved in using an existing unsupervised learning algorithm for probability measure learning. Let Uk be a parametrized algorithm (e.g. k-means) that takes a sample Xn and outputs a set Uk (Xn ). The measure learning algorithm Ak : Mn → Pp (M) corresponding to Uk is deﬁned as follows: ˆ 1. Ak takes a sample Xn and outputs the measure πSk ρn , supported on Sk = Uk (Xn ); ˆ ˆ 2. since ρn is discrete, then so must πSk ρn be, and thus Ak (Xn ) = ˆ ˆ ˆ 1 n n ˆ i=1 δπSk (xi ) ; 3. in practice, we can simply store an n-vector πSk (x1 ), . . . , πSk (xn ) , from which Ak (Xn ) ˆ ˆ can be reconstructed by placing atoms of mass 1/n at each point. In the case that Uk is the k-means algorithm, only k points and k masses need to be stored. Note that any algorithm A that attempts to output a measure A (Xn ) close to ρn can be cast in the ˆ above framework. Indeed, if S is the support of A (Xn ) then, by lemma 3.2, πS ρn is the measure ˆ closest to ρn with support in S . This effectively reduces the problem of learning a measure to that of ˆ 1 In a slight abuse of notation, we refer to the k-means algorithm here as an ideal algorithm that solves the k-means problem, even though in practice an approximation algorithm may be used. 5 ﬁnding a set, and is akin to how the fact that every optimal quantizer is a nearest-neighbor quantizer (see [15], [12, p. 350], and [14, p. 37–38]) reduces the problem of ﬁnding an optimal quantizer to that of ﬁnding an optimal quantizing set. Clearly, the minimum of equation 2 over sets of size k (the output of k-means) is monotonically ˆ ˆ non-increasing with k. In particular, since Sn = Xn and πSn ρn = ρn , it is Ex∼ρn d(x, Sn )2 = ˆ ˆ ˆ ˆ 2 W2 (ˆn , πSn ρn ) = 0. That is, we can always make the learned measure arbitrarily close to ρn ρ ˆ ˆ ˆ by increasing k. However, as pointed out in Sec. 2, the problem of measure learning is concerned with minimizing the 2-Wasserstein distance W2 (ρ, πSk ρn ) to the data-generating measure. The ˆ ˆ actual performance of k-means is thus not necessarily guaranteed to behave in the same way as the empirical one, and the question of characterizing its behavior as a function of k and n naturally arises. ˆ Finally, we note that, while it is Ex∼ρn d(x, Sk )2 = W2 (ˆn , πSk ρn )2 (the empirical performances ρ ˆ ˆ ˆ are the same in the optimal quantization, and measure learning problem formulations), the actual performances satisfy ˆ Ex∼ρ d(x, Sk )2 = W2 (ρ, π ˆ ρ)2 ≤ W2 (ρ, π ˆ ρn )2 , 1 ≤ k ≤ n. ˆ lemma 3.1 Sk lemma 3.2 Sk Consequently, with the identiﬁcation between sets S and measures πS ρn , the measure learning ˆ problem is, in general, harder than the set-approximation problem (for example, if M = Rd and ρ is absolutely continuous over a set of non-null volume, it is not hard to show that the inequality is ˆ almost surely strict: Ex∼ρ d(x, Sk )2 < W2 (ρ, πSk ρn )2 for 1 < k < n.) ˆ ˆ In the remainder, we characterize the performance of k-means on the measure learning problem, for varying k, n. Although other unsupervised learning algorithms could have been chosen as basis for our analysis, k-means is one of the oldest and most widely used, and the one for which the deep connection between optimal quantization and measure approximation is most clearly manifested. Note that, by setting k = n, our analysis includes the problem of characterizing the behavior of the distance W2 (ρ, ρn ) between empirical and population measures which, as indicated in Sec. 2.1, ˆ is a fundamental question in statistics (i.e. the speed of convergence of empirical to population measures.) 5 Learning rates In order to analyze the performance of k-means as a measure learning algorithm, and the convergence of empirical to population measures, we propose the decomposition shown in ﬁg. 2. The diagram includes all the measures considered in the paper, and shows the two decompositions used to prove upper bounds. The upper arrow (green), illustrates the decomposition used to bound the distance W2 (ρ, ρn ). This decomposition uses the measures πSk ρ and πSk ρn as intermediates to arrive ˆ ˆ at ρn , where Sk is a k-point optimal quantizer of ρ, that is, a set Sk minimizing Ex∼ρ d(x, S)2 over ˆ all sets of size |S| = k. The lower arrow (blue) corresponds to the decomposition of W2 (ρ, πSk ρn ) ˆ ˆ (the performance of k-means), whereas the labelled black arrows correspond to individual terms in the bounds. We begin with the (slightly) simpler of the two results. 5.1 Convergence rates for the empirical to population measures Let Sk be the optimal k-point quantizer of ρ of order two [14, p. 31]. By the triangle inequality and the identity (a + b + c)2 ≤ 3(a2 + b2 + c2 ), it follows that W2 (ρ, ρn )2 ≤ 3 W2 (ρ, πSk ρ)2 + W2 (πSk ρ, πSk ρn )2 + W2 (πSk ρn , ρn )2 . ˆ ˆ ˆ ˆ (3) This is the decomposition depicted in the upper arrow of ﬁg. 2. By lemma 3.1, the ﬁrst term in the sum of equation 3 is the optimal k-point quantization error of ρ over a d-manifold M which, using recent techniques from [16] (see also [17, p. 491]), is shown in the proof of theorem 5.1 (part a) to be of order Θ(k −2/d ). The remaining terms, b) and c), are slightly more technical and are bounded in the proof of theorem 5.1. Since equation 3 holds for all 1 ≤ k ≤ n, the best bound on W2 (ρ, ρn ) can be obtained by optimizˆ ing the right-hand side over all possible values of k, resulting in the following probabilistic bound for the rate of convergence of the empirical to population measures. 6 x2 x W2 (ρ, ρn ) ˆ supp ρ x1 π{x1 ,x2 ,x3 ,x4 } ρ a) x3 πSk ρ b) πSk ρn ˆ c) d) ρn ˆ πSk ρn ˆ ˆ W2 (ρ, πSk ρn ) ˆ ˆ x4 Figure 1: A sample {x1 , x2 , x3 , x4 } is drawn from a distribution ρ with support in supp ρ. The projection map π{x1 ,x2 ,x3 ,x4 } sends points x to their closest one in the sample. The induced Voronoi tiling is shown in shades of blue. Figure 2: The measures considered in this paper are linked by arrows for which upper bounds for their distance are derived. Bounds for the quantities of interest W2 (ρ, ρn )2 , and W2 (ρ, πSk ρn )2 , ˆ ˆ ˆ are decomposed by following the top and bottom colored arrows. Theorem 5.1. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, sufﬁciently large n, and τ > 0, it holds W2 (ρ, ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ where m(ρA ) := 5.2 M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Learning rates of k-means The key element in the proof of theorem 5.1 is that the distance between population and empirical measures can be bounded by choosing an intermediate optimal quantizing measure of an appropriate size k. In the analysis, the best bounds are obtained for k smaller than n. If the output of k-means is close to an optimal quantizer (for instance if sufﬁcient data is available), then we would similarly expect that the best bounds for k-means correspond to a choice of k < n. The decomposition of the bottom (blue) arrow in ﬁgure 2 leads to the following bound in probability. Theorem 5.2. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and τ > 0, then for all sufﬁciently large n, and letting k = C · m(ρA ) · nd/(2d+4) , it holds W2 (ρ, πSk ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ ˆ where m(ρA ) := M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Note that the upper bounds in theorem 5.1 and 5.2 are exactly the same. Although this may appear ˆ surprising, it stems from the following fact. Since S = Sk is a minimizer of W2 (πS ρn , ρn )2 , the ˆ ˆ bound d) of ﬁgure 2 satisﬁes: W2 (πSk ρn , ρn )2 ≤ W2 (πSk ρn , ρn )2 ˆ ˆ ˆ ˆ ˆ and therefore (by the deﬁnition of c), the term d) is of the same order as c). It follows then that adding term d) to the bound only affects the constants, but otherwise leaves it unchanged. Since d) is the term that takes the output measure of k-means to the empirical measure, this implies that the rate of convergence of k-means (for suitably chosen k) cannot be worse than that of ρn → ρ. ˆ Conversely, bounds for ρn → ρ are obtained from best rates of convergence of optimal quantizers, ˆ whose convergence to ρ cannot be slower than that of k-means (since the quantizers that k-means produces are suboptimal.) 7 Since the bounds obtained for the convergence of ρn → ρ are the same as those for k-means with ˆ k of order k = Θ(nd/(2d+4) ), this suggests that estimates of ρ that are as accurate as those derived from an n point-mass measure ρn can be derived from k point-mass measures with k ˆ n. Finally, we note that the introduced bounds are currently limited by the statistical bound sup |W2 (πS ρn , ρn )2 − W2 (πS ρ, ρ)2 | ˆ ˆ |S|=k = sup |Ex∼ρn d(x, S)2 − Ex∼ρ d(x, S)2 | ˆ lemma 3.1 |S|=k (4) (see for instance [21]), for which non-matching lower bounds are known. This means that, if better upper bounds can be obtained for equation 4, then both bounds in theorems 5.1 and 5.2 would automatically improve (would become closer to the lower bound.) References [1] M. Ajtai, J. Komls, and G. Tusndy. On optimal matchings. Combinatorica, 4:259–264, 1984. [2] Franck Barthe and Charles Bordenave. Combinatorial optimization over two random point sets. Technical Report arXiv:1103.2734, Mar 2011. [3] Gordon Blower. The Gaussian isoperimetric inequality and transportation. Positivity, 7:203–224, 2003. [4] S. G. Bobkov and F. G¨ tze. Exponential integrability and transportation cost related to logarithmic o Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28, April 1999. [5] Emmanuel Boissard. Simple bounds for the convergence of empirical and occupation measures in 1wasserstein distance. Electron. J. Probab., 16(83):2296–2333, 2011. [6] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137(3):541–593, 2007. [7] F. Bolley and C. Villani. Weighted Csisz´ r-Kullback-Pinsker inequalities and applications to transportaa tion inequalities. Annales de la Faculte des Sciences de Toulouse, 14(3):331–352, 2005. [8] Claire Caillerie, Fr´ d´ ric Chazal, J´ rˆ me Dedecker, and Bertrand Michel. Deconvolution for the Wassere e eo stein metric and geometric inference. Rapport de recherche RR-7678, INRIA, July 2011. [9] Kenneth L. Clarkson. Building triangulations using -nets. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, STOC ’06, pages 326–335, New York, NY, USA, 2006. ACM. [10] Luc Devroye and G´ bor Lugosi. Combinatorial methods in density estimation. Springer Series in Statisa tics. Springer-Verlag, New York, 2001. [11] V. Dobri and J. Yukich. Asymptotics for transportation cost in high dimensions. Journal of Theoretical Probability, 8:97–118, 1995. [12] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, 1992. [13] Alison L. Gibbs and Francis E. Su. On choosing and bounding probability metrics. International Statistical Review, 70:419–435, 2002. [14] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2000. [15] Siegfried Graf, Harald Luschgy, and Gilles Page`. Distortion mismatch in the quantization of probability s measures. Esaim: Probability and Statistics, 12:127–153, 2008. [16] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002. [17] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer, 2007. [18] Guillermo Henry and Daniela Rodriguez. Kernel density estimation on riemannian manifolds: Asymptotic results. J. Math. Imaging Vis., 34(3):235–239, July 2009. [19] Joseph Horowitz and Rajeeva L. Karandikar. Mean rates of convergence of empirical measures in the Wasserstein metric. J. Comput. Appl. Math., 55(3):261–273, November 1994. [20] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, 2001. [21] A. Maurer and M. Pontil. K–dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 –5846, nov. 2010. [22] Yann Ollivier. Ricci curvature of markov chains on metric spaces. J. Funct. Anal., 256(3):810–864, 2009. 8 [23] Arkadas Ozakin and Alexander Gray. Submanifold density estimation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1375–1382. 2009. [24] C. Papadimitriou. The probabilistic analysis of matching heuristics. In Proc. of the 15th Allerton Conf. on Communication, Control and Computing, pages 368–378, 1978. [25] Bruno Pelletier. Kernel density estimation on Riemannian manifolds. Statist. Probab. Lett., 73(3):297– 304, 2005. [26] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vis., 25(1):127–154, July 2006. [27] M. S. Pinsker. Information and information stability of random variables and processes. San Francisco: Holden-Day, 1964. [28] David Pollard. Quantization and the method of k-means. IEEE Transactions on Information Theory, 28(2):199–204, 1982. [29] S.T. Rachev. Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1991. [30] J.M. Steele. Probability Theory and Combinatorial Optimization. Cbms-Nsf Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, 1997. [31] M. Talagrand. Transportation cost for Gaussian and other product measures. Geometric And Functional Analysis, 6:587–600, 1996. [32] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats. [33] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, 1996. [34] V. S. Varadarajan. On the convergence of sample probability distributions. Sankhy¯ : The Indian Journal a of Statistics, 19(1/2):23–26, Feb. 1958. [35] C. Villani. Optimal Transport: Old and New. Grundlehren der Mathematischen Wissenschaften. Springer, 2009. [36] P. Vincent and Y. Bengio. Manifold Parzen Windows. In Advances in Neural Information Processing Systems 22, pages 849–856. 2003. 9

3 0.14357777 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

Author: Ben Calderhead, Mátyás A. Sustik

Abstract: One of the enduring challenges in Markov chain Monte Carlo methodology is the development of proposal mechanisms to make moves distant from the current point, that are accepted with high probability and at low computational cost. The recent introduction of locally adaptive MCMC methods based on the natural underlying Riemannian geometry of such models goes some way to alleviating these problems for certain classes of models for which the metric tensor is analytically tractable, however computational efﬁciency is not assured due to the necessity of potentially high-dimensional matrix operations at each iteration. In this paper we ﬁrstly investigate a sampling-based approach for approximating the metric tensor and suggest a valid MCMC algorithm that extends the applicability of Riemannian Manifold MCMC methods to statistical models that do not admit an analytically computable metric tensor. Secondly, we show how the approximation scheme we consider naturally motivates the use of 1 regularisation to improve estimates and obtain a sparse approximate inverse of the metric, which enables stable and sparse approximations of the local geometry to be made. We demonstrate the application of this algorithm for inferring the parameters of a realistic system of ordinary differential equations using a biologically motivated robust Student-t error model, for which the Expected Fisher Information is analytically intractable. 1

4 0.1209268 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classiﬁers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the ﬁrst practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufﬁcient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classiﬁcation tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Speciﬁcally, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is deﬁned as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the ﬁrst principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our ﬁnal model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive deﬁnite D × D matrix. Classic approaches for ﬁnding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-deﬁnite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classiﬁer [5], which ﬁnds a linear transformation that satisﬁes local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classiﬁer. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] ﬁnd a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the ﬁrst and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same ﬁgure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classiﬁers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to deﬁne sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, ﬁrst we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by deﬁning Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Deﬁnition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often ﬂat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is deﬁned as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is deﬁned as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we deﬁne the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we deﬁne akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the deﬁnition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classiﬁer and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the beneﬁt that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the speciﬁc choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the deﬁnition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of ﬁrst order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), ﬁnds the unique geodesic starting at x with initial velocity v. As the geodesic must fulﬁll (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we ﬁnd numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The ﬁrst step is to compute the mean value on the manifold, which is deﬁned as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efﬁcient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is deﬁned as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efﬁcient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But ﬁrst, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefﬁcients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefﬁcients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the ﬁrst two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a ﬁrst visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the ﬁrst principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we brieﬂy consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classiﬁer in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classiﬁcation accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classiﬁers. The idea of deﬁning the metric tensor at any given point as the weighted average of a ﬁnite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Speciﬁcally, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classiﬁcation problems using local metrics, but we feel that our approach is the ﬁrst to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difﬁculty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classiﬁcation tasks as this is all the current literature provides. We expect improvements by learning the metrics speciﬁcally for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classiﬁcation using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classiﬁcation. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-ﬁrst international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classiﬁcation. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classiﬁcation and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

5 0.099623948 142 nips-2012-Generalization Bounds for Domain Adaptation

Author: Chao Zhang, Lei Zhang, Jieping Ye

Abstract: In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. Then, we develop the speciﬁc Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results. 1

6 0.09338966 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

7 0.091881007 27 nips-2012-A quasi-Newton proximal splitting method

8 0.091182142 36 nips-2012-Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions

9 0.086395189 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

10 0.081362993 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

11 0.078516632 145 nips-2012-Gradient Weights help Nonparametric Regressors

12 0.077235535 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

13 0.075766653 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

14 0.075160436 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

15 0.072063364 254 nips-2012-On the Sample Complexity of Robust PCA

16 0.071132362 225 nips-2012-Multi-task Vector Field Learning

17 0.07107114 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

18 0.071065493 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

19 0.066878572 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

20 0.066165894 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, 0.036), (2, 0.064), (3, -0.076), (4, 0.068), (5, 0.066), (6, -0.003), (7, 0.066), (8, 0.071), (9, -0.029), (10, 0.065), (11, -0.093), (12, -0.024), (13, -0.111), (14, -0.116), (15, -0.051), (16, 0.003), (17, 0.04), (18, 0.105), (19, 0.049), (20, 0.023), (21, 0.032), (22, 0.025), (23, 0.023), (24, -0.001), (25, 0.143), (26, 0.019), (27, 0.207), (28, -0.088), (29, 0.057), (30, -0.014), (31, 0.116), (32, -0.109), (33, -0.011), (34, -0.089), (35, -0.016), (36, -0.016), (37, 0.005), (38, 0.044), (39, -0.065), (40, 0.089), (41, 0.078), (42, 0.046), (43, 0.034), (44, -0.079), (45, 0.064), (46, -0.019), (47, -0.107), (48, 0.083), (49, 0.186)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93159735 179 nips-2012-Learning Manifolds with K-Means and K-Flats

Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco

2 0.88608736 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

Author: Guillermo Canas, Lorenzo Rosasco

3 0.70479643 36 nips-2012-Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions

Author: Alexandra Carpentier, Rémi Munos

Abstract: We consider the problem of adaptive stratiﬁed sampling for Monte Carlo integration of a differentiable function given a ﬁnite number of evaluations to the function. We construct a sampling scheme that samples more often in regions where the function oscillates more, while allocating the samples such that they are well spread on the domain (this notion shares similitude with low discrepancy). We prove that the estimate returned by the algorithm is almost similarly accurate as the estimate that an optimal oracle strategy (that would know the variations of the function everywhere) would return, and provide a ﬁnite-sample analysis. 1

4 0.55616277 142 nips-2012-Generalization Bounds for Domain Adaptation

Author: Chao Zhang, Lei Zhang, Jieping Ye

5 0.55074763 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch ﬂower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: ﬁrst predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming ﬁxed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1

6 0.52043378 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

7 0.47652957 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability

8 0.47628978 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

9 0.46588188 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

10 0.45821029 145 nips-2012-Gradient Weights help Nonparametric Regressors

11 0.45499453 9 nips-2012-A Geometric take on Metric Learning

12 0.4532254 338 nips-2012-The Perturbed Variation

13 0.45241016 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

14 0.42686501 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

15 0.41569021 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

16 0.41301516 34 nips-2012-Active Learning of Multi-Index Function Models

17 0.40565422 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

18 0.40097684 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

19 0.3969107 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

20 0.39587411 254 nips-2012-On the Sample Complexity of Robust PCA

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.036), (21, 0.024), (38, 0.185), (42, 0.047), (54, 0.027), (55, 0.035), (61, 0.019), (71, 0.162), (74, 0.066), (76, 0.143), (80, 0.106), (92, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93900549 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

Author: Takayuki Osogami

Abstract: We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function. 1

2 0.88846374 198 nips-2012-Learning with Target Prior

Author: Zuoguan Wang, Siwei Lyu, Gerwin Schalk, Qiang Ji

Abstract: In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables. In this work, we describe a new learning scheme for parametric learning, in which the target variables y can be modeled with a prior model p(y) and the relations between data and target variables are estimated with p(y) and a set of uncorresponded data X in training. We term this method as learning with target priors (LTP). Speciﬁcally, LTP learning seeks parameter θ that maximizes the log likelihood of fθ (X) on a uncorresponded training set with regards to p(y). Compared to the conventional (semi)supervised learning approach, LTP can make efﬁcient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efﬁciently implemented and deployed in tasks where running efﬁciency is critical. We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video. 1

same-paper 3 0.88597685 179 nips-2012-Learning Manifolds with K-Means and K-Flats

Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco

4 0.87881374 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

Author: Emanuele Coviello, Gert R. Lanckriet, Antoni B. Chan

Abstract: In this paper, we derive a novel algorithm to cluster hidden Markov models (HMMs) according to their probability distributions. We propose a variational hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i.e., a novel HMM that is representative for the group. We illustrate the beneﬁts of the proposed algorithm on hierarchical clustering of motion capture sequences as well as on automatic music tagging. 1

5 0.85107863 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

Author: Vasiliy Karasev, Alessandro Chiuso, Stefano Soatto

Abstract: We describe the tradeoff between the performance in a visual recognition problem and the control authority that the agent can exercise on the sensing process. We focus on the problem of “visual search” of an object in an otherwise known and static scene, propose a measure of control authority, and relate it to the expected risk and its proxy (conditional entropy of the posterior density). We show this analytically, as well as empirically by simulation using the simplest known model that captures the phenomenology of image formation, including scaling and occlusions. We show that a “passive” agent given a training set can provide no guarantees on performance beyond what is afforded by the priors, and that an “omnipotent” agent, capable of inﬁnite control authority, can achieve arbitrarily good performance (asymptotically). In between these limiting cases, the tradeoff can be characterized empirically. 1

6 0.84608072 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

7 0.84568751 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

8 0.84497797 227 nips-2012-Multiclass Learning with Simplex Coding

9 0.84495831 348 nips-2012-Tractable Objectives for Robust Policy Optimization

10 0.84420645 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

11 0.84378713 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

12 0.84269649 292 nips-2012-Regularized Off-Policy TD-Learning

13 0.84145647 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

14 0.84138173 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

15 0.84104759 65 nips-2012-Cardinality Restricted Boltzmann Machines

16 0.84103471 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

17 0.84054577 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.83993244 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

19 0.83992368 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

20 0.83972263 358 nips-2012-Value Pursuit Iteration