nips nips2012 nips2012-25 knowledge-graph by maker-knowledge-mining

25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

Source: pdf

Author: Suvrit Sra

Abstract: Symmetric positive deﬁnite (spd) matrices pervade numerous scientiﬁc disciplines, including machine learning and optimization. We consider the key task of measuring distances between two spd matrices; a task that is often nontrivial whenever the distance function must respect the non-Euclidean geometry of spd matrices. Typical non-Euclidean distance measures such as the Riemannian metric δR (X, Y ) = log(Y −1/2 XY −1/2 ) F , are computationally demanding and also complicated to use. To allay some of these difﬁculties, we introduce a new metric on spd matrices, which not only respects non-Euclidean geometry but also offers faster computation than δR while being less complicated to use. We support our claims theoretically by listing a set of theorems that relate our metric to δR (X, Y ), and experimentally by studying the nonconvex problem of computing matrix geometric means based on squared distances. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A new metric on the manifold of kernel matrices with application to matrix geometric means Suvrit Sra Max Planck Institute for Intelligent Systems 72076 T¨ bigen, Germany u suvrit@tuebingen. [sent-1, score-0.626]

2 de Abstract Symmetric positive deﬁnite (spd) matrices pervade numerous scientiﬁc disciplines, including machine learning and optimization. [sent-3, score-0.313]

3 We consider the key task of measuring distances between two spd matrices; a task that is often nontrivial whenever the distance function must respect the non-Euclidean geometry of spd matrices. [sent-4, score-0.903]

4 Typical non-Euclidean distance measures such as the Riemannian metric δR (X, Y ) = log(Y −1/2 XY −1/2 ) F , are computationally demanding and also complicated to use. [sent-5, score-0.195]

5 To allay some of these difﬁculties, we introduce a new metric on spd matrices, which not only respects non-Euclidean geometry but also offers faster computation than δR while being less complicated to use. [sent-6, score-0.675]

6 We support our claims theoretically by listing a set of theorems that relate our metric to δR (X, Y ), and experimentally by studying the nonconvex problem of computing matrix geometric means based on squared distances. [sent-7, score-0.485]

7 1 Introduction Symmetric positive deﬁnite (spd) matrices1 are remarkably pervasive in a multitude of areas, especially machine learning and optimization. [sent-8, score-0.09]

8 Several applications in these areas require an answer to the fundamental question: how to measure a distance between two spd matrices? [sent-9, score-0.389]

9 This question arises, for instance, when optimizing over the set of spd matrices. [sent-10, score-0.389]

10 To judge convergence of an optimization procedure or in the design of algorithms we may need to compute distances between spd matrices [1–3]. [sent-11, score-0.6]

11 As a more concrete example, suppose we wish to retrieve from a large database of spd matrices the “closest” spd matrix to an input query. [sent-12, score-1.003]

12 Another familiar setting is that of computing statistical metrics for multivariate Gaussian distributions [6], or more recently, quantum statistics [7]. [sent-14, score-0.106]

13 Several other applications depend on being able to effectively measure distances between spd matrices–see e. [sent-15, score-0.426]

14 In many of these domains, viewing spd matrices as members of a Euclidean vector space is insufﬁcient, and the non-Euclidean geometry conferred by a suitable metric is of great importance. [sent-18, score-0.814]

15 Indeed, the set of (strict) spd matrices forms a differentiable Riemannian manifold [11, 10] that is perhaps the most studied example of a manifold of nonpositive curvature [12; Ch. [sent-19, score-0.739]

16 These matrices also form a convex cone, and the set of spd matrices in fact serves as a canonical higher-rank symmetric space [13]. [sent-21, score-0.846]

17 The conic view is of great importance in convex optimization [14–16], symmetric spaces are important in algebra and analysis [13, 17], and in optimization [14, 18], while the manifold and other views are also widely important—see e. [sent-22, score-0.317]

18 1 The starting point for this paper is the manifold view. [sent-27, score-0.088]

19 Let Sn denote the set of n × n real symmetric matrices. [sent-31, score-0.109]

20 A matrix A ∈ Sn is called positive (we drop the word “deﬁnite” for brevity) if x, Ax > 0 for all x = 0; also denoted as A > 0. [sent-32, score-0.141]

21 (1) We denote the set of n × n positive matrices by Pn . [sent-33, score-0.264]

22 If only the non-strict inequality x, Ax ≥ 0 holds (for all x ∈ Rn ) we say A is positive semideﬁnite; this is also denoted as A ≥ 0. [sent-34, score-0.146]

23 For two matrices A, B ∈ Sn , the operator inequality A ≥ B means that the difference A − B ≥ 0. [sent-35, score-0.29]

24 The Frobenius norm of a matrix X ∈ Rm×n is deﬁned as X F = tr(X T X), while X denotes the standard operator norm. [sent-36, score-0.081]

25 Let λ(X) denote the vector of eigenvalues of X (in any order) and Eig(X) a diagonal matrix that has λ(X) as its diagonal. [sent-38, score-0.082]

26 The set Pn is a canonical higher-rank symmetric space that is actually an open set within Sn , and thereby a differentiable manifold of dimension n(n + 1)/2. [sent-42, score-0.197]

27 At the point A this metric is induced by the differential form ds2 = A−1/2 dAA−1/2 2 F = tr(A−1 dAA−1 dA). [sent-45, score-0.163]

28 (2) For A, B ∈ Pn , it is known that there is a unique geodesic joining them given by [11; Thm. [sent-46, score-0.079]

29 6]: γ(t) := A t B := A1/2 (A−1/2 BA−1/2 )t A1/2 , 0 ≤ t ≤ 1, (3) and its midpoint γ(1/2) is the geometric mean of A and B. [sent-49, score-0.169]

30 The associated Riemannian metric is δR (A, B) := log(A−1/2 BA−1/2 ) F , for A, B > 0. [sent-50, score-0.163]

31 For an application that must repeatedly compute distances between numerous pairs of matrices this computational burden can be excessive [4]. [sent-53, score-0.26]

32 [4] introduced a symmetrized “log-det” based matrix divergence: J(A, B) = log det A+B 2 − 1 2 log det(AB) for A, B > 0. [sent-55, score-0.282]

33 Independently, Chebbi and Moahker [2] also introduced a slightly generalized version of (5) and studied some of its properties, especially computation of “centroids” of positive matrices using their matrix divergence. [sent-58, score-0.315]

34 We resolve this uncertainty and prove that J(A, B) is indeed a metric, albeit not one that embeds isometrically into a Hilbert space. [sent-61, score-0.197]

35 Due to space constraints, we only summarily mention several of the properties that this metric sat√ isﬁes, primarily to help develop intuition that motivates J as a good proxy for the Riemannian metric δR . [sent-62, score-0.387]

36 We apply these insights to study “matrix geometric means” of set of positive matrices: a problem also studied in [4, 2]. [sent-63, score-0.21]

37 Both cited papers have some gaps in their claims, which we ﬁll by proving that even though computing the geometric mean is a nonconvex problem, we can still compute it efﬁciently and optimally. [sent-64, score-0.238]

38 2 2 The δ d metric The main result of this paper is Theorem 1. [sent-65, score-0.163]

39 The ﬁrst crucial result is that for positive scalars, δ d is indeed a metric. [sent-75, score-0.09]

40 To prove this, recall the notion of negative deﬁnite functions (Def. [sent-76, score-0.087]

41 A function ψ : X × X → R is said to be negative deﬁnite if for all x, y ∈ X it is symmetric (ψ(x, y) = ψ(y, x)), and satisﬁes the inequality n i,j=1 ci cj ψ(xi , xj ) ≤ 0, n (6) n n for all integers n ≥ 2, and subsets {xi }i=1 ⊆ X , {ci }i=1 ⊆ R with i=1 ci = 0. [sent-83, score-0.319]

42 Deﬁne δs (x, y) := log[(x + y)/(2 xy)] for scalars x, y > 0. [sent-92, score-0.08]

43 3, 20] we may equivalently show that e−βψ(x,y) = ((x + y)/2)−β is a positive deﬁnite function for β > 0, and all x, y > 0. [sent-101, score-0.09]

44 To that end, it sufﬁces to show that the matrix H = [hij ] = (xi + xj )−β , 1 ≤ i, j ≤ n, n is positive deﬁnite for every integer n ≥ 1, and positive numbers {xi }i=1 . [sent-102, score-0.284]

45 Now, observe that hij = 1 1 = (xi + xj )β Γ(β) ∞ e−t(xi +xj ) tβ−1 dt, (9) 0 ∞ where Γ(β) = 0 e−t tβ−1 dt is the well-known Gamma function. [sent-103, score-0.13]

46 Thus, with fi (t) = e−txi t L2 ([0, ∞)), we see that [hij ] equals the Gram matrix [ fi , fj ], whereby H > 0. [sent-104, score-0.3]

47 Then, δ d (Eig↓ (A), Eig↓ (B)) ≤ ≤ δ d (A, B) δ d (Eig↓ (A), Eig↑ (B)) The ﬁnal result that we need is a well-known fact from linear algebra (our own proof is in [19]). [sent-120, score-0.082]

48 (13) With all these theorems and lemmas in hand, we are now ﬁnally ready to prove Thm. [sent-127, score-0.085]

49 We must prove that δ d is symmetric, nonnegative, deﬁnite, and that is satisﬁes the triangle inequality. [sent-131, score-0.1]

50 Nonnegativity and deﬁniteness follow from the strict log-concavity (on Pn ) of the determinant, whereby det X+Y 2 ≥ det(X)1/2 det(Y )1/2 , which equality iff X = Y , which in turn implies that δ d (X, Y ) ≥ 0 with equality iff X = Y . [sent-133, score-0.376]

51 The only hard part is to prove the triangle inequality, a result that has eluded previous attempts [4, 2]. [sent-134, score-0.1]

52 Since Z > 0 is arbitrary, and congruence preserves positive deﬁniteness, we may write just Z instead of P ∗ ZP . [sent-137, score-0.09]

53 2), proving the triangle inequality reduces to showing that δ d (I, D) ≤ δ d (I, Z) + δ d (D, Z). [sent-139, score-0.144]

54 (14) ↓ ↓ Consider now the diagonal matrices D and Eig (Z). [sent-140, score-0.205]

55 Although the metric space (Pn , δ d ) has numerous fascinating properties, due to space concerns, we do not discuss it further. [sent-146, score-0.212]

56 1 Hilbert space embedding of δ d Theorem 1 shows that δ d is a metric and Theorem 5 shows that actually for positive scalars, the metric space (R++ , δs ) embeds isometrically into a Hilbert space. [sent-153, score-0.567]

57 Theorem 4 says that such a kernel exists if and only if δ 2d is negative deﬁnite; equivalently, iff 2 e−βδ d (X,Y ) = det(XY )β , det((X+Y )/2)β (16) is a positive deﬁnite kernel for all β > 0. [sent-155, score-0.171]

58 To verify this, it sufﬁces to check if the matrix Hβ = [hij ] := 1 det(Xi +Xj )β , 1 ≤ i, j ≤ m, (17) is positive for every integer m ≥ 1 and arbitrary positive matrices X1 , . [sent-156, score-0.405]

59 This implies that (Pd , δ d ) cannot embed isometrically into a Hilbert space. [sent-161, score-0.111]

60 Deﬁne the function fi := πn/4 e−x Xi x (for 1 ≤ i ≤ m). [sent-174, score-0.089]

61 Then, fi ∈ L2 (Rn ), where the inner-product is given by the Gaussian integral fi , fj := 1 π d/2 T e−x (Xi +Xj )x Rn dx = 1 . [sent-175, score-0.217]

62 Since the Schur (elementwise) product of two positive matrices is again positive, it follows that Hβ > 0 whenever β is an integer multiple of 1/2. [sent-177, score-0.264]

63 Deﬁne for each i the function fi := ce− tr(AXi ) 2 (c > 0 is a constant). [sent-181, score-0.089]

64 Then, fi ∈ L2 (Pn ), which we equip with the inner product e− tr(A(Xi +Xj )) det(A)β−(n+1)/2 dA = det(Xi + Xj )−β , fi , fj := c2 Pn 1 and it exists whenever β > 2 (n − 1). [sent-182, score-0.217]

65 Consequently, Hβ is positive for all β deﬁned by (18). [sent-183, score-0.09]

66 The “only if” part follows from deeper results in the rich theory of symmetric spaces. [sent-184, score-0.109]

67 2 Speciﬁcally, since Pn is a symmetric cone, and 1/ det(X) is a decreasing function on this cone, (i. [sent-185, score-0.109]

68 Let X be a set of positive matrices that commute with each other. [sent-194, score-0.264]

69 Then, (X , δ d ) can be isometrically embedded into some Hilbert space. [sent-195, score-0.111]

70 The proof follows because a commuting set of matrices can be simultaneously diagonalized, 2 and for diagonal matrices, δ 2d (X, Y ) = i δs (Xii , Yii ), which is a nonnegative sum of negative deﬁnite kernels and is therefore itself negative deﬁnite. [sent-197, score-0.287]

71 3 Connections between δ d and δR After showing that δ d is a metric and studying its relation to kernel functions, let us now return to our original motivation: introducing δ d as a reasonable alternative to the widely used Riemannian metric δR . [sent-198, score-0.326]

72 Due to lack of space, we present only a summary of our results in Table 1, and cite the corresponding theorems in the longer article [19] for proofs. [sent-202, score-0.07]

73 While the actual proofs are valuable and instructive, the key message worth noting is: both δR and δ d express the (negatively curved) non-Euclidean geometry of their respective metric spaces by displaying similar properties. [sent-203, score-0.351]

74 4 Application: computing geometric means In this section we turn our attention to an object that perhaps connects δR and δ d most intimately: the operator geometric mean (GM), which is given by the midpoint of the geodesic (3), denoted as A B := γ(1/2) = A1/2 (A−1/2 BA−1/2 )1/2 A1/2 . [sent-204, score-0.386]

75 (21) 2 Speciﬁcally, the set (18) is identical to the Wallach set which is important in the study of Hilbert spaces of holomorphic functions over symmetric domains [26; Ch. [sent-205, score-0.147]

76 The scalars t, s, u satisfy 0 < t ≤ 1, 1 ≤ s ≤ u < ∞. [sent-239, score-0.08]

77 (22) especially because it generalizes the matrix geometric mean to more than two matrices. [sent-241, score-0.171]

78 [4] only proved their solution to be a stationary point of φ(X); they did not prove either global or local optimality. [sent-256, score-0.115]

79 Although Chebbi and Moahker [2] showed that (24) has a unique solution, like [4] they too only proved stationarity, neither global nor local optimality. [sent-257, score-0.075]

80 We show that the unique positive solution to (24) is globally optimal; this result is particularly interesting because φ(X) is nonconvex. [sent-260, score-0.132]

81 (26) The latter equation is a Riccati equation that is known to have a unique, positive deﬁnite solution given by the matrix GM (21) (see [11; Prop 1. [sent-273, score-0.141]

82 So A B is a strict local minimum of (8), which is actually a global minimum because it is the unique positive solution to φ(X) = 0. [sent-278, score-0.198]

83 The nonlinear equation (27) has a unique positive solution. [sent-291, score-0.132]

84 Using the above results, we can ﬁnally prove the main theorem of this section. [sent-292, score-0.104]

85 The objective function φ(X) (24) has only one positive stationary point, which follows from Lemma 17. [sent-297, score-0.126]

86 We show that X is actually a local minimum; global optimality is immediate from uniqueness of X. [sent-299, score-0.145]

87 2 To show local optimality, we prove that the Hessian positivity of the Hessian reduces to proving that m mX −1 ⊗ X −1 − 1 i=1 2 φ(X) > 0. [sent-300, score-0.08]

88 It is worth noting that Theorem 18 establishes that solving (27) yields the global minimum of a nonconvex optimization problem. [sent-306, score-0.143]

89 This result is even more remarkable because unlike CAT(0)-metrics such as δR , the metric δ d is not geodesically convex. [sent-307, score-0.163]

90 1 indicate that δ d can be around 5 times faster than δR2 and up to 50 times faster than δR1 . [sent-317, score-0.07]

91 5 Conclusions and future work We presented a new metric on the manifold of positive deﬁnite matrices, and related it to the classical Riemmannian metric on this manifold. [sent-321, score-0.504]

92 Empirically, our new metric was shown to lead to large computational gains, while theoretically, a series of theorems demonstrated how it expresses the negatively curved non-Euclidean geometry in a manner analogous to the Riemannian metric. [sent-322, score-0.364]

93 In the plot, δR1 refers to the implementation of δR in the matrix means toolbox [33], while δR2 is our own implementation. [sent-325, score-0.112]

94 Means of hermitian positive-deﬁnite matrices based on the log-determinant α-divergence function. [sent-338, score-0.214]

95 Riemannian metrics on positive deﬁnite matrices related to means. [sent-387, score-0.309]

96 On the riemannian geometry deﬁned for self-concordant barriers and interior point methods. [sent-417, score-0.302]

97 Harmonic analysis on semigroups: theory of positive deﬁnite and related functions, volume 100 of GTM. [sent-444, score-0.09]

98 Concavity of certain maps on positive deﬁnite matrices and applications to hadamard products. [sent-499, score-0.264]

99 A differential geometric approach to the geometric mean of symmetric positive-deﬁnite matrices. [sent-511, score-0.349]

100 Computing the Karcher mean of symmetric positive deﬁnite matrices. [sent-520, score-0.199]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('spd', 0.389), ('eig', 0.318), ('pn', 0.273), ('gm', 0.244), ('det', 0.231), ('riemannian', 0.214), ('cherian', 0.194), ('matrices', 0.174), ('metric', 0.163), ('chebbi', 0.139), ('ai', 0.126), ('geometric', 0.12), ('isometrically', 0.111), ('karcher', 0.111), ('moahker', 0.111), ('symmetric', 0.109), ('positive', 0.09), ('fi', 0.089), ('geometry', 0.088), ('manifold', 0.088), ('algebra', 0.082), ('scalars', 0.08), ('hilbert', 0.078), ('hij', 0.077), ('ax', 0.068), ('mx', 0.068), ('sn', 0.066), ('argminx', 0.064), ('nite', 0.062), ('niteness', 0.061), ('quantum', 0.061), ('theorem', 0.058), ('bx', 0.056), ('inequality', 0.056), ('bini', 0.056), ('daa', 0.056), ('finsler', 0.056), ('gmld', 0.056), ('triangle', 0.054), ('xj', 0.053), ('sra', 0.051), ('xy', 0.051), ('matrix', 0.051), ('numerous', 0.049), ('midpoint', 0.049), ('nonconvex', 0.048), ('prove', 0.046), ('lemma', 0.046), ('logdet', 0.045), ('semigroup', 0.045), ('suvrit', 0.045), ('uniqueness', 0.045), ('metrics', 0.045), ('ba', 0.043), ('gl', 0.042), ('unique', 0.042), ('da', 0.042), ('tr', 0.042), ('negative', 0.041), ('embeds', 0.04), ('hermitian', 0.04), ('wishart', 0.04), ('xi', 0.04), ('iff', 0.04), ('hessian', 0.04), ('theorems', 0.039), ('fj', 0.039), ('curved', 0.039), ('cone', 0.038), ('spaces', 0.038), ('geodesic', 0.037), ('distances', 0.037), ('optimality', 0.037), ('siam', 0.037), ('stationary', 0.036), ('gaps', 0.036), ('au', 0.036), ('negatively', 0.035), ('faster', 0.035), ('proving', 0.034), ('claims', 0.034), ('strict', 0.033), ('corollary', 0.033), ('global', 0.033), ('whereby', 0.032), ('demanding', 0.032), ('worth', 0.032), ('toolbox', 0.031), ('diagonal', 0.031), ('mention', 0.031), ('nesterov', 0.031), ('article', 0.031), ('xp', 0.031), ('means', 0.03), ('noting', 0.03), ('proxy', 0.03), ('operator', 0.03), ('immediate', 0.03), ('ci', 0.03), ('brevity', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

Author: Suvrit Sra

2 0.21003848 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

Author: Ben Calderhead, Mátyás A. Sustik

Abstract: One of the enduring challenges in Markov chain Monte Carlo methodology is the development of proposal mechanisms to make moves distant from the current point, that are accepted with high probability and at low computational cost. The recent introduction of locally adaptive MCMC methods based on the natural underlying Riemannian geometry of such models goes some way to alleviating these problems for certain classes of models for which the metric tensor is analytically tractable, however computational efﬁciency is not assured due to the necessity of potentially high-dimensional matrix operations at each iteration. In this paper we ﬁrstly investigate a sampling-based approach for approximating the metric tensor and suggest a valid MCMC algorithm that extends the applicability of Riemannian Manifold MCMC methods to statistical models that do not admit an analytically computable metric tensor. Secondly, we show how the approximation scheme we consider naturally motivates the use of 1 regularisation to improve estimates and obtain a sparse approximate inverse of the metric, which enables stable and sparse approximations of the local geometry to be made. We demonstrate the application of this algorithm for inferring the parameters of a realistic system of ordinary differential equations using a biologically motivated robust Student-t error model, for which the Expected Fisher Information is analytically intractable. 1

3 0.20395622 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classiﬁers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the ﬁrst practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufﬁcient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classiﬁcation tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Speciﬁcally, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is deﬁned as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the ﬁrst principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our ﬁnal model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive deﬁnite D × D matrix. Classic approaches for ﬁnding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-deﬁnite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classiﬁer [5], which ﬁnds a linear transformation that satisﬁes local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classiﬁer. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] ﬁnd a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the ﬁrst and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same ﬁgure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classiﬁers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to deﬁne sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, ﬁrst we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by deﬁning Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Deﬁnition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often ﬂat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is deﬁned as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is deﬁned as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we deﬁne the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we deﬁne akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the deﬁnition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classiﬁer and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the beneﬁt that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the speciﬁc choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the deﬁnition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of ﬁrst order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), ﬁnds the unique geodesic starting at x with initial velocity v. As the geodesic must fulﬁll (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we ﬁnd numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The ﬁrst step is to compute the mean value on the manifold, which is deﬁned as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efﬁcient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is deﬁned as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efﬁcient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But ﬁrst, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefﬁcients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefﬁcients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the ﬁrst two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a ﬁrst visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the ﬁrst principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we brieﬂy consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classiﬁer in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classiﬁcation accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classiﬁers. The idea of deﬁning the metric tensor at any given point as the weighted average of a ﬁnite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Speciﬁcally, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classiﬁcation problems using local metrics, but we feel that our approach is the ﬁrst to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difﬁculty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classiﬁcation tasks as this is all the current literature provides. We expect improvements by learning the metrics speciﬁcally for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classiﬁcation using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classiﬁcation. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-ﬁrst international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classiﬁcation. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classiﬁcation and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

4 0.11504588 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

Author: Jun Wang, Alexandros Kalousis, Adam Woznica

Abstract: We study the problem of learning local metrics for nearest neighbor classiﬁcation. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased ﬂexibility its downside is the considerable risk of overﬁtting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics deﬁned on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classiﬁcation problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a signiﬁcant manner. 1

5 0.11416852 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation

Author: Alexandra Carpentier, Odalric-ambrym Maillard

Abstract: In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a ﬁnite number of arms, recent results show that it is possible to derive strategies based on ﬁnite-time conﬁdence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a ﬁnite partition P of a continuous sampling space X ⊂ Rd . Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is ﬁxed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper conﬁdence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of α−H¨ lder functions. o 1 Setting and Previous work Let us consider some space X ⊂ Rd , and Y ⊂ R. We call X the input space or sampling space, Y the output space or value space. We consider the problem of estimating with uniform precision the function f : X ⊂ Rd → Y ⊂ R. We assume that we can query n times the function f , anywhere in the domain, and observe noisy samples of this function. These samples are collected sequentially, and our aim is to design an adaptive procedure that selects wisely where on the domain to query the function, according to the information provided by the previous samples. More formally: Observed process We consider an unknown Y-valued process deﬁned on X , written ν : X → M+ (Y), where M+ (Y) refers to the set of all probability measures on Y, such that for all x ∈ X , 1 1 def the random variable Y (x) ∼ ν(x) has mean f (x) = E[Y (x)|x] ∈ R. We write for convenience the model in the following way Y (x) = f (x) + noise(x) , def where noise(x) = Y (x) − E[Y (x)|x] is the centered random variable corresponding to the noise, o with unknown variance σ 2 (x). We assume throughout this paper that f is α-H¨ lder. Partition We consider we can deﬁne a partition P of the input space X , with ﬁnitely many P regions {Rp }1≤p≤P that are assumed to be convex and not degenerated, i.e. such that the interior of each region Rp has positive Lebesgue volume vp . Moreover, with each region Rp is associated a sampling distribution in that region, written µp ∈ M+ (Rp ). Thus, when we decide to sample in 1 region Rp , a new sample X ∈ Rp is generated according to X ∼ µp . Allocation. We consider that we have a ﬁnite budget of n ∈ N samples that we can use in order to allocate samples as we wish among the regions {Rp }1≤p≤P . For illustration, let us assume that we deterministically allocate Tp,n ∈ N samples in region Rp , with the constraint that the allocation {Tp,n }1≤p≤P must some to n. In region Rp , we thus sample points {Xp,i }1≤p≤P at random 1 according to the sampling distribution µp , and then get the corresponding values {Yp,i }1≤i≤Tp,n , where Yp,i ∼ ν(Xp,i ). In the sequel, the distribution µp is assumed to be the uniform distribution dλ(x)1x∈R over region Rp , i.e. the density of µp is λ(Rp ) p where λ denotes the Lebesgue measure. Note that this is not restrictive since we are in an active, not passive setting. Piecewise constant mean-approximation. We use the collected samples in order to build a pieceˆ wise constant approximation fn of the mean f , and measure the accuracy of approximation on a region Rp with the expected quadratic norm of the approximation error, namely � � � � � ˆ (x))2 λ(dx) = Eµ ,ν (f (X) − mp,n )2 , ˆ (f (x) − fn E p λ(Rp ) Rp ˆ where mp,n is the constant value that takes fn on the region Rp . A natural choice for the estimator ˆ mp,n is to use the empirical mean that is unbiased and asymptotically optimal for this criterion. ˆ Thus we consider the following estimate (histogram) ˆ fn (x) = P � p=1 mp,n I{x ∈ Rp } where mp,n = ˆ ˆ Tp,n 1 � Tp,n Yp,i . i=1 Pseudo loss Note that, since the Tp,n are deterministic, the expected quadratic norm of the approximation error of this estimator can be written in the following form � � � � � � ˆ Eµp ,ν (f (X) − mp,n )2 ˆ = Eµp ,ν (f (X) − Eµp [f (X)])2 + Eµp ,ν (Eµp [f (X)] − mp,n )2 � � � � = Vµp f (X) + Vµp ,ν mp,n ˆ � � � � 1 Vµp ,ν Y (X) . = Vµp f (X) + Tp,n Now, using the following immediate decomposition � � � � � Vµp ,ν Y (X) = Vµp f (X) + σ 2 (x)µp (dx) , Rp we deduce that the maximal expected quadratic norm of the approximation error over the regions def {Rp }1≤p≤P , that depends on the choice of the considered allocation strategy A = {Tp,n }1≤p≤P is thus given by the following so-called pseudo-loss � � � � � � Tp,n + 1 1 def 2 (1) Vµp f (X) + Eµ σ (X) . Ln (A) = max 1≤ p ≤P Tp,n Tp,n p Our goal is to minimize this pseudo-loss. Note that this is a local measure of performance, as opposed to a more usual yet less challenging global quadratic error. Eventually, as the number of �� 2 � ˆ cells tends to ∞, this local measure of performance approaches supx∈X Eν f (x) − fn (x) . At this point, let us also introduce, for convenience, the notation Qp (Tp,n ) that denotes the term inside the max, in order to emphasize the dependency on the quadratic error with the allocation. Previous work There is a huge literature on the topic of functional estimation in batch setting. Since it is a rather old and well studied question in statistics, many books have been written on this topic, such as Bosq and Lecoutre [1987], Rosenblatt [1991], Gy¨ rﬁ et al. [2002], where piecewise constant meano approximation are also called “partitioning estimate” or “regressogram” (ﬁrst introduced by Tukey [1947]). The minimax-optimal rate of approximation on the class of α-H¨ lder functions is known o 2α to be in O(n− 2α+d ) (see e.g. Ibragimov and Hasminski [1981], Stone [1980], Gy¨ rﬁ et al. [2002]). o In such setting, a dataset {(Xi , Yi )}i≤n is given to the learner, and a typical question is thus to try to ﬁnd the best possible histogram in order to minimize a approximation error. Thus the dataset is ﬁxed and we typically resort to techniques such as model selection where each model corresponds to one histogram (see Arlot [2007] for an extensive study of such). However, we here ask a very different question, that is how to optimally sample in an online setting in order to minimize the approximation error of some histogram. Thus we choose the histogram 2 before we see any sample, then it is ﬁxed and we need to decide which cell to sample from at each time step. Motivation for this setting comes naturally from some recent works in the setting of active learning for the multi-armed bandit problem Antos et al. [2010], Carpentier et al. [2011]. In these works, the objective is to estimate with equal precision the mean of a ﬁnite number of distributions (arms), which would correspond to the special case when X = {1, . . . , P } is a ﬁnite set in our setting. Intuitively, we reduce the problem to such bandit problem with ﬁnite set of arms (regions), and our setting answers the question whether it is possible to extend those results to the case when the arms do not correspond to a singleton, but rather to a continuous region. We show that the answer is positive, yet non trivial. This is non trivial due to the variance estimation in each region: points x in some region may have different means f(x), so that standard estimators for the variance are biased, contrary to the point-wise case and thus ﬁnite-arm techniques may yield disastrous results. (Estimating the variance of the distribution in a continuous region actually needs to take into account not only the point-wise noise but also the variation of the function f and the noise level σ 2 in that region.) We describe a way, inspired from quasi Monte-Carlo techniques, to correct this bias so that we can handle the additional error. Also, it is worth mentioning that this setting can be informally linked to a notion of curiosity-driven learning (see Schmidhuber [2010], Baranes and Oudeyer [2009]), since we want to decide in which region of the space to sample, without explicit reward but optimizing the goal to understand the unknown environment. Outline Section 2 provides more intuition about the pseudo-loss and a result about the optimal oracle strategy when the domain is partitioned in a minimax-optimal way on the class of α−H¨ lder o functions. Section 3 presents our assumptions, that are basically to have a sub-Gaussian noise and smooth mean and variance functions, then our estimator of the pseudo-loss together with its concentration properties, before introducing our sampling procedure, called OAHPA-pcma. Finally, the performance of this procedure is provided and discussed in Section 4. 2 The pseudo-loss: study and optimal strategies 2.1 More intuition on each term in the pseudo-loss It is natural to look at what happens to each of the two terms that appear in equation 1 when one makes Rp shrink towards a point. More precisely, let xp be the mean of X ∼ µp and let us look at the limit of Vµp (f (X)) when vp goes to 0. Assuming that f is differentiable, we get �2 � �� lim Vµp (f (X)) = lim Eµp f (X) − f (xp ) − E[f (X) − f (xp )] vp →0 vp →0 = = = lim Eµp �� X − xp , ∇f (xp )� − E[�X − xp , ∇f (xp )�] vp →0 � � lim Eµp �X − xp , ∇f (xp )�2 vp →0 � � lim ∇f (xp )T Eµp (X − xp )(X − xp )T ∇f (xp ) . �2 � vp →0 Therefore, if we introduce Σp to be the covariance matrix of the random variable X ∼ µp , then we simply have lim Vµp (f (X)) = lim ||∇f (xp )||2 p . Σ vp →0 vp →0 Example with hyper-cubic regions An important example is when Rp is a hypercube with side 1/d length vp and µp is the uniform distribution over the region Rp . In that case (see Lemma 1), we dx have µp (dx) = , and 2/d vp vp . ||∇f (xp )||2 p = ||∇f (xp )||2 Σ 12 More generally, when f is α−differentiable, i.e. that ∀a ∈ X , ∃∇α f (a, ·) ∈ Sd (0, 1)R such that ∀x ∈ Sd (0, 1), limh→0 f (a+hx)−f (a) = ∇α f (a, x), then it is not too difﬁcult to show that for such hα hyper-cubic regions, we have � � � 2α � Vµp f (X) = O vpd sup |∇α f (xp , u)|2 . S(0,1) � � On the other hand, by direct computation, the second term is such that limvp →0 Eµp σ 2 (X) = � � � � σ 2 (xp ). Thus, while Vµp f (X) vanishes, Eµp σ 2 (X) stays bounded away from 0 (unless ν is deterministic). 3 2.2 Oracle allocation and homogeneous partitioning for piecewise constant mean-approximation. We now assume that we are allowed to choose the partition P depending on n, thus P = Pn , amongst all homogeneous partitions of the space, i.e. partitions such that all cells have the same volume, and come from a regular grid of the space. Thus the only free parameter is the number of cells Pn of the partition. An exact yet not explicit oracle algorithm. The minimization of the pseudo-loss (1) does not yield to a closed-form solution in general. However, we can still derive the order of the optimal loss (see [Carpentier and Maillard, 2012, Lemma 2] in the full version of the paper for an example of minimax yet non adaptive oracle � algorithm given in closed-form solution): � � −β � � � −α� � � Lemma 1 In the case when Vµp f (X) = Ω Pn and Rp σ 2 (x)µp (dx) = Ω Pn , then an � optimal allocation and partitioning strategy An satisﬁes that� � � � Vµp f (X) + Eµp σ 2 (X) � � , L − Vµp f (X) � as soon as there exists, for such range of Pn , a constant L such that � � � � � Pn � Vµp f (X) + Eµp σ 2 (X) � � = n. L − Vµp f (X) p=1 1 � Pn = Ω(n max(1+α� −β� ,1) ) and def � Tp,n = The pseudo-loss of such an algorithm A� , optimal amongst the allocations strategies that use the n � partition Pn in Pn regions, is then given by � � � � def max(1 − β , 1 − α ) − 1. where γ = Ln (A� ) = Ω nγ n max(1 + α� − β � , 1) The condition involving the constant L is here to ensure that the partition is not degenerate. It is morally satisﬁed as soon as the variance of f and the noise are bounded and n is large enough. This Lemma applies to the important class W 1,2 (R) of functions that admit a weak derivative that o belongs to L2 (R). Indeed these functions are H¨ lder with coefﬁcient α = 1/2, i.e. we have o W 1,2 (R) ⊂ C 0,1/2 (R). The standard Brownian motion is an example of function that is 1/2-H¨ lder. More generally, for k = d + α with α = 1/2 when d is odd and α = 1 when d is even, we have the 2 inclusion W k,2 (Rd ) ⊂ C 0,α (Rd ) , where W k,2 (Rd ) is the set of functions that admit a k th weak derivative belonging to L2 (Rd ). Thus the previous Lemma applies to sufﬁciently smooth functions with smoothness linearly increasing with the dimension d of the input space X . Important remark Note that this Lemma gives us a choice of the partition that is minimax-optimal, and an allocation strategy on that partition that is not only minimax-optimal but also adaptive to the function f itself. Thus it provides a way to decide in a minimax way what is the good number of regions, and then to provide the best oracle way to allocate the budget. We can deduce the following immediate corollary on the class of α−H¨ lder functions observed in a o non-negligible noise of bounded variance (i.e. in the setting β � = 0 and α� = 2α ). d Corollary 1 Consider that f is α−H¨ lder and the noise is of bounded variance. Then a minimaxo d � d+2α ) and an optimal allocation achieves the rate L (A� ) = optimal partition satisﬁes Pn = Ω(n n n � −2α � Ω n d+2α . Moreover, the strategy of Lemma 1 is optimal amongst the allocations strategies that � use the partition Pn in Pn regions. � −2α � The rate Ω n d+2α is minimax-optimal on the class of α−H¨ lder functions (see Gy¨ rﬁ et al. [2002], o o Ibragimov and Hasminski [1981], Stone [1980]), and it is thus interesting to consider an initial numd � � d+2α ). After having built the partition, if the quantities ber �� 2 �� of�regions Pn that is of order Pn = Ω(n Vµp f p≤P and Eµp σ p≤P are known to the learner, it is optimal, in the aim of minimizing � the pseudo-loss, to allocate to each region the number of samples Tp,n provided in Lemma 1. Our objective in this paper is, after having chosen beforehand a minimax-optimal partition, to allocate 4 the samples properly in the regions, without having any access to those quantities. It is then �� necessary to balance between exploration, i.e. allocating the samples in order to estimate Vµp f p≤P � � �� and Eµp σ 2 p≤P , and exploitation, i.e. use the estimates to target the optimal allocation. 3 Online algorithms for allocation and homogeneous partitioning for piecewise constant mean-approximation In this section, we now turn to the design of algorithms that are fully online, with the goal to be competitive against the kind of oracle algorithms considered in Section 2.2. We now assume that the space X = [0, 1]d is divided in Pn hyper-cubic regions of same measure (the Lebesgue measure on 1 [0, 1]d ) vp = v = Pn . The goal of an algorithm is to minimize the quadratic error of approximation of f by a constant over each cell, in expectation, which we write as � � � � � � 2 λ(dx) ˆ (x))2 λ(dx) = max E , max E (f (x) − fn (f (x) − mp,n ) ˆ 1≤p≤Pn 1≤p≤Pn λ(Rp ) λ(Rp ) Rp Rp ˆ where fn is the histogram estimate of the function f on the partition P and mp,n is the empirical ˆ mean deﬁned on region Rp with the samples (Xi , Yi ) such that Xi ∈ Rp . To do so, an algorithm is only allowed to specify at each time step t, the next point Xt where to sample, based on all the past samples {(Xs , Ys )}s < ∞ satisﬁes that λ2 σ 2 (x) , ∀λ ∈ R+ log E exp[λ noise(x)] ≤ 2 and we further assume that it satisﬁes the following slightly stronger second property (that is for instance exactly veriﬁed for a Gaussian variable, looking at the moment generating function): � � � � 1 λ2 σ 2 (x) ∀λ, γ ∈ R+ log E exp λnoise(x) + γnoise(x)2 ≤ − log 1 − 2γσ 2 (x) . 2(1 − 2γσ 2 (x)) 2 5 The function f is assumed to be (L, α)-H¨ lder, meaning that it satiﬁes o � ∀x, x ∈ X f (x) − f (x� ) ≤ L||x − x� ||α . Similarly, the function σ 2 is assumed to be (M, β)-H¨ lder i.e. it satisﬁes o � 2 2 � ∀x, x ∈ X σ (x) − σ (x ) ≤ M ||x − x� ||β . We assume that Y is a convex and compact subset of R, thus w.l.g. that it is [0, 1], and that it is known that ||σ 2 ||∞ , which is thus ﬁnite, is bounded by the constant 1. 3.2 Empirical estimation of the quadratic approximation error on each cell We deﬁne the sampling distribution µp in the region Rp for each p ∈ {1, . . . , Pn } as a quasi-uniform ˜ sampling scheme using the uniform distribution over the sub-regions. More precisely at time t ≤ n, if we decide to sample in the region Rp according to µp , we sample uniformly in each sub-region ˜ one sample, resulting in a new batch of samples {(Xt,k , Yt,k )}1≤k≤K , where Xt,k ∼ µp,k . Note that due to this sampling process, the number of points Tp,t sampled in sub-region Rp at time t is always Tp,t a multiple of K and that moreover for all k, k � ∈ {1, . . . , K} we have that Tp,k,t = Tp,k� ,t = K . Now this speciﬁc sampling is used in order to be able to estimate the variances Vµp f and Eµp σ 2 , � so that the best proportions Tp,n can be computed as accurately as possible. Indeed, as explained in � � � � Lemma 1, we have that Vµp f (X) + Eµp σ 2 (X) � def � � . Tp,n = L − Vµp f (X) ˆ Variance estimation We now introduce two estimators. The ﬁrst estimator is written Vp,t and is def ˆ built in the following way. First,let us introduce the empirical estimate fp,k,t of the mean fp,k = � � Eµp,k f (X) of f in sub-region Rp,k . Similarly, to avoid some cumbersome notations, we introduce � � � � � � def def def 2 fp = Eµp f (X) and vp,k = Vµp,k f (X) for the function f , and then σp,k = Eµp,k σ 2 (X) for the variance of the noise σ 2 . We now deﬁne the empirical variance estimator to be K 1 � ˆ ˆ (fp,k,t − mp,t )2 , ˆ Vp,t = K −1 k=1 that is a biased estimator. Indeed, for a deterministic Tp,t , it is not difﬁcult to show that we have � K K � � � � � � �� 2 1 �� 1 � ˆ E Vp,t + Eµp,k f − Eµp f = Vµp,k f + Eµp,k σ 2 . K −1 Tp,t k=1 k=1 � � The leading term in this decomposition, that is given by the ﬁrst sum, is closed to Vµp f since, by using the assumption that f is (L, α)−H¨ lder, we have the following inequality o � � K � �� 1 � � � � 2 2L2 dα � Eµp,k f − Eµp f − Vµp f (X) � ≤ , � �K (KPn )2α/d k=1 where we also used that the diameter of a sub-region Rp,k is given by diam(Rp,k ) = d1/2 . (KPn )1/d ˆ Then, the second term also contributes to the bias, essentially due to the fact that V[fp,k,t ] = � � � � 2 def def 1 1 2 2 2 Tp,k,t (vp,k + σp,k ) and not Tp,t (vk + σk ) (with vp = Vµp f (X) and σp = Eµp σ (X) ). ˆ p,k,t In order to correct this term, we now introduce the second estimator σ 2 that estimates the variance � � � � � � of the outputs in a region Rp,k , i.e. Vµp,k ,ν Y (X) = Vµp,k f (X) + Eµp,k σ 2 . It is deﬁned as �2 t t �� 1 1 � def ˆ p,k,t = Yi − Yj I{Xj ∈ Rp,k } I{Xi ∈ Rp,k } . σ2 Tp,k,t − 1 i=1 Tp,k,t j=1 Now, we combine the two previous estimators to form the following estimator K 1 �� 1 1 � 2 ˆ ˆ ˆ σ − . Qp,t = Vp,t − K Tp,k,t Tp,t p,k,t k=1 ˆ The following proposition provides a high-probability bound on the difference between Qp,t and the quantity we want to estimate. We report the detailed proof in [Carpentier and Maillard, 2012]. 6 ˆ Proposition 1 By the assumption that f is (L, α)-H¨ lder, the bias of the estimator Qp,t , and for o deterministic Tp,t , is given by � K � � � � � � � � � 2 1 � 2L2 dα ˆ − Vµp f (X) ≤ . Eµp,k f − Eµp f E Qp,t − Qp (Tp,t ) = K (KPn )2α/d k=1 Moreover, it satisﬁes that for all δ ∈ [0, 1], there exists an event of probability higher than 1 − δ such that on this event, we have � � � � � � K K � � � � 8 log(4/δ) � σ 2 �1 � � � � ˆ p,k,t 1 � 2 ˆ ˆ � Qp,t − E Qp,t � ≤ � √ +o σ p,k . � � (K − 1)2 T2 T K K k=1 p,k,t p,k,t k=1 We also state the following Lemma that we are going to use in the analysis, and that takes into account randomness of the stopping times Tp,k,t . Lemma 2 Let {Xp,k,u }p≤P, k≤K, u≤n be samples potentially sampled in region Rp,k . We introduce qp,u to be the�equivalent of Qp (Tp,t ) with explicitly ﬁxed value of Tp,t = u. Let also qp,u be the ˆ � ˆ p,t but computed with the ﬁrst u samples in estimate of E qp,u , that is to say the equivalent of Q each region Rp,k (i.e. Tp,t = u). Let us deﬁne the event � � � � � � � AK log(4nP/δ)V � � ˆp,t 2L2 dα � � ξn,P,K (δ) = + ω : � qp,u (ω) − E qp,u � ≤ ˆ , u K −1 (KPn )2α/d p≤P u≤n �K 1 ˆ ˆ ˆ p,k,t and where A ≤ 4 is a numerical constant. Then it where Vp,t = Vp (Tp,t ) = K−1 k=1 σ 2 holds that � � P ξn,P,K (δ) ≥ 1 − δ . Note that, with the notations of this Lemma, Proposition 1 above is thus about qp,u . ˆ 3.3 The Online allocation and homogeneous partitioning algorithm for piecewise constant mean-approximation (OAHPA-pcma) We are now ready to state the algorithm that we propose for minimizing the quadratic error of approximation of f . The algorithm is described in Figure 1. Although it looks similar, this algorithm is ˆ quite different from a normal UCB algorithm since Qp,t decreases in expectation with Tp,t . Indeed, � � � � � �� K � 1 its expectation is close to Vµp f + KTp,t k=1 Vµp,k f + Eµp,k σ 2 . Algorithm 1 OAHPA-pcma. 1: Input: A, L, α, Horizon n; Partition {Rp }p≤P , with sub-partitions {Rp,k }k≤K . 2: Initialization: Sample K points in every sub-region {Rp,k }p≤P,k≤K 3: for t = K 2 P + 1; t ≤ n; t = t + K do ˆ 4: Compute ∀p, Qp,t . � ˆ ˆ p,t + AK log(4nP/δ)Vp,t + 2L2 dα . 5: Compute ∀p, Bp,t = Q 2α/d Tp,t K−1 (KPn ) 6: Select the region pt = argmax1≤p≤Pn Bp,t where to sample. 7: Sample K samples in region Rpt one per sub-region Rpt ,k according to µpt ,k . 8: end for 4 Performance of the allocation strategy and discussion Here is the main result of the paper; see the full version [Carpentier and Maillard, 2012] for the proof. We remind that the objective is to minimize for an algorithm A the pseudo-loss Ln (A). Theorem 1 (Main result) Let γ = � maxp Tp,n � minp Tp,n be the distortion factor of the optimal allocation stratdef d d egy, and let � > 0. Then with the choice of the number of regions Pn = n 2α+d �2+ 2α , and of the 2d d def def 8L2 α number of sub-regions K = C 4α+d �−2− α , where C = Ad1−α then the pseudo-loss of the OAHPApcma algorithm satisﬁes, under the assumptions of Section 3.1 and on an event of probability higher than 1 − δ, � � � � � 2α 1 + �γC � log(1/δ) Ln (A� ) + o n− 2α+d , Ln (A) ≤ n for some numerical constant C � not depending on n, where A� is the oracle of Lemma 1. n 7 Minimax-optimal partitioning and �-adaptive performance Theorem 1 provides a high probability bound on the performance of the OAHPA-pcma allocation strategy. It shows that this performance is competitive with that of an optimal (i.e. adaptive to the function f , see Lemma 1) allocation d A� on a partition with a number of cells Pn chosen to be of minimax order n 2α+d for the class of 2α α-H¨ lder functions. In particular, since Ln (A� ) = O(n d+2α ) on that class, we recover the same o n minimax order as what is obtained in the batch learning setting, when using for instance wavelets, or Kernel estimates (see e.g. Stone [1980], Ibragimov and Hasminski [1981]). But moreover, due to the adaptivity of A� to the function itself, this procedure is also �-adaptive to the function and not n only minimax-optimal on the class, on that partition (see Section 2.2). Naturally, the performance of the method increases, in the same way than for any classical functional estimation method, when the smoothness of the function increases. Similarly, in agreement with the classical curse of dimension, the higher the dimension of the domain, the less efﬁcient the method. Limitations In this work, we assume that the smoothness α of the function is available to the learner, which enables her to calibrate Pn properly. Now it makes sense to combine the OAHPApcma procedure with existing methods that enable to estimate this smoothness online (under a slightly stronger assumption than H¨ lder, such as H¨ lder functions that attain their exponents, o o see Gin´ and Nickl [2010]). It is thus interesting, when no preliminary knowledge on the smoothness e of f is available, to spend some of the initial budget in order to estimate α. We have seen that the OAHPA-pcma procedure, although very simple, manages to get minimax optimal results. Now the downside of the simplicity of the OAHPA-pcma strategy is two-fold. � The ﬁrst limitation is that the factor (1 + �γC � log(1/δ)) = (1 + O(�)) appearing in the bound before Ln (A� ) is not 1, but higher than 1. Of course it is generally difﬁcult to get a constant 1 in the batch setting (see Arlot [2007]), and similarly this is a difﬁcult task in our online setting too: If � is chosen to be small, then the error with respect to the optimal allocation is small. However, since Pn is expressed as an increasing function of �, this implies that the minimax bound on the loss for partition P increases also with �. That said, in the view of the work on active learning multi-armed bandit that we extend, we would still prefer to get the optimal constant 1. The second limitation is more problematic: since K is chosen irrespective of the region Rp , this causes the presence of the factor γ. Thus the algorithm will essentially no longer enjoy near-optimal performance guarantees when the optimal allocation strategy is highly not homogeneous. Conclusion and future work In this paper, we considered online regression with histograms in an active setting (we select in which bean to sample), and when we can choose the histogram in a class of homogeneous histograms. Since the (unknown) noise is heteroscedastic and we compete not only with the minimax allocation oracle on α-H¨ lder functions but with the adaptive oracle o that uses a minimax optimal histogram and allocates samples adaptively to the target function, this is an extremely challenging (and very practical) setting. Our contribution can be seen as a non trivial extension of the setting of active learning for multi-armed bandits to the case when each arm corresponds to one continuous region of a sampling space, as opposed to a singleton, which can also be seen as a problem of non parametric function approximation. This new setting offers interesting challenges: We provided a simple procedure, based on the computation of upper conﬁdence bounds of the estimation of the local quadratic error of approximation, and provided a performance analysis that shows that OAHPA-pcma is ﬁrst order �-optimal with respect to the function, for a partition chosen to be minimax-optimal on the class of α-H¨ lder functions. However, this simplicity also o has a drawback if one is interested in building exactly ﬁrst order optimal procedure, and going beyond these limitations is deﬁnitely not trivial: A more optimal but much more complex algorithm would indeed need to tune a different factor Kp in each cell in an online way, i.e. deﬁne some Kp,t that evolves with time, and redeﬁne sub-regions accordingly. Now, the analysis of the OAHPA-pcma already makes use of powerful tools such as empirical-Bernstein bounds for variance estimation (and not only for mean estimation), which make it non trivial; in order to handle possibly evolving subregions and deal with the progressive reﬁnement of the regions, we would need even more intricate analysis, due to the fact that we are online and active. This interesting next step is postponed to future work. Acknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Council, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270327 (CompLACS) and no 216886 (PASCAL2). 8 References Andr` s Antos, Varun Grover, and Csaba Szepesv` ri. Active learning in heteroscedastic noise. Thea a oretical Computer Science, 411(29-30):2712–2728, 2010. Sylvain Arlot. R´ echantillonnage et S´ lection de mod` les. PhD thesis, Universit´ Paris Sud - Paris e´ e e e XI, 2007. A. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning. IEEE Transactions on Autonomous Mental Development, 1(3):155–169, October 2009. D. Bosq and J.P. Lecoutre. Th´ orie de l’estimation fonctionnelle, volume 21. Economica, 1987. e Alexandra Carpentier and Odalric-Ambrym Maillard. Online allocation and homogeneous partitioning for piecewise constant mean-approximation. HAL, 2012. URL http://hal.archives-ouvertes.fr/hal-00742893. Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rmi Munos, and Peter Auer. Upper-conﬁdence-bound algorithms for active learning in multi-armed bandits. In Jyrki Kivinen, Csaba Szepesv` ri, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory, a volume 6925 of Lecture Notes in Computer Science, pages 189–203. Springer Berlin / Heidelberg, 2011. E. Gin´ and R. Nickl. Conﬁdence bands in density estimation. The Annals of Statistics, 38(2): e 1122–1170, 2010. L. Gy¨ rﬁ, M. Kohler, A. Krzy´ ak, and Walk H. A distribution-free theory of nonparametric regreso z sion. Springer Verlag, 2002. I. Ibragimov and R. Hasminski. Statistical estimation: Asymptotic theory. 1981. M. Rosenblatt. Stochastic curve estimation, volume 3. Inst of Mathematical Statistic, 1991. J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). Autonomous Mental Development, IEEE Transactions on, 2(3):230–247, 2010. C.J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980. J.W. Tukey. Non-parametric estimation ii. statistically equivalent blocks and tolerance regions–the continuous case. The Annals of Mathematical Statistics, 18(4):529–539, 1947. 9

6 0.10732564 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

7 0.096960805 34 nips-2012-Active Learning of Multi-Index Function Models

8 0.096665449 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

9 0.094401203 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

10 0.09338966 179 nips-2012-Learning Manifolds with K-Means and K-Flats

11 0.093357436 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

12 0.091445714 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

13 0.080395021 145 nips-2012-Gradient Weights help Nonparametric Regressors

14 0.07336387 282 nips-2012-Proximal Newton-type methods for convex optimization

15 0.070568442 242 nips-2012-Non-linear Metric Learning

16 0.066871807 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

17 0.065996602 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

18 0.062991984 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

19 0.062621474 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

20 0.061935026 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.188), (1, 0.053), (2, 0.076), (3, -0.111), (4, 0.048), (5, 0.058), (6, -0.027), (7, 0.03), (8, 0.048), (9, -0.043), (10, 0.017), (11, -0.158), (12, 0.032), (13, -0.073), (14, -0.075), (15, 0.069), (16, 0.011), (17, 0.081), (18, 0.214), (19, 0.055), (20, 0.132), (21, -0.057), (22, 0.058), (23, 0.029), (24, 0.101), (25, 0.074), (26, 0.047), (27, 0.025), (28, -0.044), (29, -0.008), (30, -0.023), (31, 0.031), (32, 0.057), (33, 0.086), (34, 0.048), (35, -0.005), (36, 0.044), (37, -0.033), (38, -0.074), (39, -0.048), (40, 0.065), (41, 0.075), (42, -0.045), (43, -0.044), (44, -0.052), (45, -0.094), (46, 0.043), (47, -0.062), (48, 0.021), (49, -0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94082558 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

Author: Suvrit Sra

2 0.81731147 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

3 0.79332328 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

Author: Ben Calderhead, Mátyás A. Sustik

4 0.64924884 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

Author: Jun Wang, Alexandros Kalousis, Adam Woznica

5 0.5137943 179 nips-2012-Learning Manifolds with K-Means and K-Flats

Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco

Abstract: We study the problem of estimating a manifold from random samples. In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-ﬂats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-ﬂats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-ﬂats, both the results and the mathematical tools are new. 1

6 0.49599507 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

7 0.48887184 225 nips-2012-Multi-task Vector Field Learning

8 0.47923404 242 nips-2012-Non-linear Metric Learning

9 0.45188531 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

10 0.44524151 145 nips-2012-Gradient Weights help Nonparametric Regressors

11 0.43541929 34 nips-2012-Active Learning of Multi-Index Function Models

12 0.42736489 254 nips-2012-On the Sample Complexity of Robust PCA

13 0.42298055 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

14 0.41701758 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

15 0.40811193 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

16 0.40509605 304 nips-2012-Selecting Diverse Features via Spectral Regularization

17 0.40477371 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

18 0.40335089 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

19 0.40245801 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

20 0.39345005 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.062), (19, 0.216), (21, 0.026), (38, 0.129), (39, 0.014), (42, 0.032), (54, 0.049), (55, 0.025), (68, 0.013), (74, 0.059), (76, 0.171), (80, 0.08), (92, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86686152 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

Author: Jingrui He, Hanghang Tong, Qiaozhu Mei, Boleslaw Szymanski

Abstract: Diversiﬁed ranking is a fundamental task in machine learning. It is broadly applicable in many real world problems, e.g., information retrieval, team assembling, product search, etc. In this paper, we consider a generic setting where we aim to diversify the top-k ranking list based on an arbitrary relevance function and an arbitrary similarity function among all the examples. We formulate it as an optimization problem and show that in general it is NP-hard. Then, we show that for a large volume of the parameter space, the proposed objective function enjoys the diminishing returns property, which enables us to design a scalable, greedy algorithm to ﬁnd the (1 − 1/e) near-optimal solution. Experimental results on real data sets demonstrate the effectiveness of the proposed algorithm.

same-paper 2 0.82486022 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

Author: Suvrit Sra

3 0.74517351 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

4 0.74469376 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

Author: Du Tran, Junsong Yuan

Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efﬁcient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1

5 0.74280971 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulﬁlled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the ﬂexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1

6 0.74069071 38 nips-2012-Algorithms for Learning Markov Field Policies

7 0.73989993 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

8 0.73920375 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

9 0.73843533 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

10 0.73823768 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

11 0.73778117 188 nips-2012-Learning from Distributions via Support Measure Machines

12 0.73775399 126 nips-2012-FastEx: Hash Clustering with Exponential Families

13 0.73756087 294 nips-2012-Repulsive Mixtures

14 0.73712218 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

15 0.73707682 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

16 0.7369715 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

17 0.7369566 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

18 0.73622376 232 nips-2012-Multiplicative Forests for Continuous-Time Processes

19 0.73619455 348 nips-2012-Tractable Objectives for Robust Policy Optimization

20 0.73514569 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models