nips nips2012 nips2012-242 knowledge-graph by maker-knowledge-mining

242 nips-2012-Non-linear Metric Learning


Source: pdf

Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger

Abstract: In this paper, we introduce two novel metric learning algorithms, χ2 -LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: χ2 -LMNN inherits the computational benefits of a linear mapping from linear metric learning, but uses a non-linear χ2 -distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach’s robustness, speed, parallelizability and insensitivity towards the single additional hyperparameter. On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classification error, but in the case of χ2 -LMNN, obtain best results in 19 out of 20 learning settings. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we introduce two novel metric learning algorithms, χ2 -LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. [sent-21, score-0.17]

2 Guided by this motivation, a surge of recent research [10, 13, 15, 24, 31, 32] has focused on Mahalanobis metric learning. [sent-28, score-0.196]

3 The resulting methods greatly improve the performance of metric dependent algorithms, such as k-means clustering and kNN classification, and have gained popularity in many research areas and applications within and beyond machine learning. [sent-29, score-0.17]

4 One reason for this success is the out-of-the-box usability and robustness of several popular methods to learn these linear metrics. [sent-30, score-0.103]

5 So far, non-linear approaches [6, 18, 26, 30] to metric learning have not managed to replicate this success. [sent-31, score-0.17]

6 The two algorithms follow different approaches to achieve this goal: (i) Our first algorithm, χ2 -LMNN is specialized for histogram data. [sent-36, score-0.188]

7 It generalizes the non-linear χ2 -distance and learns a metric that strictly preserve the histogram properties of input data on a probability simplex. [sent-37, score-0.469]

8 It successfully combines the simplicity and elegance of the LMNN objective and the domain-specific expressiveness of the χ2 -distance. [sent-38, score-0.076]

9 1 (ii) Our second algorithm, gradient boosted LMNN (GB-LMNN) employs a non-linear mapping combined with a traditional Euclidean distance function. [sent-39, score-0.277]

10 It is a natural extension of LMNN from linear to non-linear mappings. [sent-40, score-0.061]

11 By training the non-linear transformation directly in function space with gradient-boosted regression trees (GBRT) [11] the resulting algorithm inherits the positive aspects of GBRT—its insensitivity to hyper-parameters, robustness against overfitting, speed and natural parallelizability [28]. [sent-41, score-0.458]

12 We demonstrate the efficacy of both algorithms on several real-world data sets and observe two noticeable trends: i) GB-LMNN (with default settings) achieves state-of-the-art k-nearest neighbor classification errors with high consistency across all our data sets. [sent-43, score-0.075]

13 On more complex data sets it reliably improves over linear metrics and matches or out-performs previous work on non-linear metric learning. [sent-45, score-0.285]

14 ii) For data sampled from a simplex, χ2 -LMNN is strongly superior to alternative approaches that do not explicitly incorporate the histogram aspect of the data—in fact it obtains best results in 19/20 learning settings. [sent-46, score-0.161]

15 Large margin nearest neighbors (LMNN) [30, 31] is an algorithm to learn a Mahalanobis metric specifically to improve the classification error of k-nearest neighbors (kNN) [7] classification. [sent-54, score-0.583]

16 As the kNN rule relies heavily on the underlying metric (a test input is classified by a majority vote amongst its k nearest neighbors), it is a good indicator for the quality of the metric in use. [sent-55, score-0.444]

17 LMNN identifies two types of neighborhood relations between an input xi and other inputs in the data set: For each xi , as a first step, k dedicated target neighbors are identified prior to learning. [sent-59, score-0.639]

18 These are the inputs that should ideally be the actual nearest neighbors after applying the transformation (we use the notation j i to indicate that xj is a target neighbor of xi ). [sent-60, score-0.881]

19 A common heuristic for choosing target neighbors is picking the k closest inputs (according to the Euclidean distance) to a given xi within the same class. [sent-61, score-0.453]

20 These are inputs that should not be among the k-nearest neighbors — defined to be all inputs from a different class that are within the local neighborhood of xi . [sent-63, score-0.602]

21 The LMNN objective has two terms, one for each neighborhood objective: First, it reduces the distance between an instance and its target neighbors, thus pulling them closer and making the input’s local neighborhood smaller. [sent-65, score-0.294]

22 , differently labeled inputs) farther away so that the distances to impostors should exceed the distances to target neighbors by a large margin. [sent-68, score-0.383]

23 As an extension to the original LMNN formulation, [26, 30] show that with L ∈ Rr×d with r < d, LMNN learns a projection into a lower-dimensional space Rr that still represents domain specific similarities. [sent-74, score-0.079]

24 2 3 χ2 -LMNN: Non-linear Distance Functions on the Probability Simplex The original LMNN algorithm learns a linear transformation L ∈ Rd×d that captures semantic similarity for kNN classification on data in some Euclidean vector space Rd . [sent-76, score-0.259]

25 In this section we extend this formulation to settings in which data are sampled from a probability simplex S d = {x ∈ Rd |x ≥ 0, x 1 = 1}, where 1 ∈ Rd denotes the vector of all-ones. [sent-77, score-0.095]

26 Each input xi ∈ S d can be interpreted as a histogram over d buckets. [sent-78, score-0.307]

27 ping is constrained to preserve all inputs Histogram distances. [sent-81, score-0.215]

28 The abundance of such data has on the simplex S 3 (grey surface). [sent-82, score-0.095]

29 The sparked the development of several specialized distance arrows indicate the push (red and yelmetrics designed to compare histograms. [sent-83, score-0.195]

30 Examples are low) and pull (blue) forces from the χ2 Quadratic-Form distance [16], Earth Mover’s Distance LMNN objective. [sent-84, score-0.15]

31 [21], Quadratic-Chi distance family [20] and χ2 histogram distance [16]. [sent-85, score-0.373]

32 Transforming the inputs with a linear transformation learned with LMNN will almost certainly result in a loss of their histogram properties — and the ability to use such distances. [sent-87, score-0.544]

33 We focus on the χ2 histogram distance, whose origin is the χ2 statistical hypothesis test [19], and which has successfully been applied in many domains [8, 27, 29]. [sent-91, score-0.161]

34 The χ2 distance is a bin-to-bin distance measurement, which takes into account the size of the bins and their differences. [sent-92, score-0.212]

35 Formally, the χ2 distance is a well-defined metric χ2 : S d → [0, 1] defined as [20] χ2 (xi , xj ) = 1 2 d f =1 ([xi ]f − [xj ]f )2 , [xi ]f + [xj ]f (3) where [xi ]f indicates the f th feature value of the vector xi . [sent-93, score-0.521]

36 First, analogous to the generalized Euclidean metric in (1), we generalize the χ2 distance with a linear transformation and introduce the pseudo-metric χ2 (xi , xj ), defined as L χ2 (xi , xj ) = χ2 (Lxi , Lxj ). [sent-95, score-0.753]

37 L (4) The χ2 distance is only a well-defined metric within the simplex S d and therefore we constrain L to map any x onto S d . [sent-96, score-0.371]

38 We define the set of such simplex-preserving linear transformations as P = {L ∈ Rd×d : ∀x ∈ S d , Lx ∈ S d }. [sent-97, score-0.077]

39 To optimize the transformation L with respect to the χ2 histogram distance directly, we replace the Mahalanobis distance DL in (2) with χ2 and obtain the following: L + χ2 (xi , xj ) − χ2 (xi , xk ) L L χ2 (xi , xj ) + µ L min L∈P i,j: j i + . [sent-99, score-0.895]

40 (5) k: yi =yk Besides the substituted distance function, there are two important changes in the optimization problem (5) compared to (2). [sent-100, score-0.134]

41 Second, because (4) is not linear in L L, different values for the margin parameter lead to truly different solutions (which differ not just up to a scaling factor as before). [sent-102, score-0.073]

42 These constraints are linear with respect 3 to L and we can optimize (5) efficiently with a projected sub-gradient method [2]. [sent-112, score-0.097]

43 As an even faster optimization method, we propose a simple change of variables to generate an unconstrained version of (5). [sent-113, score-0.078]

44 This allows us to minimize (5) with respect to A using unconstrained sub-gradient descent1 . [sent-118, score-0.076]

45 01 11 (where I denotes the identity matrix) to approximate the non-transformed χ2 histogram distance after the change of variable (f (A) ≈ I). [sent-120, score-0.267]

46 In this case χ2 -LMNN learns a projection into a lower dimensional simplex L : S d → S r . [sent-123, score-0.144]

47 This extension can be very valuable to enable faster nearest neighbor search [33] especially for time-sensitive applications, e. [sent-125, score-0.174]

48 4 GB-LMNN: Non-linear Transformations with Gradient Boosting Whereas section 3 focuses on the learning scenario where a linear transformation is too general, in this section we explore the opposite case where it is too restrictive. [sent-129, score-0.181]

49 Affine transformations preserve collinearity and ratios of distances along lines — i. [sent-130, score-0.189]

50 , inputs on a straight line remain on a straight line and their relative distances are preserved. [sent-132, score-0.318]

51 [6] pioneered non-linear metric learning, using convolutional neural networks to learn embeddings for faceverification tasks. [sent-137, score-0.24]

52 Inspired by their work, we propose to optimize the LMNN objective (2) directly in function space with gradient boosted CART trees [11]. [sent-138, score-0.268]

53 Combining the learned transformation φ(x) : Rd → Rd with a Euclidean distance function has the capability to capture highly non-linear similarity relations. [sent-139, score-0.326]

54 To generalize the LMNN objective 2 to a non-linear transformation φ(·), we denote the Euclidean distance after the transformation as Dφ (xi , xj ) = φ(xi ) − φ(xj ) 2 , (7) which satisfies all properties of a well-defined pseudo-metric in the original input space. [sent-142, score-0.635]

55 To optimize the LMNN objective directly with respect to Dφ , we follow the same steps as in Section 3 and substitute Dφ for DL in (2). [sent-143, score-0.098]

56 (8) k: yi =yk In its most general form, with an unspecified mapping φ, (8) unifies most of the existing variations of LMNN metric learning. [sent-145, score-0.232]

57 The original linear LMNN mapping [31] is a special case where φ(x) = Lx. [sent-146, score-0.093]

58 Kernelized versions [5, 12, 26] are captured by φ(x) = Lψ(x), producing the kernel K(xi , xj ) = φ(xi ) φ(xj ) = ψ(xi ) L Lψ(xj ). [sent-147, score-0.134]

59 The embedding of Globerson and Roweis [14] corresponds to the most expressive mapping function φ(xi ) = zi , where each input xi is transformed independently to a new location zi to satisfy similarity constraints — without out-of-sample extensions. [sent-148, score-0.298]

60 The previous examples vary widely in expressiveness, scalability, and generalization, largely as a consequence of the mapping function φ. [sent-150, score-0.062]

61 It is important to find the right non-linear form for φ, and we believe an elegant solution lies in gradient boosted regression trees. [sent-151, score-0.143]

62 The construction of the mapping, an ensemble of multivariate regression trees selected by gradient boosting [11], minimizes the general LMNN objective (8) directly in function space. [sent-153, score-0.263]

63 Formally, the GB-LMNN transformation 1 The set of all possible matrices f (A) is slightly more restricted than P, as it reaches zero entries only in the limit. [sent-154, score-0.15]

64 The figure depicts the true gradient (top row) with respect to each input and its least squares approximation (bottom row) with a multi-variate regression tree (depth, p = 4). [sent-157, score-0.226]

65 T is an additive function φ = φ0 + α t=1 ht initialized by φ0 and constructed by iteratively adding regression trees ht of limited depth p [4], each weighted by a learning rate α. [sent-158, score-0.504]

66 Individually, the trees are weak learners and are capable of learning only simple functions, but additively they form powerful ensembles with good generalization to out-of-sample data. [sent-159, score-0.169]

67 In iteration t, the tree ht is selected greedily to best minimize the objective upon its addition to the ensemble, φt (·) = φt−1 (·) + αht (·), where ht ≈ argmin L(φt−1 + αh). [sent-160, score-0.455]

68 (9) h∈T p Here, T p denotes the set of all regression trees of depth p. [sent-161, score-0.18]

69 The (approximately) optimal tree ht is found by a first-order Taylor approximation of L. [sent-162, score-0.21]

70 This makes the optimization akin to a steepest descent step in function space, where ht is selected to approximate the negative gradient gt of the objective L(φt−1 ) with respect to the transformation learned at the previous iteration φt−1 . [sent-163, score-0.548]

71 Since we learn an approximation of gt as a function of the training data, sub-gradients are computed with respect to each training input xi , and approximated by the tree ht (·) in the least-squared sense, n (gt (xi ) − ht (xi ))2 , where: gt (xi ) = ht (·) = argmin h∈T p i=1 ∂L(φt−1 ) . [sent-164, score-0.856]

72 ∂φt−1 (xi ) (10) Intuitively, at each iteration, the tree ht (·) of depth p splits the input space into 2p axis-aligned regions. [sent-165, score-0.33]

73 All inputs that fall into one region are translated by a constant vector — consequently, the inputs in different regions are shifted in different directions. [sent-166, score-0.322]

74 We learn the trees greedily with a modified version of the public-domain CART implementation pGBRT [28]. [sent-167, score-0.182]

75 Since (8) is non-convex with respect to φ, we initialize with the linear transformation learned by LMNN, φ0 = Lx, making our method a non-linear refinement of LMNN. [sent-169, score-0.248]

76 The only additional hyperparameter to the optimization is the maximum tree depth p to which the algorithm is not particularly sensitive (we set p = 6). [sent-170, score-0.135]

77 2 Figure 2 depicts a simple toy-example with concentric circles of inputs from two different classes. [sent-171, score-0.248]

78 By design, the inputs are sampled such that the nearest neighbor for any given input is from the other class. [sent-172, score-0.34]

79 A linear transformation is incapable of separating the two classes. [sent-173, score-0.207]

80 However GB-LMNN produces a mapping with the desired separation. [sent-174, score-0.062]

81 The limited-depth regression trees are unable to capture the gradient for all inputs in a single iteration. [sent-176, score-0.338]

82 But by greedily focusing on inputs with the largest gradients or groups of inputs with the most easily encoded gradients, the gradient boosting process additively constructs the transformation function. [sent-177, score-0.707]

83 At iteration 100, the gradients with respect to most inputs vanish, indicating that a local minimum of L(φ) is almost reached — the inputs from the two classes are separated by a large margin. [sent-178, score-0.376]

84 Like linear LMNN and χ2 -LMNN, it is possible to learn a non-linear transformation to a lower dimensional space, φ(x) : Rd → Rr , r ≤ d. [sent-182, score-0.225]

85 Training proceeds by learning trees with r- rather than d-dimensional outputs. [sent-184, score-0.087]

86 5 Related Work There have been previous attempts to generalize learning linear distances to nonlinear metrics. [sent-185, score-0.174]

87 They partition the space into multiple regions, and jointly learn a separate metric for each region—however, these local metrics do not give rise to a global metric and distances between inputs within different regions are not well-defined. [sent-192, score-0.718]

88 Of particular relevance to our GB-LMNN work is the use of boosting ensembles to learn distances between bit-vectors [1, 23]. [sent-195, score-0.223]

89 Note that their goals are to preserve distances computed by locality sensitive hashing to enable fast search and retrieval. [sent-196, score-0.143]

90 Ours are very different: we alter the distances discriminatively to minimize classification error. [sent-197, score-0.089]

91 Our work on χ2 -LMNN echoes the recent interest in learning earth-mover-distance (EMD) which is also frequently used in measuring similarities between histogram-type data [9]. [sent-198, score-0.069]

92 Despite its name, EMD is not necessarily a metric [20]. [sent-199, score-0.17]

93 6 Experimental Results We evaluate our non-linear metric learning algorithms against several competitive methods. [sent-201, score-0.17]

94 The effectiveness of learned metrics is assessed by kNN classification error. [sent-202, score-0.125]

95 GB-LMNN We compare the non-linear global metric learned by GB-LMNN to three linear metrics: the Euclidean metric and metrics learned by LMNN [31] and Information-Theoretic Metric Learning (ITML) [10]. [sent-208, score-0.537]

96 We also compare to the metrics learned by Multi-Metric LMNN (M 2 -LMNN) [30]. [sent-210, score-0.125]

97 M 2 -LMNN learns |C| linear metrics, one for each the input labels. [sent-211, score-0.115]

98 We also apply GB-LMNN to four datasets with histogram data — setting the stage for an interesting comparison to χ2 -LMNN below. [sent-223, score-0.161]

99 Neither method evaluated so far is specifically adapted to histogram features. [sent-228, score-0.161]

100 Especially linear models, such as LMNN and ITML, are expected to fumble over the intricate similarities that such 3 In the case of ISOLET, which consists of audio signals of spoken letters by different individuals, the holdout set consisted of one speaker. [sent-229, score-0.132]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lmnn', 0.698), ('knn', 0.183), ('metric', 0.17), ('ht', 0.162), ('histogram', 0.161), ('inputs', 0.161), ('transformation', 0.15), ('xj', 0.134), ('neighbors', 0.129), ('xi', 0.111), ('dl', 0.107), ('distance', 0.106), ('lx', 0.098), ('simplex', 0.095), ('distances', 0.089), ('trees', 0.087), ('metrics', 0.084), ('rr', 0.081), ('rd', 0.077), ('mahalanobis', 0.075), ('neighbor', 0.075), ('nearest', 0.069), ('euclidean', 0.067), ('mapping', 0.062), ('concentric', 0.06), ('gbrt', 0.06), ('impostor', 0.06), ('insensitivity', 0.06), ('parallelizability', 0.06), ('depth', 0.059), ('gradient', 0.056), ('preserve', 0.054), ('boosting', 0.054), ('boosted', 0.053), ('isolet', 0.053), ('gt', 0.053), ('target', 0.052), ('greedily', 0.051), ('unconstrained', 0.05), ('itml', 0.049), ('learns', 0.049), ('tree', 0.048), ('transformations', 0.046), ('gert', 0.046), ('additively', 0.046), ('emd', 0.046), ('classi', 0.044), ('learn', 0.044), ('cart', 0.044), ('expressiveness', 0.044), ('pull', 0.044), ('similarities', 0.043), ('margin', 0.042), ('learned', 0.041), ('yk', 0.04), ('neighborhood', 0.04), ('optimize', 0.04), ('inherits', 0.039), ('xk', 0.038), ('usps', 0.037), ('push', 0.036), ('ensembles', 0.036), ('input', 0.035), ('letters', 0.034), ('straight', 0.034), ('regression', 0.034), ('objective', 0.032), ('linear', 0.031), ('transformed', 0.031), ('weinberger', 0.03), ('extension', 0.03), ('expressive', 0.03), ('similarity', 0.029), ('robustness', 0.028), ('mappings', 0.028), ('generalize', 0.028), ('optimization', 0.028), ('gradients', 0.028), ('specialized', 0.027), ('depicts', 0.027), ('plagued', 0.026), ('sparked', 0.026), ('echoes', 0.026), ('surge', 0.026), ('pioneered', 0.026), ('dor', 0.026), ('feisha', 0.026), ('incapable', 0.026), ('mo', 0.026), ('nonlinear', 0.026), ('splits', 0.026), ('respect', 0.026), ('histograms', 0.025), ('pulling', 0.024), ('codebooks', 0.024), ('font', 0.024), ('holdout', 0.024), ('mover', 0.024), ('impostors', 0.024), ('louis', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 242 nips-2012-Non-linear Metric Learning

Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger

Abstract: In this paper, we introduce two novel metric learning algorithms, χ2 -LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: χ2 -LMNN inherits the computational benefits of a linear mapping from linear metric learning, but uses a non-linear χ2 -distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach’s robustness, speed, parallelizability and insensitivity towards the single additional hyperparameter. On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classification error, but in the case of χ2 -LMNN, obtain best results in 19 out of 20 learning settings. 1

2 0.26114827 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

Author: Jun Wang, Alexandros Kalousis, Adam Woznica

Abstract: We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased flexibility its downside is the considerable risk of overfitting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics defined on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classification problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a significant manner. 1

3 0.24061938 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Author: Matthew Der, Lawrence K. Saul

Abstract: We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between differently labeled ones. The model’s continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model differs significantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model’s main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly effective. 1

4 0.17736621 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classifiers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the first practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufficient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classification tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Specifically, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is defined as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the first principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our final model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive definite D × D matrix. Classic approaches for finding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-definite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classifier [5], which finds a linear transformation that satisfies local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classifier. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] find a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the first and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same figure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classifiers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to define sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, first we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by defining Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Definition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often flat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is defined as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is defined as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we define the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we define akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the definition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classifier and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the benefit that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the specific choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the definition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of first order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), finds the unique geodesic starting at x with initial velocity v. As the geodesic must fulfill (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we find numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The first step is to compute the mean value on the manifold, which is defined as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efficient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is defined as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efficient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But first, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefficients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefficients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the first two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a first visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the first principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we briefly consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classifier in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classification accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classifiers. The idea of defining the metric tensor at any given point as the weighted average of a finite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Specifically, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classification problems using local metrics, but we feel that our approach is the first to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difficulty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classification tasks as this is all the current literature provides. We expect improvements by learning the metrics specifically for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classification using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-first international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classification. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classification and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

5 0.17625082 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain

Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 1

6 0.15007159 148 nips-2012-Hamming Distance Metric Learning

7 0.10753005 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

8 0.096709445 200 nips-2012-Local Supervised Learning through Space Partitioning

9 0.09594398 195 nips-2012-Learning visual motion in recurrent neural networks

10 0.085665718 168 nips-2012-Kernel Latent SVM for Visual Recognition

11 0.085531034 227 nips-2012-Multiclass Learning with Simplex Coding

12 0.085477352 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

13 0.080976665 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

14 0.077791519 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

15 0.074049778 204 nips-2012-MAP Inference in Chains using Column Generation

16 0.070568442 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

17 0.063994408 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

18 0.063652612 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

19 0.062356111 279 nips-2012-Projection Retrieval for Classification

20 0.059899766 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, 0.05), (2, -0.03), (3, -0.074), (4, 0.092), (5, -0.032), (6, 0.0), (7, 0.108), (8, 0.051), (9, 0.013), (10, 0.03), (11, -0.149), (12, 0.123), (13, 0.003), (14, -0.117), (15, 0.053), (16, -0.089), (17, 0.125), (18, 0.156), (19, 0.197), (20, 0.017), (21, -0.101), (22, 0.036), (23, 0.073), (24, 0.087), (25, -0.05), (26, -0.044), (27, -0.093), (28, 0.151), (29, 0.046), (30, -0.066), (31, -0.052), (32, 0.041), (33, -0.02), (34, 0.027), (35, 0.1), (36, 0.113), (37, 0.045), (38, -0.047), (39, 0.054), (40, -0.127), (41, 0.058), (42, 0.011), (43, 0.06), (44, 0.029), (45, 0.143), (46, 0.044), (47, 0.023), (48, -0.076), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95331395 242 nips-2012-Non-linear Metric Learning

Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger

Abstract: In this paper, we introduce two novel metric learning algorithms, χ2 -LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: χ2 -LMNN inherits the computational benefits of a linear mapping from linear metric learning, but uses a non-linear χ2 -distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach’s robustness, speed, parallelizability and insensitivity towards the single additional hyperparameter. On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classification error, but in the case of χ2 -LMNN, obtain best results in 19 out of 20 learning settings. 1

2 0.89525461 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

Author: Jun Wang, Alexandros Kalousis, Adam Woznica

Abstract: We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased flexibility its downside is the considerable risk of overfitting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics defined on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classification problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a significant manner. 1

3 0.72142524 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classifiers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the first practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufficient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classification tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Specifically, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is defined as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the first principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our final model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive definite D × D matrix. Classic approaches for finding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-definite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classifier [5], which finds a linear transformation that satisfies local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classifier. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] find a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the first and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same figure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classifiers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to define sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, first we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by defining Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Definition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often flat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is defined as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is defined as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we define the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we define akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the definition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classifier and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the benefit that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the specific choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the definition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of first order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), finds the unique geodesic starting at x with initial velocity v. As the geodesic must fulfill (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we find numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The first step is to compute the mean value on the manifold, which is defined as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efficient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is defined as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efficient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But first, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefficients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefficients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the first two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a first visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the first principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we briefly consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classifier in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classification accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classifiers. The idea of defining the metric tensor at any given point as the weighted average of a finite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Specifically, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classification problems using local metrics, but we feel that our approach is the first to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difficulty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classification tasks as this is all the current literature provides. We expect improvements by learning the metrics specifically for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classification using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-first international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classification. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classification and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

4 0.71988332 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Author: Matthew Der, Lawrence K. Saul

Abstract: We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between differently labeled ones. The model’s continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model differs significantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model’s main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly effective. 1

5 0.66017628 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain

Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 1

6 0.55841738 148 nips-2012-Hamming Distance Metric Learning

7 0.55363154 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

8 0.50824326 338 nips-2012-The Perturbed Variation

9 0.50143301 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

10 0.49989459 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

11 0.49216706 279 nips-2012-Projection Retrieval for Classification

12 0.45042878 146 nips-2012-Graphical Gaussian Vector for Image Categorization

13 0.44727874 200 nips-2012-Local Supervised Learning through Space Partitioning

14 0.44577628 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

15 0.43348172 145 nips-2012-Gradient Weights help Nonparametric Regressors

16 0.41934636 204 nips-2012-MAP Inference in Chains using Column Generation

17 0.38582295 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

18 0.37393108 330 nips-2012-Supervised Learning with Similarity Functions

19 0.37342232 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

20 0.36928797 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (17, 0.015), (21, 0.038), (38, 0.102), (42, 0.395), (54, 0.017), (55, 0.016), (74, 0.066), (76, 0.13), (80, 0.073), (92, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8685382 289 nips-2012-Recognizing Activities by Attribute Dynamics

Author: Weixin Li, Nuno Vasconcelos

Abstract: In this work, we consider the problem of modeling the dynamic structure of human activities in the attributes space. A video sequence is Ä?Ĺš rst represented in a semantic feature space, where each feature encodes the probability of occurrence of an activity attribute at a given time. A generative model, denoted the binary dynamic system (BDS), is proposed to learn both the distribution and dynamics of different activities in this space. The BDS is a non-linear dynamic system, which extends both the binary principal component analysis (PCA) and classical linear dynamic systems (LDS), by combining binary observation variables with a hidden Gauss-Markov state process. In this way, it integrates the representation power of semantic modeling with the ability of dynamic systems to capture the temporal structure of time-varying processes. An algorithm for learning BDS parameters, inspired by a popular LDS learning method from dynamic textures, is proposed. A similarity measure between BDSs, which generalizes the BinetCauchy kernel for LDS, is then introduced and used to design activity classiÄ?Ĺš ers. The proposed method is shown to outperform similar classiÄ?Ĺš ers derived from the kernel dynamic system (KDS) and state-of-the-art approaches for dynamics-based or attribute-based action recognition. 1

same-paper 2 0.84614229 242 nips-2012-Non-linear Metric Learning

Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger

Abstract: In this paper, we introduce two novel metric learning algorithms, χ2 -LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: χ2 -LMNN inherits the computational benefits of a linear mapping from linear metric learning, but uses a non-linear χ2 -distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach’s robustness, speed, parallelizability and insensitivity towards the single additional hyperparameter. On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classification error, but in the case of χ2 -LMNN, obtain best results in 19 out of 20 learning settings. 1

3 0.84398025 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

Author: Charles Blundell, Jeff Beck, Katherine A. Heller

Abstract: We present a Bayesian nonparametric model that discovers implicit social structure from interaction time-series data. Social groups are often formed implicitly, through actions among members of groups. Yet many models of social networks use explicitly declared relationships to infer social structure. We consider a particular class of Hawkes processes, a doubly stochastic point process, that is able to model reciprocity between groups of individuals. We then extend the Infinite Relational Model by using these reciprocating Hawkes processes to parameterise its edges, making events associated with edges co-dependent through time. Our model outperforms general, unstructured Hawkes processes as well as structured Poisson process-based models at predicting verbal and email turn-taking, and military conflicts among nations. 1

4 0.79708666 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

5 0.78858232 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

6 0.61897147 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

7 0.59726351 275 nips-2012-Privacy Aware Learning

8 0.5955618 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

9 0.58601767 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

10 0.58349365 285 nips-2012-Query Complexity of Derivative-Free Optimization

11 0.57965946 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

12 0.57490808 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

13 0.5717234 9 nips-2012-A Geometric take on Metric Learning

14 0.56957144 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

15 0.56338549 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

16 0.56106782 358 nips-2012-Value Pursuit Iteration

17 0.56024933 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

18 0.55733365 292 nips-2012-Regularized Off-Policy TD-Learning

19 0.55725229 217 nips-2012-Mixability in Statistical Learning

20 0.55263072 148 nips-2012-Hamming Distance Metric Learning