nips nips2005 nips2005-57 knowledge-graph by maker-knowledge-mining

57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification


Source: pdf

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. [sent-5, score-0.566]

2 The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. [sent-6, score-0.504]

3 On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1. [sent-7, score-0.292]

4 As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. [sent-9, score-0.23]

5 1 Introduction The k-nearest neighbors (kNN) rule [3] is one of the oldest and simplest methods for pattern classification. [sent-11, score-0.156]

6 The kNN rule classifies each unlabeled example by the majority label among its k-nearest neighbors in the training set. [sent-13, score-0.287]

7 Its performance thus depends crucially on the distance metric used to identify nearest neighbors. [sent-14, score-0.583]

8 In the absence of prior knowledge, most kNN classifiers use simple Euclidean distances to measure the dissimilarities between examples represented as vector inputs. [sent-15, score-0.224]

9 Euclidean distance metrics, however, do not capitalize on any statistical regularities in the data that might be estimated from a large training set of labeled examples. [sent-16, score-0.411]

10 Ideally, the distance metric for kNN classification should be adapted to the particular problem being solved. [sent-17, score-0.403]

11 It can hardly be optimal, for example, to use the same distance metric for face recognition as for gender identification, even if in both tasks, distances are computed between the same fixed-size images. [sent-18, score-0.685]

12 In fact, as shown by many researchers [2, 6, 7, 8, 12, 13], kNN classification can be significantly improved by learning a distance metric from labeled examples. [sent-19, score-0.564]

13 In this paper, we show how to learn a Mahanalobis distance metric for kNN classification. [sent-22, score-0.435]

14 The metric is optimized with the goal that k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. [sent-23, score-0.504]

15 Our goal for metric learning differs in a crucial way from those of previous approaches that minimize the pairwise distances between all similarly labeled examples [12, 13, 17]. [sent-24, score-0.731]

16 This latter objective is far more difficult to achieve and does not leverage the full power of kNN classification, whose accuracy does not require that all similarly labeled inputs be tightly clustered. [sent-25, score-0.299]

17 Our approach is largely inspired by recent work on neighborhood component analysis [7] and metric learning by energy-based models [2]. [sent-26, score-0.307]

18 Our approach has several parallels to learning in support vector machines (SVMs)—most notably, the goal of margin maximization and a convex objective function based on the hinge loss. [sent-30, score-0.371]

19 In light of these parallels, we describe our approach as large margin nearest neighbor (LMNN) classification. [sent-31, score-0.437]

20 Extensions of SVMs to multiclass problems typically involve combining the results of many binary classifiers, or they require additional machinery that is elegant but nontrivial [4]. [sent-34, score-0.146]

21 2 Model Let {(xi , yi )}n denote a training set of n labeled examples with inputs xi ∈ Rd and disi=1 crete (but not necessarily binary) class labels yi . [sent-37, score-0.526]

22 We use the binary matrix yij ∈ {0, 1} to indicate whether or not the labels yi and yj match. [sent-38, score-0.103]

23 Our goal is to learn a linear transformation L : Rd → Rd , which we will use to compute squared distances as: D(xi , xj ) = L(xi − xj ) 2 . [sent-39, score-0.434]

24 (1) Specifically, we want to learn the linear transformation that optimizes kNN classification when distances are measured in this way. [sent-40, score-0.26]

25 Target neighbors In addition to the class label yi , for each input xi we also specify k “target” neighbors— that is, k other inputs with the same label yi that we wish to have minimal distance to xi , as computed by eq. [sent-42, score-0.8]

26 In the absence of prior knowledge, the target neighbors can simply be identified as the k nearest neighbors, determined by Euclidean distance, that share the same label yi . [sent-44, score-0.474]

27 ) We use ηij ∈ {0, 1} to indicate whether input xj is a target neighbor of input xi . [sent-46, score-0.431]

28 Cost function Our cost function over the distance metrics parameterized by eq. [sent-48, score-0.38]

29 The first term penalizes large distances between each input and its target neighbors, while the second term penalizes small distances between each input and all other inputs that do not share the same label. [sent-50, score-0.769]

30 Specifically, the cost function is given by: ηij L(xi −xj ) ε(L) = ij 2 ηij (1−yil ) 1 + L(xi −xj ) +c 2 − L(xi −xl ) 2 + , ijl (2) where in the second term [z]+ = max(z, 0) denotes the standard hinge loss and c > 0 is some positive constant (typically set by cross validation). [sent-51, score-0.501]

31 Note that the first term only penalizes large distances between inputs and target neighbors, not between all similarly labeled examples. [sent-52, score-0.625]

32 Large margin The second term in the cost function incorporates the idea of a margin. [sent-53, score-0.232]

33 In particular, for each input xi , the hinge loss is incurred by differently labeled inputs whose distances do not exceed, by one absolute unit of distance, the distance from input xi to any of its target neighbors. [sent-54, score-1.241]

34 The cost function thereby favors distance metrics in which differently labeled inputs maintain a large margin of distance and do not threaten to “invade” each other’s neighborhoods. [sent-55, score-1.058]

35 BEFORE margin AFTER local neighborhood margin xi xi Similarly labeled Differently labeled target neighbor Differently labeled Figure 1: Schematic illustration of one input’s neighborhood xi before training (left) versus after training (right). [sent-58, score-1.485]

36 The distance metric is optimized so that: (i) its k = 3 target neighbors lie within a smaller radius after training; (ii) differently labeled inputs lie outside this smaller radius, with a margin of at least one unit distance. [sent-59, score-1.12]

37 Arrows indicate the gradients on distances arising from the optimization of the cost function. [sent-60, score-0.283]

38 In both cost functions, one term penalizes the norm of the “parameter” vector (i. [sent-63, score-0.169]

39 , the weight vector of the maximum margin hyperplane, or the linear transformation in the distance metric), while the other incurs the hinge loss for examples that violate the condition of unit margin. [sent-65, score-0.608]

40 Finally, just as the hinge loss in SVMs is only triggered by examples near the decision boundary, the hinge loss in eq. [sent-66, score-0.472]

41 (2) is only triggered by differently labeled examples that invade each other’s neighborhoods. [sent-67, score-0.37]

42 (1) as: D(xi , xj ) = (xi − xj ) M(xi − xj ), (3) where the matrix M = L L, parameterizes the Mahalanobis distance metric induced by the linear transformation L. [sent-74, score-0.714]

43 (2) as an SDP in terms of M is straightforward, since the first term is already linear in M = L L and the hinge loss can be “mimicked” by introducing slack variables ξij for all pairs of differently labeled inputs (i. [sent-76, score-0.661]

44 The resulting SDP is given by: Minimize ij ηij (xi − xj ) M(xi − xj ) + c ij ηij (1 − yil )ξijl subject to: (1) (xi − xl ) M(xi − xl ) − (xi − xj ) M(xi − xj ) ≥ 1 − ξijl (2) ξijl ≥ 0 (3) M 0. [sent-79, score-0.702]

45 The slack variables {ξij } are sparse because most labeled inputs are well separated; thus, their resulting pairwise distances do not incur the hinge loss, and we obtain very few active constraints. [sent-83, score-0.675]

46 Both the number of target neighbors (k) and the weighting parameter (c) in eq. [sent-93, score-0.212]

47 (For the purpose of cross-validation, the training sets were further partitioned into training and validation sets. [sent-95, score-0.196]

48 To break ties among different classes, we repeatedly reduced the neighborhood size, ultimately classifying (if necessary) by just the k = 1 nearest neighbor. [sent-98, score-0.257]

49 Except on the smallest data set (where over-training appears to be an issue), the Mahalanobis distance metrics learned by semidefinite programming led to significant improvements in kNN classification, both in training and testing. [sent-101, score-0.449]

50 We also computed test error rates using a variant of kNN classification, inspired by previous work on energy-based models [2]. [sent-104, score-0.114]

51 Energy-based classification of a test example xt was done by finding the label that minimizes the cost function in eq. [sent-105, score-0.205]

52 Finally, the test example was classified by the hypothetical label that minimized the combination of these two terms: ηtj L(xt−xj ) 2 +c yt = argminyt j ηij (1−yil ) 1 + L(xi −xj ) 2 − L(xi −xl ) 2 + j,i=t∨l=t As shown in Fig. [sent-108, score-0.13]

53 Finally, we compared our results to those of multiclass SVMs [4]. [sent-110, score-0.116]

54 On each data set (except MNIST), we trained multiclass SVMs using linear and RBF kernels; Fig. [sent-111, score-0.116]

55 ) 1 A great speedup can be achieved by solving an SDP that only monitors a fraction of the margin conditions, then using the resulting solution as a starting point for the actual SDP of interest. [sent-115, score-0.126]

56 9 IRIS kNN Mahalanobis distance Energy based classification 14. [sent-156, score-0.173]

57 4 testing error rate (%) Figure 2: Training and test error rates for kNN classification using Euclidean versus Mahalanobis distances. [sent-165, score-0.151]

58 The latter yields lower test error rates on all but the smallest data set (presumably due to over-training). [sent-166, score-0.114]

59 Small data sets with few classes The wine, iris, and balance data sets are small data sets, with less than 500 training examples and just three classes, taken from the UCI Machine Learning Repository3 . [sent-169, score-0.249]

60 On data sets of this size, a distance metric can be learned in a matter of seconds. [sent-170, score-0.445]

61 Face recognition The AT&T; face recognition data set4 contains 400 grayscale images of 40 individuals in 10 different poses. [sent-176, score-0.223]

62 We downsampled the images from to 38 × 31 pixels and used PCA to obtain 30-dimensional eigenfaces [15]. [sent-177, score-0.092]

63 Training and test sets were created by randomly sampling 7 images of each person for training and 3 images for testing. [sent-178, score-0.245]

64 3 illustrates the improvements more graphically by showing how the k = 3 nearest neighbors change as a result of learning a Mahalanobis metric. [sent-183, score-0.375]

65 html Test Image: Among 3 nearest neighbors after but not before training: Among 3 nearest neighbors before but not after training: Figure 3: Images from the AT&T; face recognition data base. [sent-195, score-0.776]

66 Middle row: correct match among the k = 3 nearest neighbors according to Mahalanobis distance, but not Euclidean distance. [sent-197, score-0.336]

67 Bottom row: incorrect match among the k = 3 nearest neighbors according to Euclidean distance, but not Mahalanobis distance. [sent-198, score-0.336]

68 Spoken letter recognition The Isolet data set from UCI Machine Learning Repository has 6238 examples and 26 classes corresponding to letters of the alphabet. [sent-199, score-0.137]

69 On this data set, Dietterich and Bakiri report test error rates of 4. [sent-201, score-0.114]

70 2 were obtained by averaging over 10 runs with 70/30 splits for training and test data. [sent-212, score-0.166]

71 LMNN also performed comparably to our best multiclass SVM [4], which obtained a 12. [sent-215, score-0.116]

72 Handwritten digit recognition The MNIST data set of handwritten digits6 has been extensively benchmarked [9]. [sent-217, score-0.098]

73 4 shows some digits whose nearest neighbor changed as a result of learning, from a mismatch using Euclidean distance to a match using Mahanalobis distance. [sent-225, score-0.484]

74 4 Related Work Many researchers have attempted to learn distance metrics from labeled examples. [sent-226, score-0.496]

75 com/exdb/mnist/ Test Image: Nearest neighbor after training: Nearest neighbor before training: Figure 4: Top row: Examples of MNIST images whose nearest neighbor changes during training. [sent-233, score-0.614]

76 Middle row: nearest neighbor after training, using the Mahalanobis distance metric. [sent-234, score-0.484]

77 Bottom row: nearest neighbor before training, using the Euclidean distance metric. [sent-235, score-0.484]

78 Xing et al [17] used semidefinite programming to learn a Mahalanobis distance metric for clustering. [sent-236, score-0.52]

79 Their algorithm aims to minimize the sum of squared distances between similarly labeled inputs, while maintaining a lower bound on the sum of distances between differently labeled inputs. [sent-237, score-0.832]

80 Shalev-Shwartz et al [12] proposed an online learning algorithm for learning a Mahalanobis distance metric. [sent-239, score-0.228]

81 The metric is trained with the goal that all similarly labeled inputs have small pairwise distances (bounded from above), while all differently labeled inputs have large pairwise distances (bounded from below). [sent-240, score-1.332]

82 A margin is defined by the difference of these thresholds and induced by a hinge loss function. [sent-241, score-0.339]

83 Our work has a similar basis in its appeal to margins and hinge loss functions, but again differs in its focus on local neighborhoods for kNN classification. [sent-242, score-0.29]

84 In particular, we do not seek to minimize the distance between all similarly labeled inputs, only those that are specified as neighbors. [sent-243, score-0.371]

85 Goldberger et al [7] proposed neighborhood component analysis (NCA), a distance metric learning algorithm especially designed to improve kNN classification. [sent-244, score-0.535]

86 The algorithm minimizes the probability of error under stochastic neighborhood assignments using gradient descent. [sent-245, score-0.114]

87 Chopra et al [2] recently proposed a framework for similarity metric learning in which the metrics are parameterized by pairs of identical convolutional neural nets. [sent-247, score-0.444]

88 Their cost function penalizes large distances between similarly labeled inputs and small distances between differently labeled inputs, with penalties that incorporate the idea of a margin. [sent-248, score-1.073]

89 Our work is based on a similar cost function, but our metric is parameterized by a linear transformation instead of a convolutional neural net. [sent-249, score-0.386]

90 Relevant component analysis (RCA) constructs a Mahalanobis distance metric from a weighted sum of in-class covariance matrices [13]. [sent-251, score-0.403]

91 ] and Domeniconi et al [6] consider schemes for locally adaptive distance metrics that vary throughout the input space. [sent-254, score-0.422]

92 The latter work appeals to the goal of margin maximization but otherwise differs substantially from our approach. [sent-255, score-0.171]

93 In particular, Domeniconi et al [6] suggest to use the decision boundaries of SVMs to induce a locally adaptive distance metric for kNN classification. [sent-256, score-0.486]

94 By contrast, our approach (though similarly named) does not involve the training of SVMs. [sent-257, score-0.114]

95 5 Discussion In this paper, we have shown how to learn Mahalanobis distance metrics for kNN classification by semidefinite programming. [sent-258, score-0.335]

96 Finally, we are extending our framework to learn locally adaptive distance metrics [6, 8] that vary across the input space. [sent-264, score-0.399]

97 Such metrics should lead to even more flexible and powerful large margin classifiers. [sent-265, score-0.256]

98 Learning a similiarty metric discriminatively, with application to face verification. [sent-276, score-0.285]

99 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-286, score-0.116]

100 Distance metric learning, with application to clustering with side-information. [sent-390, score-0.23]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('knn', 0.439), ('mahalanobis', 0.253), ('metric', 0.23), ('lmnn', 0.182), ('nearest', 0.18), ('distances', 0.178), ('distance', 0.173), ('hinge', 0.161), ('labeled', 0.161), ('neighbors', 0.156), ('classi', 0.153), ('mnist', 0.136), ('neighbor', 0.131), ('semide', 0.131), ('metrics', 0.13), ('euclidean', 0.128), ('svms', 0.127), ('sdp', 0.127), ('margin', 0.126), ('differently', 0.117), ('multiclass', 0.116), ('inputs', 0.101), ('isolet', 0.099), ('ijl', 0.091), ('ij', 0.091), ('xj', 0.087), ('xi', 0.085), ('cost', 0.077), ('training', 0.077), ('neighborhood', 0.077), ('cation', 0.074), ('domeniconi', 0.068), ('mahanalobis', 0.068), ('yil', 0.068), ('penalizes', 0.063), ('iris', 0.06), ('nca', 0.06), ('target', 0.056), ('al', 0.055), ('face', 0.055), ('label', 0.054), ('wine', 0.054), ('loss', 0.052), ('xl', 0.052), ('eigenfaces', 0.051), ('transformation', 0.05), ('recognition', 0.049), ('handwritten', 0.049), ('examples', 0.046), ('bal', 0.046), ('blitzer', 0.046), ('chopra', 0.046), ('invade', 0.046), ('lmmn', 0.046), ('multiway', 0.046), ('rca', 0.046), ('yij', 0.045), ('differs', 0.045), ('splits', 0.045), ('test', 0.044), ('parallels', 0.043), ('classes', 0.042), ('sets', 0.042), ('convex', 0.041), ('images', 0.041), ('slack', 0.04), ('mil', 0.04), ('improvements', 0.039), ('error', 0.037), ('similarly', 0.037), ('nite', 0.036), ('simard', 0.036), ('input', 0.036), ('pca', 0.035), ('dietterich', 0.035), ('pairwise', 0.034), ('newsgroups', 0.034), ('goldberger', 0.034), ('rates', 0.033), ('learn', 0.032), ('solver', 0.032), ('neighborhoods', 0.032), ('hypothetical', 0.032), ('lecun', 0.032), ('http', 0.032), ('principal', 0.031), ('row', 0.031), ('binary', 0.03), ('rd', 0.03), ('separated', 0.03), ('programming', 0.03), ('xt', 0.03), ('term', 0.029), ('grayscale', 0.029), ('backpropagation', 0.029), ('articles', 0.029), ('convolutional', 0.029), ('adaptive', 0.028), ('yi', 0.028), ('optimization', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1

2 0.26965424 126 nips-2005-Metric Learning by Collapsing Classes

Author: Amir Globerson, Sam T. Roweis

Abstract: We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature ´Üµ for an input using a simple distance funcspace , we can measure distances between ܽ ܾ ¾ tion (e.g. Euclidean) ´Ü½ µ ´Ü¾ µ℄ in feature space. Thus by fixing , any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the ´Üµ is a linear projection of Ü ¾ Ö so that ´Üµ Ï Ü. The Euclidean distance between ´Ü½ µ and ´Ü¾ µ is then the Mahalanobis distance ´Ü½ µ   ´Ü¾ µ ¾ ´Ü½   ܾ µÌ ´Ü½   ܾ µ, where Ï Ì Ï is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix . This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equiv- alence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of Ò labeled examples ´Ü Ý µ, where Ü Ý ¾ ½ , we seek a space. We focus on Mahalanobis form metrics similarity measure between two points in ´Ü where Ü ´Ü µ ¾   Ü µÌ Ö and ´Ü  Ü µ (1) is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping Ü via a linear projection Ï Ü ( Ï Ì Ï ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). such that Specifically, for each Ü we define a conditional distribution over points Ô ´ µ ½ È       (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: Ô¼ ´ µ » Ò ½ ¼ Ý Ý Ý Ý (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least Ö · ¾ points in each class, where Ö rank ℄ (note that Ö Ö). Then Ô ´ µ Ô¼ ´ µ ( ) implies that under all points in the same class will be mapped to a single point, infinitely far from other class points 1 . 1 Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix such that Ô ´ µ is as close as possible to Ô¼ ´ Since we are trying to match distributions, we minimize the KL divergence ÃÄ Ô¼ Ô℄: ÑÒ ÃÄ Ô¼ ´ µ Ô ´ µ℄ ¾ ÈË ×Ø µ. (4) The crucial property of this optimization problem is that it is convex in the matrix . To see this, first note that any convex linear combination of feasible solutions « ¼ · ´½   «µ ½ s.t. ¼ « ½ is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that ´ µ alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ in the form 2 : È   ´ µ ÐÓ Ý Ý Ô´ µ · Ý ÐÓ Ý where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why ´ µ is convex, first note that ´Ü   Ü µÌ ´Ü   Ü µ is linear is a ÐÓ ÜÔ function of affine functions of in , and thus convex. The function ÐÓ and is therefore also convex (see [4], page 74). È 2.1 Convex Duality Since our optimization problem is convex, it has an equivalent convex dual. Specifically, the convex dual of Eq. (4) is the following entropy maximization problem: À Ô´ Ñ Ü Ô´ µ where Ú Ü ×Ø µ℄ Ô¼ ´ Ú ÚÌ ℄   µ Ô´ µ Ú ÚÌ ℄   Ü , À ¡℄ is the entropy function and we require È Ô´ µ ¼ ½ (5) . To prove this duality we start with the proposed dual and obtain the original problem in Equation 4 as its dual. Write the Lagrangian for the above problem (where is PSD) 3 Ä´Ô   ¬µ À ´Ô´ µµ   Ì Ö´ ´ Ô¼ ´ Ú ÚÌ ℄   Ô Ú ÚÌ ℄µµµ   ¬ ´ Ô´ µ   ½µ The dual function is defined as ´ ¬ µ Ñ ÒÔ Ä´Ô ¬ µ. To derive it, we first solve for the minimizing Ô by setting the derivative of Ä´Ô ¬ µ w.r.t. Ô´ µ equal to zero. Ì ¼ ½ · ÐÓ Ô´ µ · Ì Ö ´ Ú Ú µ   ¬ µ Ô´ µ ¬  ½ Ì Ö´ Ú ÚÌ µ Plugging solution È Ô thisThe dual to Ä Ô is¬ towe get ¬ . problem maximize È  Ì Ö Ú Ú . ¬ , yielding   ¬ ´ ´ µ ´ µ ½ Now note that Ì Ö´ ÐÓ Ú ÚÌ µ ÚÌ Ú ´ µ   ´ ̵ ´ µ  Ì Ö´ ¬ µ. È Ô¼ Ú ÚÌ ℄µ · Ȭ  We can do this analytically w.r.t. , so we can write Ý Ý   ÐÓ   which is minus our original target function. Since ´ µ should be maximized, and we have the desired duality result (identifying with ). ¼ » µ ¼ when Ý Ý . For a given point Ü , all the points in its class satisfy Ô´ µ ½. Due to the structure of Ô´ µ in Equation 2, and because it is obeyed for all points in ܼ × class, this implies that all the points in that class are equidistant from each other. However, it is easy to show that the maximum number of different equidistant points (also known as the equilateral dimension [1]) in Ö dimensions is Ö · ½. Since by assumption we have at least Ö · ¾ points in the class of Ü , and maps points into Ö , it follows that all points are identical. 2 À Ô¼ ´ µ℄. Up to an additive constant 3 We consider the equivalent problem of minimizing minus entropy. Ô¼ ´  È 2.1.1 Relation to covariance based and embedding methods The convex dual derived above reveals an interesting relation to covariance based learning methods. The sufficient statistics used by the algorithm are a set of Ò “spread” matrices. Each matrix is of the form Ô¼ ´ µ Ú ÚÌ ℄. The algorithm tries to find a maximum entropy distribution which matches these matrices when averaged over the sample. This should be contrasted with the covariance matrices used in metric learning such as Fisher’s Discriminant Analysis. The latter uses the within and between class covariance matrices. The within covariance matrix is similar to the covariance matrix used here, but is calculated with respect to the class means, whereas here it is calculated separately for every point, and is centered on this point. This highlights the fact that MCML is not based on Gaussian assumptions where it is indeed sufficient to calculate a single class covariance. Our method can also be thought of as a supervised version of the Stochastic Neighbour Embedding algorithm [7] in which the “target” distribution is Ô¼ (determined by the class labels) and the embedding points are not completely free but are instead constrained to be of the form Ï Ü . 2.2 Optimizing the Convex Objective Since the optimization problem in Equation 4 is convex, it is guaranteed to have only a single minimum which is the globally optimal solution4 . It can be optimized using any appropriate numerical convex optimization machinery; all methods will yield the same solution although some may be faster than others. One standard approach is to use interior point Newton methods. However, these algorithms require the Hessian to be calculated, which would require Ç´ µ resources, and could be prohibitive in our case. Instead, we have experimented with using a first order gradient method, specifically the projected gradient approach as in [10]. At each iteration we take a small step in the direction of the negative gradient of the objective function5, followed by a projection back onto the PSD cone. This projection is performed simply by taking the eigen-decomposition of and removing the components with negative eigenvalues. The algorithm is summarized below: Ý µ, Ò Input: Set of labeled data points ´Ü Output: PSD metric which optimally collapses classes. ½ Initialization: Initialize ¼ to some PSD matrix (randomly or using some initialization heuristic). Iterate: È Ø   ¯   ÔØ where   Ü Ô Ü ¯ Calculate the eigen-decomposition of È È Ù ÙÌ , then set Ø Ø Ø ¯ Set Ø·½ ´ µ ´ ´ ¼´ µ µ ´ µµ´ µ´Ü   Ü µÌ ·½ ·½ ·½ Ñ Ü´ ¼µÙ ÙÌ Of course in principle it is possible to optimize over the dual instead of the primal but in our case, if the training data consists of Ò points in Ö-dimensional space then the primal has only Ç´Ö¾ ¾µ variables while the dual has Ç´Ò¾ µ so it will almost always be more efficient to operate on the primal directly. One exception to this case may be the kernel version (Section 4) where the primal is also of size Ç´Ò¾ µ. 4 When the data can be exactly collapsed into single class points, there will be multiple solutions at infinity. However, this is very unlikely to happen in real data. 5 In the experiments, we used an Armijo like step size rule, as described in [3]. 3 Low Dimensional Projections for Feature Extraction The Mahalanobis distance under a metric can be interpreted as a linear projection of the original inputs by the square root of , followed by Euclidean distance in the projected space. Matrices which have less than full rank correspond to Mahalanobis distances based on low dimensional projections. Such metrics and the induced distances can be advantageous for several reasons [5]. First, low dimensional projections can substantially reduce the storage and computational requirements of a supervised method since only the projections of the training points must be stored and the manipulations at test time all occur in the lower dimensional feature space. Second, low dimensional projections re-represent the inputs, allowing for a supervised embedding or visualization of the original data. If we consider matrices with rank at most Õ , we can always represent them in the form Ï Ì Ï for some projection matrix Ï of size Õ ¢ Ö. This corresponds to projecting the original data into a Õ -dimensional space specified by the rows of Ï . However, rank constraints on a matrix are not convex [4], and hence the rank constrained problem is not convex and is likely to have local minima which make the optimization difficult and illdefined since it becomes sensitive to initial conditions and choice of optimization method. Luckily, there is an alternative approach to obtaining low dimensional projections, which does specify a unique solution by sequentially solving two globally tractable problems. This is the approach we follow here. First we solve for a (potentially) full rank metric using the convex program outlined above, and then obtain a low rank projection from it via spectral decomposition. This is done by diagonalizing into the form Ö Ú ÚÌ where ½ and Ú are the corre¾ Ö are eigenvalues of ½ sponding eigenvectors. To obtain a low rank projection we constrain the sum above to Õ include only the Õ terms corresponding to the Õ largest eigenvalues: Õ Ú ÚÌ . ½ The resulting projection is uniquely defined (up to an irrelevant unitary transformation) as Ô Ì Ì Ï ´ ÚÕ ℄. ½ Õ µ Ú½ È È Ô In general, the projection returned by this approach is not guaranteed to be the same as the projection corresponding to minimizing our objective function subject to a rank constraint on unless the optimal metric is of rank less than or equal to Õ . However, as we show in the experimental results, it is often the case that for practical problems the optimal has an eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed very small, suggesting the low rank solution will be close to optimal. 4 Learning Metrics with Kernels It is interesting to consider the case where Ü are mapped into a high dimensional feature space ´Ü µ and a Mahalanobis distance is sought in this space. We focus on the case where dot products in the feature space may be expressed via a kernel function, such that ´Ü µ ¡ ´Ü µ ´Ü Ü µ for some kernel . We now show how our method can be changed to accommodate this setting, so that optimization depends only on dot products. Consider the regularized target function: Ê ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ · Ì Ö´ µ (6) where the regularizing factor is equivalent to the Frobenius norm of the projection matrix Ï since Ì Ö´ µ Ï ¾. Deriving w.r.t. Ï we obtain Ï Í , where Í is some matrix which specifies Ï as a linear combination of sample points, and the Ø row of the matrix Ì Í Ì Í . Defining the PSD matrix is Ü . Thus is given by Í Ì Í , we can recast our optimization as looking for a PSD matrix , where the Mahalanobis distance is ´Ü   Ü µÌ Ì ´Ü   Ü µ ´   µÌ ´   µ, where we define Ü. This is exactly our original distance, with Ü replaced by , which depends only on dot products in space. The regularization term also depends solely on the dot products since Ì Ö´ µ Ì Ö´ Ì µ Ì Ö´ Ì µ Ì Ö´Ã µ, where à is the kernel matrix given Ì . Note that the trace is a linear function of , keeping the problem convex. by à Thus, as long as dot products can be represented via kernels, the optimization can be carried out without explicitly using the high dimensional space. To obtain a low dimensional solution, we follow the approach in Section 3: obtain a decomposition Î Ì Î 6 , and take the projection matrix to be the first Õ rows of ¼ Î . Ì , and thus Ì Ì As a first step, we calculate a matrix such that . Since is a correlation matrix for the rows of it can be shown (as in Kernel PCA) that its (left) eigenvectors are linear combinations of the rows of . Denoting by Î « the eigenvector matrix, we obtain, after some algebra, that « Ã Ì «. We conclude that « is an eigenvector of the matrix Ã Ì . Denote by « the matrix whose rows are orthonormal eigenvectors of Ã Ì . Then Î can be shown to be orthonormal if we set  ¼ « . The final projection will then be ¼ Î Ü « . Low dimensional Î projections will be obtained by keeping only the first Õ components of this projection. 5 Experimental Results We compared our method to several metric learning algorithms on a supervised classification task. Training data was first used to learn a metric over the input space. Then this metric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we investigated were taken from the UCI repository and have been used previously in evaluating supervised methods for metric learning [10, 5]. To these we added the USPS handwritten digits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22). The algorithms used in the comparative evaluation were ¯ ¯ ¯ Fisher’s Linear Discriminant Analysis (LDA), which projects on the eigenvectors   of ËϽ Ë where ËÏ Ë are the within and between class covariance matrices. The method of Xing et al [10] which minimizes the mean within class distance, while keeping the mean between class distance larger than one. Principal Component Analysis (PCA). There are several possibilities for scaling the PCA projections. We tested several, and report results of the empirically superior one (PCAW), which scales the projection components so that the covariance matrix after projection is the identity. PCAW often performs poorly on high dimensions, but globally outperforms all other variants. We also evaluated the kernel version of MCML with an RBF kernel (denoted by KMCML)7 . Since all methods allow projections to lower dimensions we compared performance for different projection dimensions 8 . The out-of sample performance results (based on 40 random splits of the data taking 70% for training and 30% for testing9 ) are shown in Figure 1. It can be seen that when used in a simple nearest-neighbour classifier, the metric learned by MCML almost always performs as well as, or significantly better than those learned by all other methods, across most dimensions. Furthermore, the kernel version of MCML outperforms the linear one on most datasets. Where Î is orthonormal, and the eigenvalues in are sorted in decreasing order. The regularization parameter and the width of the RBF kernel were chosen using 5 fold crossvalidation. KMCML was only evaluated for datasets with less than 1000 training points. 8 To obtain low dimensional mappings we used the approach outlined in Section 3. 9 Except for the larger datasets where 1000 random samples were used for training. 6 7 Wine 0.2 0.1 2 4 6 8 10 Projection Dimension 0.5 Error Rate Error Rate Balance Ion MCML PCAW LDA XING KMCML 0.3 0.25 0.4 0.2 0.15 0.3 0.1 0.2 0.05 10 20 30 Projection Dimension Soybean−small 0.1 1 2 3 Protein 4 Spam 0.6 0.4 0.25 0.5 0.4 0 5 10 15 0.2 0.3 0.2 0.15 20 5 10 Yale7 15 20 0.1 10 20 30 40 50 40 50 Digits Housing 0.6 0.4 0.4 0.3 0.4 0.35 0.2 0.3 0.2 0.1 0.25 10 20 30 40 50 5 10 15 10 20 30 Figure 1: Classification error rate on several UCI datasets, USPS digits and YALE faces, for different projection dimensions. Algorithms are our Maximally Collapsing Metric Learning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher’s Discriminant Analysis (LDA). Standard errors of the means shown on curves. No results given for XING on YALE and KMCML on Digits and Spam due to the data size. 5.1 Comparison to non convex procedures The methods in the previous comparison are all well defined, in the sense that they are not susceptible to local minima in the optimization. They also have the added advantage of obtaining projections to all dimensions using one optimization run. Below, we also compare the MCML results to the results of two non-convex procedures. The first is the Non Convex variant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the projection matrix Ï , where Ï Ì Ï . Although this is no longer a convex problem, it is not constrained and is thus easier to optimize. The second non convex method is Neighbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error incurred by a nearest neighbor classifier. For both methods we optimized the matrix Ï by restarting the optimization separately for each size of Ï . Minimization was performed using a conjugate gradient algorithm, initialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets. It can be seen that the performance of NMCML is similar to that of MCML, although it is less stable, possibly due to local minima, and both methods usually outperform NCA. The inset in each figure shows the spectrum of the MCML matrix , revealing that it often drops quickly after a few dimensions. This illustrates the effectiveness of our two stage optimization procedure, and suggests its low dimensional solutions are close to optimal. 6 Discussion and Extensions We have presented an algorithm for learning maximally collapsing metrics (MCML), based on the intuition of collapsing classes into single points. MCML assumes that each class Balance Error Rate Wine MCML NMCML NCA 0.2 Soybean 0.4 0.1 0.3 0.1 0.05 0.2 0 2 4 6 8 10 Projection Dimension 0.1 1 0 2 Protein 3 4 5 10 Ion 15 20 Housing 0.4 0.6 0.3 0.5 0.35 0.2 0.4 0.3 0.3 0.25 5 10 15 20 0.1 10 20 30 5 10 15 Figure 2: Classification error for non convex procedures, and the MCML method. Eigen-spectra for the MCML solution are shown in the inset. may be collapsed to a single point, at least approximately, and thus is only suitable for unimodal class distributions (or for simply connected sets if kernelization is used). However, if points belonging to a single class appear in several disconnected clusters in input (or feature) space, it is unlikely that MCML could collapse the class into a single point. It is possible that using a mixture of distributions, an EM-like algorithm can be constructed to accommodate this scenario. The method can also be used to learn low dimensional projections of the input space. We showed that it performs well, even across a range of projection dimensions, and consistently outperforms existing methods. Finally, we have shown how the method can be extended to projections in high dimensional feature spaces using the kernel trick. The resulting nonlinear method was shown to improve classification results over the linear version. References Ò [1] N. Alon and P. Pudlak. Equilateral sets in ÐÔ . Geom. Funct. Anal., 13(3), 2003. [2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In ECCV (1), 1996. [3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transaction on Automatic Control, 21(2):174–184, 1976. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. [5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2001. [7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002. [8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. [9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proc. of ECCV, 2002. [10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2004.

3 0.19544083 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

4 0.17652509 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia

Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.

5 0.13144776 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio

Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1

6 0.12986307 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

7 0.11946099 50 nips-2005-Convex Neural Networks

8 0.11738419 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

9 0.11592957 138 nips-2005-Non-Local Manifold Parzen Windows

10 0.1144027 195 nips-2005-Transfer learning for text classification

11 0.11260203 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

12 0.1033155 131 nips-2005-Multiple Instance Boosting for Object Detection

13 0.097480856 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

14 0.095311105 105 nips-2005-Large-Scale Multiclass Transduction

15 0.093941122 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

16 0.093518578 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

17 0.08925233 151 nips-2005-Pattern Recognition from One Example by Chopping

18 0.08740934 114 nips-2005-Learning Rankings via Convex Hull Separation

19 0.081906289 189 nips-2005-Tensor Subspace Analysis

20 0.080159791 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.271), (1, 0.119), (2, -0.099), (3, -0.04), (4, 0.001), (5, 0.193), (6, 0.227), (7, 0.008), (8, 0.081), (9, -0.016), (10, 0.035), (11, -0.157), (12, -0.087), (13, 0.003), (14, 0.067), (15, -0.153), (16, -0.023), (17, -0.024), (18, -0.219), (19, 0.018), (20, -0.045), (21, 0.083), (22, 0.012), (23, 0.166), (24, -0.155), (25, 0.006), (26, 0.008), (27, 0.033), (28, -0.048), (29, -0.002), (30, -0.002), (31, 0.062), (32, 0.11), (33, 0.015), (34, 0.002), (35, -0.071), (36, -0.001), (37, 0.142), (38, -0.041), (39, -0.046), (40, 0.058), (41, 0.058), (42, -0.058), (43, 0.132), (44, -0.012), (45, 0.11), (46, -0.104), (47, 0.073), (48, 0.121), (49, 0.157)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95567816 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1

2 0.74116832 126 nips-2005-Metric Learning by Collapsing Classes

Author: Amir Globerson, Sam T. Roweis

Abstract: We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature ´Üµ for an input using a simple distance funcspace , we can measure distances between ܽ ܾ ¾ tion (e.g. Euclidean) ´Ü½ µ ´Ü¾ µ℄ in feature space. Thus by fixing , any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the ´Üµ is a linear projection of Ü ¾ Ö so that ´Üµ Ï Ü. The Euclidean distance between ´Ü½ µ and ´Ü¾ µ is then the Mahalanobis distance ´Ü½ µ   ´Ü¾ µ ¾ ´Ü½   ܾ µÌ ´Ü½   ܾ µ, where Ï Ì Ï is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix . This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equiv- alence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of Ò labeled examples ´Ü Ý µ, where Ü Ý ¾ ½ , we seek a space. We focus on Mahalanobis form metrics similarity measure between two points in ´Ü where Ü ´Ü µ ¾   Ü µÌ Ö and ´Ü  Ü µ (1) is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping Ü via a linear projection Ï Ü ( Ï Ì Ï ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). such that Specifically, for each Ü we define a conditional distribution over points Ô ´ µ ½ È       (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: Ô¼ ´ µ » Ò ½ ¼ Ý Ý Ý Ý (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least Ö · ¾ points in each class, where Ö rank ℄ (note that Ö Ö). Then Ô ´ µ Ô¼ ´ µ ( ) implies that under all points in the same class will be mapped to a single point, infinitely far from other class points 1 . 1 Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix such that Ô ´ µ is as close as possible to Ô¼ ´ Since we are trying to match distributions, we minimize the KL divergence ÃÄ Ô¼ Ô℄: ÑÒ ÃÄ Ô¼ ´ µ Ô ´ µ℄ ¾ ÈË ×Ø µ. (4) The crucial property of this optimization problem is that it is convex in the matrix . To see this, first note that any convex linear combination of feasible solutions « ¼ · ´½   «µ ½ s.t. ¼ « ½ is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that ´ µ alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ in the form 2 : È   ´ µ ÐÓ Ý Ý Ô´ µ · Ý ÐÓ Ý where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why ´ µ is convex, first note that ´Ü   Ü µÌ ´Ü   Ü µ is linear is a ÐÓ ÜÔ function of affine functions of in , and thus convex. The function ÐÓ and is therefore also convex (see [4], page 74). È 2.1 Convex Duality Since our optimization problem is convex, it has an equivalent convex dual. Specifically, the convex dual of Eq. (4) is the following entropy maximization problem: À Ô´ Ñ Ü Ô´ µ where Ú Ü ×Ø µ℄ Ô¼ ´ Ú ÚÌ ℄   µ Ô´ µ Ú ÚÌ ℄   Ü , À ¡℄ is the entropy function and we require È Ô´ µ ¼ ½ (5) . To prove this duality we start with the proposed dual and obtain the original problem in Equation 4 as its dual. Write the Lagrangian for the above problem (where is PSD) 3 Ä´Ô   ¬µ À ´Ô´ µµ   Ì Ö´ ´ Ô¼ ´ Ú ÚÌ ℄   Ô Ú ÚÌ ℄µµµ   ¬ ´ Ô´ µ   ½µ The dual function is defined as ´ ¬ µ Ñ ÒÔ Ä´Ô ¬ µ. To derive it, we first solve for the minimizing Ô by setting the derivative of Ä´Ô ¬ µ w.r.t. Ô´ µ equal to zero. Ì ¼ ½ · ÐÓ Ô´ µ · Ì Ö ´ Ú Ú µ   ¬ µ Ô´ µ ¬  ½ Ì Ö´ Ú ÚÌ µ Plugging solution È Ô thisThe dual to Ä Ô is¬ towe get ¬ . problem maximize È  Ì Ö Ú Ú . ¬ , yielding   ¬ ´ ´ µ ´ µ ½ Now note that Ì Ö´ ÐÓ Ú ÚÌ µ ÚÌ Ú ´ µ   ´ ̵ ´ µ  Ì Ö´ ¬ µ. È Ô¼ Ú ÚÌ ℄µ · Ȭ  We can do this analytically w.r.t. , so we can write Ý Ý   ÐÓ   which is minus our original target function. Since ´ µ should be maximized, and we have the desired duality result (identifying with ). ¼ » µ ¼ when Ý Ý . For a given point Ü , all the points in its class satisfy Ô´ µ ½. Due to the structure of Ô´ µ in Equation 2, and because it is obeyed for all points in ܼ × class, this implies that all the points in that class are equidistant from each other. However, it is easy to show that the maximum number of different equidistant points (also known as the equilateral dimension [1]) in Ö dimensions is Ö · ½. Since by assumption we have at least Ö · ¾ points in the class of Ü , and maps points into Ö , it follows that all points are identical. 2 À Ô¼ ´ µ℄. Up to an additive constant 3 We consider the equivalent problem of minimizing minus entropy. Ô¼ ´  È 2.1.1 Relation to covariance based and embedding methods The convex dual derived above reveals an interesting relation to covariance based learning methods. The sufficient statistics used by the algorithm are a set of Ò “spread” matrices. Each matrix is of the form Ô¼ ´ µ Ú ÚÌ ℄. The algorithm tries to find a maximum entropy distribution which matches these matrices when averaged over the sample. This should be contrasted with the covariance matrices used in metric learning such as Fisher’s Discriminant Analysis. The latter uses the within and between class covariance matrices. The within covariance matrix is similar to the covariance matrix used here, but is calculated with respect to the class means, whereas here it is calculated separately for every point, and is centered on this point. This highlights the fact that MCML is not based on Gaussian assumptions where it is indeed sufficient to calculate a single class covariance. Our method can also be thought of as a supervised version of the Stochastic Neighbour Embedding algorithm [7] in which the “target” distribution is Ô¼ (determined by the class labels) and the embedding points are not completely free but are instead constrained to be of the form Ï Ü . 2.2 Optimizing the Convex Objective Since the optimization problem in Equation 4 is convex, it is guaranteed to have only a single minimum which is the globally optimal solution4 . It can be optimized using any appropriate numerical convex optimization machinery; all methods will yield the same solution although some may be faster than others. One standard approach is to use interior point Newton methods. However, these algorithms require the Hessian to be calculated, which would require Ç´ µ resources, and could be prohibitive in our case. Instead, we have experimented with using a first order gradient method, specifically the projected gradient approach as in [10]. At each iteration we take a small step in the direction of the negative gradient of the objective function5, followed by a projection back onto the PSD cone. This projection is performed simply by taking the eigen-decomposition of and removing the components with negative eigenvalues. The algorithm is summarized below: Ý µ, Ò Input: Set of labeled data points ´Ü Output: PSD metric which optimally collapses classes. ½ Initialization: Initialize ¼ to some PSD matrix (randomly or using some initialization heuristic). Iterate: È Ø   ¯   ÔØ where   Ü Ô Ü ¯ Calculate the eigen-decomposition of È È Ù ÙÌ , then set Ø Ø Ø ¯ Set Ø·½ ´ µ ´ ´ ¼´ µ µ ´ µµ´ µ´Ü   Ü µÌ ·½ ·½ ·½ Ñ Ü´ ¼µÙ ÙÌ Of course in principle it is possible to optimize over the dual instead of the primal but in our case, if the training data consists of Ò points in Ö-dimensional space then the primal has only Ç´Ö¾ ¾µ variables while the dual has Ç´Ò¾ µ so it will almost always be more efficient to operate on the primal directly. One exception to this case may be the kernel version (Section 4) where the primal is also of size Ç´Ò¾ µ. 4 When the data can be exactly collapsed into single class points, there will be multiple solutions at infinity. However, this is very unlikely to happen in real data. 5 In the experiments, we used an Armijo like step size rule, as described in [3]. 3 Low Dimensional Projections for Feature Extraction The Mahalanobis distance under a metric can be interpreted as a linear projection of the original inputs by the square root of , followed by Euclidean distance in the projected space. Matrices which have less than full rank correspond to Mahalanobis distances based on low dimensional projections. Such metrics and the induced distances can be advantageous for several reasons [5]. First, low dimensional projections can substantially reduce the storage and computational requirements of a supervised method since only the projections of the training points must be stored and the manipulations at test time all occur in the lower dimensional feature space. Second, low dimensional projections re-represent the inputs, allowing for a supervised embedding or visualization of the original data. If we consider matrices with rank at most Õ , we can always represent them in the form Ï Ì Ï for some projection matrix Ï of size Õ ¢ Ö. This corresponds to projecting the original data into a Õ -dimensional space specified by the rows of Ï . However, rank constraints on a matrix are not convex [4], and hence the rank constrained problem is not convex and is likely to have local minima which make the optimization difficult and illdefined since it becomes sensitive to initial conditions and choice of optimization method. Luckily, there is an alternative approach to obtaining low dimensional projections, which does specify a unique solution by sequentially solving two globally tractable problems. This is the approach we follow here. First we solve for a (potentially) full rank metric using the convex program outlined above, and then obtain a low rank projection from it via spectral decomposition. This is done by diagonalizing into the form Ö Ú ÚÌ where ½ and Ú are the corre¾ Ö are eigenvalues of ½ sponding eigenvectors. To obtain a low rank projection we constrain the sum above to Õ include only the Õ terms corresponding to the Õ largest eigenvalues: Õ Ú ÚÌ . ½ The resulting projection is uniquely defined (up to an irrelevant unitary transformation) as Ô Ì Ì Ï ´ ÚÕ ℄. ½ Õ µ Ú½ È È Ô In general, the projection returned by this approach is not guaranteed to be the same as the projection corresponding to minimizing our objective function subject to a rank constraint on unless the optimal metric is of rank less than or equal to Õ . However, as we show in the experimental results, it is often the case that for practical problems the optimal has an eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed very small, suggesting the low rank solution will be close to optimal. 4 Learning Metrics with Kernels It is interesting to consider the case where Ü are mapped into a high dimensional feature space ´Ü µ and a Mahalanobis distance is sought in this space. We focus on the case where dot products in the feature space may be expressed via a kernel function, such that ´Ü µ ¡ ´Ü µ ´Ü Ü µ for some kernel . We now show how our method can be changed to accommodate this setting, so that optimization depends only on dot products. Consider the regularized target function: Ê ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ · Ì Ö´ µ (6) where the regularizing factor is equivalent to the Frobenius norm of the projection matrix Ï since Ì Ö´ µ Ï ¾. Deriving w.r.t. Ï we obtain Ï Í , where Í is some matrix which specifies Ï as a linear combination of sample points, and the Ø row of the matrix Ì Í Ì Í . Defining the PSD matrix is Ü . Thus is given by Í Ì Í , we can recast our optimization as looking for a PSD matrix , where the Mahalanobis distance is ´Ü   Ü µÌ Ì ´Ü   Ü µ ´   µÌ ´   µ, where we define Ü. This is exactly our original distance, with Ü replaced by , which depends only on dot products in space. The regularization term also depends solely on the dot products since Ì Ö´ µ Ì Ö´ Ì µ Ì Ö´ Ì µ Ì Ö´Ã µ, where à is the kernel matrix given Ì . Note that the trace is a linear function of , keeping the problem convex. by à Thus, as long as dot products can be represented via kernels, the optimization can be carried out without explicitly using the high dimensional space. To obtain a low dimensional solution, we follow the approach in Section 3: obtain a decomposition Î Ì Î 6 , and take the projection matrix to be the first Õ rows of ¼ Î . Ì , and thus Ì Ì As a first step, we calculate a matrix such that . Since is a correlation matrix for the rows of it can be shown (as in Kernel PCA) that its (left) eigenvectors are linear combinations of the rows of . Denoting by Î « the eigenvector matrix, we obtain, after some algebra, that « Ã Ì «. We conclude that « is an eigenvector of the matrix Ã Ì . Denote by « the matrix whose rows are orthonormal eigenvectors of Ã Ì . Then Î can be shown to be orthonormal if we set  ¼ « . The final projection will then be ¼ Î Ü « . Low dimensional Î projections will be obtained by keeping only the first Õ components of this projection. 5 Experimental Results We compared our method to several metric learning algorithms on a supervised classification task. Training data was first used to learn a metric over the input space. Then this metric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we investigated were taken from the UCI repository and have been used previously in evaluating supervised methods for metric learning [10, 5]. To these we added the USPS handwritten digits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22). The algorithms used in the comparative evaluation were ¯ ¯ ¯ Fisher’s Linear Discriminant Analysis (LDA), which projects on the eigenvectors   of ËϽ Ë where ËÏ Ë are the within and between class covariance matrices. The method of Xing et al [10] which minimizes the mean within class distance, while keeping the mean between class distance larger than one. Principal Component Analysis (PCA). There are several possibilities for scaling the PCA projections. We tested several, and report results of the empirically superior one (PCAW), which scales the projection components so that the covariance matrix after projection is the identity. PCAW often performs poorly on high dimensions, but globally outperforms all other variants. We also evaluated the kernel version of MCML with an RBF kernel (denoted by KMCML)7 . Since all methods allow projections to lower dimensions we compared performance for different projection dimensions 8 . The out-of sample performance results (based on 40 random splits of the data taking 70% for training and 30% for testing9 ) are shown in Figure 1. It can be seen that when used in a simple nearest-neighbour classifier, the metric learned by MCML almost always performs as well as, or significantly better than those learned by all other methods, across most dimensions. Furthermore, the kernel version of MCML outperforms the linear one on most datasets. Where Î is orthonormal, and the eigenvalues in are sorted in decreasing order. The regularization parameter and the width of the RBF kernel were chosen using 5 fold crossvalidation. KMCML was only evaluated for datasets with less than 1000 training points. 8 To obtain low dimensional mappings we used the approach outlined in Section 3. 9 Except for the larger datasets where 1000 random samples were used for training. 6 7 Wine 0.2 0.1 2 4 6 8 10 Projection Dimension 0.5 Error Rate Error Rate Balance Ion MCML PCAW LDA XING KMCML 0.3 0.25 0.4 0.2 0.15 0.3 0.1 0.2 0.05 10 20 30 Projection Dimension Soybean−small 0.1 1 2 3 Protein 4 Spam 0.6 0.4 0.25 0.5 0.4 0 5 10 15 0.2 0.3 0.2 0.15 20 5 10 Yale7 15 20 0.1 10 20 30 40 50 40 50 Digits Housing 0.6 0.4 0.4 0.3 0.4 0.35 0.2 0.3 0.2 0.1 0.25 10 20 30 40 50 5 10 15 10 20 30 Figure 1: Classification error rate on several UCI datasets, USPS digits and YALE faces, for different projection dimensions. Algorithms are our Maximally Collapsing Metric Learning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher’s Discriminant Analysis (LDA). Standard errors of the means shown on curves. No results given for XING on YALE and KMCML on Digits and Spam due to the data size. 5.1 Comparison to non convex procedures The methods in the previous comparison are all well defined, in the sense that they are not susceptible to local minima in the optimization. They also have the added advantage of obtaining projections to all dimensions using one optimization run. Below, we also compare the MCML results to the results of two non-convex procedures. The first is the Non Convex variant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the projection matrix Ï , where Ï Ì Ï . Although this is no longer a convex problem, it is not constrained and is thus easier to optimize. The second non convex method is Neighbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error incurred by a nearest neighbor classifier. For both methods we optimized the matrix Ï by restarting the optimization separately for each size of Ï . Minimization was performed using a conjugate gradient algorithm, initialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets. It can be seen that the performance of NMCML is similar to that of MCML, although it is less stable, possibly due to local minima, and both methods usually outperform NCA. The inset in each figure shows the spectrum of the MCML matrix , revealing that it often drops quickly after a few dimensions. This illustrates the effectiveness of our two stage optimization procedure, and suggests its low dimensional solutions are close to optimal. 6 Discussion and Extensions We have presented an algorithm for learning maximally collapsing metrics (MCML), based on the intuition of collapsing classes into single points. MCML assumes that each class Balance Error Rate Wine MCML NMCML NCA 0.2 Soybean 0.4 0.1 0.3 0.1 0.05 0.2 0 2 4 6 8 10 Projection Dimension 0.1 1 0 2 Protein 3 4 5 10 Ion 15 20 Housing 0.4 0.6 0.3 0.5 0.35 0.2 0.4 0.3 0.3 0.25 5 10 15 20 0.1 10 20 30 5 10 15 Figure 2: Classification error for non convex procedures, and the MCML method. Eigen-spectra for the MCML solution are shown in the inset. may be collapsed to a single point, at least approximately, and thus is only suitable for unimodal class distributions (or for simply connected sets if kernelization is used). However, if points belonging to a single class appear in several disconnected clusters in input (or feature) space, it is unlikely that MCML could collapse the class into a single point. It is possible that using a mixture of distributions, an EM-like algorithm can be constructed to accommodate this scenario. The method can also be used to learn low dimensional projections of the input space. We showed that it performs well, even across a range of projection dimensions, and consistently outperforms existing methods. Finally, we have shown how the method can be extended to projections in high dimensional feature spaces using the kernel trick. The resulting nonlinear method was shown to improve classification results over the linear version. References Ò [1] N. Alon and P. Pudlak. Equilateral sets in ÐÔ . Geom. Funct. Anal., 13(3), 2003. [2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In ECCV (1), 1996. [3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transaction on Automatic Control, 21(2):174–184, 1976. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. [5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2001. [7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002. [8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. [9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proc. of ECCV, 2002. [10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2004.

3 0.57904208 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

4 0.55492264 195 nips-2005-Transfer learning for text classification

Author: Chuong B. Do, Andrew Y. Ng

Abstract: Linear text classification algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classification over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classification problems. The parameter function found by our algorithm then defines a new learning algorithm for text classification, which we can apply to novel classification tasks. We find that our learned classifier outperforms existing methods on a variety of multiclass text classification tasks. 1

5 0.50234598 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia

Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.

6 0.48628953 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

7 0.45619628 131 nips-2005-Multiple Instance Boosting for Object Detection

8 0.45512509 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

9 0.45286912 189 nips-2005-Tensor Subspace Analysis

10 0.44651565 50 nips-2005-Convex Neural Networks

11 0.44301838 151 nips-2005-Pattern Recognition from One Example by Chopping

12 0.44185719 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

13 0.43582076 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

14 0.42366731 105 nips-2005-Large-Scale Multiclass Transduction

15 0.4223907 160 nips-2005-Query by Committee Made Real

16 0.41980603 114 nips-2005-Learning Rankings via Convex Hull Separation

17 0.41649279 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

18 0.40545917 138 nips-2005-Non-Local Manifold Parzen Windows

19 0.40366355 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

20 0.40033484 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.063), (10, 0.074), (17, 0.138), (27, 0.023), (31, 0.037), (34, 0.19), (39, 0.025), (41, 0.016), (50, 0.074), (55, 0.024), (65, 0.014), (69, 0.051), (73, 0.053), (88, 0.112), (91, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90828794 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1

2 0.85798657 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

Author: Y. Altun, D. McAllester, M. Belkin

Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1

3 0.85375106 126 nips-2005-Metric Learning by Collapsing Classes

Author: Amir Globerson, Sam T. Roweis

Abstract: We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature ´Üµ for an input using a simple distance funcspace , we can measure distances between ܽ ܾ ¾ tion (e.g. Euclidean) ´Ü½ µ ´Ü¾ µ℄ in feature space. Thus by fixing , any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the ´Üµ is a linear projection of Ü ¾ Ö so that ´Üµ Ï Ü. The Euclidean distance between ´Ü½ µ and ´Ü¾ µ is then the Mahalanobis distance ´Ü½ µ   ´Ü¾ µ ¾ ´Ü½   ܾ µÌ ´Ü½   ܾ µ, where Ï Ì Ï is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix . This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equiv- alence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of Ò labeled examples ´Ü Ý µ, where Ü Ý ¾ ½ , we seek a space. We focus on Mahalanobis form metrics similarity measure between two points in ´Ü where Ü ´Ü µ ¾   Ü µÌ Ö and ´Ü  Ü µ (1) is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping Ü via a linear projection Ï Ü ( Ï Ì Ï ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). such that Specifically, for each Ü we define a conditional distribution over points Ô ´ µ ½ È       (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: Ô¼ ´ µ » Ò ½ ¼ Ý Ý Ý Ý (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least Ö · ¾ points in each class, where Ö rank ℄ (note that Ö Ö). Then Ô ´ µ Ô¼ ´ µ ( ) implies that under all points in the same class will be mapped to a single point, infinitely far from other class points 1 . 1 Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix such that Ô ´ µ is as close as possible to Ô¼ ´ Since we are trying to match distributions, we minimize the KL divergence ÃÄ Ô¼ Ô℄: ÑÒ ÃÄ Ô¼ ´ µ Ô ´ µ℄ ¾ ÈË ×Ø µ. (4) The crucial property of this optimization problem is that it is convex in the matrix . To see this, first note that any convex linear combination of feasible solutions « ¼ · ´½   «µ ½ s.t. ¼ « ½ is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that ´ µ alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ in the form 2 : È   ´ µ ÐÓ Ý Ý Ô´ µ · Ý ÐÓ Ý where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why ´ µ is convex, first note that ´Ü   Ü µÌ ´Ü   Ü µ is linear is a ÐÓ ÜÔ function of affine functions of in , and thus convex. The function ÐÓ and is therefore also convex (see [4], page 74). È 2.1 Convex Duality Since our optimization problem is convex, it has an equivalent convex dual. Specifically, the convex dual of Eq. (4) is the following entropy maximization problem: À Ô´ Ñ Ü Ô´ µ where Ú Ü ×Ø µ℄ Ô¼ ´ Ú ÚÌ ℄   µ Ô´ µ Ú ÚÌ ℄   Ü , À ¡℄ is the entropy function and we require È Ô´ µ ¼ ½ (5) . To prove this duality we start with the proposed dual and obtain the original problem in Equation 4 as its dual. Write the Lagrangian for the above problem (where is PSD) 3 Ä´Ô   ¬µ À ´Ô´ µµ   Ì Ö´ ´ Ô¼ ´ Ú ÚÌ ℄   Ô Ú ÚÌ ℄µµµ   ¬ ´ Ô´ µ   ½µ The dual function is defined as ´ ¬ µ Ñ ÒÔ Ä´Ô ¬ µ. To derive it, we first solve for the minimizing Ô by setting the derivative of Ä´Ô ¬ µ w.r.t. Ô´ µ equal to zero. Ì ¼ ½ · ÐÓ Ô´ µ · Ì Ö ´ Ú Ú µ   ¬ µ Ô´ µ ¬  ½ Ì Ö´ Ú ÚÌ µ Plugging solution È Ô thisThe dual to Ä Ô is¬ towe get ¬ . problem maximize È  Ì Ö Ú Ú . ¬ , yielding   ¬ ´ ´ µ ´ µ ½ Now note that Ì Ö´ ÐÓ Ú ÚÌ µ ÚÌ Ú ´ µ   ´ ̵ ´ µ  Ì Ö´ ¬ µ. È Ô¼ Ú ÚÌ ℄µ · Ȭ  We can do this analytically w.r.t. , so we can write Ý Ý   ÐÓ   which is minus our original target function. Since ´ µ should be maximized, and we have the desired duality result (identifying with ). ¼ » µ ¼ when Ý Ý . For a given point Ü , all the points in its class satisfy Ô´ µ ½. Due to the structure of Ô´ µ in Equation 2, and because it is obeyed for all points in ܼ × class, this implies that all the points in that class are equidistant from each other. However, it is easy to show that the maximum number of different equidistant points (also known as the equilateral dimension [1]) in Ö dimensions is Ö · ½. Since by assumption we have at least Ö · ¾ points in the class of Ü , and maps points into Ö , it follows that all points are identical. 2 À Ô¼ ´ µ℄. Up to an additive constant 3 We consider the equivalent problem of minimizing minus entropy. Ô¼ ´  È 2.1.1 Relation to covariance based and embedding methods The convex dual derived above reveals an interesting relation to covariance based learning methods. The sufficient statistics used by the algorithm are a set of Ò “spread” matrices. Each matrix is of the form Ô¼ ´ µ Ú ÚÌ ℄. The algorithm tries to find a maximum entropy distribution which matches these matrices when averaged over the sample. This should be contrasted with the covariance matrices used in metric learning such as Fisher’s Discriminant Analysis. The latter uses the within and between class covariance matrices. The within covariance matrix is similar to the covariance matrix used here, but is calculated with respect to the class means, whereas here it is calculated separately for every point, and is centered on this point. This highlights the fact that MCML is not based on Gaussian assumptions where it is indeed sufficient to calculate a single class covariance. Our method can also be thought of as a supervised version of the Stochastic Neighbour Embedding algorithm [7] in which the “target” distribution is Ô¼ (determined by the class labels) and the embedding points are not completely free but are instead constrained to be of the form Ï Ü . 2.2 Optimizing the Convex Objective Since the optimization problem in Equation 4 is convex, it is guaranteed to have only a single minimum which is the globally optimal solution4 . It can be optimized using any appropriate numerical convex optimization machinery; all methods will yield the same solution although some may be faster than others. One standard approach is to use interior point Newton methods. However, these algorithms require the Hessian to be calculated, which would require Ç´ µ resources, and could be prohibitive in our case. Instead, we have experimented with using a first order gradient method, specifically the projected gradient approach as in [10]. At each iteration we take a small step in the direction of the negative gradient of the objective function5, followed by a projection back onto the PSD cone. This projection is performed simply by taking the eigen-decomposition of and removing the components with negative eigenvalues. The algorithm is summarized below: Ý µ, Ò Input: Set of labeled data points ´Ü Output: PSD metric which optimally collapses classes. ½ Initialization: Initialize ¼ to some PSD matrix (randomly or using some initialization heuristic). Iterate: È Ø   ¯   ÔØ where   Ü Ô Ü ¯ Calculate the eigen-decomposition of È È Ù ÙÌ , then set Ø Ø Ø ¯ Set Ø·½ ´ µ ´ ´ ¼´ µ µ ´ µµ´ µ´Ü   Ü µÌ ·½ ·½ ·½ Ñ Ü´ ¼µÙ ÙÌ Of course in principle it is possible to optimize over the dual instead of the primal but in our case, if the training data consists of Ò points in Ö-dimensional space then the primal has only Ç´Ö¾ ¾µ variables while the dual has Ç´Ò¾ µ so it will almost always be more efficient to operate on the primal directly. One exception to this case may be the kernel version (Section 4) where the primal is also of size Ç´Ò¾ µ. 4 When the data can be exactly collapsed into single class points, there will be multiple solutions at infinity. However, this is very unlikely to happen in real data. 5 In the experiments, we used an Armijo like step size rule, as described in [3]. 3 Low Dimensional Projections for Feature Extraction The Mahalanobis distance under a metric can be interpreted as a linear projection of the original inputs by the square root of , followed by Euclidean distance in the projected space. Matrices which have less than full rank correspond to Mahalanobis distances based on low dimensional projections. Such metrics and the induced distances can be advantageous for several reasons [5]. First, low dimensional projections can substantially reduce the storage and computational requirements of a supervised method since only the projections of the training points must be stored and the manipulations at test time all occur in the lower dimensional feature space. Second, low dimensional projections re-represent the inputs, allowing for a supervised embedding or visualization of the original data. If we consider matrices with rank at most Õ , we can always represent them in the form Ï Ì Ï for some projection matrix Ï of size Õ ¢ Ö. This corresponds to projecting the original data into a Õ -dimensional space specified by the rows of Ï . However, rank constraints on a matrix are not convex [4], and hence the rank constrained problem is not convex and is likely to have local minima which make the optimization difficult and illdefined since it becomes sensitive to initial conditions and choice of optimization method. Luckily, there is an alternative approach to obtaining low dimensional projections, which does specify a unique solution by sequentially solving two globally tractable problems. This is the approach we follow here. First we solve for a (potentially) full rank metric using the convex program outlined above, and then obtain a low rank projection from it via spectral decomposition. This is done by diagonalizing into the form Ö Ú ÚÌ where ½ and Ú are the corre¾ Ö are eigenvalues of ½ sponding eigenvectors. To obtain a low rank projection we constrain the sum above to Õ include only the Õ terms corresponding to the Õ largest eigenvalues: Õ Ú ÚÌ . ½ The resulting projection is uniquely defined (up to an irrelevant unitary transformation) as Ô Ì Ì Ï ´ ÚÕ ℄. ½ Õ µ Ú½ È È Ô In general, the projection returned by this approach is not guaranteed to be the same as the projection corresponding to minimizing our objective function subject to a rank constraint on unless the optimal metric is of rank less than or equal to Õ . However, as we show in the experimental results, it is often the case that for practical problems the optimal has an eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed very small, suggesting the low rank solution will be close to optimal. 4 Learning Metrics with Kernels It is interesting to consider the case where Ü are mapped into a high dimensional feature space ´Ü µ and a Mahalanobis distance is sought in this space. We focus on the case where dot products in the feature space may be expressed via a kernel function, such that ´Ü µ ¡ ´Ü µ ´Ü Ü µ for some kernel . We now show how our method can be changed to accommodate this setting, so that optimization depends only on dot products. Consider the regularized target function: Ê ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ · Ì Ö´ µ (6) where the regularizing factor is equivalent to the Frobenius norm of the projection matrix Ï since Ì Ö´ µ Ï ¾. Deriving w.r.t. Ï we obtain Ï Í , where Í is some matrix which specifies Ï as a linear combination of sample points, and the Ø row of the matrix Ì Í Ì Í . Defining the PSD matrix is Ü . Thus is given by Í Ì Í , we can recast our optimization as looking for a PSD matrix , where the Mahalanobis distance is ´Ü   Ü µÌ Ì ´Ü   Ü µ ´   µÌ ´   µ, where we define Ü. This is exactly our original distance, with Ü replaced by , which depends only on dot products in space. The regularization term also depends solely on the dot products since Ì Ö´ µ Ì Ö´ Ì µ Ì Ö´ Ì µ Ì Ö´Ã µ, where à is the kernel matrix given Ì . Note that the trace is a linear function of , keeping the problem convex. by à Thus, as long as dot products can be represented via kernels, the optimization can be carried out without explicitly using the high dimensional space. To obtain a low dimensional solution, we follow the approach in Section 3: obtain a decomposition Î Ì Î 6 , and take the projection matrix to be the first Õ rows of ¼ Î . Ì , and thus Ì Ì As a first step, we calculate a matrix such that . Since is a correlation matrix for the rows of it can be shown (as in Kernel PCA) that its (left) eigenvectors are linear combinations of the rows of . Denoting by Î « the eigenvector matrix, we obtain, after some algebra, that « Ã Ì «. We conclude that « is an eigenvector of the matrix Ã Ì . Denote by « the matrix whose rows are orthonormal eigenvectors of Ã Ì . Then Î can be shown to be orthonormal if we set  ¼ « . The final projection will then be ¼ Î Ü « . Low dimensional Î projections will be obtained by keeping only the first Õ components of this projection. 5 Experimental Results We compared our method to several metric learning algorithms on a supervised classification task. Training data was first used to learn a metric over the input space. Then this metric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we investigated were taken from the UCI repository and have been used previously in evaluating supervised methods for metric learning [10, 5]. To these we added the USPS handwritten digits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22). The algorithms used in the comparative evaluation were ¯ ¯ ¯ Fisher’s Linear Discriminant Analysis (LDA), which projects on the eigenvectors   of ËϽ Ë where ËÏ Ë are the within and between class covariance matrices. The method of Xing et al [10] which minimizes the mean within class distance, while keeping the mean between class distance larger than one. Principal Component Analysis (PCA). There are several possibilities for scaling the PCA projections. We tested several, and report results of the empirically superior one (PCAW), which scales the projection components so that the covariance matrix after projection is the identity. PCAW often performs poorly on high dimensions, but globally outperforms all other variants. We also evaluated the kernel version of MCML with an RBF kernel (denoted by KMCML)7 . Since all methods allow projections to lower dimensions we compared performance for different projection dimensions 8 . The out-of sample performance results (based on 40 random splits of the data taking 70% for training and 30% for testing9 ) are shown in Figure 1. It can be seen that when used in a simple nearest-neighbour classifier, the metric learned by MCML almost always performs as well as, or significantly better than those learned by all other methods, across most dimensions. Furthermore, the kernel version of MCML outperforms the linear one on most datasets. Where Î is orthonormal, and the eigenvalues in are sorted in decreasing order. The regularization parameter and the width of the RBF kernel were chosen using 5 fold crossvalidation. KMCML was only evaluated for datasets with less than 1000 training points. 8 To obtain low dimensional mappings we used the approach outlined in Section 3. 9 Except for the larger datasets where 1000 random samples were used for training. 6 7 Wine 0.2 0.1 2 4 6 8 10 Projection Dimension 0.5 Error Rate Error Rate Balance Ion MCML PCAW LDA XING KMCML 0.3 0.25 0.4 0.2 0.15 0.3 0.1 0.2 0.05 10 20 30 Projection Dimension Soybean−small 0.1 1 2 3 Protein 4 Spam 0.6 0.4 0.25 0.5 0.4 0 5 10 15 0.2 0.3 0.2 0.15 20 5 10 Yale7 15 20 0.1 10 20 30 40 50 40 50 Digits Housing 0.6 0.4 0.4 0.3 0.4 0.35 0.2 0.3 0.2 0.1 0.25 10 20 30 40 50 5 10 15 10 20 30 Figure 1: Classification error rate on several UCI datasets, USPS digits and YALE faces, for different projection dimensions. Algorithms are our Maximally Collapsing Metric Learning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher’s Discriminant Analysis (LDA). Standard errors of the means shown on curves. No results given for XING on YALE and KMCML on Digits and Spam due to the data size. 5.1 Comparison to non convex procedures The methods in the previous comparison are all well defined, in the sense that they are not susceptible to local minima in the optimization. They also have the added advantage of obtaining projections to all dimensions using one optimization run. Below, we also compare the MCML results to the results of two non-convex procedures. The first is the Non Convex variant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the projection matrix Ï , where Ï Ì Ï . Although this is no longer a convex problem, it is not constrained and is thus easier to optimize. The second non convex method is Neighbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error incurred by a nearest neighbor classifier. For both methods we optimized the matrix Ï by restarting the optimization separately for each size of Ï . Minimization was performed using a conjugate gradient algorithm, initialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets. It can be seen that the performance of NMCML is similar to that of MCML, although it is less stable, possibly due to local minima, and both methods usually outperform NCA. The inset in each figure shows the spectrum of the MCML matrix , revealing that it often drops quickly after a few dimensions. This illustrates the effectiveness of our two stage optimization procedure, and suggests its low dimensional solutions are close to optimal. 6 Discussion and Extensions We have presented an algorithm for learning maximally collapsing metrics (MCML), based on the intuition of collapsing classes into single points. MCML assumes that each class Balance Error Rate Wine MCML NMCML NCA 0.2 Soybean 0.4 0.1 0.3 0.1 0.05 0.2 0 2 4 6 8 10 Projection Dimension 0.1 1 0 2 Protein 3 4 5 10 Ion 15 20 Housing 0.4 0.6 0.3 0.5 0.35 0.2 0.4 0.3 0.3 0.25 5 10 15 20 0.1 10 20 30 5 10 15 Figure 2: Classification error for non convex procedures, and the MCML method. Eigen-spectra for the MCML solution are shown in the inset. may be collapsed to a single point, at least approximately, and thus is only suitable for unimodal class distributions (or for simply connected sets if kernelization is used). However, if points belonging to a single class appear in several disconnected clusters in input (or feature) space, it is unlikely that MCML could collapse the class into a single point. It is possible that using a mixture of distributions, an EM-like algorithm can be constructed to accommodate this scenario. The method can also be used to learn low dimensional projections of the input space. We showed that it performs well, even across a range of projection dimensions, and consistently outperforms existing methods. Finally, we have shown how the method can be extended to projections in high dimensional feature spaces using the kernel trick. The resulting nonlinear method was shown to improve classification results over the linear version. References Ò [1] N. Alon and P. Pudlak. Equilateral sets in ÐÔ . Geom. Funct. Anal., 13(3), 2003. [2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In ECCV (1), 1996. [3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transaction on Automatic Control, 21(2):174–184, 1976. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. [5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2001. [7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002. [8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. [9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proc. of ECCV, 2002. [10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2004.

4 0.83626312 153 nips-2005-Policy-Gradient Methods for Planning

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

5 0.82749999 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1

6 0.82705903 114 nips-2005-Learning Rankings via Convex Hull Separation

7 0.81433618 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

8 0.81178248 50 nips-2005-Convex Neural Networks

9 0.8117733 177 nips-2005-Size Regularized Cut for Data Clustering

10 0.81056368 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

11 0.80809975 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

12 0.80301905 77 nips-2005-From Lasso regression to Feature vector machine

13 0.80237383 105 nips-2005-Large-Scale Multiclass Transduction

14 0.79237092 195 nips-2005-Transfer learning for text classification

15 0.78951472 23 nips-2005-An Application of Markov Random Fields to Range Sensing

16 0.78907669 160 nips-2005-Query by Committee Made Real

17 0.78906471 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining

18 0.78862727 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

19 0.7825861 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

20 0.78191084 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning