jmlr jmlr2012 jmlr2012-66 knowledge-graph by maker-knowledge-mining

66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation


Source: pdf

Author: Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon

Abstract: Metric and kernel learning arise in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework—that of minimizing the LogDet divergence subject to linear constraints. We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. Our theoretical connections between metric and kernel learning have two main consequences: 1) the learned kernel matrix parameterizes a linear transformation kernel function and can be applied inductively to new data points, 2) our result yields a constructive method for kernelizing most existing Mahalanobis metric learning formulations. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. Keywords: divergence metric learning, kernel learning, linear transformation, matrix divergences, logdet

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. [sent-12, score-0.573]

2 In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. [sent-13, score-0.603]

3 We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. [sent-15, score-0.367]

4 We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. [sent-17, score-0.287]

5 Keywords: divergence metric learning, kernel learning, linear transformation, matrix divergences, logdet 1. [sent-18, score-1.156]

6 One prominent approach has been to learn a distance metric between objects given additional side information such as pairwise similarity and dissimilarity constraints over the data. [sent-28, score-0.404]

7 A class of distance metrics that has shown excellent generalization properties is the learned Mahalanobis distance function (Davis et al. [sent-29, score-0.284]

8 The Mahalanobis distance can be viewed as a method in which data is subject to a linear transformation, and the goal of such metric learning methods is to learn the linear transformation for a given task. [sent-35, score-0.34]

9 To address the latter shortcoming, kernel learning algorithms typically attempt to learn a kernel matrix over the data. [sent-37, score-0.496]

10 However, many existing kernel learning methods are still limited in that the learned kernels do not generalize to new points (Kwok and Tsang, 2003; Kulis et al. [sent-39, score-0.422]

11 In this paper, we explore metric learning with linear transformations over arbitrarily highdimensional spaces; as we will see, this is equivalent to learning a linear transformation kernel function φ(x)T W φ(y) given an input kernel function φ(x)T φ(y). [sent-47, score-0.662]

12 But perhaps most importantly, the loss function permits efficient kernelization, allowing efficient learning of a linear transformation in kernel space. [sent-52, score-0.319]

13 As a result, unlike transductive kernel learning methods, our method easily handles out-of-sample extensions, that is, it can be applied to unseen data. [sent-53, score-0.269]

14 We build upon our results of kernelization for the LogDet formulation to develop a general framework for learning linear transformation kernel functions and show that such kernels can be efficiently learned over a wide class of convex constraints and loss functions. [sent-54, score-0.61]

15 In our case, even though the matrix W may be infinite-dimensional, it can be 520 M ETRIC AND K ERNEL L EARNING U SING A L INEAR T RANSFORMATION fully represented in terms of the constrained data points, making it possible to compute the learned kernel function over arbitrary points. [sent-56, score-0.344]

16 We demonstrate the benefits of a generalized framework for inductive kernel learning by applying our techniques to the problem of inductive kernelized semi-supervised dimensionality reduction. [sent-57, score-0.369]

17 Finally, we apply our metric and kernel learning algorithms to a number of challenging learning problems, including ones from the domains of computer vision and text mining. [sent-59, score-0.397]

18 Unlike existing techniques, we can learn linear transformation-based distance or kernel functions over these domains, and we show that the resulting functions lead to improvements over state-of-the-art techniques for a variety of problems. [sent-60, score-0.346]

19 Related Work Most of the existing work in metric learning has been done in the Mahalanobis distance (or metric) learning paradigm, which has been found to be a sufficiently powerful class of metrics for a variety of data. [sent-62, score-0.302]

20 (2004) provided the first demonstration of Mahalanobis distance learning in kernel space. [sent-77, score-0.281]

21 The first category includes parametric approaches, where the learned kernel function is restricted to be of a 521 JAIN , K ULIS , DAVIS AND D HILLON specific form and then the relevant parameters are learned according to the provided data. [sent-84, score-0.407]

22 However, based on our choice of regularization and constraints, we show that our learned kernel matrix corresponds to a linear transformation kernel function parameterized by a PSD matrix. [sent-97, score-0.639]

23 In addition, our kernel learning method naturally provides kernelization for many existing metric learning methods. [sent-99, score-0.466]

24 (2010) showed kernelization for a class of metric learning algorithms including LMNN and NCA; as we will see, our result is more general and we can prove kernelization over a larger class of problems and can also reduce the number of parameters to be learned. [sent-101, score-0.294]

25 (2010), we showed the equivalence between a general class of kernel learning problems and metric learning problems. [sent-114, score-0.367]

26 Finally, we address limitations of the method when the amount of training data is large, and propose a modified algorithm to efficiently learn a kernel under such circumstances. [sent-120, score-0.249]

27 1 Mahalanobis Distances and Parameterized Kernels First we introduce the framework for metric and kernel learning that is employed in this paper. [sent-123, score-0.367]

28 , xn ], xi ∈ Rd0 (when working in kernel space, the data 522 M ETRIC AND K ERNEL L EARNING U SING A L INEAR T RANSFORMATION matrix will be represented as Φ = [φ(x1 ), . [sent-127, score-0.278]

29 As a result, we are interested in working in kernel space; that is, we will express the Mahalanobis distance in kernel space using an appropriate mapping φ from input to feature space: dW (φ(xi ), φ(x j )) = (φ(xi ) − φ(x j ))T W (φ(xi ) − φ(x j )). [sent-135, score-0.494]

30 As is standard with kernel-based algorithms, we require that this distance be computable given the ability to compute the kernel function κ0 (x, y) = φ(x)T φ(y). [sent-137, score-0.281]

31 We can therefore equivalently pose the problem as learning a parameterized kernel function κ(x, y) = φ(x)T W φ(y) given some input kernel function κ0 (x, y) = φ(x)T φ(y). [sent-138, score-0.426]

32 In this paper, we assume that pairwise similarity and dissimilarity constraints are given over the data—that is, pairs of points that should be similar under the learned metric/kernel, and pairs of points that should be dissimilar under the learned metric/kernel. [sent-140, score-0.364]

33 Further, (2) considers simple similarity and dissimilarity constraints over the learned Mahalanobis distance, but other linear constraints are possible. [sent-166, score-0.3]

34 Note that the squared Euclidean distance in kernel space may be written as K(i, i) + K( j, j) − 2K(i, j), where K is the learned kernel matrix; equivalently, we may write the distance as tr(K(ei − e j )(ei − e j )T ), where ei is the i-th canonical basis vector. [sent-173, score-0.756]

35 (3) This kernel learning problem was first proposed in the transductive setting in Kulis et al. [sent-177, score-0.269]

36 Note that problem (2) optimizes over a d × d matrix W , while the kernel learning problem (3) optimizes over an n × n matrix K. [sent-179, score-0.281]

37 The above theorem shows that the LogDet metric learning problem (2) can be solved implicitly by solving an equivalent kernel learning problem (3). [sent-187, score-0.367]

38 In fact, in Section 4 we show an equivalence between metric and kernel learning for a general class of regularization functions. [sent-188, score-0.367]

39 This equivalence implies that we can implicitly solve the metric learning problem by instead solving for the optimal kernel matrix K ∗ . [sent-210, score-0.401]

40 4 Generalizing to New Points In this section, we see how to generalize to new points using the learned kernel matrix K ∗ . [sent-215, score-0.385]

41 The distance between two points φ(xi ) and φ(x j ) that are in the training set can be computed directly from the learned kernel matrix as K(i, i) + K( j, j) − 2K(i, j). [sent-217, score-0.412]

42 (6) Thus, the expression above can be used to evaluate kernelized distances with respect to the learned kernel function between arbitrary data objects. [sent-224, score-0.375]

43 In summary, the connection between kernel learning and metric learning allows us to generalize our metrics to new points in kernel space. [sent-225, score-0.672]

44 This is performed by first solving the kernel learning problem for K, then using the learned kernel matrix and the input kernel function to compute learned distances using (6). [sent-226, score-0.896]

45 5 Kernel Learning Algorithm Given the connection between the Mahalanobis metric learning problem for the d × d matrix W and the kernel learning problem for the n × n kernel matrix K, we develop an algorithm for efficiently performing metric learning in kernel space. [sent-228, score-1.015]

46 3, we proposed a LogDet divergence-based Mahalanobis metric learning problem (2) and an equivalent kernel learning problem (3). [sent-272, score-0.367]

47 In particular, we propose a method to efficiently learn an identity plus low-rank Mahalanobis distance matrix and its equivalent kernel function. [sent-280, score-0.351]

48 For learning the kernel function, the basis R = ΦJ can be selected by: 1) using a randomly sampled coefficient matrix J, 2) clustering Φ using kernel k-means or a spectral clustering method, 3) choosing a random subset of Φ, that is, the columns of J are random indicator vectors. [sent-310, score-0.615]

49 A natural question is whether one can learn similar kernel functions with other loss functions, such as those considered previously in the literature for Mahalanobis metric learning. [sent-314, score-0.427]

50 In this section, we propose and analyze a general kernel matrix learning problem similar to (3) but using a more general class of loss functions. [sent-315, score-0.271]

51 As in the LogDet loss function case, we show that our kernel matrix learning problem is equivalent to learning a linear transformation (LT) kernel function with a specific loss function. [sent-316, score-0.59]

52 This implies that the learned LT kernel function can be naturally applied to new data. [sent-317, score-0.31]

53 Additionally, since a large class of metric learning methods can be seen as learning a LT kernel function, our result provides a constructive method for kernelizing these methods. [sent-318, score-0.405]

54 As in the LogDet formulation, we will first consider a transductive scenario, where we learn a kernel matrix K that is regularized against K0 while satisfying the available side-information. [sent-328, score-0.339]

55 529 JAIN , K ULIS , DAVIS AND D HILLON Recall that the LogDet divergence based loss function in the kernel matrix learning problem (3) is given by: −1 −1 Dℓd (K, K0 ) = tr(KK0 ) − log det(KK0 ) − n, −1/2 = tr(K0 −1/2 KK0 −1/2 ) − log det(K0 −1/2 KK0 ) − n. [sent-329, score-0.334]

56 We also generalize our constraints to include arbitrary constraints over the kernel matrix K rather than just the pairwise distance constraints in the above problem. [sent-334, score-0.527]

57 In general, such formulations are limited in that the learned kernel cannot readily be applied to new data points. [sent-342, score-0.31]

58 However, we will show that the above proposed problem is equivalent to learning linear transformation (LT) kernel functions. [sent-343, score-0.295]

59 Formally, an LT kernel function κW is a kernel function of the form κ(x, y) = φ(x)T W φ(y), where W is a positive semi-definite (PSD) matrix. [sent-344, score-0.426]

60 A natural way to learn an LT kernel function would be to learn the parameterization matrix W using the provided side-information. [sent-345, score-0.319]

61 gi (ΦT W Φ) ≤ bi , 1 ≤ i ≤ m, (13) where, as before, the function f is the loss function and the functions gi are the constraints that encode the side information. [sent-348, score-0.309]

62 The constraints gi are assumed to be a function of the matrix ΦT W Φ of learned kernel values over the training data. [sent-349, score-0.482]

63 Further, this result will yield insight into the type of kernel that is learned by the kernel learning problem. [sent-355, score-0.523]

64 Similarly, most of the existing metric learning formulations have a spectral function as their objective function. [sent-363, score-0.258]

65 Also, note that the constraints in (15) can be simplified to: 1/2 1/2 gi (αΦT Φ + ΦT ULU T Φ) ≤ bi ≡ gi (αK0 + K0 LK0 ) ≤ bi . [sent-422, score-0.351]

66 Given that K = ΦT W Φ, we can see that the learned kernel function is a linear transformation kernel; that is, κ(xi , x j ) = φ(xi )T W φ(x j ). [sent-426, score-0.392]

67 Given a pair of new data points z1 and z2 , we use the fact that the learned kernel is a linear transformation kernel, along with the first result of the theorem (W = αI d + ΦSΦT ) to compute the learned kernel as: T φ(z1 )T W φ(z2 ) = α · κ0 (z1 , z2 ) + k1 Sk2 , where ki = [κ0 (zi , x1 ), . [sent-427, score-0.702]

68 (18) Since LogDet divergence is also a spectral function, Theorem 4 is a generalization of Theorem 1 and implies kernelization for our metric learning formulation (2). [sent-431, score-0.387]

69 Then, we will explicitly enforce that the learned kernel is of this form. [sent-443, score-0.31]

70 Below, 533 JAIN , K ULIS , DAVIS AND D HILLON we show that for any spectral function f and linear constraints gi (K) = tr(Ci K), (19) reduces to a problem that applies f and gi ’s to k × k matrices only, which provides significant scalability. [sent-450, score-0.294]

71 Note that (20) is over k × k matrices (after initial pre-processing) and is in fact similar to the kernel learning problem (12), but with a kernel K J of smaller size k × k, k ≪ n. [sent-461, score-0.426]

72 This enables us to naturally apply the above kernel learning problem in the inductive setting. [sent-463, score-0.251]

73 Then, (19) and (20) with gi (K) = tr(Ci K) are equivalent to the following linear transformation kernel learning problem (analogous to the connection between (12) and (13)): min f (W ), W 0,L s. [sent-466, score-0.399]

74 Note that, in contrast to (13), where the last constraint over W is achieved automatically, (21) requires this constraint on W to be satisfied during the optimization process, which leads to a reduced number of parameters for our kernel learning problem. [sent-474, score-0.293]

75 The above theorem shows that our reducedparameter kernel learning method (19) also implicitly learns a linear transformation kernel function, hence we can generalize the learned kernel to unseen data points using an expression similar to (18). [sent-475, score-0.859]

76 Special Cases In the previous section, we proved a general result showing the connections between metric and kernel learning using a wide class of loss functions and constraints. [sent-477, score-0.391]

77 Now, the kernel learning problem (12) with loss function fvN and linear constraints is: −1/2 min fvN (K0 K 0 −1/2 KK0 ), s. [sent-488, score-0.317]

78 (22) As fvN is an spectral function, using Theorem 4, the above kernel learning problem is equivalent to the following metric learning problem: min DvN (W, I), W 0 s. [sent-491, score-0.465]

79 In this case, f is once again a strictly convex spectral function, and its global minimum is α = 0, so we can use (12) to solve for the learned kernel K as −1 min KK0 K 0 2 F, s. [sent-509, score-0.408]

80 gi (K) ≤ bi , 1 ≤ i ≤ m, The constraints gi for this problem can be easily constructed by re-writing each of POLA’s constraints as a function of ΦT W Φ. [sent-511, score-0.342]

81 (23) Here we show that this problem can be efficiently solved for high dimensional data in its kernel space, hence kernelizing the metric learning methods introduced by Weinberger et al. [sent-520, score-0.405]

82 4 Trace-norm Based Inductive Semi-supervised Kernel Dimensionality Reduction (Trace-SSIKDR) Finally, we apply our framework to semi-supervised kernel dimensionality reduction, which provides a novel and practical application of our framework. [sent-538, score-0.257]

83 While there exists a variety of methods for kernel dimensionality reduction, most of these methods are unsupervised (e. [sent-539, score-0.257]

84 In contrast, we can use our kernel learning framework to learn a low-rank transformation of the feature vectors implicitly that in turn provides a low-dimensional embedding of the data set. [sent-542, score-0.37]

85 However, using elementary linear algebra we can show that K and 1/2 the learned kernel function can be computed efficiently without computing K0 by maintaining −1/2 ˜ −1/2 S = K0 KK0 from step to step. [sent-564, score-0.31]

86 Note also that the learned embedding xi → K 1/2 K0 ki , where ki is a vector of input kernel function values between xi and the training data, can be computed efficiently 1/2 1/2 as xi → Σk DkVk ki , which does not require K0 explicitly. [sent-583, score-0.442]

87 6, we showed that the kernelized problem can be learned with respect to a reduced basis of size k, admitting a learned kernel parameterized by O(k2 ) values. [sent-589, score-0.471]

88 A special case of our approach is the trace-norm based kernel function learning problem, which can be applied to the task of semi-supervised inductive kernel dimensionality reduction. [sent-592, score-0.508]

89 We evaluate performance of our learned distance metrics or kernel functions in the context of a) classification accuracy for the k-nearest neighbor algorithm, b) kernel dimensionality reduction. [sent-594, score-0.723]

90 The upper and lower bounds for the similarity and dissimilarity constraints are determined empirically as the 1st and 99th percentiles of the distribution of distances computed using a baseline Mahalanobis distance parameterized by W0 . [sent-603, score-0.28]

91 1 Low-Dimensional Data Sets First we evaluate our LogDet divergence based metric learning method (see Algorithm 1) on the standard UCI data sets in the low-dimensional (non-kernelized) setting, to directly compare with several existing metric learning methods. [sent-615, score-0.4]

92 In Figure 1 (a), we compare LogDet Linear (K0 equals the linear kernel) and the LogDet Gaussian (K0 equals Gaussian kernel in kernel space) algorithms 539 JAIN , K ULIS , DAVIS AND D HILLON 0. [sent-616, score-0.426]

93 LogDet metric learning was run with in input space (LogDet Linear) as well as in kernel space with a Gaussian kernel (LogDet Gaussian). [sent-636, score-0.58]

94 In general, for the Mpg321, Foxpro, and Iptables data sets, learned metrics yield only marginal gains over the baseline Euclidean distance measure. [sent-679, score-0.253]

95 We compute distances between images using learning kernels with three different base image kernels: 1) PMK: Grauman and Darrell’s pyramid match kernel (Grauman and Darrell, 2007) applied to SIFT features, 2) CORR: the kernel designed by Zhang et al. [sent-768, score-0.525]

96 For the baseline kernel, we use the data-dependent kernel function proposed by Sindhwani et al. [sent-808, score-0.25]

97 Conclusions In this paper, we considered the general problem of learning a linear transformation of the input data and applied it to the problems of metric and kernel learning, with a focus on establishing connections between the two problems. [sent-815, score-0.449]

98 We also showed that our learned metric can be restricted to a small dimensional basis efficiently, thereby permitting scalability of our method to large data sets with high-dimensional feature spaces. [sent-817, score-0.279]

99 A key consequence of our analysis is that a number of existing approaches for Mahalanobis metric learning may be applied in kernel space using our kernel learning formulation. [sent-820, score-0.609]

100 Finally, we presented several experiments on benchmark data, high-dimensional vision, and text classification problems as well as a semi-supervised kernel dimensionality reduction problem, demonstrating our method compared to several existing state-of-the-art techniques. [sent-821, score-0.316]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('logdet', 0.692), ('kernel', 0.213), ('mahalanobis', 0.209), ('metric', 0.154), ('proccedings', 0.146), ('davis', 0.144), ('ulu', 0.144), ('tr', 0.132), ('jain', 0.127), ('hillon', 0.126), ('ransformation', 0.126), ('ulis', 0.126), ('kulis', 0.125), ('pmk', 0.117), ('sing', 0.097), ('learned', 0.097), ('etric', 0.089), ('transformation', 0.082), ('jlj', 0.081), ('gi', 0.081), ('dw', 0.078), ('spectral', 0.075), ('ernel', 0.075), ('kernelization', 0.07), ('ei', 0.069), ('distance', 0.068), ('bi', 0.066), ('divergence', 0.063), ('inear', 0.063), ('lmnn', 0.062), ('fs', 0.057), ('constraints', 0.057), ('transductive', 0.056), ('fvn', 0.054), ('latex', 0.054), ('dissimilarity', 0.054), ('metrics', 0.051), ('ci', 0.047), ('mcml', 0.046), ('grauman', 0.046), ('dk', 0.045), ('dimensionality', 0.044), ('weinberger', 0.044), ('lt', 0.043), ('kernels', 0.042), ('earning', 0.041), ('generalize', 0.041), ('embedding', 0.039), ('digits', 0.039), ('kernelizing', 0.038), ('corr', 0.038), ('inductive', 0.038), ('baseline', 0.037), ('neighbor', 0.037), ('learn', 0.036), ('vi', 0.036), ('pola', 0.036), ('zti', 0.036), ('usps', 0.036), ('kernelized', 0.036), ('similarity', 0.035), ('matrix', 0.034), ('neumann', 0.034), ('vk', 0.034), ('bregman', 0.031), ('xi', 0.031), ('von', 0.03), ('darrell', 0.03), ('dhillon', 0.03), ('text', 0.03), ('distances', 0.029), ('nn', 0.029), ('existing', 0.029), ('constraint', 0.029), ('basis', 0.028), ('slack', 0.028), ('pyramid', 0.028), ('euclidean', 0.028), ('dkvk', 0.027), ('foxpro', 0.027), ('iptables', 0.027), ('globerson', 0.027), ('clustering', 0.026), ('formulation', 0.025), ('semide', 0.025), ('berg', 0.024), ('nips', 0.024), ('loss', 0.024), ('dissimilar', 0.024), ('blur', 0.023), ('pami', 0.023), ('lsa', 0.023), ('spmk', 0.023), ('min', 0.023), ('uci', 0.023), ('nearest', 0.023), ('det', 0.023), ('goldberger', 0.022), ('optimization', 0.022), ('eigenvectors', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

Author: Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon

Abstract: Metric and kernel learning arise in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework—that of minimizing the LogDet divergence subject to linear constraints. We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. Our theoretical connections between metric and kernel learning have two main consequences: 1) the learned kernel matrix parameterizes a linear transformation kernel function and can be applied inductively to new data points, 2) our result yields a constructive method for kernelizing most existing Mahalanobis metric learning formulations. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. Keywords: divergence metric learning, kernel learning, linear transformation, matrix divergences, logdet

2 0.1751707 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

3 0.17423247 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

Author: Yiming Ying, Peng Li

Abstract: The main theme of this paper is to develop a novel eigenvalue optimization framework for learning a Mahalanobis metric. Within this context, we introduce a novel metric learning approach called DML-eig which is shown to be equivalent to a well-known eigenvalue optimization problem called minimizing the maximal eigenvalue of a symmetric matrix (Overton, 1988; Lewis and Overton, 1996). Moreover, we formulate LMNN (Weinberger et al., 2005), one of the state-of-the-art metric learning methods, as a similar eigenvalue optimization problem. This novel framework not only provides new insights into metric learning but also opens new avenues to the design of efficient metric learning algorithms. Indeed, first-order algorithms are developed for DML-eig and LMNN which only need the computation of the largest eigenvector of a matrix per iteration. Their convergence characteristics are rigorously established. Various experiments on benchmark data sets show the competitive performance of our new approaches. In addition, we report an encouraging result on a difficult and challenging face verification data set called Labeled Faces in the Wild (LFW). Keywords: metric learning, convex optimization, semi-definite programming, first-order methods, eigenvalue optimization, matrix factorization, face verification

4 0.077980705 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the unified framework of regularised kernel machines. Keywords: multiple kernel learning, kernel fisher discriminant analysis, regularised least squares, support vector machines

5 0.070504621 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

Author: Francesco Orabona, Luo Jie, Barbara Caputo

Abstract: In recent years there has been a lot of interest in designing principled classification algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach. Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem. We prove theoretically and experimentally that 1) our algorithm has a faster convergence rate as the number of kernels grows; 2) the training complexity is linear in the number of training examples; 3) very few iterations are sufficient to reach good solutions. Experiments on standard benchmark databases support our claims. Keywords: multiple kernel learning, learning kernels, online optimization, stochastic subgradient descent, convergence bounds, large scale

6 0.069025837 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

7 0.063681222 97 jmlr-2012-Regularization Techniques for Learning with Matrices

8 0.063365318 44 jmlr-2012-Feature Selection via Dependence Maximization

9 0.060943432 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

10 0.059388459 73 jmlr-2012-Multi-task Regression using Minimal Penalties

11 0.058173608 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

12 0.056989126 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

13 0.055491131 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

14 0.055180389 100 jmlr-2012-Robust Kernel Density Estimation

15 0.054996308 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

16 0.053449422 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

17 0.051769372 4 jmlr-2012-A Kernel Two-Sample Test

18 0.047250833 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

19 0.046982363 12 jmlr-2012-Active Clustering of Biological Sequences

20 0.044638887 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.222), (1, -0.029), (2, 0.175), (3, 0.211), (4, -0.179), (5, 0.009), (6, -0.208), (7, 0.082), (8, -0.04), (9, -0.088), (10, 0.032), (11, 0.015), (12, -0.171), (13, -0.028), (14, 0.073), (15, 0.046), (16, -0.01), (17, -0.035), (18, 0.083), (19, -0.007), (20, -0.055), (21, 0.129), (22, 0.052), (23, -0.147), (24, 0.066), (25, -0.124), (26, -0.03), (27, -0.084), (28, -0.087), (29, 0.029), (30, 0.017), (31, 0.025), (32, -0.084), (33, -0.021), (34, -0.006), (35, 0.026), (36, -0.054), (37, -0.057), (38, 0.021), (39, 0.05), (40, 0.061), (41, -0.098), (42, -0.083), (43, -0.178), (44, 0.041), (45, -0.029), (46, 0.031), (47, 0.055), (48, 0.068), (49, 0.1)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91768533 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

Author: Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon

Abstract: Metric and kernel learning arise in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework—that of minimizing the LogDet divergence subject to linear constraints. We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. Our theoretical connections between metric and kernel learning have two main consequences: 1) the learned kernel matrix parameterizes a linear transformation kernel function and can be applied inductively to new data points, 2) our result yields a constructive method for kernelizing most existing Mahalanobis metric learning formulations. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. Keywords: divergence metric learning, kernel learning, linear transformation, matrix divergences, logdet

2 0.77593625 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

Author: Yiming Ying, Peng Li

Abstract: The main theme of this paper is to develop a novel eigenvalue optimization framework for learning a Mahalanobis metric. Within this context, we introduce a novel metric learning approach called DML-eig which is shown to be equivalent to a well-known eigenvalue optimization problem called minimizing the maximal eigenvalue of a symmetric matrix (Overton, 1988; Lewis and Overton, 1996). Moreover, we formulate LMNN (Weinberger et al., 2005), one of the state-of-the-art metric learning methods, as a similar eigenvalue optimization problem. This novel framework not only provides new insights into metric learning but also opens new avenues to the design of efficient metric learning algorithms. Indeed, first-order algorithms are developed for DML-eig and LMNN which only need the computation of the largest eigenvector of a matrix per iteration. Their convergence characteristics are rigorously established. Various experiments on benchmark data sets show the competitive performance of our new approaches. In addition, we report an encouraging result on a difficult and challenging face verification data set called Labeled Faces in the Wild (LFW). Keywords: metric learning, convex optimization, semi-definite programming, first-order methods, eigenvalue optimization, matrix factorization, face verification

3 0.59381407 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

4 0.50585616 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

Author: Neil D. Lawrence

Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter fitting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random field via the graphical lasso.

5 0.41899267 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. Keywords: kernel methods, learning kernels, feature selection

6 0.37987331 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

7 0.37308168 44 jmlr-2012-Feature Selection via Dependence Maximization

8 0.37065369 12 jmlr-2012-Active Clustering of Biological Sequences

9 0.36156955 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

10 0.35405466 73 jmlr-2012-Multi-task Regression using Minimal Penalties

11 0.34376153 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

12 0.33702129 100 jmlr-2012-Robust Kernel Density Estimation

13 0.33603588 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

14 0.33575469 4 jmlr-2012-A Kernel Two-Sample Test

15 0.32198748 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

16 0.32038286 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

17 0.31688839 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

18 0.31272414 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

19 0.30466974 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

20 0.28421217 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.02), (21, 0.026), (26, 0.466), (27, 0.011), (29, 0.02), (35, 0.024), (49, 0.014), (56, 0.017), (64, 0.032), (69, 0.02), (75, 0.076), (79, 0.011), (81, 0.014), (92, 0.061), (96, 0.099)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98331469 79 jmlr-2012-Oger: Modular Learning Architectures For Large-Scale Sequential Processing

Author: David Verstraeten, Benjamin Schrauwen, Sander Dieleman, Philemon Brakel, Pieter Buteneers, Dejan Pecevski

Abstract: Oger (OrGanic Environment for Reservoir computing) is a Python toolbox for building, training and evaluating modular learning architectures on large data sets. It builds on MDP for its modularity, and adds processing of sequential data sets, gradient descent training, several crossvalidation schemes and parallel parameter optimization methods. Additionally, several learning algorithms are implemented, such as different reservoir implementations (both sigmoid and spiking), ridge regression, conditional restricted Boltzmann machine (CRBM) and others, including GPU accelerated versions. Oger is released under the GNU LGPL, and is available from http: //organic.elis.ugent.be/oger. Keywords: Python, modular architectures, sequential processing

2 0.91275889 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

Author: Zhiwei Qin, Donald Goldfarb

Abstract: We consider a class of sparse learning problems in high dimensional feature space regularized by a structured sparsity-inducing norm that incorporates prior knowledge of the group structure of the features. Such problems often pose a considerable challenge to optimization algorithms due to the non-smoothness and non-separability of the regularization term. In this paper, we focus on two commonly adopted sparsity-inducing regularization terms, the overlapping Group Lasso penalty l1 /l2 -norm and the l1 /l∞ -norm. We propose a unified framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efficiently solved. As one of the core building-blocks of this framework, we develop new algorithms using a partial-linearization/splitting technique and prove that the accelerated versions 1 of these algorithms require O( √ε ) iterations to obtain an ε-optimal solution. We compare the performance of these algorithms against that of the alternating direction augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two real-world problems to compare the relative merits of the two norms. Keywords: structured sparsity, overlapping Group Lasso, alternating direction methods, variable splitting, augmented Lagrangian

same-paper 3 0.90462536 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

Author: Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon

Abstract: Metric and kernel learning arise in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework—that of minimizing the LogDet divergence subject to linear constraints. We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. Our theoretical connections between metric and kernel learning have two main consequences: 1) the learned kernel matrix parameterizes a linear transformation kernel function and can be applied inductively to new data points, 2) our result yields a constructive method for kernelizing most existing Mahalanobis metric learning formulations. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. Keywords: divergence metric learning, kernel learning, linear transformation, matrix divergences, logdet

4 0.90331769 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image

5 0.62742531 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

6 0.60103047 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

7 0.5920915 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

8 0.58247751 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

9 0.56245267 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

10 0.55513704 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

11 0.55290556 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

12 0.54404986 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

13 0.53780353 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

14 0.53354985 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

15 0.53354394 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

16 0.52954805 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

17 0.52562106 103 jmlr-2012-Sampling Methods for the Nyström Method

18 0.51772463 54 jmlr-2012-Large-scale Linear Support Vector Regression

19 0.4966301 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

20 0.49500695 100 jmlr-2012-Robust Kernel Density Estimation