nips nips2009 nips2009-191 knowledge-graph by maker-knowledge-mining

191 nips-2009-Positive Semidefinite Metric Learning with Boosting


Source: pdf

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton Hengel

Abstract: The learning of appropriate distance metrics is a critical problem in image classification and retrieval. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a Mahalanobis distance metric. One of the primary difficulties in learning such a metric is to ensure that the Mahalanobis matrix remains positive semidefinite. Semidefinite programming is sometimes used to enforce this constraint, but does not scale well. B OOST M ETRIC is instead based on a key observation that any positive semidefinite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting method is easy to implement, does not require tuning, and can accommodate various types of constraints. Experiments on various datasets show that the proposed algorithm compares favorably to those state-of-the-art methods in terms of classification accuracy and running time. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a Mahalanobis distance metric. [sent-2, score-0.056]

2 One of the primary difficulties in learning such a metric is to ensure that the Mahalanobis matrix remains positive semidefinite. [sent-3, score-0.138]

3 B OOST M ETRIC is instead based on a key observation that any positive semidefinite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices. [sent-5, score-0.069]

4 B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. [sent-6, score-0.084]

5 1 Introduction It has been an extensively sought-after goal to learn an appropriate distance metric in image classification and retrieval problems using simple and efficient algorithms [1–5]. [sent-9, score-0.201]

6 Such distance metrics are essential to the effectiveness of many critical algorithms such as k-nearest neighbor (kNN), kmeans clustering, and kernel regression, for example. [sent-10, score-0.091]

7 We show in this work how a Mahalanobis metric is learned from proximity comparisons among triples of training data. [sent-11, score-0.111]

8 Therefore, typically methods for learning a Mahalanobis distance result in constrained semidefinite programs. [sent-19, score-0.056]

9 If we let ai , i = 1, 2 · · · , represent a set of points in RD , the training data consist of a set of constraints upon the relative distances between these points, S = {(ai , aj , ak )|distij < distik }, S where distij measures the distance between ai and aj . [sent-24, score-0.709]

10 The Mahalanobis distance between two vectors takes the form: ai − aj X = (ai − aj )⊤ X(ai − aj ), with X 0, a p. [sent-26, score-0.534]

11 Constraints such as those above often arise when it is known that ai and aj belong to the same class of data points while ai , ak belong to different classes. [sent-31, score-0.395]

12 has led to the development of a number of methods for learning a Mahalanobis distance which rely upon constrained semidefinite programing. [sent-39, score-0.078]

13 matrix from a set of constraints upon pairwise-distance comparisons. [sent-43, score-0.064]

14 Xing et al [4] firstly proposed to learn a Mahalanobis metric for clustering using convex optimization. [sent-45, score-0.142]

15 The algorithm maximizes the distance between points in the dis-similarity set under the constraint that the distance between points in the similarity set is upper-bounded. [sent-47, score-0.112]

16 Neighborhood component analysis (NCA) [6] and large margin nearest neighbor (LMNN) [7] learn a metric by maintaining consistency in data’s neighborhood and keeping a large margin at the boundaries of different classes. [sent-48, score-0.22]

17 It has been shown in [7] that LMNN delivers the state-of-the-art performance among most distance metric learning algorithms. [sent-49, score-0.149]

18 Instead of using hinge loss in LMNN and PSDBoost, we use the exponential loss function in order to derive an AdaBoostlike optimization procedure. [sent-51, score-0.092]

19 Reformulationlinearization is a typical technique in convex optimization to relax and convexify the problem [11]. [sent-55, score-0.081]

20 In metric learning, much existing work instead learns X = LL⊤ for seeking a global optimum, e. [sent-56, score-0.093]

21 PSDBoost [9] converts the particular semidefinite program in metric learning into a sequence of linear programs (LP’s). [sent-64, score-0.093]

22 Another problem is that PSDBoost needs to store all the weak learners (the rank-one matrices) during the optimization. [sent-68, score-0.079]

23 Based on the observation from [9] that any positive semidefinite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices, we propose B OOST M ETRIC for learning a p. [sent-71, score-0.069]

24 At each iteration, only the largest eigenvalue and its corresponding eigenvector are needed. [sent-81, score-0.05]

25 We demonstrate learning a Mahalanobis metric by proximity comparison constraints. [sent-83, score-0.093]

26 No sophisticated optimization techniques such as LP solvers are involved. [sent-88, score-0.054]

27 Throughout this paper, a matrix is denoted by a bold upper-case letter (X); a column vector is denoted by a bold lower-case letter (x). [sent-91, score-0.085]

28 We use X 0 to indicate that matrix X is positive semidefinite. [sent-95, score-0.045]

29 This technique has been used widely in convex optimization and machine learning such as [12]. [sent-100, score-0.051]

30 If X is diagonal, the problem corresponds to learning a metric in which the different features are given different weights, a. [sent-102, score-0.093]

31 In the framework of large-margin learning, we want to maximize the distance between distij and distik . [sent-106, score-0.131]

32 To simplify notation, we rewrite the distance between dist2 and dist2 as dist2 − dist2 = Ar , X , ij ik ij ik Ar = (ai − ak )(ai − ak )⊤ − (ai − aj )(ai − aj )⊤ , (2) r = 1, · · · , |S |S is the size of the set S S|. [sent-108, score-0.48]

33 By employing exponential loss, we want to optimize min log |S| S r=1 exp −ρr + v Tr(X) s. [sent-121, score-0.047]

34 This transform does not change the original optimization problem of sum of exponential loss because the logarithmic function is strictly monotonically decreasing. [sent-125, score-0.072]

35 Without this regularization, one can always multiply an arbitrarily large factor to X to make the exponential loss approach zero in the case of all constraints being satisfied. [sent-127, score-0.062]

36 must be introduced for deriving a meaningful dual problem, as we show later. [sent-132, score-0.045]

37 We can decompose X into: X = So ρr = Ar , X = Ar , J j=1 wj Zj , J j=1 wj Zj = with wj ≥ 0, rank(Zj ) = 1 and Tr(Zj ) = 1, ∀j. [sent-133, score-0.693]

38 J j=1 wj Ar , Zj = Here Hrj is a shorthand for Hrj = Ar , Zj . [sent-134, score-0.231]

39 The Lagrange Dual Problem We now derive the Lagrange dual of the problem we are interested in. [sent-138, score-0.045]

40 S|, (P1) In order to derive its dual, we write its Lagrangian L(w, ρ, u, p) = log |S| S r=1 exp −ρr + v1⊤ w + |S| S r=1 ur (ρr − Hr: w) − p⊤ w, (4) with p ≥ 0. [sent-142, score-0.173]

41 The dual problem is obtained by finding the saddle point of L; i. [sent-144, score-0.045]

42 L2 L1 inf L = inf log w,ρ ρ |S| S r=1 exp −ρr + u⊤ ρ + inf (v1⊤ − w |S| S r=1 ur Hr: − p⊤ )w = − |S| S r=1 ur log ur . [sent-147, score-0.554]

43 The infimum of L1 is found by setting its first derivative to zero and we have: inf L1 = ρ − r ur log ur −∞ if u ≥ 0, 1⊤ u = 1, otherwise. [sent-148, score-0.323]

44 (5) The Lagrange dual problem of (P1) is an entropy maximization problem, which writes max − u |S| S r=1 ur log ur , s. [sent-152, score-0.339]

45 Let us start from some basic knowledge of column generation because our coordinate descent strategy is inspired by column generation. [sent-160, score-0.09]

46 J) and hence the entire matrix H is known, then either the primal (P1) or the dual (D1) could be trivially solved (at least in theory) because both are convex optimization problems. [sent-164, score-0.159]

47 Especially the primal problem is convex minimization with simple nonnegativeness constraints. [sent-166, score-0.062]

48 In convex optimization, column generation is a technique that is designed for solving this difficulty. [sent-169, score-0.063]

49 Instead of directly solving the primal problem (P1), we find the most violated constraint in the dual (D1) iteratively for the current solution and add this constraint to the optimization problem. [sent-170, score-0.118]

50 For this purpose, we need to solve |S| S r=1 ur ˆ Z = argmaxZ Ar , Z , s. [sent-171, score-0.147]

51 Now we move on to derive a coordinate descent optimization procedure. [sent-176, score-0.054]

52 4 Coordinate Descent Optimization We show how an AdaBoost-like optimization procedure can be derived for our metric learning problem. [sent-178, score-0.124]

53 As in AdaBoost, we need to solve for the primal variables wj given all the weak learners up to iteration j. [sent-179, score-0.353]

54 Optimizing for wj Since we are interested in the one-at-a-time coordinate-wise optimization, we keep w1 , w2 , . [sent-180, score-0.231]

55 The cost function of the primal problem is (in the following derivation, we drop those terms irrelevant to the variable wj ) Cp (wj ) = log |S| S r=1 exp(−ρj−1 ) · exp(−Hrj wj ) + vwj . [sent-184, score-0.504]

56 r Clearly, Cp is convex in wj and hence there is only one minimum that is also globally optimal. [sent-185, score-0.251]

57 wj vanishes at optimality, which results in |S| S r=1 (Hrj − v)uj−1 exp(−wj Hrj ) = 0. [sent-189, score-0.231]

58 1 2 3 4 Input: An interval [wl , wu ] known to contain the optimal value of wj and convergence tolerance ε > 0. [sent-191, score-0.258]

59 We instead use bisection to search for the optimal wj . [sent-200, score-0.292]

60 The bisection method is one of the root-finding algorithms. [sent-201, score-0.061]

61 Updating u j, we have The rule for updating the dual variable u can be easily obtained from (6). [sent-215, score-0.045]

62 At iteration uj ∝ exp −ρj ∝ uj−1 exp(−Hrj wj ), and r r r |S| S j r=1 ur = 1, derived from (6). [sent-216, score-0.481]

63 So once wj is calculated, we can update u as uj = r uj−1 exp(−Hrj wj ) r , r = 1, . [sent-217, score-0.519]

64 We have |S| S r=1 ur Ar , Z = |S| S r=1 ur Ar , Z = ξ⊤ |S| S r=1 ur Ar ξ. [sent-225, score-0.441]

65 By denoting ˆ A= |S| S r=1 ur Ar , ⊤ˆ (10) the base learning optimization equals: maxξ ξ Aξ, s. [sent-226, score-0.178]

66 It is clear that the largest ˆ ˆ eigenvalue of A, λmax (A), and its corresponding eigenvector ξ 1 gives the solution to the above ˆ is symmetric. [sent-229, score-0.05]

67 We use PCA to decrease the dimensionality before training on these datasets (datasets 2-6). [sent-238, score-0.059]

68 Input: • Training set triplets (ai , aj , ak ) ∈ S Compute Ar , r = 1, 2, · · · , using (2). [sent-241, score-0.253]

69 We have used the same mechanism to generate training triplets as described in [7]. [sent-253, score-0.1]

70 Briefly, for each training point ai , k nearest neighbors that have same labels as yi (targets), as well as k nearest neighbors that have different labels from yi (imposers) are found. [sent-254, score-0.178]

71 We then construct triplets from ai and its corresponding targets and imposers. [sent-255, score-0.194]

72 As we can see from Table 1, we can conclude: (1) B OOST M ETRIC consistently improves kNN classification using Euclidean distance on most datasets. [sent-264, score-0.056]

73 So learning a Mahalanobis metric based upon the large margin concept does lead to improvements in kNN classification. [sent-265, score-0.149]

74 The coordinatewise gradient descent optimization strategy of AdaBoost leads to an ℓ1 -norm regularized maximum margin classifier [17]. [sent-275, score-0.088]

75 Computational time As we discussed, one major issue in learning a Mahalanobis distance is heavy computational cost because of the semidefiniteness constraint. [sent-282, score-0.056]

76 bz2 2 Table 1: Test classification error rates (%) of a 3-nearest neighbor classifier on benchmark datasets. [sent-294, score-0.053]

77 Results of NCA and Xing et al [4] on large datasets are not available either because the algorithm does not converge or due to the out-of-memory problem. [sent-295, score-0.07]

78 We vary the input dimensions from 50 to 1000 and keep the number of triplets fixed to 250. [sent-479, score-0.102]

79 Instead of using standard interior-point SDP solvers that do not scale well, LMNN heuristically combines sub-gradient descent in both the matrices L and X. [sent-480, score-0.046]

80 B OOST M ETRIC is faster than LMNN with large input dimensions because at each iteration B OOST M ETRIC only needs to calculate the largest eigenvector and LMNN needs a full eigen-decomposition. [sent-496, score-0.108]

81 The second figure shows the test error against the number of training triplets with a 100-word codebook. [sent-504, score-0.123]

82 5% with 8631 triplets for training, which is worse than B OOST M ETRIC. [sent-507, score-0.082]

83 To accumulate statistics, the images of two involved object classes are randomly split as 10 pairs of training/test subsets. [sent-513, score-0.05]

84 Restricted to the images in a training subset (those in a test subset are only used for test), their local descriptors are clustered to form visual words by using k-means clustering. [sent-514, score-0.096]

85 Each image is then represented by a histogram containing the number of occurrences of each visual word. [sent-515, score-0.048]

86 In each of the 10 pairs of training/test subsets, there are 959 training images and 639 test images. [sent-518, score-0.07]

87 Two visual codebooks of size 100 and 200 are used, respectively. [sent-519, score-0.075]

88 2 (right) plots the test error of the B OOST M ETRIC against the number of triplets for training. [sent-537, score-0.105]

89 The general trend is that more triplets lead to smaller errors. [sent-538, score-0.082]

90 Background-Google This experiment uses the two object classes as a retrieval problem. [sent-540, score-0.051]

91 B OOSTM ETRIC is first learned from a training subset and retrieval is conducted on the corresponding test subset. [sent-543, score-0.071]

92 In each of the 10 training/test subsets, there are 573 training images and 382 test images. [sent-544, score-0.07]

93 Again, two visual codebooks of size 100 and 200 are used. [sent-545, score-0.075]

94 Each face image in a test subset is used as a query, and its distances from other test images are calculated by B OOST M ETRIC, LMNN and the Euclidean distance. [sent-546, score-0.097]

95 The retrieval precision for each query are averaged on this test subset and then averaged over the whole 10 test subsets. [sent-548, score-0.076]

96 4 Conclusion We have presented a new algorithm, B OOST M ETRIC, to learn a positive semidefinite metric using boosting techniques. [sent-552, score-0.152]

97 Experiments show its better performance over a few state-of-the-art existing metric learning methods. [sent-555, score-0.093]

98 Distance metric learning, with application to clustering with side-information. [sent-595, score-0.093]

99 Distance metric learning for large margin nearest neighbor classification. [sent-631, score-0.186]

100 A boosting framework for visuality-preserving distance metric learning and its application to medical image retrieval. [sent-717, score-0.206]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('oost', 0.576), ('etric', 0.504), ('lmnn', 0.256), ('wj', 0.231), ('hrj', 0.152), ('ur', 0.147), ('psdboost', 0.136), ('semide', 0.128), ('aj', 0.122), ('mahalanobis', 0.121), ('ai', 0.112), ('ar', 0.112), ('adaboost', 0.098), ('metric', 0.093), ('nca', 0.085), ('triplets', 0.082), ('zj', 0.069), ('hr', 0.066), ('bisection', 0.061), ('uj', 0.057), ('wl', 0.057), ('distance', 0.056), ('airplanes', 0.053), ('evd', 0.053), ('ak', 0.049), ('codebooks', 0.049), ('motorbikes', 0.049), ('distij', 0.045), ('dual', 0.045), ('euclidean', 0.044), ('primal', 0.042), ('datasets', 0.041), ('lpboost', 0.04), ('xing', 0.038), ('tr', 0.037), ('neighbor', 0.035), ('boosting', 0.035), ('australia', 0.035), ('margin', 0.034), ('weak', 0.034), ('canberra', 0.032), ('lagrange', 0.031), ('optimization', 0.031), ('lp', 0.03), ('eigenvector', 0.03), ('boostmetric', 0.03), ('convexify', 0.03), ('distik', 0.03), ('oostm', 0.03), ('twin', 0.03), ('retrieval', 0.03), ('classi', 0.03), ('knn', 0.03), ('inf', 0.029), ('images', 0.029), ('al', 0.029), ('australian', 0.028), ('cp', 0.028), ('wu', 0.027), ('ll', 0.027), ('csdp', 0.027), ('adelaide', 0.027), ('visual', 0.026), ('learners', 0.026), ('exp', 0.026), ('nite', 0.025), ('diabetes', 0.024), ('bal', 0.024), ('column', 0.024), ('nearest', 0.024), ('positive', 0.024), ('descent', 0.023), ('solvers', 0.023), ('test', 0.023), ('upon', 0.022), ('image', 0.022), ('faces', 0.022), ('mum', 0.022), ('rca', 0.022), ('constraints', 0.021), ('matrix', 0.021), ('exponential', 0.021), ('ik', 0.021), ('nicta', 0.021), ('object', 0.021), ('ij', 0.02), ('dimensions', 0.02), ('eigenvalue', 0.02), ('iteration', 0.02), ('loss', 0.02), ('convex', 0.02), ('letter', 0.02), ('sdp', 0.02), ('library', 0.019), ('needs', 0.019), ('generation', 0.019), ('accommodate', 0.018), ('training', 0.018), ('rates', 0.018), ('peaks', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton Hengel

Abstract: The learning of appropriate distance metrics is a critical problem in image classification and retrieval. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a Mahalanobis distance metric. One of the primary difficulties in learning such a metric is to ensure that the Mahalanobis matrix remains positive semidefinite. Semidefinite programming is sometimes used to enforce this constraint, but does not scale well. B OOST M ETRIC is instead based on a key observation that any positive semidefinite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting method is easy to implement, does not require tuning, and can accommodate various types of constraints. Experiments on various datasets show that the proposed algorithm compares favorably to those state-of-the-art methods in terms of classification accuracy and running time. 1

2 0.16912982 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

Author: Rong Jin, Shijun Wang, Yang Zhou

Abstract: In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.

3 0.11940141 223 nips-2009-Sparse Metric Learning via Smooth Optimization

Author: Yiming Ying, Kaizhu Huang, Colin Campbell

Abstract: In this paper we study the problem of learning a low-rank (sparse) distance matrix. We propose a novel metric learning model which can simultaneously conduct dimension reduction and learn a distance matrix. The sparse representation involves a mixed-norm regularization which is non-convex. We then show that it can be equivalently formulated as a convex saddle (min-max) problem. From this saddle representation, we develop an efficient smooth optimization approach [17] for sparse metric learning, although the learning model is based on a nondifferentiable loss function. Finally, we run experiments to validate the effectiveness and efficiency of our sparse metric learning model on various datasets.

4 0.079433478 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

Author: Gal Chechik, Uri Shalit, Varun Sharma, Samy Bengio

Abstract: Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before. 1

5 0.072317682 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

Author: Novi Quadrianto, John Lim, Dale Schuurmans, Tibério S. Caetano

Abstract: We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semidefinite matrix variable, we reformulate the problem to eliminate the need for general semidefinite programming. In particular, we provide two reformulations that admit fast algorithms. The first is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data. 1

6 0.069803424 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

7 0.067899361 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

8 0.05945196 47 nips-2009-Boosting with Spatial Regularization

9 0.054162119 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

10 0.048899852 67 nips-2009-Directed Regression

11 0.047851499 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

12 0.04625994 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

13 0.044090364 234 nips-2009-Streaming k-means approximation

14 0.041155942 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

15 0.040297158 104 nips-2009-Group Sparse Coding

16 0.039860353 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

17 0.039420675 72 nips-2009-Distribution Matching for Transduction

18 0.039167836 33 nips-2009-Analysis of SVM with Indefinite Kernels

19 0.03884932 190 nips-2009-Polynomial Semantic Indexing

20 0.038088579 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.142), (1, 0.058), (2, -0.037), (3, 0.027), (4, -0.013), (5, 0.025), (6, -0.052), (7, 0.039), (8, 0.042), (9, 0.048), (10, 0.045), (11, -0.012), (12, 0.078), (13, 0.053), (14, -0.036), (15, -0.141), (16, 0.124), (17, -0.019), (18, -0.029), (19, 0.044), (20, -0.104), (21, -0.111), (22, -0.07), (23, -0.002), (24, 0.046), (25, 0.046), (26, -0.018), (27, -0.089), (28, -0.059), (29, 0.015), (30, 0.014), (31, 0.005), (32, 0.084), (33, -0.009), (34, -0.095), (35, 0.061), (36, 0.015), (37, 0.052), (38, -0.013), (39, 0.001), (40, -0.077), (41, -0.012), (42, -0.153), (43, -0.002), (44, -0.087), (45, -0.053), (46, 0.08), (47, -0.008), (48, 0.046), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92414314 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton Hengel

Abstract: The learning of appropriate distance metrics is a critical problem in image classification and retrieval. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a Mahalanobis distance metric. One of the primary difficulties in learning such a metric is to ensure that the Mahalanobis matrix remains positive semidefinite. Semidefinite programming is sometimes used to enforce this constraint, but does not scale well. B OOST M ETRIC is instead based on a key observation that any positive semidefinite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting method is easy to implement, does not require tuning, and can accommodate various types of constraints. Experiments on various datasets show that the proposed algorithm compares favorably to those state-of-the-art methods in terms of classification accuracy and running time. 1

2 0.80681169 223 nips-2009-Sparse Metric Learning via Smooth Optimization

Author: Yiming Ying, Kaizhu Huang, Colin Campbell

Abstract: In this paper we study the problem of learning a low-rank (sparse) distance matrix. We propose a novel metric learning model which can simultaneously conduct dimension reduction and learn a distance matrix. The sparse representation involves a mixed-norm regularization which is non-convex. We then show that it can be equivalently formulated as a convex saddle (min-max) problem. From this saddle representation, we develop an efficient smooth optimization approach [17] for sparse metric learning, although the learning model is based on a nondifferentiable loss function. Finally, we run experiments to validate the effectiveness and efficiency of our sparse metric learning model on various datasets.

3 0.77531815 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

Author: Rong Jin, Shijun Wang, Yang Zhou

Abstract: In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.

4 0.59921986 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

Author: Gal Chechik, Uri Shalit, Varun Sharma, Samy Bengio

Abstract: Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before. 1

5 0.55752575 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

Author: Lei Wu, Rong Jin, Steven C. Hoi, Jianke Zhu, Nenghai Yu

Abstract: Learning distance functions with side information plays a key role in many machine learning and data mining applications. Conventional approaches often assume a Mahalanobis distance function. These approaches are limited in two aspects: (i) they are computationally expensive (even infeasible) for high dimensional data because the size of the metric is in the square of dimensionality; (ii) they assume a fixed metric for the entire input space and therefore are unable to handle heterogeneous data. In this paper, we propose a novel scheme that learns nonlinear Bregman distance functions from side information using a nonparametric approach that is similar to support vector machines. The proposed scheme avoids the assumption of fixed metric by implicitly deriving a local distance from the Hessian matrix of a convex function that is used to generate the Bregman distance function. We also present an efficient learning algorithm for the proposed scheme for distance function learning. The extensive experiments with semi-supervised clustering show the proposed technique (i) outperforms the state-of-the-art approaches for distance function learning, and (ii) is computationally efficient for high dimensional data. 1

6 0.4515923 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

7 0.45041806 33 nips-2009-Analysis of SVM with Indefinite Kernels

8 0.40438333 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

9 0.3801477 46 nips-2009-Bilinear classifiers for visual recognition

10 0.36981541 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

11 0.36019406 234 nips-2009-Streaming k-means approximation

12 0.35374328 7 nips-2009-A Data-Driven Approach to Modeling Choice

13 0.34012631 47 nips-2009-Boosting with Spatial Regularization

14 0.34007686 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

15 0.33561575 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions

16 0.33550245 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

17 0.33187562 90 nips-2009-Factor Modeling for Advertisement Targeting

18 0.32485241 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

19 0.31910282 182 nips-2009-Optimal Scoring for Unsupervised Learning

20 0.30732304 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.019), (21, 0.012), (24, 0.029), (25, 0.069), (35, 0.047), (36, 0.125), (39, 0.039), (44, 0.017), (58, 0.135), (61, 0.016), (71, 0.05), (86, 0.09), (91, 0.023), (92, 0.238)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87670743 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

Author: Anibal Figueiras-vidal, Miguel Lázaro-gredilla

Abstract: We present a general inference framework for inter-domain Gaussian Processes (GPs) and focus on its usefulness to build sparse GP models. The state-of-the-art sparse GP model introduced by Snelson and Ghahramani in [1] relies on finding a small, representative pseudo data set of m elements (from the same domain as the n available data elements) which is able to explain existing data well, and then uses it to perform inference. This reduces inference and model selection computation time from O(n3 ) to O(m2 n), where m n. Inter-domain GPs can be used to find a (possibly more compact) representative set of features lying in a different domain, at the same computational cost. Being able to specify a different domain for the representative features allows to incorporate prior knowledge about relevant characteristics of data and detaches the functional form of the covariance and basis functions. We will show how previously existing models fit into this framework and will use it to develop two new sparse GP models. Tests on large, representative regression data sets suggest that significant improvement can be achieved, while retaining computational efficiency. 1 Introduction and previous work Along the past decade there has been a growing interest in the application of Gaussian Processes (GPs) to machine learning tasks. GPs are probabilistic non-parametric Bayesian models that combine a number of attractive characteristics: They achieve state-of-the-art performance on supervised learning tasks, provide probabilistic predictions, have a simple and well-founded model selection scheme, present no overfitting (since parameters are integrated out), etc. Unfortunately, the direct application of GPs to regression problems (with which we will be concerned here) is limited due to their training time being O(n3 ). To overcome this limitation, several sparse approximations have been proposed [2, 3, 4, 5, 6]. In most of them, sparsity is achieved by projecting all available data onto a smaller subset of size m n (the active set), which is selected according to some specific criterion. This reduces computation time to O(m2 n). However, active set selection interferes with hyperparameter learning, due to its non-smooth nature (see [1, 3]). These proposals have been superseded by the Sparse Pseudo-inputs GP (SPGP) model, introduced in [1]. In this model, the constraint that the samples of the active set (which are called pseudoinputs) must be selected among training data is relaxed, allowing them to lie anywhere in the input space. This allows both pseudo-inputs and hyperparameters to be selected in a joint continuous optimisation and increases flexibility, resulting in much superior performance. In this work we introduce Inter-Domain GPs (IDGPs) as a general tool to perform inference across domains. This allows to remove the constraint that the pseudo-inputs must remain within the same domain as input data. This added flexibility results in an increased performance and allows to encode prior knowledge about other domains where data can be represented more compactly. 1 2 Review of GPs for regression We will briefly state here the main definitions and results for regression with GPs. See [7] for a comprehensive review. Assume we are given a training set with n samples D ≡ {xj , yj }n , where each D-dimensional j=1 input xj is associated to a scalar output yj . The regression task goal is, given a new input x∗ , predict the corresponding output y∗ based on D. The GP regression model assumes that the outputs can be expressed as some noiseless latent function plus independent noise, y = f (x)+ε, and then sets a zero-mean1 GP prior on f (x), with covariance k(x, x ), and a zero-mean Gaussian prior on ε, with variance σ 2 (the noise power hyperparameter). The covariance function encodes prior knowledge about the smoothness of f (x). The most common choice for it is the Automatic Relevance Determination Squared Exponential (ARD SE): 2 k(x, x ) = σ0 exp − 1 2 D d=1 (xd − xd )2 2 d , (1) 2 with hyperparameters σ0 (the latent function power) and { d }D (the length-scales, defining how d=1 rapidly the covariance decays along each dimension). It is referred to as ARD SE because, when coupled with a model selection method, non-informative input dimensions can be removed automatically by growing the corresponding length-scale. The set of hyperparameters that define the GP are 2 θ = {σ 2 , σ0 , { d }D }. We will omit the dependence on θ for the sake of clarity. d=1 If we evaluate the latent function at X = {xj }n , we obtain a set of latent variables following a j=1 joint Gaussian distribution p(f |X) = N (f |0, Kff ), where [Kff ]ij = k(xi , xj ). Using this model it is possible to express the joint distribution of training and test cases and then condition on the observed outputs to obtain the predictive distribution for any test case pGP (y∗ |x∗ , D) = N (y∗ |kf ∗ (Kff + σ 2 In )−1 y, σ 2 + k∗∗ − kf ∗ (Kff + σ 2 In )−1 kf ∗ ), (2) where y = [y1 , . . . , yn ] , kf ∗ = [k(x1 , x∗ ), . . . , k(xn , x∗ )] , and k∗∗ = k(x∗ , x∗ ). In is used to denote the identity matrix of size n. The O(n3 ) cost of these equations arises from the inversion of the n × n covariance matrix. Predictive distributions for additional test cases take O(n2 ) time each. These costs make standard GPs impractical for large data sets. To select hyperparameters θ, Type-II Maximum Likelihood (ML-II) is commonly used. This amounts to selecting the hyperparameters that correspond to a (possibly local) maximum of the log-marginal likelihood, also called log-evidence. 3 Inter-domain GPs In this section we will introduce Inter-Domain GPs (IDGPs) and show how they can be used as a framework for computationally efficient inference. Then we will use this framework to express two previous relevant models and develop two new ones. 3.1 Definition Consider a real-valued GP f (x) with x ∈ RD and some deterministic real function g(x, z), with z ∈ RH . We define the following transformation: u(z) = f (x)g(x, z)dx. (3) RD There are many examples of transformations that take on this form, the Fourier transform being one of the best known. We will discuss possible choices for g(x, z) in Section 3.3; for the moment we will deal with the general form. Since u(z) is obtained by a linear transformation of GP f (x), 1 We follow the common approach of subtracting the sample mean from the outputs and then assume a zero-mean model. 2 it is also a GP. This new GP may lie in a different domain of possibly different dimension. This transformation is not invertible in general, its properties being defined by g(x, z). IDGPs arise when we jointly consider f (x) and u(z) as a single, “extended” GP. The mean and covariance function of this extended GP are overloaded to accept arguments from both the input and transformed domains and treat them accordingly. We refer to each version of an overloaded function as an instance, which will accept a different type of arguments. If the distribution of the original GP is f (x) ∼ GP(m(x), k(x, x )), then it is possible to compute the remaining instances that define the distribution of the extended GP over both domains. The transformed-domain instance of the mean is m(z) = E[u(z)] = E[f (x)]g(x, z)dx = m(x)g(x, z)dx. RD RD The inter-domain and transformed-domain instances of the covariance function are: k(x, z ) = E[f (x)u(z )] = E f (x) f (x )g(x , z )dx = RD k(z, z ) = E[u(z)u(z )] = E f (x)g(x, z)dx RD = k(x, x )g(x , z )dx f (x )g(x , z )dx RD k(x, x )g(x, z)g(x , z )dxdx . RD (4) RD (5) RD Mean m(·) and covariance function k(·, ·) are therefore defined both by the values and domains of their arguments. This can be seen as if each argument had an additional domain indicator used to select the instance. Apart from that, they define a regular GP, and all standard properties hold. In particular k(a, b) = k(b, a). This approach is related to [8], but here the latent space is defined as a transformation of the input space, and not the other way around. This allows to pre-specify the desired input-domain covariance. The transformation is also more general: Any g(x, z) can be used. We can sample an IDGP at n input-domain points f = [f1 , f2 , . . . , fn ] (with fj = f (xj )) and m transformed-domain points u = [u1 , u2 , . . . , um ] (with ui = u(zi )). With the usual assumption of f (x) being a zero mean GP and defining Z = {zi }m , the joint distribution of these samples is: i=1 Kff Kfu f f 0, X, Z = N p , (6) u u Kfu Kuu with [Kff ]pq = k(xp , xq ), [Kfu ]pq = k(xp , zq ), [Kuu ]pq = k(zp , zq ), which allows to perform inference across domains. We will only be concerned with one input domain and one transformed domain, but IDGPs can be defined for any number of domains. 3.2 Sparse regression using inducing features In the standard regression setting, we are asked to perform inference about the latent function f (x) from a data set D lying in the input domain. Using IDGPs, we can use data from any domain to perform inference in the input domain. Some latent functions might be better defined by a set of data lying in some transformed space rather than in the input space. This idea is used for sparse inference. Following [1] we introduce a pseudo data set, but here we place it in the transformed domain: D = {Z, u}. The following derivation is analogous to that of SPGP. We will refer to Z as the inducing features and u as the inducing variables. The key approximation leading to sparsity is to set m n and assume that f (x) is well-described by the pseudo data set D, so that any two samples (either from the training or test set) fp and fq with p = q will be independent given xp , xq and D. With this simplifying assumption2 , the prior over f can be factorised as a product of marginals: n p(f |X, Z, u) ≈ p(fj |xj , Z, u). (7) j=1 2 Alternatively, (7) can be obtained by proposing a generic factorised form for the approximate conn ditional p(f |X, Z, u) ≈ q(f |X, Z, u) = q (f |xj , Z, u) and then choosing the set of funcj=1 j j tions {qj (·)}n so as to minimise the Kullback-Leibler (KL) divergence from the exact joint prior j=1 KL(p(f |X, Z, u)p(u|Z)||q(f |X, Z, u)p(u|Z)), as noted in [9], Section 2.3.6. 3 Marginals are in turn obtained from (6): p(fj |xj , Z, u) = N (fj |kj K−1 u, λj ), where kj is the j-th uu row of Kfu and λj is the j-th element of the diagonal of matrix Λf = diag(Kf f − Kfu K−1 Kuf ). uu Operator diag(·) sets all off-diagonal elements to zero, so that Λf is a diagonal matrix. Since p(u|Z) is readily available and also Gaussian, the inducing variables can be integrated out from (7), yielding a new, approximate prior over f (x): n p(f |X, Z) = p(fj |xj , Z, u)p(u|Z)du = N (f |0, Kfu K−1 Kuf + Λf ) uu p(f , u|X, Z)du ≈ j=1 Using this approximate prior, the posterior distribution for a test case is: pIDGP (y∗ |x∗ , D, Z) = N (y∗ |ku∗ Q−1 Kfu Λ−1 y, σ 2 + k∗∗ + ku∗ (Q−1 − K−1 )ku∗ ), y uu (8) Kfu Λ−1 Kfu y where we have defined Q = Kuu + and Λy = Λf + σ 2 In . The distribution (2) is approximated by (8) with the information available in the pseudo data set. After O(m2 n) time precomputations, predictive means and variances can be computed in O(m) and O(m2 ) time per test case, respectively. This model is, in general, non-stationary, even when it is approximating a stationary input-domain covariance and can be interpreted as a degenerate GP plus heteroscedastic white noise. The log-marginal likelihood (or log-evidence) of the model, explicitly including the conditioning on kernel hyperparameters θ can be expressed as 1 log p(y|X, Z, θ) = − [y Λ−1 y−y Λ−1 Kfu Q−1 Kfu Λ−1 y+log(|Q||Λy |/|Kuu |)+n log(2π)] y y y 2 which is also computable in O(m2 n) time. Model selection will be performed by jointly optimising the evidence with respect to the hyperparameters and the inducing features. If analytical derivatives of the covariance function are available, conjugate gradient optimisation can be used with O(m2 n) cost per step. 3.3 On the choice of g(x, z) The feature extraction function g(x, z) defines the transformed domain in which the pseudo data set lies. According to (3), the inducing variables can be seen as projections of the target function f (x) on the feature extraction function over the whole input space. Therefore, each of them summarises information about the behaviour of f (x) everywhere. The inducing features Z define the concrete set of functions over which the target function will be projected. It is desirable that this set captures the most significant characteristics of the function. This can be achieved either using prior knowledge about data to select {g(x, zi )}m or using a very general family of functions and letting model i=1 selection automatically choose the appropriate set. Another way to choose g(x, z) relies on the form of the posterior. The posterior mean of a GP is often thought of as a linear combination of “basis functions”. For full GPs and other approximations such as [1, 2, 3, 4, 5, 6], basis functions must have the form of the input-domain covariance function. When using IDGPs, basis functions have the form of the inter-domain instance of the covariance function, and can therefore be adjusted by choosing g(x, z), independently of the input-domain covariance function. If two feature extraction functions g(·, ·) and h(·, ·) can be related by g(x, z) = h(x, z)r(z) for any function r(·), then both yield the same sparse GP model. This property can be used to simplify the expressions of the instances of the covariance function. In this work we use the same functional form for every feature, i.e. our function set is {g(x, zi )}m , i=1 but it is also possible to use sets with different functional forms for each inducing feature, i.e. {gi (x, zi )}m where each zi may even have a different size (dimension). In the sections below i=1 we will discuss different possible choices for g(x, z). 3.3.1 Relation with Sparse GPs using pseudo-inputs The sparse GP using pseudo-inputs (SPGP) was introduced in [1] and was later renamed to Fully Independent Training Conditional (FITC) model to fit in the systematic framework of [10]. Since 4 the sparse model introduced in Section 3.2 also uses a fully independent training conditional, we will stick to the first name to avoid possible confusion. IDGP innovation with respect to SPGP consists in letting the pseudo data set lie in a different domain. If we set gSPGP (x, z) ≡ δ(x − z) where δ(·) is a Dirac delta, we force the pseudo data set to lie in the input domain. Thus there is no longer a transformed space and the original SPGP model is retrieved. In this setting, the inducing features of IDGP play the role of SPGP’s pseudo-inputs. 3.3.2 Relation with Sparse Multiscale GPs Sparse Multiscale GPs (SMGPs) are presented in [11]. Seeking to generalise the SPGP model with ARD SE covariance function, they propose to use a different set of length-scales for each basis function. The resulting model presents a defective variance that is healed by adding heteroscedastic white noise. SMGPs, including the variance improvement, can be derived in a principled way as IDGPs: D 1 gSMGP (x, z) ≡ D d=1 2π(c2 d D kSMGP (x, z ) = exp − d=1 D kSMGP (z, z ) = exp − d=1 − 2) d exp − d=1 (xd − µd )2 2cd2 D µ c with z = 2 d cd2 d=1 (µd − µd )2 2(c2 + cd2 − 2 ) d d (xd − µd )2 2(c2 − 2 ) d d (9) (10) D c2 + d d=1 2 d cd2 − 2. d (11) With this approximation, each basis function has its own centre µ = [µ1 , µ2 , . . . , µd ] and its own length-scales c = [c1 , c2 , . . . , cd ] , whereas global length-scales { d }D are shared by all d=1 inducing features. Equations (10) and (11) are derived from (4) and (5) using (1) and (9). The integrals defining kSMGP (·, ·) converge if and only if c2 ≥ 2 , ∀d , which suggests that other values, d d even if permitted in [11], should be avoided for the model to remain well defined. 3.3.3 Frequency Inducing Features GP If the target function can be described more compactly in the frequency domain than in the input domain, it can be advantageous to let the pseudo data set lie in the former domain. We will pursue that possibility for the case where the input domain covariance is the ARD SE. We will call the resulting sparse model Frequency Inducing Features GP (FIFGP). Directly applying the Fourier transform is not possible because the target function is not square 2 integrable (it has constant power σ0 everywhere, so (5) does not converge). We will workaround this by windowing the target function in the region of interest. It is possible to use a square window, but this results in the covariance being defined in terms of the complex error function, which is very slow to evaluate. Instead, we will use a Gaussian window3 . Since multiplying by a Gaussian in the input domain is equivalent to convolving with a Gaussian in the frequency domain, we will be working with a blurred version of the frequency space. This model is defined by: gFIF (x, z) ≡ D 1 D d=1 2πc2 d D kFIF (x, z ) = exp − d=1 D kFIF (z, z ) = exp − d=1 exp − d=1 x2 d cos ω0 + 2c2 d x2 + c2 ωd2 d d cos ω0 + 2(c2 + 2 ) d d D d=1 exp − d=1 D + exp 3 c4 (ωd + ωd )2 d − 2(2c2 + 2 ) d d d=1 x d ωd with z = ω D 2 d d=1 c2 + d 2 d (13) c4 (ωd − ωd )2 d cos(ω0 − ω0 ) 2(2c2 + 2 ) d d D cos(ω0 + ω0 ) d=1 2 d 2c2 + d 2. d A mixture of m Gaussians could also be used as window without increasing the complexity order. 5 (12) d=1 c2 ωd xd d c2 + 2 d d D 2 c2 (ωd + ωd2 ) d 2(2c2 + 2 ) d d D (14) The inducing features are ω = [ω0 , ω1 , . . . , ωd ] , where ω0 is the phase and the remaining components are frequencies along each dimension. In this model, both global length-scales { d }D and d=1 window length-scales {cd }D are shared, thus cd = cd . Instances (13) and (14) are induced by (12) d=1 using (4) and (5). 3.3.4 Time-Frequency Inducing Features GP Instead of using a single window to select the region of interest, it is possible to use a different window for each feature. We will use windows of the same size but different centres. The resulting model combines SPGP and FIFGP, so we will call it Time-Frequency Inducing Features GP (TFIFGP). It is defined by gTFIF (x, z) ≡ gFIF (x − µ, ω), with z = [µ ω ] . The implied inter-domain and transformed-domain instances of the covariance function are: D kTFIF (x, z ) = kFIF (x − µ , ω ) , kTFIF (z, z ) = kFIF (z, z ) exp − d=1 (µd − µd )2 2(2c2 + 2 ) d d FIFGP is trivially obtained by setting every centre to zero {µi = 0}m , whereas SPGP is obtained i=1 by setting window length-scales c, frequencies and phases {ω i }m to zero. If the window lengthi=1 scales were individually adjusted, SMGP would be obtained. While FIFGP has the modelling power of both FIFGP and SPGP, it might perform worse in practice due to it having roughly twice as many hyperparameters, thus making the optimisation problem harder. The same problem also exists in SMGP. A possible workaround is to initialise the hyperparameters using a simpler model, as done in [11] for SMGP, though we will not do this here. 4 Experiments In this section we will compare the proposed approximations FIFGP and TFIFGP with the current state of the art, SPGP on some large data sets, for the same number of inducing features/inputs and therefore, roughly equal computational cost. Additionally, we provide results using a full GP, which is expected to provide top performance (though requiring an impractically big amount of computation). In all cases, the (input-domain) covariance function is the ARD SE (1). We use four large data sets: Kin-40k, Pumadyn-32nm4 (describing the dynamics of a robot arm, used with SPGP in [1]), Elevators and Pole Telecomm5 (related to the control of the elevators of an F16 aircraft and a telecommunications problem, and used in [12, 13, 14]). Input dimensions that remained constant throughout the training set were removed. Input data was additionally centred for use with FIFGP (the remaining methods are translation invariant). Pole Telecomm outputs actually take discrete values in the 0-100 range, in multiples of 10. This was taken into account by using the corresponding quantization noise variance (102 /12) as lower bound for the noise hyperparameter6 . n 1 2 2 2 Hyperparameters are initialised as follows: σ0 = n j=1 yj , σ 2 = σ0 /4, { d }D to one half of d=1 the range spanned by training data along each dimension. For SPGP, pseudo-inputs are initialised to a random subset of the training data, for FIFGP window size c is initialised to the standard deviation of input data, frequencies are randomly chosen from a zero-mean −2 -variance Gaussian d distribution, and phases are obtained from a uniform distribution in [0 . . . 2π). TFIFGP uses the same initialisation as FIFGP, with window centres set to zero. Final values are selected by evidence maximisation. Denoting the output average over the training set as y and the predictive mean and variance for test sample y∗l as µ∗l and σ∗l respectively, we define the following quality measures: Normalized Mean Square Error (NMSE) (y∗l − µ∗l )2 / (y∗l − y)2 and Mean Negative Log-Probability (MNLP) 1 2 2 2 2 (y∗l − µ∗l ) /σ∗l + log σ∗l + log 2π , where · averages over the test set. 4 Kin-40k: 8 input dimensions, 10000/30000 samples for train/test, Pumadyn-32nm: 32 input dimensions, 7168/1024 samples for train/test, using exactly the same preprocessing and train/test splits as [1, 3]. Note that their error measure is actually one half of the Normalized Mean Square Error defined here. 5 Pole Telecomm: 26 non-constant input dimensions, 10000/5000 samples for train/test. Elevators: 17 non-constant input dimensions, 8752/7847 samples for train/test. Both have been downloaded from http://www.liaad.up.pt/∼ltorgo/Regression/datasets.html 6 If unconstrained, similar plots are obtained; in particular, no overfitting is observed. 6 For Kin-40k (Fig. 1, top), all three sparse methods perform similarly, though for high sparseness (the most useful case) FIFGP and TFIFGP are slightly superior. In Pumadyn-32nm (Fig. 1, bottom), only 4 out the 32 input dimensions are relevant to the regression task, so it can be used as an ARD capabilities test. We follow [1] and use a full GP on a small subset of the training data (1024 data points) to obtain the initial length-scales. This allows better minima to be found during optimisation. Though all methods are able to properly find a good solution, FIFGP and especially TFIFGP are better in the sparser regime. Roughly the same considerations can be made about Pole Telecomm and Elevators (Fig. 2), but in these data sets the superiority of FIFGP and TFIFGP is more dramatic. Though not shown here, we have additionally tested these models on smaller, overfitting-prone data sets, and have found no noticeable overfitting even using m > n, despite the relatively high number of parameters being adjusted. This is in line with the results and discussion of [1]. Mean Negative Log−Probability Normalized Mean Squared Error 2.5 0.5 0.1 0.05 0.01 0.005 SPGP FIFGP TFIFGP Full GP on 10000 data points 0.001 25 50 100 200 300 500 750 SPGP FIFGP TFIFGP Full GP on 10000 data points 2 1.5 1 0.5 0 −0.5 −1 −1.5 1250 25 50 100 200 300 500 750 1250 Inducing features / pseudo−inputs (b) Kin-40k MNLP (semilog plot) Inducing features / pseudo−inputs (a) Kin-40k NMSE (log-log plot) 0.05 0.04 10 25 50 75 Mean Negative Log−Probability Normalized Mean Squared Error 0.2 SPGP FIFGP TFIFGP Full GP on 7168 data points 0.1 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 100 Inducing features / pseudo−inputs (c) Pumadyn-32nm NMSE (log-log plot) SPGP FIFGP TFIFGP Full GP on 7168 data points 0.15 10 25 50 75 100 Inducing features / pseudo−inputs (d) Pumadyn-32nm MNLP (semilog plot) Figure 1: Performance of the compared methods on Kin-40k and Pumadyn-32nm. 5 Conclusions and extensions In this work we have introduced IDGPs, which are able combine representations of a GP in different domains, and have used them to extend SPGP to handle inducing features lying in a different domain. This provides a general framework for sparse models, which are defined by a feature extraction function. Using this framework, SMGPs can be reinterpreted as fully principled models using a transformed space of local features, without any need for post-hoc variance improvements. Furthermore, it is possible to develop new sparse models of practical use, such as the proposed FIFGP and TFIFGP, which are able to outperform the state-of-the-art SPGP on some large data sets, especially for high sparsity regimes. 7 0.25 0.2 0.15 0.1 10 25 50 100 250 Mean Negative Log−Probability Normalized Mean Squared Error −3.8 SPGP FIFGP TFIFGP Full GP on 8752 data points SPGP FIFGP TFIFGP Full GP on 8752 data points −4 −4.2 −4.4 −4.6 −4.8 500 750 1000 10 Inducing features / pseudo−inputs (a) Elevators NMSE (log-log plot) 25 50 100 250 500 750 1000 Inducing features / pseudo−inputs (b) Elevators MNLP (semilog plot) 0.2 0.15 0.1 0.05 0.04 0.03 0.02 0.01 10 25 50 100 250 500 Mean Negative Log−Probability Normalized Mean Squared Error 5.5 SPGP FIFGP TFIFGP Full GP on 10000 data points 4.5 4 3.5 3 2.5 1000 SPGP FIFGP TFIFGP Full GP on 10000 data points 5 10 25 50 100 250 500 1000 Inducing features / pseudo−inputs (d) Pole Telecomm MNLP (semilog plot) Inducing features / pseudo−inputs (c) Pole Telecomm NMSE (log-log plot) Figure 2: Performance of the compared methods on Elevators and Pole Telecomm. Choosing a transformed space for the inducing features enables to use domains where the target function can be expressed more compactly, or where the evidence (which is a function of the features) is easier to optimise. This added flexibility translates as a detaching of the functional form of the input-domain covariance and the set of basis functions used to express the posterior mean. IDGPs approximate full GPs optimally in the KL sense noted in Section 3.2, for a given set of inducing features. Using ML-II to select the inducing features means that models providing a good fit to data are given preference over models that might approximate the full GP more closely. This, though rarely, might lead to harmful overfitting. To more faithfully approximate the full GP and avoid overfitting altogether, our proposal can be combined with the variational approach from [15], in which the inducing features would be regarded as variational parameters. This would result in more constrained models, which would be closer to the full GP but might show reduced performance. We have explored the case of regression with Gaussian noise, which is analytically tractable, but it is straightforward to apply the same model to other tasks such as robust regression or classification, using approximate inference (see [16]). Also, IDGPs as a general tool can be used for other purposes, such as modelling noise in the frequency domain, aggregating data from different domains or even imposing constraints on the target function. Acknowledgments We would like to thank the anonymous referees for helpful comments and suggestions. This work has been partly supported by the Spanish government under grant TEC2008- 02473/TEC, and by the Madrid Community under grant S-505/TIC/0223. 8 References [1] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pages 1259–1266. MIT Press, 2006. [2] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001. [3] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the 9th International Workshop on AI Stats, 2003. [4] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719–2741, 2000. [5] L. Csat´ and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(3):641–669, 2002. o [6] C. K. I. Williams and M. Seeger. Using the Nystr¨ m method to speed up kernel machines. In Advances o in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001. [7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2006. [8] M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In Advances in Neural Information Processing Systems 21, pages 57–64, 2009. [9] Ed. Snelson. Flexible and efficient Gaussian process models for machine learning. PhD thesis, University of Cambridge, 2007. [10] J. Qui˜ onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process n regression. Journal of Machine Learning Research, 6:1939–1959, 2005. [11] C. Walder, K. I. Kim, and B. Sch¨ lkopf. Sparse multiscale Gaussian process regression. In 25th Internao tional Conference on Machine Learning. ACM Press, New York, 2008. [12] G. Potgietera and A. P. Engelbrecht. Evolving model trees for mining data sets with continuous-valued classes. Expert Systems with Applications, 35:1513–1532, 2007. [13] L. Torgo and J. Pinto da Costa. Clustered partial linear regression. In Proceedings of the 11th European Conference on Machine Learning, pages 426–436. Springer, 2000. [14] G. Potgietera and A. P. Engelbrecht. Pairwise classification as an ensemble technique. In Proceedings of the 13th European Conference on Machine Learning, pages 97–110. Springer-Verlag, 2002. [15] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Workshop on AI Stats, 2009. [16] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Information Processing Systems 20, pages 1057–1064. MIT Press, 2008. 9

same-paper 2 0.82011425 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton Hengel

Abstract: The learning of appropriate distance metrics is a critical problem in image classification and retrieval. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a Mahalanobis distance metric. One of the primary difficulties in learning such a metric is to ensure that the Mahalanobis matrix remains positive semidefinite. Semidefinite programming is sometimes used to enforce this constraint, but does not scale well. B OOST M ETRIC is instead based on a key observation that any positive semidefinite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting method is easy to implement, does not require tuning, and can accommodate various types of constraints. Experiments on various datasets show that the proposed algorithm compares favorably to those state-of-the-art methods in terms of classification accuracy and running time. 1

3 0.7867704 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

Author: Sebastian Gerwinn, Philipp Berens, Matthias Bethge

Abstract: Second-order maximum-entropy models have recently gained much interest for describing the statistics of binary spike trains. Here, we extend this approach to take continuous stimuli into account as well. By constraining the joint secondorder statistics, we obtain a joint Gaussian-Boltzmann distribution of continuous stimuli and binary neural firing patterns, for which we also compute marginal and conditional distributions. This model has the same computational complexity as pure binary models and fitting it to data is a convex problem. We show that the model can be seen as an extension to the classical spike-triggered average/covariance analysis and can be used as a non-linear method for extracting features which a neural population is sensitive to. Further, by calculating the posterior distribution of stimuli given an observed neural response, the model can be used to decode stimuli and yields a natural spike-train metric. Therefore, extending the framework of maximum-entropy models to continuous variables allows us to gain novel insights into the relationship between the firing patterns of neural ensembles and the stimuli they are processing. 1

4 0.74930882 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

Abstract: A score function induced by a generative model of the data can provide a feature vector of a fixed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a fixed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classifiers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classifiers, or the discriminative classifiers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classification performance that at least matches the performance of the free energy classifier based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classifiers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.

5 0.67853147 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

Author: Novi Quadrianto, John Lim, Dale Schuurmans, Tibério S. Caetano

Abstract: We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semidefinite matrix variable, we reformulate the problem to eliminate the need for general semidefinite programming. In particular, we provide two reformulations that admit fast algorithms. The first is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data. 1

6 0.67800623 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

7 0.67752814 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

8 0.67697328 104 nips-2009-Group Sparse Coding

9 0.67488551 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

10 0.67326862 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

11 0.66715002 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

12 0.66579723 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

13 0.66546738 70 nips-2009-Discriminative Network Models of Schizophrenia

14 0.66509849 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

15 0.66369158 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

16 0.66349792 100 nips-2009-Gaussian process regression with Student-t likelihood

17 0.6625914 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

18 0.66217345 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

19 0.66166562 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

20 0.66116953 128 nips-2009-Learning Non-Linear Combinations of Kernels