nips nips2009 nips2009-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yiming Ying, Colin Campbell, Mark Girolami
Abstract: The recent introduction of indefinite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classification with a non-positive semi-definite kernel (indefinite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indefinite kernel matrix and the proxy positive semi-definite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indefinite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efficiency of our proposed algorithms.
Reference: text
sentIndex sentText sentNum sentScore
1 Building, G12 8QQ, United Kingdom Abstract The recent introduction of indefinite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classification with a non-positive semi-definite kernel (indefinite kernel). [sent-4, score-0.106]
2 This paper studies the properties of the objective function introduced there. [sent-5, score-0.098]
3 In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. [sent-6, score-0.305]
4 The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indefinite kernel matrix and the proxy positive semi-definite one. [sent-8, score-0.463]
5 Our elementary result greatly facilitates the application of gradient-based algorithms. [sent-9, score-0.101]
6 Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indefinite SVM which has an optimal convergence rate for smooth problems. [sent-10, score-0.437]
7 1 Introduction Kernel methods [5, 24] such as Support Vector Machines (SVM) have recently attracted much attention due to their good generalization performance and appealing optimization approaches. [sent-12, score-0.102]
8 The basic idea of kernel methods is to map the data into a high dimensional (even infinite-dimensional) feature space through a kernel function. [sent-13, score-0.212]
9 The kernel function over samples forms a similarity kernel matrix which is usually required to be positive semi-definite (PSD). [sent-14, score-0.312]
10 The PSD property of the similarity matrix ensures that the SVM can be efficiently solved by a convex quadratic programming. [sent-15, score-0.168]
11 However, many potential kernel matrices could be non-positive semi-definite. [sent-16, score-0.106]
12 Such cases are quite common in applications such as the sigmoid kernel [14] for various values of the hyper-parameters, hyperbolic tangent kernels [25], and the protein sequence similarity measures derived from SmithWaterman and BLAST score [23]. [sent-17, score-0.268]
13 The problem of learning with a non-PSD similarity matrix (indefinite kernel) has recently attracted considerable attention [4, 8, 9, 14, 20, 21, 26]. [sent-18, score-0.105]
14 One widely used method is to convert the indefinite kernel matrix into a PSD one by using the spectrum transformation. [sent-19, score-0.18]
15 The denoise method neglects the negative eigenvalues [8, 21], flip [8] takes the absolute value of all eigenvalues, shift [22] shifts eigenvalues to be positive by adding a positive constant, and the diffusion method [11] takes the exponentials of eigenvalues. [sent-20, score-0.271]
16 In [9], the classification problem with indefinite kernels is regarded as the minimization of the distance between convex hulls in the pseudo-Euclidean space. [sent-23, score-0.127]
17 In [20], general Reproducing Kernel Kreˇn spaces (RKKS) with indefinite ı kernels are introduced which allows a general representer theorem and regularization formulations. [sent-24, score-0.163]
18 Training a SVM with an indefinite kernel was viewed as a learning the kernel 1 matrix problem [13] i. [sent-26, score-0.26]
19 learning a proxy PSD kernel matrix to approximate the indefinite one. [sent-28, score-0.227]
20 Without realizing that the objective function is differentiable, the authors quadratically smoothed the objective function, and then formulated two approximate algorithms including the projected gradient method and the analytic center cutting plane method. [sent-29, score-0.696]
21 In this paper we follow the formulation of SVM with indefinite kernels proposed in [15]. [sent-30, score-0.107]
22 We mainly establish the differentiability of the objective function (see its precise definition in equation (3)) and prove that it is, indeed, differentiable with Lipschitz continuous gradient. [sent-31, score-0.314]
23 This elementary result suggests there is no need to smooth the objective function which greatly facilitates the application of gradient-based algorithms. [sent-32, score-0.349]
24 The main idea behind our analysis is from its saddle (min-max) representation which involves a penalty term in the form of Frobenius norm of matrices, measuring the distance between the indefinite kernel matrix and the proxy PSD one. [sent-33, score-0.297]
25 This penalty term can be regarded as a Moreau-Yosida regularization term [12] to smooth out the objective function. [sent-34, score-0.277]
26 There, we first show that the objective function of interest is continuously differentiable and its gradient function can be explicitly computed. [sent-38, score-0.305]
27 Based on our analysis, in Section 4 we propose a simplified formulation of the projected gradient method presented in [15] and show that it has a convergence rate of O(1/k) where k is the iteration number. [sent-40, score-0.403]
28 We further develop Nesterov’s smooth optimization approach [17, 18] for indefinite SVM which has an optimal convergence rate of O(1/k 2 ) for smooth problems. [sent-41, score-0.437]
29 Suppose that K is a positive semi-definite kernel matrix (proxy kernel matrix) on inputs x. [sent-55, score-0.285]
30 Since we assume that K is positive semi-definite, the above problem is a standard convex quadratic program [2] and a global solution can be efficiently obtained by, e. [sent-60, score-0.118]
31 Suppose now we are only given an indefinite kernel matrix K0 ∈ S n . [sent-63, score-0.154]
32 Luss and d’Aspremont [15] proposed the following max-min approach to simultaneously learn a proxy PSD kernel matrix K for the indefinite matrix K0 and the SVM classification: 1 minK maxα α e − 2 α Y KY α + ρ K − K0 2 F (1) s. [sent-64, score-0.275]
33 F By the min-max theorem [2], problem (1) is equivalent to max minn L(α, K). [sent-68, score-0.2]
34 (2) α∈Q1 K∈S+ For simplicity, we refer to the following function defined by f (α) = minn L(α, K) K∈S+ (3) as the objective function. [sent-69, score-0.144]
35 We also call the associated function L(α, K) the saddle representation of the objective function f . [sent-71, score-0.168]
36 2 For fixed α ∈ Q1 , the optimization K(α) = arg minK 0 L(α, K) is equivalent to a projection to n the semi-definite cone S+ . [sent-72, score-0.125]
37 Indeed, it was shown in [15] that the optimal solution is given by K(α) = (K0 + Y αα Y /(4ρ))+ (4) n where, for any matrix A ∈ S , the notation A+ denotes the positive part of A by simply setting n its negative eigenvalues to zero. [sent-73, score-0.151]
38 The next lemma tells us that the optimal solution K ∗ belongs to a bounded domain in n S+ . [sent-80, score-0.118]
39 Problem (2) is equivalent to the formulation maxα∈Q1 minK∈Q2 L(α, K) and the objective function can be defined by f (α) = min L(α, K) K∈Q2 n where Q2 := K ∈ S+ : λmax (K) ≤ λmax (K0 ) + nC 2 4ρ (5) . [sent-82, score-0.135]
40 By the saddle theorem [2], we have L(α∗ , K ∗ ) = minK∈Q2 L(α∗ , K). [sent-84, score-0.134]
41 α Y KY α, in the objective function defined by (5), it can not be written as the above special form, and hence the theorem there can not be applied to our case. [sent-96, score-0.162]
42 3 Differentiability of the Objective Function The following lemma outlines a very useful characterization of differentiable properties of the optimal value function [3, Theorem 4. [sent-97, score-0.194]
43 Furthermore, if for α ∈ U , L(α, ·) has a unique minimizer x(α) over Q then f is differentiable at α and the gradient of f is given by f (α) = ∂α L(α, x(α)). [sent-104, score-0.202]
44 Applying the above lemma to the objective function f defined by equation (5), we have: Theorem 1. [sent-105, score-0.189]
45 The objective function f defined by (3) (equivalently by (5)) is differentiable and its gradient is given by f (α) = e − Y (K0 + Y αα Y /(4ρ))+ Y α. [sent-106, score-0.272]
46 Noting that ∂K L(α, K) = − 1 Y αα Y + 2ρ(K − K0 ) and adding the above two first2 order optimaility inequalities together, we have − K2 −K1 2 ≥ 0 which means that K1 = K2 , and F hence completes the proof of the uniqueness of K(α). [sent-113, score-0.127]
47 3 Indeed, we can go further to establish the Lipschitz continuity of f based on the strongly convex property of L(α, ·). [sent-118, score-0.143]
48 Consequently, (α2 α2 − α1 α1 ) F Y (α2 α2 − α1 α1 )Y F ≤ (7) 4ρ 4ρ where the last inequality follows from the fact that Y is an orthonormal matrix since yi ∈ {±1} and Y = diag(y1 , . [sent-132, score-0.131]
49 Putting this back into inequality (7) completes the proof of the lemma. [sent-137, score-0.134]
50 K(α1 ) − K(α2 ) F ≤ It is interesting to point out that the above lemma can be alternatively established by delicate techniques in matrix analysis. [sent-138, score-0.139]
51 The perturbation inequality in matrix analysis [1, Lemma VII. [sent-148, score-0.131]
52 F From the above lemma, we can establish the Lipschitz continuity of the gradient of the objective function. [sent-153, score-0.257]
53 The gradient of the objective function given by (6) is Lipschitz continuous with Lipschitz 2 constant L = λmax (K0 ) + nC i. [sent-155, score-0.234]
54 for any α1 , α2 ∈ Q1 the following inequality holds f (α1 ) − ρ 2 f (α2 ) ≤ λmax (K0 )) + nC /ρ α1 − α2 . [sent-157, score-0.108]
55 Table 1: Pseudo-code of projected gradient method 4 Smooth Optimization Algorithms This section is based on the theoretical analysis above, mainly Theorem 2. [sent-177, score-0.274]
56 We first outline a simplified version of the projected gradient method proposed in [15] and show it has a convergence rate of O(1/k) where k is the iteration number. [sent-178, score-0.366]
57 We can further develop a smooth optimization approach [17, 18] for indefinite SVM (5). [sent-179, score-0.222]
58 This scheme has an optimal convergence rate O(1/k 2 ) for smooth problems which has been applied to various problems, e. [sent-180, score-0.244]
59 1 Simplified Projected Gradient Method In [15], the objective function was smoothed by adding a quadratic term (see details in Section 3 there) and then they proposed a projected gradient algorithm to solve this approximation problem. [sent-184, score-0.44]
60 Using the explicit gradient representation in Theorem 1 we formulate its simplified version in Table 1 where the projection PQ1 : Rn → Q1 is defined, for any β ∈ Rn , by PQ1 (β) = arg min α − β α∈Q1 2 . [sent-185, score-0.126]
61 Let γ ≥ λmax (K0 ) + nC and {αk : k ∈ N} be given by the simplified projected ρ gradient method in Table 1. [sent-194, score-0.248]
62 For any α ∈ Q1 , the following inequality holds f (αk+1 ) ≥ f (α) + γ αk − αk+1 , α − αk + γ αk − αk+1 2 . [sent-195, score-0.108]
63 Let γ ≥ λmax (K0 ) + nC and the iteration sequence {αk : k ∈ N} be given by the ρ simplified projected gradient method in Table 1. [sent-207, score-0.328]
64 Then, we have that γ αk+1 − αk 2 , (12) f (αk+1 ) ≥ f (αk ) + 2 Moreover, γ max f (α) − f (αk ) ≤ α0 − α∗ 2 (13) α∈Q1 2k where α∗ is an optimal solution of problem maxα∈Q1 f (α). [sent-208, score-0.117]
65 Table 2: Pseudo-code of first-order Nesterov’s smooth optimization method Proof. [sent-222, score-0.248]
66 From the above theorem, the sequence {f (αk ) : k ∈ N} is monotonically increasing and the iteration complexity of SPGM is O(L/ε) for finding an ε-optimal solution. [sent-227, score-0.153]
67 2 Nesterov’s Smooth Optimization Method In [18, 17], Nesterov proposed an efficient smooth optimization method for solving convex programming problems of the form min g(x) x∈U where g is a convex function with Lipschitz continuous gradient, and U is a closed convex set in Rn . [sent-229, score-0.495]
68 The smooth optimization approach needs to introduce a proxy-function d(x) associated with the set U . [sent-231, score-0.274]
69 It is assumed to be continuous and strongly convex on U with convexity parameter σ > 0. [sent-232, score-0.119]
70 Then, a specific first-order smooth optimization scheme detailed in [18] can be then applied to the function g with convergence rate in O( L/ε). [sent-236, score-0.289]
71 Translating the first-order Nesterov’s scheme [18, Section 3] to our problem (5), we can get the smooth optimization algorithm for indefinite SVM, see its pseudo-code in Table 2. [sent-240, score-0.251]
72 The convergence of this optimal method was shown in [18]: ∗ 2 ∗ maxα∈Q1 f (α) − f (γk ) ≤ 4L α0 −α (k+1)(k+2) where α is one of the optimal solutions. [sent-245, score-0.118]
73 In [15], the objective function is smoothed by adding a quadratic term and then they further 6 proposed a projected gradient algorithm and analytic center cutting plane method (ACCPM)1 . [sent-252, score-0.573]
74 As proved in Theorem 3, the number of iterations of the projected gradient method is usually O(L/ε). [sent-253, score-0.248]
75 However, this method needs to use interior methods at each iteration which would be slow for large scale datasets. [sent-257, score-0.158]
76 Chen and Ye [4] reformulated indefinite SVM as an appealing semi-infinite quadratically constrained linear programming (SIQCLP) without applying extra smoothing techniques. [sent-258, score-0.209]
77 The iteration complexity of semi-infinite linear programming is usually O(1/ε3 ). [sent-260, score-0.164]
78 The main limitation of this approach is that one needs to save the subset of increasing quadratically constrained conditions indexed by n × n matrices and iteratively solve a quadratically constrained linear programming (QCLP). [sent-263, score-0.362]
79 This tends to make the algorithm inefficient during the iteration process, although pruning techniques were proposed to avoid too many quadratically constrained conditions. [sent-269, score-0.216]
80 Based on our theoretical results (Theorem 2), Nesterov’s smooth optimization method can be applied. [sent-270, score-0.248]
81 The complexity of this smooth optimization method (SMM) mainly relies on the eigenvalue decomposition on Step 2 listed in Table 2 which costs O(n3 ). [sent-271, score-0.32]
82 The first-order smooth optimization approach [17, 18] has iteration complexity O( L/ε) for finding an ε-optimal solution. [sent-273, score-0.348]
83 Hence, from theoretical comparison the complexity of smoothing optimization is better than the simplified projected gradient method (SPGM) and SIQCLP. [sent-275, score-0.366]
84 5 Experimental Validation We run our proposed smooth optimization approach and simplified projected gradient method on various datasets to validate our analysis. [sent-279, score-0.526]
85 In each data split, as in [4] we first generate a Gaussian kernel matrix K with the hyper-parameter determined by cross-validation on the training data using LIBSVM and then construct indefinite matrices by adding a small noisy matrix i. [sent-283, score-0.243]
86 SIQCLP needs much more time since, in each iteration, it needs to solves a quadratically constrained linear programming. [sent-295, score-0.24]
87 In Figure 1, we plot the objective values versus iteration on Sonar and Diabetes for SMM, SPGM, and ACCPM. [sent-296, score-0.178]
88 The SIQCLP approach is not included here since its objective value is not based on the iteration w. [sent-297, score-0.178]
89 the variable α which does not directly yield an increasing iteration sequence of objective values in contrast to those of the other three algorithms. [sent-300, score-0.178]
90 17s Table 3: Average test set accuracy (%) and CPU time in seconds (s) of different algorithms where λmax (λmin ) denotes the average maximum (minimum) eigenvalues of the indefinite kernel matrix over training samples. [sent-365, score-0.205]
91 However, ACCPM needs more time in each iteration than SMM and this observation becomes more apparent for the relatively large datasets shown in the time comparison of Table 3. [sent-374, score-0.159]
92 6 Conclusion In this paper we analyzed the regularization formulation for training SVM with indefinite kernels proposed by Luss and d’Aspremont [15]. [sent-375, score-0.136]
93 We show that the objective function of interest is continuously differentiable with Lipschitz continuous gradient. [sent-376, score-0.245]
94 Our elementary analysis greatly facilitates the application of gradient-based methods. [sent-377, score-0.101]
95 We formulated a simplified version of the projected gradient method presented in [15] and showed that it has a convergence rate of O(1/k). [sent-378, score-0.286]
96 We further developed Nesterov’s smooth optimization method [17, 18] for Indefinite SVM which has an optimal convergence rate of O(1/k 2 ) for smooth problems. [sent-379, score-0.463]
97 Experiments on various datasets validate our analysis and the efficiency of our proposed optimization approach. [sent-380, score-0.128]
98 We are also applying this method to real biological datasets such as protein sequence analysis using sequence alignment measures. [sent-382, score-0.124]
99 A study on sigmoid kernels for SVM and the training of non-psd kernels by smo-type methods. [sent-469, score-0.169]
100 An analysis of transformation on non-positive semidefinite similarity matrix for kernel machines. [sent-550, score-0.181]
wordName wordTfidf (topN-words)
[('inde', 0.557), ('smm', 0.301), ('spgm', 0.255), ('accpm', 0.232), ('lipschitz', 0.178), ('smooth', 0.15), ('nesterov', 0.145), ('svm', 0.142), ('siqclp', 0.139), ('nc', 0.131), ('projected', 0.124), ('psd', 0.122), ('mink', 0.122), ('luss', 0.116), ('nite', 0.113), ('kernel', 0.106), ('quadratically', 0.102), ('gradient', 0.098), ('objective', 0.098), ('lemma', 0.091), ('max', 0.09), ('ky', 0.087), ('inequality', 0.083), ('aspremont', 0.082), ('iteration', 0.08), ('differentiable', 0.076), ('diabetes', 0.074), ('proxy', 0.073), ('optimization', 0.072), ('saddle', 0.07), ('simpli', 0.07), ('kernels', 0.07), ('qclp', 0.07), ('theorem', 0.064), ('sonar', 0.063), ('convex', 0.057), ('needs', 0.052), ('completes', 0.051), ('eigenvalues', 0.051), ('matrix', 0.048), ('minn', 0.046), ('complexity', 0.046), ('smoothed', 0.043), ('adding', 0.041), ('mosek', 0.041), ('bristol', 0.041), ('differentiability', 0.041), ('worthy', 0.041), ('elementary', 0.04), ('analytic', 0.038), ('programming', 0.038), ('convergence', 0.038), ('continuous', 0.038), ('cutting', 0.037), ('formulation', 0.037), ('noting', 0.037), ('protein', 0.036), ('nn', 0.036), ('quadratic', 0.036), ('establish', 0.035), ('applying', 0.035), ('uniqueness', 0.035), ('diag', 0.034), ('facilitates', 0.034), ('constrained', 0.034), ('continuously', 0.033), ('plane', 0.032), ('kingdom', 0.031), ('attracted', 0.03), ('sigmoid', 0.029), ('validate', 0.029), ('scheme', 0.029), ('regularization', 0.029), ('arg', 0.028), ('ionosphere', 0.028), ('rn', 0.028), ('minimizer', 0.028), ('table', 0.028), ('monotonically', 0.027), ('usps', 0.027), ('indeed', 0.027), ('optimal', 0.027), ('similarity', 0.027), ('greatly', 0.027), ('datasets', 0.027), ('continuity', 0.026), ('method', 0.026), ('consequently', 0.026), ('mainly', 0.026), ('proximity', 0.026), ('cristianini', 0.026), ('accelerate', 0.026), ('diffusion', 0.026), ('optimality', 0.025), ('positive', 0.025), ('cone', 0.025), ('united', 0.025), ('go', 0.025), ('holds', 0.025), ('convexity', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 33 nips-2009-Analysis of SVM with Indefinite Kernels
Author: Yiming Ying, Colin Campbell, Mark Girolami
Abstract: The recent introduction of indefinite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classification with a non-positive semi-definite kernel (indefinite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indefinite kernel matrix and the proxy positive semi-definite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indefinite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efficiency of our proposed algorithms.
2 0.12988304 223 nips-2009-Sparse Metric Learning via Smooth Optimization
Author: Yiming Ying, Kaizhu Huang, Colin Campbell
Abstract: In this paper we study the problem of learning a low-rank (sparse) distance matrix. We propose a novel metric learning model which can simultaneously conduct dimension reduction and learn a distance matrix. The sparse representation involves a mixed-norm regularization which is non-convex. We then show that it can be equivalently formulated as a convex saddle (min-max) problem. From this saddle representation, we develop an efficient smooth optimization approach [17] for sparse metric learning, although the learning model is based on a nondifferentiable loss function. Finally, we run experiments to validate the effectiveness and efficiency of our sparse metric learning model on various datasets.
3 0.11084911 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
4 0.084482446 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar
Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1
5 0.080909893 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Ulf Brefeld, Pavel Laskov, Klaus-Robert Müller, Alexander Zien, Sören Sonnenburg
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability. Unfortunately, 1 -norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary p -norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply p -norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art. 1
6 0.079663701 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
7 0.07850717 147 nips-2009-Matrix Completion from Noisy Entries
8 0.07547304 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
9 0.07443437 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
10 0.074173242 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
11 0.071992472 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
12 0.070427746 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
13 0.068480834 119 nips-2009-Kernel Methods for Deep Learning
14 0.065315291 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
15 0.063547522 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
16 0.060785931 95 nips-2009-Fast subtree kernels on graphs
17 0.060762402 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors
18 0.060420349 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
19 0.05829227 72 nips-2009-Distribution Matching for Transduction
20 0.058154311 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
topicId topicWeight
[(0, -0.179), (1, 0.155), (2, -0.022), (3, 0.082), (4, -0.045), (5, -0.027), (6, 0.01), (7, 0.075), (8, -0.011), (9, 0.032), (10, 0.1), (11, -0.072), (12, 0.027), (13, 0.057), (14, -0.095), (15, -0.056), (16, -0.009), (17, -0.034), (18, -0.076), (19, -0.007), (20, -0.012), (21, -0.057), (22, 0.019), (23, 0.021), (24, -0.059), (25, -0.038), (26, -0.006), (27, -0.072), (28, 0.065), (29, 0.005), (30, 0.014), (31, -0.054), (32, -0.005), (33, 0.024), (34, -0.025), (35, 0.008), (36, -0.027), (37, -0.066), (38, -0.009), (39, 0.061), (40, -0.045), (41, -0.039), (42, -0.102), (43, -0.024), (44, -0.058), (45, 0.045), (46, -0.03), (47, -0.038), (48, -0.088), (49, -0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.95195973 33 nips-2009-Analysis of SVM with Indefinite Kernels
Author: Yiming Ying, Colin Campbell, Mark Girolami
Abstract: The recent introduction of indefinite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classification with a non-positive semi-definite kernel (indefinite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indefinite kernel matrix and the proxy positive semi-definite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indefinite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efficiency of our proposed algorithms.
2 0.73103207 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
Author: Xiao-ming Wu, Anthony M. So, Zhenguo Li, Shuo-yen R. Li
Abstract: Kernel learning is a powerful framework for nonlinear data modeling. Using the kernel trick, a number of problems have been formulated as semidefinite programs (SDPs). These include Maximum Variance Unfolding (MVU) (Weinberger et al., 2004) in nonlinear dimensionality reduction, and Pairwise Constraint Propagation (PCP) (Li et al., 2008) in constrained clustering. Although in theory SDPs can be efficiently solved, the high computational complexity incurred in numerically processing the huge linear matrix inequality constraints has rendered the SDP approach unscalable. In this paper, we show that a large class of kernel learning problems can be reformulated as semidefinite-quadratic-linear programs (SQLPs), which only contain a simple positive semidefinite constraint, a second-order cone constraint and a number of linear constraints. These constraints are much easier to process numerically, and the gain in speedup over previous approaches is at least of the order m2.5 , where m is the matrix dimension. Experimental results are also presented to show the superb computational efficiency of our approach.
3 0.66526377 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors
Author: Liang Sun, Jun Liu, Jianhui Chen, Jieping Ye
Abstract: We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model, in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the (2, 1)-norm minimization, which is an extension of the well-known 1-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is significantly much more difficult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semidefinite programming (SDP) problem, which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efficient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.
4 0.65311468 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
5 0.63895261 223 nips-2009-Sparse Metric Learning via Smooth Optimization
Author: Yiming Ying, Kaizhu Huang, Colin Campbell
Abstract: In this paper we study the problem of learning a low-rank (sparse) distance matrix. We propose a novel metric learning model which can simultaneously conduct dimension reduction and learn a distance matrix. The sparse representation involves a mixed-norm regularization which is non-convex. We then show that it can be equivalently formulated as a convex saddle (min-max) problem. From this saddle representation, we develop an efficient smooth optimization approach [17] for sparse metric learning, although the learning model is based on a nondifferentiable loss function. Finally, we run experiments to validate the effectiveness and efficiency of our sparse metric learning model on various datasets.
6 0.63606673 180 nips-2009-On the Convergence of the Concave-Convex Procedure
7 0.61094987 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
8 0.60147965 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization
9 0.57888532 191 nips-2009-Positive Semidefinite Metric Learning with Boosting
10 0.5742656 147 nips-2009-Matrix Completion from Noisy Entries
11 0.56507427 76 nips-2009-Efficient Learning using Forward-Backward Splitting
12 0.56449354 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning
13 0.5566209 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
14 0.54925048 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
15 0.53803521 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
16 0.52448618 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
17 0.5190649 72 nips-2009-Distribution Matching for Transduction
18 0.51818192 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
19 0.50717366 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization
20 0.50670981 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
topicId topicWeight
[(7, 0.011), (24, 0.062), (25, 0.062), (35, 0.044), (36, 0.098), (39, 0.032), (58, 0.112), (61, 0.369), (71, 0.024), (86, 0.1)]
simIndex simValue paperId paperTitle
1 0.90500981 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning
Author: Anne Hsu, Thomas L. Griffiths
Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-specific knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1
2 0.87746239 242 nips-2009-The Infinite Partially Observable Markov Decision Process
Author: Finale Doshi-velez
Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1
3 0.87559336 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1
4 0.83717757 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
Author: Sylvain Arlot, Francis R. Bach
Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1
same-paper 5 0.82017618 33 nips-2009-Analysis of SVM with Indefinite Kernels
Author: Yiming Ying, Colin Campbell, Mark Girolami
Abstract: The recent introduction of indefinite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classification with a non-positive semi-definite kernel (indefinite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indefinite kernel matrix and the proxy positive semi-definite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indefinite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efficiency of our proposed algorithms.
6 0.70233601 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
7 0.65605277 134 nips-2009-Learning to Explore and Exploit in POMDPs
8 0.61980224 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
9 0.60842258 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference
10 0.60480559 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
11 0.60386741 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
12 0.60137254 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models
13 0.59189188 113 nips-2009-Improving Existing Fault Recovery Policies
14 0.58467364 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
15 0.56781328 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning
16 0.56636465 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
17 0.56520659 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization
18 0.55869734 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing
19 0.5520969 48 nips-2009-Bootstrapping from Game Tree Search
20 0.54603797 91 nips-2009-Fast, smooth and adaptive regression in metric spaces