nips nips2009 nips2009-33 knowledge-graph by maker-knowledge-mining

33 nips-2009-Analysis of SVM with Indefinite Kernels

Source: pdf

Author: Yiming Ying, Colin Campbell, Mark Girolami

Abstract: The recent introduction of indeﬁnite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classiﬁcation with a non-positive semi-deﬁnite kernel (indeﬁnite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indeﬁnite kernel matrix and the proxy positive semi-deﬁnite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indeﬁnite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efﬁciency of our proposed algorithms.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Building, G12 8QQ, United Kingdom Abstract The recent introduction of indeﬁnite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classiﬁcation with a non-positive semi-deﬁnite kernel (indeﬁnite kernel). [sent-4, score-0.106]

2 This paper studies the properties of the objective function introduced there. [sent-5, score-0.098]

3 In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. [sent-6, score-0.305]

4 The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indeﬁnite kernel matrix and the proxy positive semi-deﬁnite one. [sent-8, score-0.463]

5 Our elementary result greatly facilitates the application of gradient-based algorithms. [sent-9, score-0.101]

6 Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indeﬁnite SVM which has an optimal convergence rate for smooth problems. [sent-10, score-0.437]

7 1 Introduction Kernel methods [5, 24] such as Support Vector Machines (SVM) have recently attracted much attention due to their good generalization performance and appealing optimization approaches. [sent-12, score-0.102]

8 The basic idea of kernel methods is to map the data into a high dimensional (even inﬁnite-dimensional) feature space through a kernel function. [sent-13, score-0.212]

9 The kernel function over samples forms a similarity kernel matrix which is usually required to be positive semi-deﬁnite (PSD). [sent-14, score-0.312]

10 The PSD property of the similarity matrix ensures that the SVM can be efﬁciently solved by a convex quadratic programming. [sent-15, score-0.168]

11 However, many potential kernel matrices could be non-positive semi-deﬁnite. [sent-16, score-0.106]

12 Such cases are quite common in applications such as the sigmoid kernel [14] for various values of the hyper-parameters, hyperbolic tangent kernels [25], and the protein sequence similarity measures derived from SmithWaterman and BLAST score [23]. [sent-17, score-0.268]

13 The problem of learning with a non-PSD similarity matrix (indeﬁnite kernel) has recently attracted considerable attention [4, 8, 9, 14, 20, 21, 26]. [sent-18, score-0.105]

14 One widely used method is to convert the indeﬁnite kernel matrix into a PSD one by using the spectrum transformation. [sent-19, score-0.18]

15 The denoise method neglects the negative eigenvalues [8, 21], ﬂip [8] takes the absolute value of all eigenvalues, shift [22] shifts eigenvalues to be positive by adding a positive constant, and the diffusion method [11] takes the exponentials of eigenvalues. [sent-20, score-0.271]

16 In [9], the classiﬁcation problem with indeﬁnite kernels is regarded as the minimization of the distance between convex hulls in the pseudo-Euclidean space. [sent-23, score-0.127]

17 In [20], general Reproducing Kernel Kreˇn spaces (RKKS) with indeﬁnite ı kernels are introduced which allows a general representer theorem and regularization formulations. [sent-24, score-0.163]

18 Training a SVM with an indeﬁnite kernel was viewed as a learning the kernel 1 matrix problem [13] i. [sent-26, score-0.26]

19 learning a proxy PSD kernel matrix to approximate the indeﬁnite one. [sent-28, score-0.227]

20 Without realizing that the objective function is differentiable, the authors quadratically smoothed the objective function, and then formulated two approximate algorithms including the projected gradient method and the analytic center cutting plane method. [sent-29, score-0.696]

21 In this paper we follow the formulation of SVM with indeﬁnite kernels proposed in [15]. [sent-30, score-0.107]

22 We mainly establish the differentiability of the objective function (see its precise deﬁnition in equation (3)) and prove that it is, indeed, differentiable with Lipschitz continuous gradient. [sent-31, score-0.314]

23 This elementary result suggests there is no need to smooth the objective function which greatly facilitates the application of gradient-based algorithms. [sent-32, score-0.349]

24 The main idea behind our analysis is from its saddle (min-max) representation which involves a penalty term in the form of Frobenius norm of matrices, measuring the distance between the indeﬁnite kernel matrix and the proxy PSD one. [sent-33, score-0.297]

25 This penalty term can be regarded as a Moreau-Yosida regularization term [12] to smooth out the objective function. [sent-34, score-0.277]

26 There, we ﬁrst show that the objective function of interest is continuously differentiable and its gradient function can be explicitly computed. [sent-38, score-0.305]

27 Based on our analysis, in Section 4 we propose a simpliﬁed formulation of the projected gradient method presented in [15] and show that it has a convergence rate of O(1/k) where k is the iteration number. [sent-40, score-0.403]

28 We further develop Nesterov’s smooth optimization approach [17, 18] for indeﬁnite SVM which has an optimal convergence rate of O(1/k 2 ) for smooth problems. [sent-41, score-0.437]

29 Suppose that K is a positive semi-deﬁnite kernel matrix (proxy kernel matrix) on inputs x. [sent-55, score-0.285]

30 Since we assume that K is positive semi-deﬁnite, the above problem is a standard convex quadratic program [2] and a global solution can be efﬁciently obtained by, e. [sent-60, score-0.118]

31 Suppose now we are only given an indeﬁnite kernel matrix K0 ∈ S n . [sent-63, score-0.154]

32 Luss and d’Aspremont [15] proposed the following max-min approach to simultaneously learn a proxy PSD kernel matrix K for the indeﬁnite matrix K0 and the SVM classiﬁcation: 1 minK maxα α e − 2 α Y KY α + ρ K − K0 2 F (1) s. [sent-64, score-0.275]

33 F By the min-max theorem [2], problem (1) is equivalent to max minn L(α, K). [sent-68, score-0.2]

34 (2) α∈Q1 K∈S+ For simplicity, we refer to the following function deﬁned by f (α) = minn L(α, K) K∈S+ (3) as the objective function. [sent-69, score-0.144]

35 We also call the associated function L(α, K) the saddle representation of the objective function f . [sent-71, score-0.168]

36 2 For ﬁxed α ∈ Q1 , the optimization K(α) = arg minK 0 L(α, K) is equivalent to a projection to n the semi-deﬁnite cone S+ . [sent-72, score-0.125]

37 Indeed, it was shown in [15] that the optimal solution is given by K(α) = (K0 + Y αα Y /(4ρ))+ (4) n where, for any matrix A ∈ S , the notation A+ denotes the positive part of A by simply setting n its negative eigenvalues to zero. [sent-73, score-0.151]

38 The next lemma tells us that the optimal solution K ∗ belongs to a bounded domain in n S+ . [sent-80, score-0.118]

39 Problem (2) is equivalent to the formulation maxα∈Q1 minK∈Q2 L(α, K) and the objective function can be deﬁned by f (α) = min L(α, K) K∈Q2 n where Q2 := K ∈ S+ : λmax (K) ≤ λmax (K0 ) + nC 2 4ρ (5) . [sent-82, score-0.135]

40 By the saddle theorem [2], we have L(α∗ , K ∗ ) = minK∈Q2 L(α∗ , K). [sent-84, score-0.134]

41 α Y KY α, in the objective function deﬁned by (5), it can not be written as the above special form, and hence the theorem there can not be applied to our case. [sent-96, score-0.162]

42 3 Differentiability of the Objective Function The following lemma outlines a very useful characterization of differentiable properties of the optimal value function [3, Theorem 4. [sent-97, score-0.194]

43 Furthermore, if for α ∈ U , L(α, ·) has a unique minimizer x(α) over Q then f is differentiable at α and the gradient of f is given by f (α) = ∂α L(α, x(α)). [sent-104, score-0.202]

44 Applying the above lemma to the objective function f deﬁned by equation (5), we have: Theorem 1. [sent-105, score-0.189]

45 The objective function f deﬁned by (3) (equivalently by (5)) is differentiable and its gradient is given by f (α) = e − Y (K0 + Y αα Y /(4ρ))+ Y α. [sent-106, score-0.272]

46 Noting that ∂K L(α, K) = − 1 Y αα Y + 2ρ(K − K0 ) and adding the above two ﬁrst2 order optimaility inequalities together, we have − K2 −K1 2 ≥ 0 which means that K1 = K2 , and F hence completes the proof of the uniqueness of K(α). [sent-113, score-0.127]

47 3 Indeed, we can go further to establish the Lipschitz continuity of f based on the strongly convex property of L(α, ·). [sent-118, score-0.143]

48 Consequently, (α2 α2 − α1 α1 ) F Y (α2 α2 − α1 α1 )Y F ≤ (7) 4ρ 4ρ where the last inequality follows from the fact that Y is an orthonormal matrix since yi ∈ {±1} and Y = diag(y1 , . [sent-132, score-0.131]

49 Putting this back into inequality (7) completes the proof of the lemma. [sent-137, score-0.134]

50 K(α1 ) − K(α2 ) F ≤ It is interesting to point out that the above lemma can be alternatively established by delicate techniques in matrix analysis. [sent-138, score-0.139]

51 The perturbation inequality in matrix analysis [1, Lemma VII. [sent-148, score-0.131]

52 F From the above lemma, we can establish the Lipschitz continuity of the gradient of the objective function. [sent-153, score-0.257]

53 The gradient of the objective function given by (6) is Lipschitz continuous with Lipschitz 2 constant L = λmax (K0 ) + nC i. [sent-155, score-0.234]

54 for any α1 , α2 ∈ Q1 the following inequality holds f (α1 ) − ρ 2 f (α2 ) ≤ λmax (K0 )) + nC /ρ α1 − α2 . [sent-157, score-0.108]

55 Table 1: Pseudo-code of projected gradient method 4 Smooth Optimization Algorithms This section is based on the theoretical analysis above, mainly Theorem 2. [sent-177, score-0.274]

56 We ﬁrst outline a simpliﬁed version of the projected gradient method proposed in [15] and show it has a convergence rate of O(1/k) where k is the iteration number. [sent-178, score-0.366]

57 We can further develop a smooth optimization approach [17, 18] for indeﬁnite SVM (5). [sent-179, score-0.222]

58 This scheme has an optimal convergence rate O(1/k 2 ) for smooth problems which has been applied to various problems, e. [sent-180, score-0.244]

59 1 Simpliﬁed Projected Gradient Method In [15], the objective function was smoothed by adding a quadratic term (see details in Section 3 there) and then they proposed a projected gradient algorithm to solve this approximation problem. [sent-184, score-0.44]

60 Using the explicit gradient representation in Theorem 1 we formulate its simpliﬁed version in Table 1 where the projection PQ1 : Rn → Q1 is deﬁned, for any β ∈ Rn , by PQ1 (β) = arg min α − β α∈Q1 2 . [sent-185, score-0.126]

61 Let γ ≥ λmax (K0 ) + nC and {αk : k ∈ N} be given by the simpliﬁed projected ρ gradient method in Table 1. [sent-194, score-0.248]

62 For any α ∈ Q1 , the following inequality holds f (αk+1 ) ≥ f (α) + γ αk − αk+1 , α − αk + γ αk − αk+1 2 . [sent-195, score-0.108]

63 Let γ ≥ λmax (K0 ) + nC and the iteration sequence {αk : k ∈ N} be given by the ρ simpliﬁed projected gradient method in Table 1. [sent-207, score-0.328]

64 Then, we have that γ αk+1 − αk 2 , (12) f (αk+1 ) ≥ f (αk ) + 2 Moreover, γ max f (α) − f (αk ) ≤ α0 − α∗ 2 (13) α∈Q1 2k where α∗ is an optimal solution of problem maxα∈Q1 f (α). [sent-208, score-0.117]

65 Table 2: Pseudo-code of ﬁrst-order Nesterov’s smooth optimization method Proof. [sent-222, score-0.248]

66 From the above theorem, the sequence {f (αk ) : k ∈ N} is monotonically increasing and the iteration complexity of SPGM is O(L/ε) for ﬁnding an ε-optimal solution. [sent-227, score-0.153]

67 2 Nesterov’s Smooth Optimization Method In [18, 17], Nesterov proposed an efﬁcient smooth optimization method for solving convex programming problems of the form min g(x) x∈U where g is a convex function with Lipschitz continuous gradient, and U is a closed convex set in Rn . [sent-229, score-0.495]

68 The smooth optimization approach needs to introduce a proxy-function d(x) associated with the set U . [sent-231, score-0.274]

69 It is assumed to be continuous and strongly convex on U with convexity parameter σ > 0. [sent-232, score-0.119]

70 Then, a speciﬁc ﬁrst-order smooth optimization scheme detailed in [18] can be then applied to the function g with convergence rate in O( L/ε). [sent-236, score-0.289]

71 Translating the ﬁrst-order Nesterov’s scheme [18, Section 3] to our problem (5), we can get the smooth optimization algorithm for indeﬁnite SVM, see its pseudo-code in Table 2. [sent-240, score-0.251]

72 The convergence of this optimal method was shown in [18]: ∗ 2 ∗ maxα∈Q1 f (α) − f (γk ) ≤ 4L α0 −α (k+1)(k+2) where α is one of the optimal solutions. [sent-245, score-0.118]

73 In [15], the objective function is smoothed by adding a quadratic term and then they further 6 proposed a projected gradient algorithm and analytic center cutting plane method (ACCPM)1 . [sent-252, score-0.573]

74 As proved in Theorem 3, the number of iterations of the projected gradient method is usually O(L/ε). [sent-253, score-0.248]

75 However, this method needs to use interior methods at each iteration which would be slow for large scale datasets. [sent-257, score-0.158]

76 Chen and Ye [4] reformulated indeﬁnite SVM as an appealing semi-inﬁnite quadratically constrained linear programming (SIQCLP) without applying extra smoothing techniques. [sent-258, score-0.209]

77 The iteration complexity of semi-inﬁnite linear programming is usually O(1/ε3 ). [sent-260, score-0.164]

78 The main limitation of this approach is that one needs to save the subset of increasing quadratically constrained conditions indexed by n × n matrices and iteratively solve a quadratically constrained linear programming (QCLP). [sent-263, score-0.362]

79 This tends to make the algorithm inefﬁcient during the iteration process, although pruning techniques were proposed to avoid too many quadratically constrained conditions. [sent-269, score-0.216]

80 Based on our theoretical results (Theorem 2), Nesterov’s smooth optimization method can be applied. [sent-270, score-0.248]

81 The complexity of this smooth optimization method (SMM) mainly relies on the eigenvalue decomposition on Step 2 listed in Table 2 which costs O(n3 ). [sent-271, score-0.32]

82 The ﬁrst-order smooth optimization approach [17, 18] has iteration complexity O( L/ε) for ﬁnding an ε-optimal solution. [sent-273, score-0.348]

83 Hence, from theoretical comparison the complexity of smoothing optimization is better than the simpliﬁed projected gradient method (SPGM) and SIQCLP. [sent-275, score-0.366]

84 5 Experimental Validation We run our proposed smooth optimization approach and simpliﬁed projected gradient method on various datasets to validate our analysis. [sent-279, score-0.526]

85 In each data split, as in [4] we ﬁrst generate a Gaussian kernel matrix K with the hyper-parameter determined by cross-validation on the training data using LIBSVM and then construct indeﬁnite matrices by adding a small noisy matrix i. [sent-283, score-0.243]

86 SIQCLP needs much more time since, in each iteration, it needs to solves a quadratically constrained linear programming. [sent-295, score-0.24]

87 In Figure 1, we plot the objective values versus iteration on Sonar and Diabetes for SMM, SPGM, and ACCPM. [sent-296, score-0.178]

88 The SIQCLP approach is not included here since its objective value is not based on the iteration w. [sent-297, score-0.178]

89 the variable α which does not directly yield an increasing iteration sequence of objective values in contrast to those of the other three algorithms. [sent-300, score-0.178]

90 17s Table 3: Average test set accuracy (%) and CPU time in seconds (s) of different algorithms where λmax (λmin ) denotes the average maximum (minimum) eigenvalues of the indeﬁnite kernel matrix over training samples. [sent-365, score-0.205]

91 However, ACCPM needs more time in each iteration than SMM and this observation becomes more apparent for the relatively large datasets shown in the time comparison of Table 3. [sent-374, score-0.159]

92 6 Conclusion In this paper we analyzed the regularization formulation for training SVM with indeﬁnite kernels proposed by Luss and d’Aspremont [15]. [sent-375, score-0.136]

93 We show that the objective function of interest is continuously differentiable with Lipschitz continuous gradient. [sent-376, score-0.245]

94 Our elementary analysis greatly facilitates the application of gradient-based methods. [sent-377, score-0.101]

95 We formulated a simpliﬁed version of the projected gradient method presented in [15] and showed that it has a convergence rate of O(1/k). [sent-378, score-0.286]

96 We further developed Nesterov’s smooth optimization method [17, 18] for Indeﬁnite SVM which has an optimal convergence rate of O(1/k 2 ) for smooth problems. [sent-379, score-0.463]

97 Experiments on various datasets validate our analysis and the efﬁciency of our proposed optimization approach. [sent-380, score-0.128]

98 We are also applying this method to real biological datasets such as protein sequence analysis using sequence alignment measures. [sent-382, score-0.124]

99 A study on sigmoid kernels for SVM and the training of non-psd kernels by smo-type methods. [sent-469, score-0.169]

100 An analysis of transformation on non-positive semideﬁnite similarity matrix for kernel machines. [sent-550, score-0.181]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('inde', 0.557), ('smm', 0.301), ('spgm', 0.255), ('accpm', 0.232), ('lipschitz', 0.178), ('smooth', 0.15), ('nesterov', 0.145), ('svm', 0.142), ('siqclp', 0.139), ('nc', 0.131), ('projected', 0.124), ('psd', 0.122), ('mink', 0.122), ('luss', 0.116), ('nite', 0.113), ('kernel', 0.106), ('quadratically', 0.102), ('gradient', 0.098), ('objective', 0.098), ('lemma', 0.091), ('max', 0.09), ('ky', 0.087), ('inequality', 0.083), ('aspremont', 0.082), ('iteration', 0.08), ('differentiable', 0.076), ('diabetes', 0.074), ('proxy', 0.073), ('optimization', 0.072), ('saddle', 0.07), ('simpli', 0.07), ('kernels', 0.07), ('qclp', 0.07), ('theorem', 0.064), ('sonar', 0.063), ('convex', 0.057), ('needs', 0.052), ('completes', 0.051), ('eigenvalues', 0.051), ('matrix', 0.048), ('minn', 0.046), ('complexity', 0.046), ('smoothed', 0.043), ('adding', 0.041), ('mosek', 0.041), ('bristol', 0.041), ('differentiability', 0.041), ('worthy', 0.041), ('elementary', 0.04), ('analytic', 0.038), ('programming', 0.038), ('convergence', 0.038), ('continuous', 0.038), ('cutting', 0.037), ('formulation', 0.037), ('noting', 0.037), ('protein', 0.036), ('nn', 0.036), ('quadratic', 0.036), ('establish', 0.035), ('applying', 0.035), ('uniqueness', 0.035), ('diag', 0.034), ('facilitates', 0.034), ('constrained', 0.034), ('continuously', 0.033), ('plane', 0.032), ('kingdom', 0.031), ('attracted', 0.03), ('sigmoid', 0.029), ('validate', 0.029), ('scheme', 0.029), ('regularization', 0.029), ('arg', 0.028), ('ionosphere', 0.028), ('rn', 0.028), ('minimizer', 0.028), ('table', 0.028), ('monotonically', 0.027), ('usps', 0.027), ('indeed', 0.027), ('optimal', 0.027), ('similarity', 0.027), ('greatly', 0.027), ('datasets', 0.027), ('continuity', 0.026), ('method', 0.026), ('consequently', 0.026), ('mainly', 0.026), ('proximity', 0.026), ('cristianini', 0.026), ('accelerate', 0.026), ('diffusion', 0.026), ('optimality', 0.025), ('positive', 0.025), ('cone', 0.025), ('united', 0.025), ('go', 0.025), ('holds', 0.025), ('convexity', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 33 nips-2009-Analysis of SVM with Indefinite Kernels

Author: Yiming Ying, Colin Campbell, Mark Girolami

2 0.12988304 223 nips-2009-Sparse Metric Learning via Smooth Optimization

Author: Yiming Ying, Kaizhu Huang, Colin Campbell

Abstract: In this paper we study the problem of learning a low-rank (sparse) distance matrix. We propose a novel metric learning model which can simultaneously conduct dimension reduction and learn a distance matrix. The sparse representation involves a mixed-norm regularization which is non-convex. We then show that it can be equivalently formulated as a convex saddle (min-max) problem. From this saddle representation, we develop an efﬁcient smooth optimization approach [17] for sparse metric learning, although the learning model is based on a nondifferentiable loss function. Finally, we run experiments to validate the effectiveness and efﬁciency of our sparse metric learning model on various datasets.

3 0.11084911 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.

4 0.084482446 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

5 0.080909893 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Pavel Laskov, Klaus-Robert Müller, Alexander Zien, Sören Sonnenburg

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability. Unfortunately, 1 -norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary p -norms. We devise new insights on the connection between several existing MKL formulations and develop two efﬁcient interleaved optimization strategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply p -norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art. 1

6 0.079663701 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

7 0.07850717 147 nips-2009-Matrix Completion from Noisy Entries

8 0.07547304 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

9 0.07443437 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

10 0.074173242 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

11 0.071992472 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

12 0.070427746 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

13 0.068480834 119 nips-2009-Kernel Methods for Deep Learning

14 0.065315291 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

15 0.063547522 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

16 0.060785931 95 nips-2009-Fast subtree kernels on graphs

17 0.060762402 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

18 0.060420349 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

19 0.05829227 72 nips-2009-Distribution Matching for Transduction

20 0.058154311 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.179), (1, 0.155), (2, -0.022), (3, 0.082), (4, -0.045), (5, -0.027), (6, 0.01), (7, 0.075), (8, -0.011), (9, 0.032), (10, 0.1), (11, -0.072), (12, 0.027), (13, 0.057), (14, -0.095), (15, -0.056), (16, -0.009), (17, -0.034), (18, -0.076), (19, -0.007), (20, -0.012), (21, -0.057), (22, 0.019), (23, 0.021), (24, -0.059), (25, -0.038), (26, -0.006), (27, -0.072), (28, 0.065), (29, 0.005), (30, 0.014), (31, -0.054), (32, -0.005), (33, 0.024), (34, -0.025), (35, 0.008), (36, -0.027), (37, -0.066), (38, -0.009), (39, 0.061), (40, -0.045), (41, -0.039), (42, -0.102), (43, -0.024), (44, -0.058), (45, 0.045), (46, -0.03), (47, -0.038), (48, -0.088), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95195973 33 nips-2009-Analysis of SVM with Indefinite Kernels

Author: Yiming Ying, Colin Campbell, Mark Girolami

2 0.73103207 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

Author: Xiao-ming Wu, Anthony M. So, Zhenguo Li, Shuo-yen R. Li

Abstract: Kernel learning is a powerful framework for nonlinear data modeling. Using the kernel trick, a number of problems have been formulated as semideﬁnite programs (SDPs). These include Maximum Variance Unfolding (MVU) (Weinberger et al., 2004) in nonlinear dimensionality reduction, and Pairwise Constraint Propagation (PCP) (Li et al., 2008) in constrained clustering. Although in theory SDPs can be efﬁciently solved, the high computational complexity incurred in numerically processing the huge linear matrix inequality constraints has rendered the SDP approach unscalable. In this paper, we show that a large class of kernel learning problems can be reformulated as semideﬁnite-quadratic-linear programs (SQLPs), which only contain a simple positive semideﬁnite constraint, a second-order cone constraint and a number of linear constraints. These constraints are much easier to process numerically, and the gain in speedup over previous approaches is at least of the order m2.5 , where m is the matrix dimension. Experimental results are also presented to show the superb computational efﬁciency of our approach.

3 0.66526377 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

Author: Liang Sun, Jun Liu, Jianhui Chen, Jieping Ye

Abstract: We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model, in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the (2, 1)-norm minimization, which is an extension of the well-known 1-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is signiﬁcantly much more difﬁcult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semideﬁnite programming (SDP) problem, which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efﬁcient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.

4 0.65311468 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

5 0.63895261 223 nips-2009-Sparse Metric Learning via Smooth Optimization

Author: Yiming Ying, Kaizhu Huang, Colin Campbell

6 0.63606673 180 nips-2009-On the Convergence of the Concave-Convex Procedure

7 0.61094987 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

8 0.60147965 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

9 0.57888532 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

10 0.5742656 147 nips-2009-Matrix Completion from Noisy Entries

11 0.56507427 76 nips-2009-Efficient Learning using Forward-Backward Splitting

12 0.56449354 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

13 0.5566209 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

14 0.54925048 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

15 0.53803521 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

16 0.52448618 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

17 0.5190649 72 nips-2009-Distribution Matching for Transduction

18 0.51818192 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

19 0.50717366 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

20 0.50670981 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.011), (24, 0.062), (25, 0.062), (35, 0.044), (36, 0.098), (39, 0.032), (58, 0.112), (61, 0.369), (71, 0.024), (86, 0.1)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90500981 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

Author: Anne Hsu, Thomas L. Griffiths

Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-speciﬁc knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1

2 0.87746239 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difﬁcult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We deﬁne an inﬁnite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

3 0.87559336 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

Abstract: We introduce the ﬁrst temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any ﬁnite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1

4 0.83717757 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which ﬁrst estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves signiﬁcantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

same-paper 5 0.82017618 33 nips-2009-Analysis of SVM with Indefinite Kernels

Author: Yiming Ying, Colin Campbell, Mark Girolami

6 0.70233601 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

7 0.65605277 134 nips-2009-Learning to Explore and Exploit in POMDPs

8 0.61980224 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

9 0.60842258 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

10 0.60480559 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

11 0.60386741 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

12 0.60137254 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

13 0.59189188 113 nips-2009-Improving Existing Fault Recovery Policies

14 0.58467364 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

15 0.56781328 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

16 0.56636465 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

17 0.56520659 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

18 0.55869734 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

19 0.5520969 48 nips-2009-Bootstrapping from Game Tree Search

20 0.54603797 91 nips-2009-Fast, smooth and adaptive regression in metric spaces