nips nips2010 nips2010-176 knowledge-graph by maker-knowledge-mining

176 nips-2010-Multiple Kernel Learning and the SMO Algorithm


Source: pdf

Author: Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, S.v.n. Vishwanathan

Abstract: Our objective is to train p-norm Multiple Kernel Learning (MKL) and, more generally, linear MKL regularised by the Bregman divergence, using the Sequential Minimal Optimization (SMO) algorithm. The SMO algorithm is simple, easy to implement and adapt, and efficiently scales to large problems. As a result, it has gained widespread acceptance and SVMs are routinely trained using SMO in diverse real world applications. Training using SMO has been a long standing goal in MKL for the very same reasons. Unfortunately, the standard MKL dual is not differentiable, and therefore can not be optimised using SMO style co-ordinate ascent. In this paper, we demonstrate that linear MKL regularised with the p-norm squared, or with certain Bregman divergences, can indeed be trained using SMO. The resulting algorithm retains both simplicity and efficiency and is significantly faster than state-of-the-art specialised p-norm MKL solvers. We show that we can train on a hundred thousand kernels in approximately seven minutes and on fifty thousand points in less than half an hour on a single core. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Our objective is to train p-norm Multiple Kernel Learning (MKL) and, more generally, linear MKL regularised by the Bregman divergence, using the Sequential Minimal Optimization (SMO) algorithm. [sent-11, score-0.067]

2 Unfortunately, the standard MKL dual is not differentiable, and therefore can not be optimised using SMO style co-ordinate ascent. [sent-15, score-0.147]

3 The resulting algorithm retains both simplicity and efficiency and is significantly faster than state-of-the-art specialised p-norm MKL solvers. [sent-17, score-0.105]

4 We show that we can train on a hundred thousand kernels in approximately seven minutes and on fifty thousand points in less than half an hour on a single core. [sent-18, score-0.761]

5 Recent trends indicate that performance gains can be achieved by non-linear kernel combinations [7,18,21], learning over large kernel spaces [2] and by using general, or non-sparse, regularisation [6, 7, 12, 18]. [sent-21, score-0.449]

6 Simultaneously, efficient optimisation techniques need to be developed to scale MKL out of the lab and into the real world. [sent-22, score-0.196]

7 Such algorithms can help in investigating new application areas and different facets of the MKL problem including dealing with a very large number of kernels and data points. [sent-23, score-0.236]

8 Optimisation using decompositional algorithms such as Sequential Minimal Optimization (SMO) [15] has been a long standing goal in MKL [3] as the algorithms are simple, easy to implement and efficiently scale to large problems. [sent-24, score-0.122]

9 The hope is that they might do for MKL what SMO did for SVMs – allow people to play with MKL on their laptops, modify and adapt it for diverse real world applications and explore large scale settings in terms of number of kernels and data points. [sent-25, score-0.267]

10 Unfortunately, the standard MKL formulation, which learns a linear combination of base kernels subject to l1 regularisation, leads to a dual which is not differentiable. [sent-26, score-0.371]

11 SMO can not be applied as a result and [3] had to resort to expensive Moreau-Yosida regularisation to smooth the dual. [sent-27, score-0.167]

12 State-ofthe-art algorithms today overcome this limitation by solving an intermediate saddle point problem rather than the dual itself [12, 16]. [sent-28, score-0.218]

13 We shift the emphasis firmly back towards solving the dual in such cases. [sent-31, score-0.128]

14 The lp MKL dual is shown to be differentiable and thereby amenable to co-ordinate ascent. [sent-32, score-0.245]

15 Placing the p-norm squared regulariser in the objective lets us efficiently solve the core reduced two variable optimisation problem analytically in some cases and algorithmically in others. [sent-33, score-0.437]

16 Using results from [4, 9], we can compute the lp -MKL Hessian, which brings into play second order variable selection methods which tremendously speed up the rate of convergence [8]. [sent-34, score-0.138]

17 The resulting optimisation algorithm, which we call SMO-MKL, is straight forward to implement and efficient. [sent-36, score-0.195]

18 We demonstrate that SMO-MKL can be significantly faster than the state-of-the-art specialised p-norm solvers [12]. [sent-37, score-0.153]

19 This implies that our algorithm is well suited for learning both sparse, and non-sparse, kernel combinations. [sent-39, score-0.121]

20 We show that we can efficiently combine a hundred thousand kernels in approximately seven minutes or train on fifty thousand points in less than half an hour using a single core on standard hardware where other solvers fail to produce results. [sent-41, score-0.832]

21 A framework for learning general non-linear kernel combinations subject to general regularisation was presented in [18]. [sent-45, score-0.294]

22 Very significant performance gains in terms of pure classification accuracy were reported in [21] by learning a different kernel combination per data point or cluster. [sent-47, score-0.178]

23 Similar trends were observed for regression while learning polynomial kernel combinations [7]. [sent-49, score-0.151]

24 Other promising directions which have resulted in performance gains are sticking to standard MKL but combining an exponentially large number of kernels [2] and linear MKL with p-norm regularisers [6, 12]. [sent-50, score-0.27]

25 regularisation method of [3] was one of the first techniques that could efficiently tackle medium scale problems. [sent-55, score-0.174]

26 This was superseded by the SILP technique of [17] which could, very impressively, train on a million point problem with twenty kernels using parallelism. [sent-56, score-0.279]

27 In response, many two-stage wrapper techniques came up [2, 10, 12, 16, 18] which could be significantly faster when the number of training points was reasonable but the number of kernels large. [sent-58, score-0.351]

28 In fact, the solution needed to be of high enough precision so that the kernel weight gradient computation was accurate and the algorithm converged. [sent-61, score-0.121]

29 In addition, Armijo rule based step size selection was also very expensive and could involve tens of inner SVM evaluations in a single line search. [sent-62, score-0.081]

30 This was particularly expensive since the kernel cache would be invalidated from one SVM evaluation to the next. [sent-63, score-0.19]

31 The one big advantage of such two-stage methods for l1 -MKL was that they could quickly identify, and discard, the kernels with zero weights and thus scaled well with the number of kernels. [sent-64, score-0.263]

32 Most recently, [12] have come up with specialised p-norm solvers which make substantial gains by not solving the inner SVM to optimality and working with a small active set to better utilise the kernel cache. [sent-65, score-0.358]

33 3 The lp -MKL Formulation The objective in MKL is to jointly learn kernel and SVM parameters from training data {(xi , yi )}. [sent-66, score-0.256]

34 Given a set of base kernels {Kk } and corresponding feature maps {φk }, linear MKL aims to learn a linear combination of the base kernels as K = k dk Kk . [sent-67, score-0.663]

35 If the kernel weights are restricted to 2 be non-negative, then the MKL task √ corresponds to learning a standard SVM in the feature space formed by concatenating the vectors dk φk . [sent-68, score-0.254]

36 yi ( min dk wt φk (xi )+b) ≥ 1−ξi (1) wt wk +C ξi + ( dp ) p k k k 2 w,b,ξ≥0,d≥0 2 i k k k The regularisation on the kernel weights is necessary to prevent them from shooting off to infinity. [sent-71, score-0.645]

37 Which regulariser one uses depends on the task at hand. [sent-72, score-0.142]

38 In this Section, we limit ourselves to the p-norm squared regulariser with p > 1. [sent-73, score-0.181]

39 If it is felt that certain kernels are noisy and should be discarded then a sparse solution can be obtained by letting p tend to unity from above. [sent-74, score-0.236]

40 Note that the √ primal above can be made convex by substituting wk for dk wk to get 2 λ 1 s. [sent-76, score-0.324]

41 yi ( wt φk (xi ) + b) ≥ 1 − ξi (2) wt wk /dk + C ξi + ( dp ) p min k k k 2 w,b,ξ≥0,d≥0 2 i k k k We first derive an intermediate saddle point optimisation problem obtained by minimising only w, b and ξ. [sent-78, score-0.525]

42 Note that most MKL methods end up optimising either this, or a very similar, saddle point problem. [sent-80, score-0.159]

43 Also note that it has sometimes been observed that l2 regularisation can provide better results than l1 [6, 7, 12, 18]. [sent-85, score-0.143]

44 This was one of the primary motivations for choosing the p-norm squared regulariser and placing it in the primal objective (the other was to be consistent with other p-norm formulations [9, 11]). [sent-87, score-0.275]

45 Had we included the regulariser as a primal constraint then the dual would have the q-norm rather than the q-norm squared. [sent-88, score-0.307]

46 However, it would then no longer have been possible to solve the two variable reduced problem analytically for the 2-norm special case. [sent-91, score-0.068]

47 3 4 SMO-MKL Optimisation We now develop the SMO-MKL algorithm for optimising the lp MKL dual. [sent-92, score-0.19]

48 The algorithm has three main components: (a) reduced variable optimisation; (b) working set selection and (c) stopping criterion and kernel caching. [sent-93, score-0.271]

49 1 The Reduced Variable Optimisation The SMO algorithm works by repeatedly choosing two variables (assumed to be α1 and α2 without loss of generality in this Subsection) and optimising them while holding all other variables constant. [sent-96, score-0.087]

50 If α1 ← α1 + ∆ and α2 ← α2 + s∆, the dual simplifies to ∆∗ = argmax (1 + s)∆ − L≤∆≤U 1 ( 8λ 2 (ak ∆2 + 2bk ∆ + ck )q ) q (11) k where s = −y1 y2 , L = (s == +1) ? [sent-97, score-0.129]

51 Nevertheless, since this is a simple one dimensional concave optimisation problem, we can efficiently find the global optimum using a variety of methods. [sent-101, score-0.165]

52 We tried bisection search and Brent’s algorithm but the Newton-Raphson method worked best – partly because the one dimensional Hessian was already available from the working set selection step. [sent-102, score-0.112]

53 2 Working Set Selection The choice of which two variables to select for optimisation can have a big impact on training time. [sent-104, score-0.224]

54 First and second order working set selection techniques are more expensive per iteration but converge in far fewer iterations. [sent-106, score-0.136]

55 We implement the greedy second order working set selection strategy of [8]. [sent-107, score-0.119]

56 We do not give the variable selection equations due to lack of space but refer the interested reader to the WSS2 method of [8] and our source code [20]. [sent-108, score-0.07]

57 These are readily derived to be ∇α D = 1 − k ∇2 D = −H − α dk Hk α = 1 − Hα 1 λ k (12) ∇θk f −1 (θ)(Hk α)(Hk α)t 2q−2 2−q q−2 where ∇θk f −1 (θ) = (2 − q)θ 2−2q θk + (q − 1)θ q θk and θk q (13) = 1 t α Hk α 2λ (14) where D has been overloaded to now refer to the dual objective. [sent-110, score-0.239]

58 The cache needs to be updated every time we change α in the reduced variable optimisation. [sent-112, score-0.08]

59 However, since only two variables are changed, Hk α can be updated by summing along just two columns of the kernel matrix. [sent-113, score-0.121]

60 The Hessian is too expensive to cache and is recomputed on demand. [sent-115, score-0.069]

61 Kernel caching strategies can have a big impact on performance since kernel computations can dominate everything else in some cases. [sent-118, score-0.229]

62 While a few different kernel caching techniques have been explored for SVMs, we stick to the standard one used in LibSVM [5]. [sent-119, score-0.202]

63 Each element in the queue is a pointer to a recently accessed (common) row of each of the individual kernel matrices. [sent-121, score-0.121]

64 1 2-Norm MKL As we noted earlier, 2-norm MKL has sometimes been found to outperform MKL trained with l1 regularisation [6, 7, 12, 18]. [sent-124, score-0.143]

65 2 The Bregman Divergence as a Regulariser The Bregman divergence generalises the squared p-norm. [sent-128, score-0.097]

66 Generalised KL Divergence To take a concrete example, different from the p-norm squared used thus far, we investigate the use of the generalised KL divergence as a regulariser. [sent-136, score-0.14]

67 We therefore stay focused on lp -MKL for the remainder of this paper. [sent-139, score-0.103]

68 6 Experiments In this Section, we empirically compare the performance of our proposed SMO-MKL algorithm against the specialised lp -MKL solver of [12] which is referred to as Shogun. [sent-144, score-0.206]

69 Our focus in these experiments is purely on training time and speed of optimisation as the prediction accuracy improvements of lp -MKL have already been documented [12]. [sent-148, score-0.346]

70 We also carry out a few large scale experiments with kernels computed on the fly. [sent-152, score-0.267]

71 In this case, kernel caching can have an effect, but not a significant one as the two methods have very similar caching strategies. [sent-154, score-0.283]

72 For each UCI data set we generated kernels as recommended in [16]. [sent-155, score-0.236]

73 We generated RBF kernels with ten bandwidths for each individual dimension of the feature vector as well as the full feature vector itself. [sent-156, score-0.257]

74 Similarly, we also generated polynomial kernels of degrees 1, 2 and 3. [sent-157, score-0.236]

75 All kernels matrices were pre-computed and normalised to have unit trace. [sent-158, score-0.236]

76 Therefore, it is hoped that SMO-MKL can be used to learn sparse kernel combinations as well as non-sparse ones. [sent-172, score-0.151]

77 Moving on to the large scale experiments with kernels computed on the fly, we first tried combining a hundred thousand RBF kernels on the Sonar data set with 208 points and 59 dimensional features. [sent-173, score-0.748]

78 6 Table 1: Training times on UCI data sets with N training points, D dimensional features, M kernels and T test points. [sent-174, score-0.268]

79 1 Note that these kernels do not form any special hierarchy so approaches such as [2] are not applicable. [sent-544, score-0.236]

80 As can be seen, SMO-MKL appears to be scaling linearly with the number of kernels and we converge in less than half an hour on all hundred thousand kernels for both p = 2 and p = 1. [sent-546, score-0.781]

81 If we were to run the same experiment using pre-computed kernels then we converge in approximately seven minutes (see Fig (1b)). [sent-548, score-0.324]

82 On the other hand, Shogun took six hundred seconds to combine just ten thousand kernels computed on the fly. [sent-549, score-0.484]

83 Figure (1c) and (1d) plot timing results on a log-log scale as the number of training points is varied on the Adult and Web data sets (please see [1] for data set details and downloads). [sent-551, score-0.089]

84 We used 50 kernels computed on the 7 Sonar 7 6 9 8 5. [sent-552, score-0.236]

85 5 11 (d) Web Figure 1: Large scale experiments varying the number of kernels and points. [sent-577, score-0.267]

86 On Adult, till about six thousand points, SMO-MKL is roughly 1. [sent-580, score-0.128]

87 However, on reaching eleven thousand points, Shogun starts taking more and more time to converge and we could not get results for sixteen thousand points or more. [sent-583, score-0.305]

88 Training on all 49,749 points and 50 kernels took 1574. [sent-589, score-0.262]

89 7 Conclusions We developed the SMO-MKL algorithm for efficiently optimising the lp -MKL formulation. [sent-595, score-0.19]

90 We placed the emphasis firmly back on optimising the MKL dual rather than the intermediate saddle point problem on which all state-of-the-art MKL solvers are based. [sent-596, score-0.375]

91 We showed that the lp -MKL dual is differentiable and that placing the p-norm squared regulariser in the primal objective lets us analytically solve the reduced variable problem for p = 2. [sent-597, score-0.588]

92 A second-order working set selection algorithm was implemented to speed up convergence. [sent-599, score-0.089]

93 In terms of empirical performance, we compared the SMO-MKL algorithm to the specialised lp MKL solver of [12] referred to as Shogun. [sent-602, score-0.206]

94 SMO-MKL was also found to be relatively stable for various values of p and could therefore be used to learn both sparse, and non-sparse, kernel combinations. [sent-604, score-0.121]

95 We demonstrated that the algorithm could combine a hundred thousand kernels on Sonar in approximately seven minutes using precomputed kernels and in less than half an hour using kernels computed on the fly. [sent-605, score-1.059]

96 This is significant as many non-linear kernel combination forms, which lead to performance improvements but are non-convex, can be recast as convex linear MKL with a much larger set of base kernels. [sent-606, score-0.173]

97 The SMOMKL algorithm can now be used to tackle such problems as long as an appropriate regulariser can be found. [sent-607, score-0.142]

98 We were also able to train on the entire Web data set with nearly fifty thousand points and fifty kernels computed on the fly in less than half an hour. [sent-608, score-0.445]

99 Exploring large feature spaces with hierarchical multiple kernel learning. [sent-622, score-0.121]

100 Multiple kernel learning, conic duality, and the SMO algorithm. [sent-632, score-0.121]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mkl', 0.609), ('shogun', 0.346), ('smo', 0.277), ('kernels', 0.236), ('hk', 0.191), ('optimisation', 0.165), ('regularisation', 0.143), ('regulariser', 0.142), ('dk', 0.133), ('thousand', 0.128), ('kernel', 0.121), ('dual', 0.106), ('lp', 0.103), ('dp', 0.091), ('sonar', 0.09), ('optimising', 0.087), ('bregman', 0.082), ('caching', 0.081), ('specialised', 0.079), ('saddle', 0.072), ('hundred', 0.068), ('kk', 0.061), ('primal', 0.059), ('divergence', 0.058), ('fty', 0.058), ('hour', 0.055), ('working', 0.054), ('wk', 0.053), ('wt', 0.052), ('varma', 0.051), ('solvers', 0.048), ('regularised', 0.047), ('cache', 0.045), ('rf', 0.044), ('generalised', 0.043), ('adult', 0.043), ('optimised', 0.041), ('intermediate', 0.04), ('squared', 0.039), ('hessian', 0.038), ('differentiable', 0.036), ('decompositional', 0.036), ('manik', 0.036), ('smomkl', 0.036), ('placing', 0.035), ('liver', 0.035), ('selection', 0.035), ('code', 0.035), ('seven', 0.035), ('reduced', 0.035), ('half', 0.035), ('gains', 0.034), ('libsvm', 0.034), ('lagrangian', 0.034), ('analytically', 0.033), ('web', 0.033), ('svm', 0.032), ('training', 0.032), ('lb', 0.031), ('wrapper', 0.031), ('seconds', 0.031), ('scale', 0.031), ('combinations', 0.03), ('implement', 0.03), ('minutes', 0.03), ('carried', 0.03), ('base', 0.029), ('dr', 0.029), ('uci', 0.029), ('ib', 0.029), ('rmly', 0.029), ('rkl', 0.029), ('kloft', 0.029), ('jmlr', 0.028), ('sonnenburg', 0.027), ('big', 0.027), ('stopping', 0.026), ('points', 0.026), ('substituting', 0.026), ('mohri', 0.026), ('faster', 0.026), ('divergences', 0.025), ('standing', 0.025), ('ak', 0.024), ('solver', 0.024), ('expensive', 0.024), ('thing', 0.024), ('cortes', 0.024), ('improvements', 0.023), ('ck', 0.023), ('accuracy', 0.023), ('core', 0.023), ('twenty', 0.023), ('converge', 0.023), ('tried', 0.023), ('kl', 0.022), ('inner', 0.022), ('emphasis', 0.022), ('ten', 0.021), ('train', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm

Author: Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, S.v.n. Vishwanathan

Abstract: Our objective is to train p-norm Multiple Kernel Learning (MKL) and, more generally, linear MKL regularised by the Bregman divergence, using the Sequential Minimal Optimization (SMO) algorithm. The SMO algorithm is simple, easy to implement and adapt, and efficiently scales to large problems. As a result, it has gained widespread acceptance and SVMs are routinely trained using SMO in diverse real world applications. Training using SMO has been a long standing goal in MKL for the very same reasons. Unfortunately, the standard MKL dual is not differentiable, and therefore can not be optimised using SMO style co-ordinate ascent. In this paper, we demonstrate that linear MKL regularised with the p-norm squared, or with certain Bregman divergences, can indeed be trained using SMO. The resulting algorithm retains both simplicity and efficiency and is significantly faster than state-of-the-art specialised p-norm MKL solvers. We show that we can train on a hundred thousand kernels in approximately seven minutes and on fifty thousand points in less than half an hour on a single core. 1

2 0.40142527 174 nips-2010-Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition

Author: Serhat Bucak, Rong Jin, Anil K. Jain

Abstract: Recent studies have shown that multiple kernel learning is very effective for object recognition, leading to the popularity of kernel learning in computer vision problems. In this work, we develop an efficient algorithm for multi-label multiple kernel learning (ML-MKL). We assume that all the classes under consideration share the same combination of kernel functions, and the objective is to find the optimal kernel combination that benefits all the classes. Although several algorithms have been developed for ML-MKL, their computational cost is linear in the number of classes, making them unscalable when the number of classes is large, a challenge frequently encountered in visual object recognition. We address this computational challenge by developing a framework for ML-MKL that combines the worst-case analysis with stochastic approximation. Our analysis √ shows that the complexity of our algorithm is O(m1/3 lnm), where m is the number of classes. Empirical studies with object recognition show that while achieving similar classification accuracy, the proposed method is significantly more efficient than the state-of-the-art algorithms for ML-MKL. 1

3 0.39646649 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls

Author: Kun Gai, Guangyun Chen, Chang-shui Zhang

Abstract: In this paper, we point out that there exist scaling and initialization problems in most existing multiple kernel learning (MKL) approaches, which employ the large margin principle to jointly learn both a kernel and an SVM classifier. The reason is that the margin itself can not well describe how good a kernel is due to the negligence of the scaling. We use the ratio between the margin and the radius of the minimum enclosing ball to measure the goodness of a kernel, and present a new minimization formulation for kernel learning. This formulation is invariant to scalings of learned kernels, and when learning linear combination of basis kernels it is also invariant to scalings of basis kernels and to the types (e.g., L1 or L2 ) of norm constraints on combination coefficients. We establish the differentiability of our formulation, and propose a gradient projection algorithm for kernel learning. Experiments show that our method significantly outperforms both SVM with the uniform combination of basis kernels and other state-of-art MKL approaches. 1

4 0.20547181 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions

Author: Achintya Kundu, Vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-tal

Abstract: In this paper we consider the problem of learning an n × n kernel matrix from m(≥ 1) similarity matrices under general convex loss. Past research have extensively studied the m = 1 case and have derived several algorithms which require sophisticated techniques like ACCP, SOCP, etc. The existing algorithms do not apply if one uses arbitrary losses and often can not handle m > 1 case. We present several provably convergent iterative algorithms, where each iteration requires either an SVM or a Multiple Kernel Learning (MKL) solver for m > 1 case. One of the major contributions of the paper is to extend the well known Mirror Descent(MD) framework to handle Cartesian product of psd matrices. This novel extension leads to an algorithm, called EMKL, which solves the problem in 2 O( m ǫlog n ) iterations; in each iteration one solves an MKL involving m kernels 2 and m eigen-decomposition of n × n matrices. By suitably defining a restriction on the objective function, a faster version of EMKL is proposed, called REKL, which avoids the eigen-decomposition. An alternative to both EMKL and REKL is also suggested which requires only an SVM solver. Experimental results on real world protein data set involving several similarity matrices illustrate the efficacy of the proposed algorithms.

5 0.1368438 279 nips-2010-Universal Kernels on Non-Standard Input Spaces

Author: Andreas Christmann, Ingo Steinwart

Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H ⊂ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X ⊂ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1

6 0.11355127 133 nips-2010-Kernel Descriptors for Visual Recognition

7 0.086850308 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

8 0.084513038 124 nips-2010-Inductive Regularized Learning of Kernel Functions

9 0.07708849 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts

10 0.076074265 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

11 0.068893291 228 nips-2010-Reverse Multi-Label Learning

12 0.068419002 250 nips-2010-Spectral Regularization for Support Estimation

13 0.065367132 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

14 0.061875176 280 nips-2010-Unsupervised Kernel Dimension Reduction

15 0.057296276 96 nips-2010-Fractionally Predictive Spiking Neurons

16 0.053708117 99 nips-2010-Gated Softmax Classification

17 0.052407864 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

18 0.049288407 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

19 0.049117777 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

20 0.048614543 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.156), (1, 0.063), (2, 0.055), (3, -0.095), (4, 0.249), (5, 0.077), (6, 0.366), (7, 0.003), (8, 0.18), (9, 0.041), (10, -0.119), (11, 0.031), (12, -0.023), (13, 0.1), (14, -0.017), (15, 0.038), (16, -0.212), (17, 0.098), (18, -0.051), (19, 0.036), (20, 0.063), (21, 0.138), (22, 0.123), (23, 0.046), (24, -0.069), (25, -0.03), (26, -0.046), (27, 0.06), (28, 0.158), (29, 0.05), (30, 0.023), (31, 0.033), (32, 0.01), (33, -0.035), (34, -0.03), (35, -0.064), (36, 0.115), (37, -0.038), (38, -0.09), (39, 0.063), (40, -0.001), (41, -0.082), (42, -0.009), (43, -0.008), (44, 0.019), (45, -0.031), (46, -0.039), (47, 0.065), (48, 0.017), (49, -0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9509449 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm

Author: Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, S.v.n. Vishwanathan

Abstract: Our objective is to train p-norm Multiple Kernel Learning (MKL) and, more generally, linear MKL regularised by the Bregman divergence, using the Sequential Minimal Optimization (SMO) algorithm. The SMO algorithm is simple, easy to implement and adapt, and efficiently scales to large problems. As a result, it has gained widespread acceptance and SVMs are routinely trained using SMO in diverse real world applications. Training using SMO has been a long standing goal in MKL for the very same reasons. Unfortunately, the standard MKL dual is not differentiable, and therefore can not be optimised using SMO style co-ordinate ascent. In this paper, we demonstrate that linear MKL regularised with the p-norm squared, or with certain Bregman divergences, can indeed be trained using SMO. The resulting algorithm retains both simplicity and efficiency and is significantly faster than state-of-the-art specialised p-norm MKL solvers. We show that we can train on a hundred thousand kernels in approximately seven minutes and on fifty thousand points in less than half an hour on a single core. 1

2 0.85237497 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls

Author: Kun Gai, Guangyun Chen, Chang-shui Zhang

Abstract: In this paper, we point out that there exist scaling and initialization problems in most existing multiple kernel learning (MKL) approaches, which employ the large margin principle to jointly learn both a kernel and an SVM classifier. The reason is that the margin itself can not well describe how good a kernel is due to the negligence of the scaling. We use the ratio between the margin and the radius of the minimum enclosing ball to measure the goodness of a kernel, and present a new minimization formulation for kernel learning. This formulation is invariant to scalings of learned kernels, and when learning linear combination of basis kernels it is also invariant to scalings of basis kernels and to the types (e.g., L1 or L2 ) of norm constraints on combination coefficients. We establish the differentiability of our formulation, and propose a gradient projection algorithm for kernel learning. Experiments show that our method significantly outperforms both SVM with the uniform combination of basis kernels and other state-of-art MKL approaches. 1

3 0.83780444 174 nips-2010-Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition

Author: Serhat Bucak, Rong Jin, Anil K. Jain

Abstract: Recent studies have shown that multiple kernel learning is very effective for object recognition, leading to the popularity of kernel learning in computer vision problems. In this work, we develop an efficient algorithm for multi-label multiple kernel learning (ML-MKL). We assume that all the classes under consideration share the same combination of kernel functions, and the objective is to find the optimal kernel combination that benefits all the classes. Although several algorithms have been developed for ML-MKL, their computational cost is linear in the number of classes, making them unscalable when the number of classes is large, a challenge frequently encountered in visual object recognition. We address this computational challenge by developing a framework for ML-MKL that combines the worst-case analysis with stochastic approximation. Our analysis √ shows that the complexity of our algorithm is O(m1/3 lnm), where m is the number of classes. Empirical studies with object recognition show that while achieving similar classification accuracy, the proposed method is significantly more efficient than the state-of-the-art algorithms for ML-MKL. 1

4 0.76175994 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions

Author: Achintya Kundu, Vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-tal

Abstract: In this paper we consider the problem of learning an n × n kernel matrix from m(≥ 1) similarity matrices under general convex loss. Past research have extensively studied the m = 1 case and have derived several algorithms which require sophisticated techniques like ACCP, SOCP, etc. The existing algorithms do not apply if one uses arbitrary losses and often can not handle m > 1 case. We present several provably convergent iterative algorithms, where each iteration requires either an SVM or a Multiple Kernel Learning (MKL) solver for m > 1 case. One of the major contributions of the paper is to extend the well known Mirror Descent(MD) framework to handle Cartesian product of psd matrices. This novel extension leads to an algorithm, called EMKL, which solves the problem in 2 O( m ǫlog n ) iterations; in each iteration one solves an MKL involving m kernels 2 and m eigen-decomposition of n × n matrices. By suitably defining a restriction on the objective function, a faster version of EMKL is proposed, called REKL, which avoids the eigen-decomposition. An alternative to both EMKL and REKL is also suggested which requires only an SVM solver. Experimental results on real world protein data set involving several similarity matrices illustrate the efficacy of the proposed algorithms.

5 0.61627609 133 nips-2010-Kernel Descriptors for Visual Recognition

Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox

Abstract: The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.

6 0.56943685 124 nips-2010-Inductive Regularized Learning of Kernel Functions

7 0.48131672 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

8 0.46458218 279 nips-2010-Universal Kernels on Non-Standard Input Spaces

9 0.40543312 280 nips-2010-Unsupervised Kernel Dimension Reduction

10 0.32552558 250 nips-2010-Spectral Regularization for Support Estimation

11 0.31958252 96 nips-2010-Fractionally Predictive Spiking Neurons

12 0.2835494 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable

13 0.27921948 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

14 0.27847695 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

15 0.25319433 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

16 0.25299788 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

17 0.22010779 62 nips-2010-Discriminative Clustering by Regularized Information Maximization

18 0.21938781 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis

19 0.21917224 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem

20 0.21612284 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.034), (17, 0.023), (27, 0.036), (30, 0.063), (35, 0.032), (41, 0.251), (45, 0.216), (50, 0.047), (52, 0.02), (60, 0.065), (65, 0.015), (77, 0.043), (78, 0.036), (90, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83820766 71 nips-2010-Efficient Relational Learning with Hidden Variable Detection

Author: Ni Lao, Jun Zhu, Liu Xinwang, Yandong Liu, William W. Cohen

Abstract: Markov networks (MNs) can incorporate arbitrarily complex features in modeling relational data. However, this flexibility comes at a sharp price of training an exponentially complex model. To address this challenge, we propose a novel relational learning approach, which consists of a restricted class of relational MNs (RMNs) called relation tree-based RMN (treeRMN), and an efficient Hidden Variable Detection algorithm called Contrastive Variable Induction (CVI). On one hand, the restricted treeRMN only considers simple (e.g., unary and pairwise) features in relational data and thus achieves computational efficiency; and on the other hand, the CVI algorithm efficiently detects hidden variables which can capture long range dependencies. Therefore, the resultant approach is highly efficient yet does not sacrifice its expressive power. Empirical results on four real datasets show that the proposed relational learning method can achieve similar prediction quality as the state-of-the-art approaches, but is significantly more efficient in training; and the induced hidden variables are semantically meaningful and crucial to improve the training speed and prediction qualities of treeRMNs.

same-paper 2 0.80482894 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm

Author: Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, S.v.n. Vishwanathan

Abstract: Our objective is to train p-norm Multiple Kernel Learning (MKL) and, more generally, linear MKL regularised by the Bregman divergence, using the Sequential Minimal Optimization (SMO) algorithm. The SMO algorithm is simple, easy to implement and adapt, and efficiently scales to large problems. As a result, it has gained widespread acceptance and SVMs are routinely trained using SMO in diverse real world applications. Training using SMO has been a long standing goal in MKL for the very same reasons. Unfortunately, the standard MKL dual is not differentiable, and therefore can not be optimised using SMO style co-ordinate ascent. In this paper, we demonstrate that linear MKL regularised with the p-norm squared, or with certain Bregman divergences, can indeed be trained using SMO. The resulting algorithm retains both simplicity and efficiency and is significantly faster than state-of-the-art specialised p-norm MKL solvers. We show that we can train on a hundred thousand kernels in approximately seven minutes and on fifty thousand points in less than half an hour on a single core. 1

3 0.76452214 138 nips-2010-Large Margin Multi-Task Metric Learning

Author: Shibin Parameswaran, Kilian Q. Weinberger

Abstract: Multi-task learning (MTL) improves the prediction performance on multiple, different but related, learning problems through shared parameters or representations. One of the most prominent multi-task learning algorithms is an extension to support vector machines (svm) by Evgeniou et al. [15]. Although very elegant, multi-task svm is inherently restricted by the fact that support vector machines require each class to be addressed explicitly with its own weight vector which, in a multi-task setting, requires the different learning tasks to share the same set of classes. This paper proposes an alternative formulation for multi-task learning by extending the recently published large margin nearest neighbor (lmnn) algorithm to the MTL paradigm. Instead of relying on separating hyperplanes, its decision function is based on the nearest neighbor rule which inherently extends to many classes and becomes a natural fit for multi-task learning. We evaluate the resulting multi-task lmnn on real-world insurance data and speech classification problems and show that it consistently outperforms single-task kNN under several metrics and state-of-the-art MTL classifiers. 1

4 0.69714063 243 nips-2010-Smoothness, Low Noise and Fast Rates

Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1

5 0.69702852 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

Author: Rina Foygel, Mathias Drton

Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1

6 0.69640005 63 nips-2010-Distributed Dual Averaging In Networks

7 0.69394839 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

8 0.69141561 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

9 0.69136214 282 nips-2010-Variable margin losses for classifier design

10 0.69036061 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups

11 0.69017571 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

12 0.68964249 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

13 0.68961501 174 nips-2010-Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition

14 0.68910658 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

15 0.68891978 290 nips-2010-t-logistic regression

16 0.68860507 287 nips-2010-Worst-Case Linear Discriminant Analysis

17 0.6884861 222 nips-2010-Random Walk Approach to Regret Minimization

18 0.68836439 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

19 0.6877628 27 nips-2010-Agnostic Active Learning Without Constraints

20 0.68738472 265 nips-2010-The LASSO risk: asymptotic results and real world examples