nips nips2010 nips2010-145 knowledge-graph by maker-knowledge-mining

145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls

Source: pdf

Author: Kun Gai, Guangyun Chen, Chang-shui Zhang

Abstract: In this paper, we point out that there exist scaling and initialization problems in most existing multiple kernel learning (MKL) approaches, which employ the large margin principle to jointly learn both a kernel and an SVM classiﬁer. The reason is that the margin itself can not well describe how good a kernel is due to the negligence of the scaling. We use the ratio between the margin and the radius of the minimum enclosing ball to measure the goodness of a kernel, and present a new minimization formulation for kernel learning. This formulation is invariant to scalings of learned kernels, and when learning linear combination of basis kernels it is also invariant to scalings of basis kernels and to the types (e.g., L1 or L2 ) of norm constraints on combination coefﬁcients. We establish the differentiability of our formulation, and propose a gradient projection algorithm for kernel learning. Experiments show that our method signiﬁcantly outperforms both SVM with the uniform combination of basis kernels and other state-of-art MKL approaches. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract In this paper, we point out that there exist scaling and initialization problems in most existing multiple kernel learning (MKL) approaches, which employ the large margin principle to jointly learn both a kernel and an SVM classiﬁer. [sent-7, score-1.017]

2 The reason is that the margin itself can not well describe how good a kernel is due to the negligence of the scaling. [sent-8, score-0.458]

3 We use the ratio between the margin and the radius of the minimum enclosing ball to measure the goodness of a kernel, and present a new minimization formulation for kernel learning. [sent-9, score-1.056]

4 This formulation is invariant to scalings of learned kernels, and when learning linear combination of basis kernels it is also invariant to scalings of basis kernels and to the types (e. [sent-10, score-1.616]

5 We establish the differentiability of our formulation, and propose a gradient projection algorithm for kernel learning. [sent-13, score-0.482]

6 Experiments show that our method signiﬁcantly outperforms both SVM with the uniform combination of basis kernels and other state-of-art MKL approaches. [sent-14, score-0.496]

7 1 Introduction In the past years, kernel methods, like support vector machines (SVM), have achieved great success in many learning problems, such as classiﬁcation and regression. [sent-15, score-0.339]

8 A good kernel function, which implicitly characterizes a suitable transformation of input data, can greatly beneﬁt the accuracy of the predictor. [sent-17, score-0.369]

9 Kernel learning has been developed to jointly learn both a kernel function and an SVM classiﬁer. [sent-19, score-0.339]

10 [1] present several principles to tune parameters in kernel functions. [sent-21, score-0.339]

11 In particular, when the learned kernel is restricted to be a linear combination of multiple basis kernels, the problem of learning the combination coefﬁcients as well as an SVM classiﬁer is usually called multiple kernel learning (MKL). [sent-22, score-1.043]

12 Some other subsequent work explores more generality of multiple kernel learning by promoting non-sparse [7, 8] or group-sparse [9] combinations of basis kernels, or using other forms of learned kernels, e. [sent-26, score-0.536]

13 , a combination of an exponential number of kernels [10] or nonlinear combinations [11, 12, 13]. [sent-28, score-0.396]

14 With an acceptable empirical loss, they aim to ﬁnd the kernel which leads to the largest margin of the SVM classiﬁer. [sent-30, score-0.506]

15 However, despite the substantial progress in both the algorithmic design and the theoretical understanding for the MKL problem, none of the approaches seems to reliably outperform baseline 1 methods, like SVM with the uniform combination of basis kernels [13]. [sent-31, score-0.503]

16 As will be shown in this paper, the large margin principle used in these methods causes the scaling problem and the initialization problem, which can strongly affect ﬁnal solutions of learned kernels as well as performances. [sent-32, score-0.653]

17 It implicates that the large margin preference can not reliably result in a good kernel, and thus the margin itself is not a suitable measure of the goodness of a kernel. [sent-33, score-0.415]

18 Their principle is sensitive to kernel scalings when a nonzero empirical loss is allowed, also causing the same problems as the margin-based formulations. [sent-37, score-0.653]

19 We prove that our formulation is invariant to scalings of learned kernels, and also invariant to initial scalings of basis kernels and to the types (e. [sent-38, score-1.142]

20 , L1 or L2 ) of norm constraints on kernel parameters for the MKL problem. [sent-40, score-0.53]

21 Experiments show that our approach gives signiﬁcant performance improvements both over SVM with the uniform combination of basis kernels and over other state-of-art kernel learning methods. [sent-42, score-0.813]

22 Our proposed kernel learning problem can be reformulated to a tri-level optimization problem. [sent-43, score-0.366]

23 This enables us to generally tackle the radius of the minimal enclosing ball, or other complicated optimal value functions, in the kernel learning framework by simple gradient-based methods. [sent-45, score-0.65]

24 In Section 3 we present a new kernel learning formulation and give discussions. [sent-49, score-0.462]

25 2 Measuring how good a kernel is Let D = {(x1 , y1 ), . [sent-53, score-0.339]

26 Suppose we have a kernel family K = {k : X × X → R}, in which any kernel function k implicitly deﬁnes a transformation φ(·; k) from the input space X to a feature space by k(xc , xd ) = φ(xc ; k), φ(xd ; k) . [sent-57, score-0.702]

27 The task of kernel learning (for binary classiﬁcation) is to learn both a kernel function k ∈ K and a classiﬁer w and b. [sent-59, score-0.678]

28 To make the problem trackable, the learned kernel is usually restricted to a parametric form k (θ) (·, ·), where θ = [θi ]i is the kernel parameter. [sent-60, score-0.742]

29 Then the problem of learning a kernel transfers to the problem of learning a kernel parameter θ. [sent-61, score-0.71]

30 The most common used kernel form is a linear combination of multiple basis kernels, as m j=1 θj kj (·, ·), k (θ) (·, ·) = 2. [sent-62, score-0.678]

31 (2) Problems in multiple kernel learning Most existing MKL approaches, e. [sent-64, score-0.361]

32 (3) (4) (5) For any kernel k, the optimal classiﬁer w and b is actually the SVM classiﬁer with the kernel k. [sent-72, score-0.718]

33 Thus the term w 2 makes formulation (3) prefer the kernel that results in an SVM classiﬁer with a larger margin (as well as an acceptable empirical loss). [sent-75, score-0.603]

34 Here, a natural question is that for different kernels whether the margins of SVM classiﬁers can well measure the goodness of the kernels. [sent-76, score-0.396]

35 2 To answer this question, we consider what happens when a kernel k is enlarged √ a scalar a: by k new = ak, where a > 1. [sent-77, score-0.401]

36 Then we obtain: G(ak) = G(k new ) ≤ 2 w2 2 + C i ξi < 1 w1 2 + 2 ˜ C i ξi = G(k), which means the enlarged kernel gives a larger margin and a smaller objective value. [sent-81, score-0.511]

37 As a consequence, on one hand, the large margin preference guides the scaling of the learned kernel to be as large as possible. [sent-82, score-0.582]

38 In the linear combination case, the scaling problem causes that the kernel parameter θ does not converge in the optimization. [sent-86, score-0.536]

39 Consider an L1 norm constraint and a learned kernel which is a combination of two basis kernels, as k (θ) (·, ·) = θ1 k1 (·, ·) + θ2 k2 (·, ·), θ1 , θ2 ≥ 0, θ1 + θ2 = 1. [sent-91, score-0.758]

40 The example shows that the ﬁnal solution can be greatly affected by the initial scalings of basis kernels, although a norm constraint is used. [sent-97, score-0.543]

41 When the MKL framework is extended from the linear combination cases to the nonlinear cases, the scaling problem becomes more serious, as even a ﬁnite scaling of the learned kernel may not be generally guaranteed by a simple norm constraint on kernel parameters for some kernel forms. [sent-99, score-1.502]

42 2 Measuring the goodness of kernels with the radiuses of MEB Now we need to ﬁnd a more reasonable way to measure the goodness of kernels. [sent-102, score-0.568]

43 Below we introduce the generalization error bounds for SVM and kernel learning, which inspire us to consider the minimum enclosing ball to learn a kernel. [sent-103, score-0.552]

44 For SVM with a ﬁxed kernel, it is well known that the estimation error, which denotes the gap between the expected error and the empirical error, is bounded by O(R2 γ −2 )/n, where R is the radius of the minimum enclosing ball (MEB) of data in the feature space endowed with the kernel used. [sent-104, score-0.77]

45 Scalar dφ denotes γ2d γ 8R γ2 the pseudodimension [14] of the kernel family K. [sent-106, score-0.363]

46 For example, dφ of linear combination kernels is no larger than the number of basis kernels, and dφ of the Gaussian kernels with a form of a b 2 k (θ) (xa , xb ) = e−θ x −x is no larger than 1 (See [14] for more details). [sent-107, score-0.752]

47 The above results clearly state that the generalization error bounds for SVM with both ﬁxed kernels and learned kernels depend on the ratio between the margin γ and the radius R of the minimum enclosing ball of data. [sent-108, score-1.095]

48 Although some new results of the generalization bounds for kernel learning, like [15], give different types of dependencies on dφ , they also rely on the margin-and-radius ratio. [sent-109, score-0.413]

49 However, in kernel learning, the radius R changes drastically from one kernel to another (An example is given in the supplemental materials: when we uniformly combine p basis p 1 1 kernels by kunif = j=1 p kj , the squared radius becomes only p of the squared radius of each basis kernel. [sent-111, score-1.765]

50 As a result, we use the ratio between the margin γ and the radius R to measure how good a kernel is for kernel learning. [sent-114, score-0.967]

51 3 Given any kernel k, the radius of the minimum enclosing ball, denoted by R(k), can be obtained by: R2 (k) = miny,c y, s. [sent-115, score-0.668]

52 i βi = 1, βi ≥ 0, 2 (8) 2 which shows a property of R (k): for any kernel k and any scalar a > 0, we have R (ak) = aR (k). [sent-120, score-0.376]

53 3 Learning kernels with the radiuses Considering the ratio between the margin and the radius of MEB, we propose a new formulation, as mink,w,b,ξi 2 1 2 2 R (k) w 2 +C i ξi , s. [sent-121, score-0.621]

54 This optimization problem is called radius based kernel learning problem, referred to as RKL. [sent-124, score-0.479]

55 [1] also utilize the radius of MEB to tune kernel parameters for hard margin SVM. [sent-126, score-0.598]

56 To give a soft margin version, 1 they modify the kernel matrix K(θ) = K(θ) + C I, resulting in a formulation equivalent to: min 1 R2 (k (θ) ) w 2 θ,w,b,ξi 2 2 (θ) + CR2 (k (θ) ) 2 i ξi , s. [sent-128, score-0.605]

57 Besides, when √ we reduce the scaling of a kernel by multiplying it with a small scalar a and substitute w = w/ a ˜ for w to keep the same ξi , the objective function always decreases (due to the decrease of R2 in the empirical loss term), still leading to scaling problems. [sent-132, score-0.656]

58 [16] recently propose to learn a linear kernel combination, as deﬁned in (2), through min 1 θ,wj ,b,ξi 2 j wj θj 2 + j C θj R2 (kj ) 2 i ξi , s. [sent-134, score-0.367]

59 If we initially adjust the scalings of basis kernels to make each R(kj ) be equal to each other, then their formulation is equivalent to the margin-based formulation (3). [sent-139, score-0.822]

60 Different from the above formulations, our formulation (9) is invariant to scalings of kernels. [sent-140, score-0.384]

61 1 Invariance to scalings of kernels Now we discuss the properties of formulation (9). [sent-142, score-0.593]

62 (12) (13) Functional G(k) deﬁnes a measure of the goodness of kernel functions, which consider a trade-off between the margin-and-radius ratio and the empirical loss. [sent-146, score-0.511]

63 For any kernel k and any scalar a > 0, equation G(ak) = G(k) holds. [sent-149, score-0.376]

64 For a parametric kernel form k (θ) , the RKL problem transfers to minimizing a function g(θ) = G(k (θ) ). [sent-158, score-0.426]

65 Here we temporarily focus on the linear combination case deﬁned by (2), and use glinear (θ) to denote g(θ) in such case. [sent-159, score-0.358]

66 Due to the scaling invariance, for any θ and any a > 0, we have glinear (aθ) = glinear (θ). [sent-160, score-0.627]

67 It makes the problem of minimizing glinear (θ) be invariant to the types of norm constraints on θ, as stated in the following. [sent-161, score-0.606]

68 On one hand, if θa is the global optimal solution of (a), then for any θ that satisﬁes θi ≥ 0 and N (θ) ∈ S, we have glinear ( N c θa ) = (θ) c glinear (θa ) ≤ g(θ). [sent-174, score-0.654]

69 On the other hand, for any θ (θi ≥ 0), (θ) glinear ( N c θ) = glinear (θ) due to the scaling invariance. [sent-176, score-0.627]

70 If θb is the global optimal solution of (b), (θ) c then for any θ (θi ≥ 0), as N c θ satisﬁes the constraint of (b), we have glinear (θb ) ≤ glinear (N (θ) θ), (θ) giving glinear (θb ) ≤ glinear (θ). [sent-177, score-1.245]

71 As the problems of minimizing glinear (θ) under different types of norm constraints on θ are all equivalent to the same problem without any norm constraint, they are equivalent to each other. [sent-179, score-0.742]

72 Based on the above proposition, we can also get the another conclusion: in the linear combination case the minimization problem (12) is also invariant to the initial scalings of basis kernels (see below). [sent-180, score-0.761]

73 Let kj denote basis kernels, and aj > 0 be initial scaling coefﬁcients of basis kernels. [sent-182, score-0.471]

74 By proposition 2, problems (b) is equivalent to the one without any norm constraint: mini˜ mizing G( j θj aj kj ) w. [sent-198, score-0.402]

75 Note that in Proposition 3, optimal solutions of problems (a) and (b), which are with different initial scalings of basis kernels, actually result in the same kernel combinations up to the scalings. [sent-213, score-0.8]

76 As shown in the above three propositions, our proposed formulation not only completely addresses scaling and initialization problems, but also is not sensitive to the types of norm constraints used. [sent-214, score-0.464]

77 Given a parametric kernel form k (θ) , for any parameter θ, to obtain the value of the objective function g(θ) = G(k (θ) ) in (12), we need to solve the SVM-like problem in (13), which is a convex minimization problem and can be solved by its dual problem. [sent-217, score-0.444]

78 Unlike in other kernel learning approaches, here the optimization of the SVM dual problem relies on another optimal value function r2 (θ), making the RKL problem more challenging. [sent-227, score-0.429]

79 Second, the kernel matrix K(θ) shall be continuously differentiable to θ. [sent-252, score-0.403]

80 Both conditions can be met in the linear combination case when each basis kernel matrix is strictly positive deﬁnite, and can also be easily satisﬁed in nonlinear cases, like in [11, 12]. [sent-253, score-0.559]

81 For example, for the linear combination kernel Ki,j (θ) = m θm Ki,j , we have ∂θm = m Ki,j . [sent-257, score-0.427]

82 For the Gaussian kernel Ki,j (θ) = e−θ 5 xi −xj 2 , we have dKi,j (θ) dθ = −Ki,j (θ) xi − xj 2 . [sent-258, score-0.395]

83 To compare with the most popular kernel learning algorithm, simpleMKL [5], in experiments we employ the linear combination 6 kernel form with nonnegative combination coefﬁcients, as deﬁned in (2). [sent-260, score-0.912]

84 In addition, we also consider three types of norm constraints on kernel parameters (combination coefﬁcients): L1 , L2 and 2 no norm constraint. [sent-261, score-0.713]

85 It is because optimal solutions of two problems are continuous to kernel parameter θ according to Theorem 1. [sent-270, score-0.444]

86 In linear combination cases, for any local optimal solution of the RKL problem, denoted by θ∗ , there exist C1 > 0 and C2 > 0 that θ∗ is the global optimal solution of the following convex problem: min 1 θ,wj ,b,ξi 2 j wj 2 + C1 r2 (θ) + C2 2 i ξi , s. [sent-278, score-0.356]

87 Besides, such method is also lack of extension ability to nonlinear parametric kernel forms. [sent-284, score-0.366]

88 Then, in the experiments, we demonstrate that the gradient-based approach can give satisfactory performances, which are signiﬁcantly better than ones of SVM with the uniform combination of basis kernels and of other kernel learning approaches. [sent-285, score-0.839]

89 6 Experiments In this section, we illustrate the performances of our presented RKL approach, in comparison with SVM with the uniform combination of basis kernels (Unif), the margin-based MKL method using formulation (3) (MKL), and the kernel learning principle by Chapelle et al. [sent-286, score-0.936]

90 The used basis kernels are the same as in SimpleMKL [5]: 10 Gaussian kernels with bandwidths γG ∈ {0. [sent-290, score-0.664]

91 Note that although our RKL formulation is theoretically invariant to the initial scalings, the normalization is still applied in RKL to avoid numerical problems caused by large value kernel matrices in SVM and MEB solvers. [sent-293, score-0.527]

92 The average accuracies with standard deviations and average numbers of selected basis kernels are reported in Table 1. [sent-306, score-0.412]

93 ) with standard deviations (in parentheses), and the average numbers of selected basis kernels (Nk). [sent-308, score-0.386]

94 We set the numbers of our method to be bold if our method outperforms both Unif and other two kernel learning approaches under the same norm constraint. [sent-309, score-0.496]

95 (d) For MKL, the L1 norm constraint always results in sparse combinations, whereas the L2 norm constraint always gives non-sparse results (see Index 2 and 5). [sent-552, score-0.372]

96 As there usually exist redundancies in the basis kernels, the searching for good kernels and small empirical loss often directly leads to sparse solutions. [sent-554, score-0.434]

97 7 Conclusion In this paper, we show that the margin term used in previous MKL formulations is not a suitable measure of the goodness of kernels, resulting in scaling and initialization problems. [sent-557, score-0.419]

98 We prove that our formulation is invariant to kernel scalings, and also invariant to scalings of basis kernels and to the types of norm constraints for the MKL problem. [sent-559, score-1.417]

99 The experiments validate that our approach outperforms both SVM with the uniform combination of basis kernels and other state-of-art kernel learning methods. [sent-562, score-0.835]

100 Exploring large feature spaces with hierarchical multiple kernel learning. [sent-651, score-0.361]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mkl', 0.381), ('rkl', 0.378), ('kernel', 0.339), ('kernels', 0.278), ('glinear', 0.27), ('meb', 0.252), ('scalings', 0.218), ('svm', 0.155), ('radius', 0.14), ('norm', 0.135), ('enclosing', 0.131), ('kj', 0.121), ('margin', 0.119), ('goodness', 0.118), ('differentiability', 0.111), ('basis', 0.108), ('multilevel', 0.102), ('formulation', 0.097), ('combination', 0.088), ('scaling', 0.087), ('chapelle', 0.079), ('invariant', 0.069), ('nk', 0.069), ('ak', 0.066), ('constraints', 0.056), ('ball', 0.055), ('radiuses', 0.054), ('endowed', 0.054), ('proposition', 0.053), ('supplemental', 0.052), ('constraint', 0.051), ('dual', 0.05), ('types', 0.048), ('simplemkl', 0.047), ('aj', 0.047), ('yi', 0.047), ('global', 0.043), ('solutions', 0.043), ('initialization', 0.041), ('continuously', 0.04), ('index', 0.04), ('optimal', 0.04), ('unif', 0.039), ('learned', 0.037), ('scalar', 0.037), ('nonnegative', 0.036), ('splice', 0.036), ('calculation', 0.035), ('libsvm', 0.034), ('er', 0.034), ('danskin', 0.032), ('mink', 0.032), ('transfers', 0.032), ('percents', 0.032), ('projection', 0.032), ('denoted', 0.031), ('solution', 0.031), ('multiplying', 0.03), ('suitable', 0.03), ('combinations', 0.03), ('ratio', 0.03), ('lanckriet', 0.03), ('tsinghua', 0.029), ('reliably', 0.029), ('wj', 0.028), ('classi', 0.028), ('objective', 0.028), ('xi', 0.028), ('minimizing', 0.028), ('minimum', 0.027), ('sonnenburg', 0.027), ('xc', 0.027), ('reformulated', 0.027), ('parametric', 0.027), ('give', 0.026), ('principle', 0.026), ('mohri', 0.026), ('accuracies', 0.026), ('cients', 0.025), ('coef', 0.025), ('enlarged', 0.025), ('local', 0.024), ('differentiable', 0.024), ('formulations', 0.024), ('equivalent', 0.024), ('family', 0.024), ('loss', 0.024), ('acceptable', 0.024), ('met', 0.024), ('cortes', 0.024), ('liver', 0.024), ('empirical', 0.024), ('minimizer', 0.023), ('materials', 0.023), ('outperforms', 0.022), ('problems', 0.022), ('multiple', 0.022), ('china', 0.022), ('causes', 0.022), ('employ', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls

Author: Kun Gai, Guangyun Chen, Chang-shui Zhang

2 0.39646649 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm

Author: Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, S.v.n. Vishwanathan

Abstract: Our objective is to train p-norm Multiple Kernel Learning (MKL) and, more generally, linear MKL regularised by the Bregman divergence, using the Sequential Minimal Optimization (SMO) algorithm. The SMO algorithm is simple, easy to implement and adapt, and efﬁciently scales to large problems. As a result, it has gained widespread acceptance and SVMs are routinely trained using SMO in diverse real world applications. Training using SMO has been a long standing goal in MKL for the very same reasons. Unfortunately, the standard MKL dual is not differentiable, and therefore can not be optimised using SMO style co-ordinate ascent. In this paper, we demonstrate that linear MKL regularised with the p-norm squared, or with certain Bregman divergences, can indeed be trained using SMO. The resulting algorithm retains both simplicity and efﬁciency and is signiﬁcantly faster than state-of-the-art specialised p-norm MKL solvers. We show that we can train on a hundred thousand kernels in approximately seven minutes and on ﬁfty thousand points in less than half an hour on a single core. 1

3 0.37125409 174 nips-2010-Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition

Author: Serhat Bucak, Rong Jin, Anil K. Jain

Abstract: Recent studies have shown that multiple kernel learning is very effective for object recognition, leading to the popularity of kernel learning in computer vision problems. In this work, we develop an efﬁcient algorithm for multi-label multiple kernel learning (ML-MKL). We assume that all the classes under consideration share the same combination of kernel functions, and the objective is to ﬁnd the optimal kernel combination that beneﬁts all the classes. Although several algorithms have been developed for ML-MKL, their computational cost is linear in the number of classes, making them unscalable when the number of classes is large, a challenge frequently encountered in visual object recognition. We address this computational challenge by developing a framework for ML-MKL that combines the worst-case analysis with stochastic approximation. Our analysis √ shows that the complexity of our algorithm is O(m1/3 lnm), where m is the number of classes. Empirical studies with object recognition show that while achieving similar classiﬁcation accuracy, the proposed method is signiﬁcantly more efﬁcient than the state-of-the-art algorithms for ML-MKL. 1

4 0.22202557 133 nips-2010-Kernel Descriptors for Visual Recognition

Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox

Abstract: The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a uniﬁed and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classiﬁcation benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.

5 0.20731407 279 nips-2010-Universal Kernels on Non-Standard Input Spaces

Author: Andreas Christmann, Ingo Steinwart

Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H Ă˘Š‚ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X Ă˘Š‚ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1

6 0.1941916 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions

7 0.17710802 124 nips-2010-Inductive Regularized Learning of Kernel Functions

8 0.12114185 250 nips-2010-Spectral Regularization for Support Estimation

9 0.117366 280 nips-2010-Unsupervised Kernel Dimension Reduction

10 0.085635446 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

11 0.079047203 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis

12 0.07568524 96 nips-2010-Fractionally Predictive Spiking Neurons

13 0.071034834 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

14 0.069615431 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio

15 0.069161825 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

16 0.065204002 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

17 0.062725075 243 nips-2010-Smoothness, Low Noise and Fast Rates

18 0.06200562 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

19 0.061382357 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

20 0.061262935 233 nips-2010-Scrambled Objects for Least-Squares Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.201), (1, 0.081), (2, 0.083), (3, -0.096), (4, 0.335), (5, 0.081), (6, 0.405), (7, -0.027), (8, 0.162), (9, 0.057), (10, -0.099), (11, 0.03), (12, -0.065), (13, 0.075), (14, -0.015), (15, 0.048), (16, -0.186), (17, 0.098), (18, -0.047), (19, 0.008), (20, 0.04), (21, 0.058), (22, 0.094), (23, -0.006), (24, -0.036), (25, -0.024), (26, -0.019), (27, 0.037), (28, 0.124), (29, 0.065), (30, 0.055), (31, 0.024), (32, -0.0), (33, -0.038), (34, -0.011), (35, -0.05), (36, 0.09), (37, -0.037), (38, -0.043), (39, 0.047), (40, -0.035), (41, -0.021), (42, -0.011), (43, -0.025), (44, 0.0), (45, 0.025), (46, -0.019), (47, 0.031), (48, -0.033), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95567036 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls

Author: Kun Gai, Guangyun Chen, Chang-shui Zhang

2 0.95500994 176 nips-2010-Multiple Kernel Learning and the SMO Algorithm

Author: Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, S.v.n. Vishwanathan

3 0.8836177 174 nips-2010-Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition

Author: Serhat Bucak, Rong Jin, Anil K. Jain

4 0.81256914 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions

Author: Achintya Kundu, Vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-tal

Abstract: In this paper we consider the problem of learning an n × n kernel matrix from m(≥ 1) similarity matrices under general convex loss. Past research have extensively studied the m = 1 case and have derived several algorithms which require sophisticated techniques like ACCP, SOCP, etc. The existing algorithms do not apply if one uses arbitrary losses and often can not handle m > 1 case. We present several provably convergent iterative algorithms, where each iteration requires either an SVM or a Multiple Kernel Learning (MKL) solver for m > 1 case. One of the major contributions of the paper is to extend the well known Mirror Descent(MD) framework to handle Cartesian product of psd matrices. This novel extension leads to an algorithm, called EMKL, which solves the problem in 2 O( m ǫlog n ) iterations; in each iteration one solves an MKL involving m kernels 2 and m eigen-decomposition of n × n matrices. By suitably deﬁning a restriction on the objective function, a faster version of EMKL is proposed, called REKL, which avoids the eigen-decomposition. An alternative to both EMKL and REKL is also suggested which requires only an SVM solver. Experimental results on real world protein data set involving several similarity matrices illustrate the efﬁcacy of the proposed algorithms.

5 0.72834712 133 nips-2010-Kernel Descriptors for Visual Recognition

Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox

6 0.71845907 124 nips-2010-Inductive Regularized Learning of Kernel Functions

7 0.65608388 279 nips-2010-Universal Kernels on Non-Standard Input Spaces

8 0.58409286 280 nips-2010-Unsupervised Kernel Dimension Reduction

9 0.53745341 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

10 0.53320646 250 nips-2010-Spectral Regularization for Support Estimation

11 0.44085601 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

12 0.37238368 96 nips-2010-Fractionally Predictive Spiking Neurons

13 0.35995257 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis

14 0.32072672 233 nips-2010-Scrambled Objects for Least-Squares Regression

15 0.30194533 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

16 0.28479743 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

17 0.27924731 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage

18 0.27733254 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

19 0.27609766 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable

20 0.2750065 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.042), (17, 0.015), (27, 0.054), (30, 0.062), (35, 0.044), (45, 0.21), (50, 0.039), (52, 0.05), (55, 0.218), (60, 0.028), (65, 0.011), (77, 0.063), (78, 0.022), (90, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82133883 269 nips-2010-Throttling Poisson Processes

Author: Uwe Dick, Peter Haider, Thomas Vanck, Michael Brückner, Tobias Scheffer

Abstract: We study a setting in which Poisson processes generate sequences of decisionmaking events. The optimization goal is allowed to depend on the rate of decision outcomes; the rate may depend on a potentially long backlog of events and decisions. We model the problem as a Poisson process with a throttling policy that enforces a data-dependent rate limit and reduce the learning problem to a convex optimization problem that can be solved efﬁciently. This problem setting matches applications in which damage caused by an attacker grows as a function of the rate of unsuppressed hostile events. We report on experiments on abuse detection for an email service. 1

same-paper 2 0.81258571 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls

Author: Kun Gai, Guangyun Chen, Chang-shui Zhang

3 0.74391747 117 nips-2010-Identifying graph-structured activation patterns in networks

Author: James Sharpnack, Aarti Singh

Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1

4 0.7419467 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefﬁcients. This family subsumes the ℓ1 norm and is ﬂexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the beneﬁt of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.

5 0.74153888 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

Abstract: Many statistical M -estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of ﬁrst-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension d to grow with (and possibly exceed) the sample size n. This high-dimensional structure precludes the usual global assumptions— namely, strong convexity and smoothness conditions—that underlie classical optimization analysis. We deﬁne appropriately restricted versions of these conditions, and show that they are satisﬁed with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov’s ﬁrst-order method [12] has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter θ∗ and the optimal solution θ. This globally linear rate is substantially faster than previous analyses of global convergence for speciﬁc methods that yielded only sublinear rates. Our analysis applies to a wide range of M -estimators and statistical models, including sparse linear regression using Lasso ( 1 regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational eﬃciency in high-dimensional estimation. 1

6 0.73879915 148 nips-2010-Learning Networks of Stochastic Differential Equations

7 0.7383467 265 nips-2010-The LASSO risk: asymptotic results and real world examples

8 0.7381658 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

9 0.73780972 282 nips-2010-Variable margin losses for classifier design

10 0.73779374 63 nips-2010-Distributed Dual Averaging In Networks

11 0.73621356 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

12 0.73475879 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups

13 0.73468244 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

14 0.73450774 290 nips-2010-t-logistic regression

15 0.73298663 5 nips-2010-A Dirty Model for Multi-task Learning

16 0.7326358 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

17 0.73256797 155 nips-2010-Learning the context of a category

18 0.73251671 243 nips-2010-Smoothness, Low Noise and Fast Rates

19 0.73238128 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

20 0.73156148 158 nips-2010-Learning via Gaussian Herding