jmlr jmlr2008 jmlr2008-11 knowledge-graph by maker-knowledge-mining

11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces


Source: pdf

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. [sent-6, score-0.647]

2 Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers 1. [sent-7, score-0.21]

3 Under this condition, Tsybakov (2004) gets minimax fast rates of convergence for classification with ERM estimators over a class F with controlled complexity (in terms of entropy). [sent-38, score-0.137]

4 These rates depend on two parameters : the margin parameter and the complexity of the class of candidates f ∗ (see also Massart and N´ d´ lec, 2006). [sent-39, score-0.194]

5 The space HK is a RKHS with reproducing kernel K. [sent-58, score-0.14]

6 Recall that every positive definite kernel has an essentially unique RKHS (Aronszajn, 1950). [sent-60, score-0.107]

7 For a survey on this kernel method we refer to Cristianini and Shawe-Taylor (2000). [sent-68, score-0.107]

8 However, its good practical performances are not yet completely understood. [sent-70, score-0.161]

9 Wu and Zhou (2006) state slow rates (logarithmic with the sample size) for SVM using a Gaussian kernel with fixed width. [sent-77, score-0.206]

10 Steinwart and Scovel (2007) give, under a margin assumption, fast rates for SVM using a decreasing width (which depends on the sample size). [sent-79, score-0.194]

11 The goal of this work is to clarify both practical and theoretical performances of the algorithm using two different classes of kernels. [sent-82, score-0.161]

12 In a first theoretical part, we consider a family of kernels generating Sobolev spaces as RKHS. [sent-83, score-0.195]

13 Then under the margin assumption, we give learning rates of convergence for SVM using Sobolev spaces. [sent-87, score-0.194]

14 This choice strongly depends on the regularity assumption over the Bayes and the margin assumption. [sent-89, score-0.163]

15 It uses a method called aggregation with exponential weights. [sent-93, score-0.21]

16 Finally, we show practical performances of this aggregate and compare it with a similar classifier using Gaussian kernels and results of Steinwart and Scovel (2007). [sent-94, score-0.458]

17 In Section 2, we give statistical performances of SVM using Sobolev spaces. [sent-96, score-0.161]

18 Section 3 presents the adaptive procedure of aggregation and show the performances of the data-dependent aggregate. [sent-97, score-0.371]

19 For the estimation error, we will state an oracle-type inequality of the form : ERl ( fˆn , f ∗ ) ≤ C inf Rl ( f , f ∗ ) + αn f + εn , 2 K f ∈HK (5) where Rl ( f , f ∗ ) := EP l(Y, f (X)) − EP l(Y, f ∗ (X)) is the excess l-risk of f . [sent-110, score-0.118]

20 Using this approach, Steinwart and Scovel (2007) study the statistical performances of SVM minimization (4) with the parametric family of Gaussian kernels. [sent-118, score-0.231]

21 For σ ∈ R, we define the Gaussian kernel Kσ (x, y) = exp −σ2 x − y 2 on the closed unit ball of Rd (denoted X ). [sent-119, score-0.142]

22 In this paper, under a margin assumption and a geometric assumption over the distribution, they state fast learning rates for SVM. [sent-121, score-0.25]

23 These rates hold under some specific choices of tuning parameters recalled in Sect. [sent-122, score-0.142]

24 Following Lecu´ (2007a), we will e use this result and more precisely these choices of tuning parameters to implement the aggregate using Gaussian kernels. [sent-124, score-0.251]

25 For r ∈ R+ , a kernel Kr will be called Sobolev smooth kernel with exponent r > d if the associated RKHS HKr is such that r HKr = W 22 , where W r2 is defined in (7). [sent-141, score-0.337]

26 We say that a kernel K is a translation invariant kernel (or RBF kernel), if for all x, y ∈ R d , K(x, y) = Φ(x − y) (8) for a given Φ : Rd → C. [sent-144, score-0.303]

27 The most popular example of translation invariant kernel is the Gaussian kernel Kσ (x, y) = exp(−σ2 x − y 2 ). [sent-146, score-0.303]

28 This kernel is not a Sobolev smooth kernel (see below). [sent-147, score-0.297]

29 Under suitable assumptions on Φ, the following theorem gives a Fourier representation of a RKHS associated to a translation invariant kernel. [sent-148, score-0.162]

30 Theorem 1 Let K : Rd × Rd → C be a translation invariant kernel where in (8) Φ belongs to L1 (Rd ) ∩ L2 (Rd ) and such that Φ is integrable. [sent-150, score-0.196]

31 Sufficient conditions to have a Sobolev smooth kernel are: 1563 dω, L OUSTAU Corollary 2 Let K satisfying assumptions of Theorem 1. [sent-152, score-0.19]

32 (c + ω 2 )s (9) Then K is a Sobolev smooth kernel with exponent r = 2s > d. [sent-154, score-0.23]

33 In Section 5 we propose an example of Sobolev smooth kernel and use it into the SVM procedure. [sent-155, score-0.19]

34 Remark 3 (Gaussian kernels are not Sobolev smooth) Theorem 1 can be used to define Gaussian kernels in terms of Fourier transform. [sent-156, score-0.178]

35 Indeed, the Gaussian kernel defined above is a translation invariant kernel with RB function Φ(x) = exp(−σ2 x 2 ). [sent-157, score-0.303]

36 Theorem 4 Consider the approximation function a(αn ) defined in (6), with Sobolev smooth kernel Kr such that r > 2s > 0. [sent-174, score-0.19]

37 Under a geometric assumption over the distribution, they get γ γ+1 a(αn ) ≤ Cαn , where γ is the geometric noise exponent. [sent-187, score-0.112]

38 Consider the SVM minimization (4) with Sobolev smooth kernel Kr , with r > 2s ∨ d, built on the i. [sent-212, score-0.219]

39 2 rd In particular, for q = +∞, it corresponds to s > r+d . [sent-224, score-0.409]

40 The presence of fast rates depends on the regularity of the Bayes classifier. [sent-225, score-0.167]

41 2 Remark 9 (COMPARISON WITH S TEINWART AND S COVEL , 2007) This theorem gives performances of SVM using a fixed kernel. [sent-228, score-0.234]

42 Nevertheless, rates of convergences are fast for sufficiently large geometric noise parameter. [sent-230, score-0.155]

43 If we consider a margin parameter q = +∞, we hence cannot reach the rate of convergence r n− 3r−1 which corresponds to a regularity s = 1 in the Besov space. [sent-245, score-0.163]

44 Then the SVM using Sobolev smooth 2 kernel HKr with r > 1 cannot learn with fast rate in this simple case. [sent-246, score-0.19]

45 It holds under two ad-hoc assumptions: a margin assumption over the distribution and a regularity assumption over the Bayes rule. [sent-249, score-0.163]

46 Hence the choice of the smoothing parameter depends on two unknown parameters: the margin parameter q and the exponent s in the Besov space. [sent-250, score-0.135]

47 We use a technique called aggregation (Nemirovski, 1998; Yang, 2000). [sent-257, score-0.21]

48 The procedure of aggregation uses the second subsample D 22 to construct a convex combination n with exponential weights. [sent-273, score-0.25]

49 Namely, the aggregate f˜n is defined by f˜n = ∑ α∈G (n2 ) (n) α ωα fˆn1 , where (n) ωα = α exp ∑n 1 +1 Yi fˆn1 (Xi ) i=n . [sent-274, score-0.243]

50 dPX dx < C0 , (10) with parameter q and such that Remark 13 Same rates as in Theorem 7 are attained. [sent-279, score-0.147]

51 In Section 5 we sum up practical performances of this aggregate. [sent-281, score-0.161]

52 Lecu´ (2007b) states an oracle inequality such as (22) without any restriction on e the number of estimators to aggregate. [sent-283, score-0.114]

53 Practical Experiments We now propose experiments illustrating performances of the aggregate of Section 3. [sent-293, score-0.369]

54 With Corollary 2, for any fixed σ, the Laplacian kernel defined in (15) is a Sobolev smooth kernel with exponent r = d + 1. [sent-321, score-0.337]

55 To avoid this problem, we choose in our aggregation step using this class of kernels a constant σ = 5. [sent-326, score-0.299]

56 In the sequel the Laplacian kernel used is precisely K(x, y) = exp(−5 x − y ). [sent-327, score-0.143]

57 For each realization of training set, we use previous section to build R α • the set of classifiers ( fˆn1 ) for α belonging to G (n2 ); (n) • exponential weights ωα to deduce aggregate f˜n . [sent-329, score-0.245]

58 Note that growing b does not improve significantly the performances whereas it adds computing time. [sent-338, score-0.161]

59 We mention in order the performances of the worst estimator, the mean over the family and the best over the family. [sent-345, score-0.202]

60 Then the performances of the aggregate using exponential weights are given in the last column. [sent-347, score-0.369]

61 72 Table 2: Performances using Laplacian kernel Note that the amplitude in the family is not very important. [sent-420, score-0.148]

62 The test errors of the aggregate are located between the average over the family and the oracle of the family. [sent-424, score-0.325]

63 2 SVM Using Gaussian Kernels Here we focus on the parametric class of Gaussian kernels Kσ (x, y) = exp −σ2 x − y 2 , for σ ∈ R. [sent-430, score-0.124]

64 We build an aggregate made of a convex combination of Gaussian SVM classifiers. [sent-431, score-0.208]

65 According to Steinwart and Scovel (2007), suppose that the probability distribution P has a geometric noise γ > 0 and assumption (10) holds with margin parameter q > 0. [sent-437, score-0.151]

66 As a result, parameter σ must be considered in the aggregation procedure, as the smoothing parameter α. [sent-441, score-0.21]

67 Thus we have more classifiers to aggregate and needs more time to run. [sent-449, score-0.208]

68 Such as the Sobolev case, the number of classifiers to aggregate is mentioned in Table 3 for each data set. [sent-452, score-0.208]

69 Table 3 relates the generalization performances of the classifiers over the test samples. [sent-453, score-0.161]

70 We first give the performances of the family of Gaussian SVM (namely the worst, the mean and the oracle over the family). [sent-454, score-0.278]

71 The performances of the aggregate using exponential weights are given in the last column. [sent-455, score-0.369]

72 79 Table 3: Performances using Gaussian kernels In this case the generalization errors in the family are more disparate. [sent-528, score-0.13]

73 The performances of the Gaussian aggregate, as above, are located between the average of weak estimators and the best among the family. [sent-530, score-0.199]

74 (1998) a Table 4 combines the performances of the aggregates using Laplacian kernel and Gaussian kernels. [sent-533, score-0.35]

75 Gaussian kernels and Laplacian kernel lead to similar performances. [sent-535, score-0.196]

76 We have tackled several important questions such as its statistical performances, the role of the kernel and the choice of the tuning parameters. [sent-615, score-0.15]

77 The first part of the paper focuses on the statistical performances of the method. [sent-616, score-0.161]

78 In this study, we consider Sobolev smooth kernels as an alternative to the Gaussian kernels. [sent-617, score-0.172]

79 Explicit rates of convergence have been given depending on the margin and the regularity (Theorem 7). [sent-619, score-0.262]

80 The aggregation method appeared suitable in this context to construct directly from the data a competitive decision rule: it has the same statistical performances as the non-adaptive classifier (Theorem 12). [sent-622, score-0.371]

81 For completeness, we have finally implemented the method and gave practical performances over real benchmark data sets. [sent-624, score-0.161]

82 However it shows similar 1572 AGGREGATION OF SVM performances for SVM using Gaussian or non-Gaussian kernel. [sent-626, score-0.161]

83 1 Proof of Theorem 1 and Corollary 2 We consider a translation invariant kernel K : Rd × Rd → C with RB function Φ satisfying assumptions of Theorem 1. [sent-631, score-0.196]

84 Lemma 16 For any y ∈ Rd , consider the function ky : x → K(x, y) defined in Rd . [sent-633, score-0.138]

85 ˆ Gathering with the inverse Fourier transform of gy ∈ L2 (Rd ), we have ˆ ky (ω) = gy (ω) = e−iω. [sent-651, score-0.264]

86 ˆ For a given y ∈ Rd , from Lemma 16 it is clear that ky (ω) = 0 for ω ∈ Rd \S. [sent-655, score-0.138]

87 Then K is a Sobolev smooth kernel with exponent r = 2s. [sent-674, score-0.23]

88 Here we are interested in the case q = ∞ and the following geometric explanation of interpolation space (Smale and Zhou, 2003, Theorem 3. [sent-684, score-0.136]

89 Proof By the lipschitz property of the hinge loss, we have clearly since a(αn ) ≤ ≤ inf f ∈HK inf R>0 f − f∗ C0 L1 (PX ) + αn inf f ∈BHK (R) 1575 f − f∗ f dPX dx ≤ C0 : 2 K L2 (Rd ) + αn R 2 . [sent-691, score-0.276]

90 This is a large class of functional spaces, including 2 in particular the Sobolev spaces defined in (7) (Ws2 = Bs,2 (Rd ) for any s > 0) and the H¨ lder spaces o s = B s (Rd ) for any s > 0). [sent-697, score-0.13]

91 (19) Therefore, to control the excess risk of a classifier, it is sufficient to control the RHS of (19). [sent-731, score-0.151]

92 The method of aggregation consists in building a new classifier f˜n from Dn called aggregate which mimics the best among f1 , . [sent-749, score-0.418]

93 (21) j=1 Under the margin assumption (10), we have this oracle inequality: Theorem 20 (Lecu´ , 2005) Suppose (10) holds for some q ∈ (0, +∞). [sent-758, score-0.171]

94 Assume we have at least a e polynomial number of classifiers to aggregate (i. [sent-759, score-0.208]

95 Then the aggregate defined in (21) satisfies, for all integer n ≥ 1, ER( f˜n , f ∗ ) ≤ (1 + 2 log−1/4 M) 2 q+1 min k∈{1,. [sent-762, score-0.208]

96 Proof (of Theorem 12) Let (q0 , s0 ) ∈ K and consider 0 < qmin < qmax < +∞ and 0 < smin < smax < +∞ such that K ⊂ [qmin , qmax ] × [smin , smax ]. [sent-766, score-0.321]

97 2 2 Since q → Φ(q, s) continuously increases on R+ , for n greater than a constant depending on b, r, d and K, there exists q0 ∈ qmin , qmax such that q0 ≤ q0 and 2 1 + k0 ∆−1 . [sent-772, score-0.13]

98 Since ∆ = nb , putting M = 2 oracle inequality: (2r−d)∆ 2d we have the following q +1 − q0 +2 α R( fˆn1 , f ∗ ) +C1 n2 1 EP⊗n2 R( f˜n , f ∗ )|D11 ≤ (1 + 2 log− 4 M) 2 min n α∈G (n2 ) 0 log7/4 M , where C1 depends on c0 , K and b. [sent-774, score-0.123]

99 Remark that C does not depend on q 0 and s0 since (q0 , s0 ) ∈ [ qmin , qmax ] × [smin , smax ]. [sent-777, score-0.172]

100 Optimal rates of aggregation in classification under low noise assumption. [sent-888, score-0.309]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rd', 0.409), ('sobolev', 0.398), ('aggregation', 0.21), ('aggregate', 0.208), ('oustau', 0.196), ('performances', 0.161), ('besov', 0.152), ('bs', 0.148), ('hkr', 0.147), ('lecu', 0.147), ('steinwart', 0.144), ('svm', 0.14), ('ky', 0.138), ('hk', 0.134), ('kernel', 0.107), ('scovel', 0.106), ('rkhs', 0.104), ('rates', 0.099), ('margin', 0.095), ('kernels', 0.089), ('smooth', 0.083), ('dpx', 0.083), ('aggregates', 0.082), ('matache', 0.082), ('interpolation', 0.08), ('rl', 0.076), ('oracle', 0.076), ('inf', 0.076), ('theorem', 0.073), ('fourier', 0.073), ('remark', 0.069), ('erm', 0.069), ('sign', 0.068), ('regularity', 0.068), ('gaussian', 0.067), ('qmax', 0.065), ('qmin', 0.065), ('triebel', 0.065), ('spaces', 0.065), ('gy', 0.063), ('tsch', 0.062), ('tsybakov', 0.062), ('er', 0.061), ('translation', 0.059), ('classi', 0.059), ('banana', 0.057), ('thyroid', 0.057), ('titanic', 0.057), ('geometric', 0.056), ('diabetis', 0.055), ('tx', 0.053), ('laplacian', 0.053), ('hilbert', 0.053), ('bayes', 0.053), ('px', 0.051), ('heart', 0.051), ('bhk', 0.049), ('rb', 0.049), ('dx', 0.048), ('ers', 0.047), ('nb', 0.047), ('dn', 0.046), ('tuning', 0.043), ('wu', 0.043), ('waveform', 0.042), ('kr', 0.042), ('excess', 0.042), ('smax', 0.042), ('smin', 0.042), ('family', 0.041), ('exponent', 0.04), ('subsample', 0.04), ('ep', 0.039), ('estimators', 0.038), ('realization', 0.037), ('endowed', 0.037), ('risk', 0.037), ('clipped', 0.037), ('control', 0.036), ('sequel', 0.036), ('exp', 0.035), ('fn', 0.034), ('norm', 0.034), ('reproducing', 0.033), ('arora', 0.033), ('bastien', 0.033), ('covel', 0.033), ('karatzoglou', 0.033), ('loustau', 0.033), ('teinwart', 0.033), ('realizations', 0.032), ('kx', 0.032), ('annals', 0.031), ('ei', 0.031), ('corollary', 0.031), ('zhou', 0.03), ('smoothness', 0.03), ('invariant', 0.03), ('banach', 0.03), ('minimization', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers

2 0.15603915 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces

3 0.14046755 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni

Abstract: We consider the problem of estimating the value of a multiattribute resource, where the attributes are categorical or discrete in nature and the number of potential attribute vectors is very large. The problem arises in approximate dynamic programming when we need to estimate the value of a multiattribute resource from estimates based on Monte-Carlo simulation. These problems have been traditionally solved using aggregation, but choosing the right level of aggregation requires resolving the classic tradeoff between aggregation error and sampling error. We propose a method that estimates the value of a resource at different levels of aggregation simultaneously, and then uses a weighted combination of the estimates. Using the optimal weights, which minimizes the variance of the estimate while accounting for correlations between the estimates, is computationally too expensive for practical applications. We have found that a simple inverse variance formula (adjusted for bias), which effectively assumes the estimates are independent, produces near-optimal estimates. We use the setting of two levels of aggregation to explain why this approximation works so well. Keywords: hierarchical statistics, approximate dynamic programming, mixture models, adaptive learning, multiattribute resources

4 0.10923748 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park

Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform

5 0.10911611 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

Author: Peter L. Bartlett, Marten H. Wegkamp

Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines

6 0.10548657 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

7 0.10188253 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

8 0.096154295 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

9 0.086990803 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

10 0.081161402 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

11 0.077130452 86 jmlr-2008-SimpleMKL

12 0.071850114 52 jmlr-2008-Learning from Multiple Sources

13 0.068786368 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

14 0.064627565 58 jmlr-2008-Max-margin Classification of Data with Absent Features

15 0.05988355 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

16 0.059523698 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity    (Special Topic on Model Selection)

17 0.058722902 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

18 0.057420198 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

19 0.057366785 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

20 0.055847272 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.271), (1, -0.163), (2, 0.115), (3, -0.123), (4, 0.172), (5, -0.177), (6, 0.151), (7, 0.018), (8, 0.189), (9, 0.036), (10, 0.049), (11, -0.1), (12, 0.067), (13, -0.063), (14, -0.055), (15, -0.126), (16, 0.065), (17, -0.279), (18, 0.025), (19, -0.002), (20, -0.059), (21, 0.103), (22, -0.016), (23, -0.075), (24, 0.117), (25, -0.218), (26, 0.039), (27, 0.035), (28, -0.022), (29, 0.029), (30, -0.054), (31, -0.081), (32, 0.02), (33, -0.01), (34, -0.157), (35, 0.084), (36, -0.052), (37, -0.095), (38, -0.029), (39, -0.015), (40, -0.024), (41, 0.116), (42, -0.155), (43, -0.095), (44, 0.019), (45, 0.088), (46, 0.092), (47, 0.064), (48, -0.104), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94677341 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers

2 0.56740695 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni

Abstract: We consider the problem of estimating the value of a multiattribute resource, where the attributes are categorical or discrete in nature and the number of potential attribute vectors is very large. The problem arises in approximate dynamic programming when we need to estimate the value of a multiattribute resource from estimates based on Monte-Carlo simulation. These problems have been traditionally solved using aggregation, but choosing the right level of aggregation requires resolving the classic tradeoff between aggregation error and sampling error. We propose a method that estimates the value of a resource at different levels of aggregation simultaneously, and then uses a weighted combination of the estimates. Using the optimal weights, which minimizes the variance of the estimate while accounting for correlations between the estimates, is computationally too expensive for practical applications. We have found that a simple inverse variance formula (adjusted for bias), which effectively assumes the estimates are independent, produces near-optimal estimates. We use the setting of two levels of aggregation to explain why this approximation works so well. Keywords: hierarchical statistics, approximate dynamic programming, mixture models, adaptive learning, multiattribute resources

3 0.55975217 92 jmlr-2008-Universal Multi-Task Kernels

Author: Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Abstract: In this paper we are concerned with reproducing kernel Hilbert spaces HK of functions from an input space into a Hilbert space Y , an environment appropriate for multi-task learning. The reproducing kernel K associated to HK has its values as operators on Y . Our primary goal here is to derive conditions which ensure that the kernel K is universal. This means that on every compact subset of the input space, every continuous function with values in Y can be uniformly approximated by sections of the kernel. We provide various characterizations of universal kernels and highlight them with several concrete examples of some practical importance. Our analysis uses basic principles of functional analysis and especially the useful notion of vector measures which we describe in sufficient detail to clarify our results. Keywords: multi-task learning, multi-task kernels, universal approximation, vector-valued reproducing kernel Hilbert spaces

4 0.4677321 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

Author: Peter L. Bartlett, Marten H. Wegkamp

Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines

5 0.44641706 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park

Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform

6 0.44343901 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

7 0.36469617 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

8 0.36221898 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

9 0.35047919 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

10 0.34822589 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

11 0.30159464 86 jmlr-2008-SimpleMKL

12 0.29838982 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

13 0.28425947 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

14 0.27005991 52 jmlr-2008-Learning from Multiple Sources

15 0.26667783 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

16 0.26302424 57 jmlr-2008-Manifold Learning: The Price of Normalization

17 0.26038799 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

18 0.25931528 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity    (Special Topic on Model Selection)

19 0.24800682 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

20 0.23704328 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.523), (31, 0.01), (40, 0.04), (54, 0.04), (58, 0.032), (66, 0.053), (76, 0.011), (78, 0.035), (88, 0.088), (92, 0.03), (94, 0.03), (99, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94237989 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models

Author: Mathias Drton, Thomas S. Richardson

Abstract: In graphical modelling, a bi-directed graph encodes marginal independences among random variables that are identified with the vertices of the graph. We show how to transform a bi-directed graph into a maximal ancestral graph that (i) represents the same independence structure as the original bi-directed graph, and (ii) minimizes the number of arrowheads among all ancestral graphs satisfying (i). Here the number of arrowheads of an ancestral graph is the number of directed edges plus twice the number of bi-directed edges. In Gaussian models, this construction can be used for more efficient iterative maximization of the likelihood function and to determine when maximum likelihood estimates are equal to empirical counterparts. Keywords: ancestral graph, covariance graph, graphical model, marginal independence, maximum likelihood estimation, multivariate normal distribution

same-paper 2 0.93333542 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers

3 0.89794666 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines    (Special Topic on Model Selection)

Author: Gerda Claeskens, Christophe Croux, Johan Van Kerckhoven

Abstract: Support vector machines for classification have the advantage that the curse of dimensionality is circumvented. It has been shown that a reduction of the dimension of the input space leads to even better results. For this purpose, we propose two information criteria which can be computed directly from the definition of the support vector machine. We assess the predictive performance of the models selected by our new criteria and compare them to existing variable selection techniques in a simulation study. The simulation results show that the new criteria are competitive in terms of generalization error rate while being much easier to compute. We arrive at the same findings for comparison on some real-world benchmark data sets. Keywords: information criterion, supervised classification, support vector machine, variable selection

4 0.53297496 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

Author: Peter L. Bartlett, Marten H. Wegkamp

Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines

5 0.48445994 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

Author: Andreas Christmann, Arnout Van Messem

Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines

6 0.48392111 20 jmlr-2008-Causal Reasoning with Ancestral Graphs    (Special Topic on Causality)

7 0.44974816 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

8 0.44936666 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

9 0.4143016 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

10 0.41210461 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs    (Special Topic on Causality)

11 0.40748268 80 jmlr-2008-Ranking Individuals by Group Comparisons

12 0.40490276 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning    (Special Topic on Causality)

13 0.404158 92 jmlr-2008-Universal Multi-Task Kernels

14 0.39756677 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

15 0.3891933 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

16 0.3877317 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

17 0.38382971 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

18 0.380472 57 jmlr-2008-Manifold Learning: The Price of Normalization

19 0.36675972 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

20 0.36441809 16 jmlr-2008-Approximations for Binary Gaussian Process Classification