jmlr jmlr2008 jmlr2008-3 knowledge-graph by maker-knowledge-mining

3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

Source: pdf

Author: Bernadetta Tarigan, Sara A. van de Geer

Abstract: The success of support vector machines in binary classiﬁcation relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than ﬁnding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classiﬁcation, all-at-once, moment bound, fast rate, entropy

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Recent research explores some extensions of this large margin based method to the multicategory case. [sent-9, score-0.267]

2 We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. [sent-10, score-0.421]

3 Keywords: multi-hinge classiﬁcation, all-at-once, moment bound, fast rate, entropy 1. [sent-13, score-0.084]

4 In the binary case (m = 2) a classiﬁer f : d → can be obtained by minimizing the empirical hinge loss 1 n ∑ (1 −Yi f (Xi ))+ n i=1 (1) over a given class of candidate classiﬁers f ∈ F , where (1 −Y f (X)) + := max(0, 1 −Y f (X)) with Y ∈ {±1}. [sent-21, score-0.259]

5 Hinge loss in combination with a reproducing kernel Hilbert space (RKHS) regularization penalty is called the support vector machine (SVM). [sent-22, score-0.075]

6 In this paper, we examine the generalization of (1) to the multicategory case (m > 2). [sent-24, score-0.192]

7 We show a moment bound for the excess multi-hinge risk based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. [sent-26, score-0.562]

8 There are two strategies to generalize the binary SVM to the multicategory SVM. [sent-29, score-0.232]

9 One strategy is by solving a series of binary problems; the other is by considering all of the categories at once. [sent-30, score-0.083]

10 A new example x is assigned to the category with the largest values of f j (x). [sent-36, score-0.087]

11 The one-versus-one method constructs one binary SVM classiﬁer for every pair of distinct categories, that is, all together m(m − 1)/2 binary SVM classiﬁers are constructed. [sent-37, score-0.08]

12 The classiﬁer f i j is trained taking the examples from category i as positive and the examples from category j as negative. [sent-38, score-0.174]

13 For a new example x, if f i j classiﬁes x into category i then the vote for category i is increased by one. [sent-39, score-0.213]

14 Otherwise the vote for category j is increased by one. [sent-40, score-0.126]

15 After each of the m(m − 1)/2 classiﬁers makes its vote, x is assigned to the category with the largest number of votes. [sent-41, score-0.087]

16 An all-at-once strategy for SVM loss has been proposed by some authors. [sent-43, score-0.048]

17 , fm (·)) ∈ Πm H K with j=1 the zero-sum constraint, the extension of SVM methodology is to minimize m λ 1 n 1 ∑ ∑ ( f j (Xi ) + m − 1 )+ + 2 n i=1 j=1, j=Yi m ∑ ||h j ||2 H j=1 K . [sent-64, score-0.053]

18 m−1 ( f j (X) + j=1, j=Y (4) The binary SVM loss (1) is a special case by taking m = 2. [sent-66, score-0.088]

19 Thus, (4) is identical with the binary SVM loss (1 −Y f (X))+ , where f1 plays the same role as f . [sent-69, score-0.088]

20 Let p j (x) denote the conditional probability of category j given x ∈ d , j = 1, . [sent-73, score-0.087]

21 2172 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS The theoretical multi-hinge risk is the expectation of the empirical multi-hinge loss with respect to the measure P and is denoted by R( f ) := Z l(y, f (x)) dP(x, y) , (5) with l(Y, f (X)) deﬁned as in (4). [sent-81, score-0.167]

22 That is, f ∗ minimizes multi-hinge risk (5) over all possible classiﬁers. [sent-87, score-0.079]

23 We write R∗ = R( f ∗ ), the smallest possible multihinge risk. [sent-88, score-0.081]

24 Bayes classiﬁer f ∗ minimizes the multi-hinge risk R( f ). [sent-91, score-0.079]

25 This lemma can be found in Lee, Lin, and Wahba (2004), Zhang (2004b,c), Tewari and Bartlett (2005) and Zou, Zhu, and Hastie (2006). [sent-92, score-0.123]

26 They establish the conditions needed to achieve the consistency for a general family of multicategory loss functions extended from various large margin binary classiﬁers. [sent-94, score-0.355]

27 Tewari and Bartlett (2005) and Zhang (2004b,c) also show that the convergence to zero (in probability) of the excess multi-hinge risk R( f ) − R ∗ implies the convergence to zero with the same rate (in probability) of the excess prediction error P(g( f (X)) = Y ) − P(g( f ∗ (X)) = Y ). [sent-96, score-0.355]

28 In this paper, we will not study the RKHS-regularization but we take the minimization of the empirical multi-hinge loss over a given class of candidate classiﬁers F satisfying a complexity constraint. [sent-99, score-0.124]

29 (6) Let Pn be the empirical distribution of (X,Y ) based on the observations {(Xi ,Yi )}n and Qn the i=1 corresponding empirical distribution of X based on X1 , . [sent-110, score-0.08]

30 We impose a complexity constraint on the class Fo in term of either the entropy with bracketing or the empirical entropy. [sent-115, score-0.356]

31 j j Then NB (ε, G , d) is called the ε-covering number with bracketing of G and HB (ε, G , d) is called the ε-entropy with bracketing of G (for the d-metric). [sent-139, score-0.426]

32 Let HB (ε, Fo , L2 (Q)) and H(ε, Fo , L2 (Qn )) denote the ε-entropy with bracketing and the empirical ε-entropy of the class Fo , respectively. [sent-140, score-0.253]

33 The complexity of a model class can be summarized in a complexity parameter ρ ∈ (0, 1). [sent-141, score-0.072]

34 Besides the model class complexity, the rate of convergence also depends on the so-called margin condition (see Condition A below) that quantiﬁes the identiﬁability of the Bayes rule and is summarized in a margin parameter (or noise level) κ ≥ 1. [sent-148, score-0.259]

35 In Tarigan and van de Geer (2006), a 2174 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS probability inequality has been obtained for l1 -penalized excess hinge risk in the binary case that adapts to the unknown parameters. [sent-149, score-0.576]

36 In this paper, we show a moment bound for the excess multihinge risk R( fˆn ) − R∗ of fˆn over the model class F with rate of convergence n−κ/(2κ−1+ρ) , which is faster than n−1/2 . [sent-150, score-0.304]

37 In Section 2 we present our main result based on the margin and complexity conditions. [sent-151, score-0.111]

38 A Moment Bound for Multi-hinge Classiﬁers We ﬁrst state the margin and the complexity conditions. [sent-155, score-0.111]

39 The ε-entropy with bracketing satisﬁes the inequality HB (ε, Fo , L2 (Q)) ≤ Aε−2ρ , for all ε > 0 . [sent-160, score-0.257]

40 The empirical ε-entropy, almost surely for all n ≥ 1, satisﬁes the inequality H(ε, Fo , L2 (Qn )) ≤ Aε−2ρ , for all ε > 0 . [sent-163, score-0.084]

41 Condition A follows from the condition on the behaviour of the conditional probabilities p j . [sent-176, score-0.127]

42 The margin condition is also called the condition on the noise level, and it is summarized in a margin parameter κ. [sent-181, score-0.312]

43 2) discuss the noise condition and its equivalent variants, corresponding to the fast rates of convergence, in the binary case. [sent-183, score-0.121]

44 Thus, Condition AA is a natural extension for the multicategory case wrt. [sent-184, score-0.192]

45 ,m} p j (x) and deﬁne τ(x) := min{|p j (x) − pk (x)|, 1 − pk (x)} , (8) j=k where j and k take values in {1, 2, . [sent-191, score-0.372]

46 We refer to the work of Bartlett and Wegkamp (2006, Section 4) and Tarigan and van de Geer (2006, Section 3. [sent-210, score-0.172]

47 See, for example, van der Vaart and Wellner (1996, Section 2. [sent-214, score-0.172]

48 In the situation when the approximation error inf f ∈F R( f ) − R∗ is zero (the model class F contains the Bayes classiﬁer), Steinwart and Scovel (2005) obtain the same rate of convergence for the excess hinge risk under the margin condition A and the complexity condition B2. [sent-221, score-0.621]

49 Proof of Theorem 2 Let f o := arg min f ∈F R( f ), the minimizer of the theoretical risk in the model class F . [sent-226, score-0.113]

50 As shorthand √ notation we write for the loss l f = l f (X,Y ) = l(Y, f (X)). [sent-227, score-0.086]

51 (9) We call inequality (9) a basic-inequality, following van de Geer (2000). [sent-230, score-0.216]

52 This upper bound enables us to work with the increments of the empirical process {νn (l f ) − νn (l f o ) : l f ∈ L } indexed by the multi-hinge loss l f ∈ L , where L = {l f : f ∈ F }. [sent-231, score-0.088]

53 It gives an upper bound for the ε-entropy with bracketing of the model class L : HB (ε, L , L2 (P)) ≤ Ao ε−2ρ , for all ε > 0, with Ao = A(m − 1)2+2ρ . [sent-251, score-0.213]

54 14 in van de Geer (2000), presented below in Lemma 6, gives the desired exponential tail probability. [sent-253, score-0.172]

55 Lemma 4 gives an upper bound of the squared L2 (P)-metric of the excess loss in terms of 2,Q -metric. [sent-261, score-0.158]

56 Deﬁnition of the loss gives ∆( f , f ∗ ) = m j=1 m = 2 1 1 ∑ p j ∑( fi + m − 1 )+ − ( fi∗ + m − 1 )+ ∑ pj j=1 i= j ∑ i∈I + ( j) ( fi − fi∗ ) + ∑ (− i∈I − ( j) 1 − fi∗ ) m−1 2 , where I + ( j) = {i = j : fi ≥ −1/(m − 1), i = 1, . [sent-270, score-0.375]

57 , m} and I − ( j) = {i = j : fi < −1/(m − and ai ∈ , and that 1), i = 1, . [sent-273, score-0.109]

58 Use the facts that (∑n ai )2 ≤ n ∑n a2 for all n ∈ i=1 i i=1 max{|I + ( j)|, |I − ( j)|} ≤ m − 1, to obtain ¡ m ∆( f , f ∗ ) ≤ (m − 1) ∑ p j j=1 ∑ i∈I + ( j) ( fi − fi∗ )2 + ∑ (− i∈I − ( j) 1 − fi∗ )2 . [sent-277, score-0.109]

59 m−1 Clearly, | − 1/(m − 1) − f i∗ | ≤ | fi − fi∗ | for all i ∈ I − ( j). [sent-278, score-0.109]

60 Hence, m m ∆( f , f ∗ ) ≤ (m − 1) ∑ p j ( ∑ | fi − fi∗ |2 ) = (m − 1) ∑ (1 − p j )| f j − f j∗ |2 , j=1 j=1 i= j 2178 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS where the last equality is obtained using ∑m p j = 1. [sent-279, score-0.109]

61 The technical lemma below is an immediate consequence of Young’s inequality (see, for ex´ ample, Hardy, Littlewood, and Polya, 1988, Chapter 8. [sent-282, score-0.167]

62 For a probability measure Q, let H be a class of uniformly bounded functions h in L2 (Q), say suph∈H |h − ho |∞ < 1, where ho is a ﬁxed but arbitrary function in H . [sent-289, score-0.842]

63 Then for some positive constants c and no depending only on ρ and Ao , sup h∈H |νn (h) − νn (ho )| h − ho ∨ n 1 − 2+2ρ 1−ρ ≥ t ≤ c exp(−t/c2 ) , for all t > c and n > no . [sent-291, score-0.547]

64 For a probability measure Q on (Z , A ), let H be a class of uniformly bounded functions h in L2 (Q), say suph∈H |h − ho |∞ < 1, where ho is a ﬁxed but arbitrary element in H . [sent-293, score-0.842]

65 Then for some positive constants c and no depending on ρ and Ao , sup h∈H |νn (h) − νn (ho )| h − ho ∨ n 1 − 2+2ρ 1−ρ ≥ t ≤ c exp(−t/c2 ) , for all t > c and n > no . [sent-295, score-0.547]

66 To handle (12), we divide the class H into two disjoint classes where the empirical distance h − ho n is smaller or larger than n−1/(2+2ρ) . [sent-302, score-0.461]

67 1 in van de Geer (2000), stated below in Lemma 8, for some positive constant c 1 , |νε (h) − νε (ho )| √ t n1/(1+ρ) n n ≥ t/4 ≤ c1 exp − −(1−ρ)/(2+2ρ) 64 c2 h∈Hn n 1 sup . [sent-305, score-0.317]

68 We apply the peeling device on the set {h ∈ H : 2− j ≤ h − ho n ≤ 2− j+1 , j = 1, . [sent-307, score-0.505]

69 , J} to obtain that, for all t > 1, sup h∈Hnc |νε (h) − νε (ho )| n n 1−ρ h − ho n J ∑ j=1 ≤ sup h∈H h−ho n ≤2− j+1 √ t/4 Z1 , . [sent-310, score-0.611]

70 6 in van de Geer (2000), stated below in √ Lemma 9, where we take t such that ( t/4)1/(1−ρ) ≥ 14u. [sent-318, score-0.172]

71 Let n i=1 H (δ) := {h ∈ H : h − ho 2,Q ≤ δ} , ˆ δn := sup h∈H (δ) h − ho 2,Qn , where ho is a ﬁxed but arbitrary function in H and Qn is the corresponding empirical distribution Rˆ ˆ of Z based on {Zi }n . [sent-337, score-1.398]

72 For a ≥ 8C δn √ H 1/2 (u, H , Qn )du ∨ δn , where C is some positive i=1 a/(32 n) constant, we have sup |νε (h) − νε (ho )| ≥ n n h∈H (δ) a Z1 , . [sent-338, score-0.095]

73 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS The following lemma is a modiﬁcation of Lemma 5. [sent-342, score-0.123]

74 Then, for all n, h sup h∈H ρ 2,Sn 1 − 2+2ρ ≥ 14u ≤ 4 exp(−u2 n 1+ρ ) , 2,S ∨ n h for all u ≥ 1. [sent-347, score-0.095]

75 n n We apply the randomization device in Pollard (1984, page 32), as follows. [sent-350, score-0.072]

76 Use a symmetrization lemma of Pollard (1984, Lemma II. [sent-373, score-0.177]

77 8), see Appendix, to obtain 2,S ∨ δn h h∈H 2,Sn ≥ 14u ≤ 2 h sup sup h∈H | h − h 2,Sn | ≥ 12u . [sent-375, score-0.19]

78 h 2,S ∨ δn 2,Sn The peeling device on the set {h ∈ H : (2u) j−1 δn ≤ h 2,S ≤ (2u) j δn , j = 1, 2, . [sent-376, score-0.084]

79 } and the inequality in Pollard (1984, page 33) give sup | h − h 2,Sn | ≥ 12u Z1 , . [sent-379, score-0.17]

80 , Zn h 2,S ∨ δn 2,Sn h∈H ∞ ∑ ≤ h∈H j=1 h ∞ ≤ ∞ ≤ ∑ 2 exp j=1 | h sup 2,S ≤(2u) ∑ 2 exp j=1 jδ Sn − h Sn | ≥ 6(2u) j δn | Z1 , . [sent-382, score-0.195]

81 Deﬁnition (4) of the loss j=1 and the fact that ∑m p j = 1 give j=1 m L( f ) = m ∑ p j( ∑ j=1 ( fk + k=1, k= j 1 )+ ) = m−1 m 1 ∑ (1 − p j )( f j + m − 1 )+ . [sent-397, score-0.101]

82 Write ∆( f ) := L( f ) − L( f ∗ ) = 1 ∑ (1 − p j )( f j + m − 1 )+ j=k + (1 − pk )( fk + 1 1 )+ − (1 − pk )(1 + ). [sent-410, score-0.425]

83 Here, ∆( f ) = (1 − pk )( fk − 1) + 1 ∑ (1 − p j )( f j + m − 1 )+ . [sent-412, score-0.239]

84 Divide the sum j=1 + (k) and J − (k) to obtain into the sets J ∆( f ) = ∑ j∈J + (k) (pk − p j ) ( f j + 1 1 ) + (1 − pk ) ∑ | f j + |. [sent-414, score-0.186]

85 m−1 m−1 j∈J − (k) 2182 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS In both cases clearly L( f ) − L( f ∗ ) is always non-negative since pk − p j is non-negative for all j = k. [sent-416, score-0.186]

87 That is, the excess risk is lower bounded by Z 1 m ∑ τ| f j − f j∗ |dQ . [sent-434, score-0.189]

88 Then t sup |Z(t)| > ε ≤ β−1 sup |Z(t) − Z (t)| > ε − α . [sent-444, score-0.19]

89 Classiﬁcation with a reject option using a hinge loss. [sent-448, score-0.131]

90 On the learnability and design of output codes for multiclass problems. [sent-462, score-0.065]

91 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-466, score-0.065]

92 Characterizing the solution path of multicategory support vector machines. [sent-497, score-0.219]

93 Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. [sent-500, score-0.102]

94 Support vector machines and the bayes rule in classiﬁcation. [sent-503, score-0.064]

95 Classiﬁers of support machine type with l 1 complexity regularization. [sent-531, score-0.063]

96 On l1 -norm multiclass support vector machines: Methodology and theory. [sent-558, score-0.092]

97 Statistical behaviour and consistency of classiﬁcation methods based on convex risk minimization. [sent-564, score-0.125]

98 Statistical analysis of some multi-category large margin classiﬁcation methods. [sent-568, score-0.075]

99 Statistical analysis of some multi-category large margin classiﬁcation methods. [sent-571, score-0.075]

100 The margin vector, admissible loss and multi-class marginbased classiﬁers. [sent-574, score-0.123]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ho', 0.421), ('tarigan', 0.235), ('fo', 0.227), ('bracketing', 0.213), ('zn', 0.211), ('geer', 0.21), ('hb', 0.193), ('multicategory', 0.192), ('pk', 0.186), ('van', 0.172), ('eer', 0.149), ('oment', 0.149), ('ao', 0.148), ('dq', 0.148), ('hinge', 0.131), ('ound', 0.127), ('ulti', 0.127), ('lemma', 0.123), ('qn', 0.117), ('excess', 0.11), ('fi', 0.109), ('sara', 0.107), ('lassifiers', 0.104), ('sup', 0.095), ('category', 0.087), ('sn', 0.081), ('condition', 0.081), ('risk', 0.079), ('margin', 0.075), ('cz', 0.072), ('gl', 0.072), ('multiclass', 0.065), ('bayes', 0.064), ('bernadetta', 0.064), ('gu', 0.059), ('pollard', 0.055), ('yoonkyung', 0.054), ('departement', 0.054), ('jon', 0.054), ('suph', 0.054), ('symmetrization', 0.054), ('fk', 0.053), ('fm', 0.053), ('classi', 0.051), ('svm', 0.05), ('lee', 0.05), ('exp', 0.05), ('aa', 0.049), ('tewari', 0.049), ('loss', 0.048), ('behaviour', 0.046), ('zi', 0.045), ('bartlett', 0.044), ('moment', 0.044), ('inequality', 0.044), ('categories', 0.043), ('barrio', 0.043), ('hardy', 0.043), ('littlewood', 0.043), ('multihinge', 0.043), ('peeling', 0.043), ('radiance', 0.043), ('crammer', 0.042), ('device', 0.041), ('empirical', 0.04), ('binary', 0.04), ('entropy', 0.04), ('vote', 0.039), ('write', 0.038), ('nb', 0.037), ('complexity', 0.036), ('auer', 0.036), ('ethz', 0.036), ('koby', 0.036), ('meir', 0.036), ('tong', 0.035), ('arg', 0.034), ('ers', 0.033), ('alexandre', 0.032), ('mammen', 0.032), ('satellite', 0.032), ('song', 0.032), ('tsybakov', 0.032), ('wellner', 0.032), ('weston', 0.032), ('er', 0.032), ('page', 0.031), ('constants', 0.031), ('lin', 0.031), ('del', 0.03), ('duan', 0.03), ('zurich', 0.03), ('convergence', 0.028), ('boucheron', 0.028), ('lectures', 0.028), ('vaart', 0.028), ('yoram', 0.028), ('peter', 0.028), ('constraint', 0.027), ('support', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

Author: Bernadetta Tarigan, Sara A. van de Geer

2 0.15329947 7 jmlr-2008-A Tutorial on Conformal Prediction

Author: Glenn Shafer, Vladimir Vovk

Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 Ă˘ˆ’ ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 Ă˘ˆ’ ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions

3 0.12522101 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

Author: Peter L. Bartlett, Marten H. Wegkamp

Abstract: We consider the problem of binary classiﬁcation where the classiﬁer can, for a particular cost, choose not to classify an observation. Just as in the conventional classiﬁcation problem, minimization of the sample average of the cost is a difﬁcult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efﬁciently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classiﬁers, classiﬁcation, convex surrogate loss, empirical risk minimization, hinge loss, large margin classiﬁers, margin condition, reject option, support vector machines

4 0.10889544 52 jmlr-2008-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classiﬁcation and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning

5 0.086990803 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classiﬁer with no tuning parameter. It is a combination of SVM classiﬁers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classiﬁcation, support vector machines, learning rates, approximation, aggregation of classiﬁers

6 0.06677904 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

7 0.063071348 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

8 0.062582016 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

9 0.061697669 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

10 0.055150997 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix

11 0.050147001 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

12 0.049157292 9 jmlr-2008-Active Learning by Spherical Subdivision

13 0.043433819 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)

14 0.041071679 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

15 0.040103577 83 jmlr-2008-Robust Submodular Observation Selection

16 0.03973911 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

17 0.036546696 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

18 0.036348544 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

19 0.035626374 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

20 0.035158549 58 jmlr-2008-Max-margin Classification of Data with Absent Features

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, -0.1), (2, -0.019), (3, -0.114), (4, 0.095), (5, -0.046), (6, 0.143), (7, -0.188), (8, 0.137), (9, -0.208), (10, 0.2), (11, -0.124), (12, -0.156), (13, -0.061), (14, 0.219), (15, -0.062), (16, -0.059), (17, 0.055), (18, 0.235), (19, -0.061), (20, -0.06), (21, 0.091), (22, 0.071), (23, 0.011), (24, -0.026), (25, -0.102), (26, -0.129), (27, -0.091), (28, 0.048), (29, 0.017), (30, 0.049), (31, 0.023), (32, -0.031), (33, -0.063), (34, -0.06), (35, -0.195), (36, -0.068), (37, -0.051), (38, 0.024), (39, 0.003), (40, -0.058), (41, -0.066), (42, 0.116), (43, 0.007), (44, 0.035), (45, -0.004), (46, 0.081), (47, -0.061), (48, 0.098), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94073594 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

Author: Bernadetta Tarigan, Sara A. van de Geer

2 0.6823296 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

Author: Peter L. Bartlett, Marten H. Wegkamp

3 0.64497709 7 jmlr-2008-A Tutorial on Conformal Prediction

Author: Glenn Shafer, Vladimir Vovk

4 0.38750932 52 jmlr-2008-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

5 0.30370384 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

6 0.30274767 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

7 0.29385102 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix

8 0.27699164 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

9 0.27244478 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

10 0.27068588 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

11 0.23948415 9 jmlr-2008-Active Learning by Spherical Subdivision

12 0.20729484 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

13 0.19540808 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

14 0.17993751 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

15 0.17426766 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

16 0.17003925 58 jmlr-2008-Max-margin Classification of Data with Absent Features

17 0.16541786 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

18 0.16167261 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

19 0.16117479 26 jmlr-2008-Consistency of Trace Norm Minimization

20 0.16058424 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.057), (39, 0.377), (40, 0.022), (54, 0.038), (58, 0.056), (66, 0.058), (76, 0.013), (78, 0.015), (88, 0.112), (92, 0.073), (94, 0.036), (99, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7377488 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

Author: Bernadetta Tarigan, Sara A. van de Geer

2 0.38430429 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

Author: Peter L. Bartlett, Marten H. Wegkamp

3 0.38155636 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

Author: Rémi Munos, Csaba Szepesvári

Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based ﬁtted value iteration (FVI) to solve inﬁnite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of ﬁnite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufﬁciently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reﬂects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical ﬁndings. Keywords: ﬁtted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control

4 0.37482905 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park

Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefﬁcients of the linear support vector machine. A Bahadur type representation of the coefﬁcients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classiﬁcation, convexity lemma, Radon transform

5 0.37152594 57 jmlr-2008-Manifold Learning: The Price of Normalization

Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov

Abstract: We analyze the performance of a class of manifold-learning algorithms that ﬁnd their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the ﬁnite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap

6 0.37092125 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

7 0.36701947 96 jmlr-2008-Visualizing Data using t-SNE

8 0.36676899 9 jmlr-2008-Active Learning by Spherical Subdivision

9 0.36647868 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

10 0.36424276 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

11 0.35954228 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

12 0.35922906 56 jmlr-2008-Magic Moments for Structured Output Prediction

13 0.35710084 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

14 0.35540697 86 jmlr-2008-SimpleMKL

15 0.35396197 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)

16 0.35369653 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity (Special Topic on Model Selection)

17 0.35321218 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

18 0.35193324 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

19 0.35172972 83 jmlr-2008-Robust Submodular Observation Selection

20 0.351385 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks