jmlr jmlr2008 jmlr2008-3 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
Reference: text
sentIndex sentText sentNum sentScore
1 Recent research explores some extensions of this large margin based method to the multicategory case. [sent-9, score-0.267]
2 We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. [sent-10, score-0.421]
3 Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy 1. [sent-13, score-0.084]
4 In the binary case (m = 2) a classifier f : d → can be obtained by minimizing the empirical hinge loss 1 n ∑ (1 −Yi f (Xi ))+ n i=1 (1) over a given class of candidate classifiers f ∈ F , where (1 −Y f (X)) + := max(0, 1 −Y f (X)) with Y ∈ {±1}. [sent-21, score-0.259]
5 Hinge loss in combination with a reproducing kernel Hilbert space (RKHS) regularization penalty is called the support vector machine (SVM). [sent-22, score-0.075]
6 In this paper, we examine the generalization of (1) to the multicategory case (m > 2). [sent-24, score-0.192]
7 We show a moment bound for the excess multi-hinge risk based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. [sent-26, score-0.562]
8 There are two strategies to generalize the binary SVM to the multicategory SVM. [sent-29, score-0.232]
9 One strategy is by solving a series of binary problems; the other is by considering all of the categories at once. [sent-30, score-0.083]
10 A new example x is assigned to the category with the largest values of f j (x). [sent-36, score-0.087]
11 The one-versus-one method constructs one binary SVM classifier for every pair of distinct categories, that is, all together m(m − 1)/2 binary SVM classifiers are constructed. [sent-37, score-0.08]
12 The classifier f i j is trained taking the examples from category i as positive and the examples from category j as negative. [sent-38, score-0.174]
13 For a new example x, if f i j classifies x into category i then the vote for category i is increased by one. [sent-39, score-0.213]
14 Otherwise the vote for category j is increased by one. [sent-40, score-0.126]
15 After each of the m(m − 1)/2 classifiers makes its vote, x is assigned to the category with the largest number of votes. [sent-41, score-0.087]
16 An all-at-once strategy for SVM loss has been proposed by some authors. [sent-43, score-0.048]
17 , fm (·)) ∈ Πm H K with j=1 the zero-sum constraint, the extension of SVM methodology is to minimize m λ 1 n 1 ∑ ∑ ( f j (Xi ) + m − 1 )+ + 2 n i=1 j=1, j=Yi m ∑ ||h j ||2 H j=1 K . [sent-64, score-0.053]
18 m−1 ( f j (X) + j=1, j=Y (4) The binary SVM loss (1) is a special case by taking m = 2. [sent-66, score-0.088]
19 Thus, (4) is identical with the binary SVM loss (1 −Y f (X))+ , where f1 plays the same role as f . [sent-69, score-0.088]
20 Let p j (x) denote the conditional probability of category j given x ∈ d , j = 1, . [sent-73, score-0.087]
21 2172 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS The theoretical multi-hinge risk is the expectation of the empirical multi-hinge loss with respect to the measure P and is denoted by R( f ) := Z l(y, f (x)) dP(x, y) , (5) with l(Y, f (X)) defined as in (4). [sent-81, score-0.167]
22 That is, f ∗ minimizes multi-hinge risk (5) over all possible classifiers. [sent-87, score-0.079]
23 We write R∗ = R( f ∗ ), the smallest possible multihinge risk. [sent-88, score-0.081]
24 Bayes classifier f ∗ minimizes the multi-hinge risk R( f ). [sent-91, score-0.079]
25 This lemma can be found in Lee, Lin, and Wahba (2004), Zhang (2004b,c), Tewari and Bartlett (2005) and Zou, Zhu, and Hastie (2006). [sent-92, score-0.123]
26 They establish the conditions needed to achieve the consistency for a general family of multicategory loss functions extended from various large margin binary classifiers. [sent-94, score-0.355]
27 Tewari and Bartlett (2005) and Zhang (2004b,c) also show that the convergence to zero (in probability) of the excess multi-hinge risk R( f ) − R ∗ implies the convergence to zero with the same rate (in probability) of the excess prediction error P(g( f (X)) = Y ) − P(g( f ∗ (X)) = Y ). [sent-96, score-0.355]
28 In this paper, we will not study the RKHS-regularization but we take the minimization of the empirical multi-hinge loss over a given class of candidate classifiers F satisfying a complexity constraint. [sent-99, score-0.124]
29 (6) Let Pn be the empirical distribution of (X,Y ) based on the observations {(Xi ,Yi )}n and Qn the i=1 corresponding empirical distribution of X based on X1 , . [sent-110, score-0.08]
30 We impose a complexity constraint on the class Fo in term of either the entropy with bracketing or the empirical entropy. [sent-115, score-0.356]
31 j j Then NB (ε, G , d) is called the ε-covering number with bracketing of G and HB (ε, G , d) is called the ε-entropy with bracketing of G (for the d-metric). [sent-139, score-0.426]
32 Let HB (ε, Fo , L2 (Q)) and H(ε, Fo , L2 (Qn )) denote the ε-entropy with bracketing and the empirical ε-entropy of the class Fo , respectively. [sent-140, score-0.253]
33 The complexity of a model class can be summarized in a complexity parameter ρ ∈ (0, 1). [sent-141, score-0.072]
34 Besides the model class complexity, the rate of convergence also depends on the so-called margin condition (see Condition A below) that quantifies the identifiability of the Bayes rule and is summarized in a margin parameter (or noise level) κ ≥ 1. [sent-148, score-0.259]
35 In Tarigan and van de Geer (2006), a 2174 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS probability inequality has been obtained for l1 -penalized excess hinge risk in the binary case that adapts to the unknown parameters. [sent-149, score-0.576]
36 In this paper, we show a moment bound for the excess multihinge risk R( fˆn ) − R∗ of fˆn over the model class F with rate of convergence n−κ/(2κ−1+ρ) , which is faster than n−1/2 . [sent-150, score-0.304]
37 In Section 2 we present our main result based on the margin and complexity conditions. [sent-151, score-0.111]
38 A Moment Bound for Multi-hinge Classifiers We first state the margin and the complexity conditions. [sent-155, score-0.111]
39 The ε-entropy with bracketing satisfies the inequality HB (ε, Fo , L2 (Q)) ≤ Aε−2ρ , for all ε > 0 . [sent-160, score-0.257]
40 The empirical ε-entropy, almost surely for all n ≥ 1, satisfies the inequality H(ε, Fo , L2 (Qn )) ≤ Aε−2ρ , for all ε > 0 . [sent-163, score-0.084]
41 Condition A follows from the condition on the behaviour of the conditional probabilities p j . [sent-176, score-0.127]
42 The margin condition is also called the condition on the noise level, and it is summarized in a margin parameter κ. [sent-181, score-0.312]
43 2) discuss the noise condition and its equivalent variants, corresponding to the fast rates of convergence, in the binary case. [sent-183, score-0.121]
44 Thus, Condition AA is a natural extension for the multicategory case wrt. [sent-184, score-0.192]
45 ,m} p j (x) and define τ(x) := min{|p j (x) − pk (x)|, 1 − pk (x)} , (8) j=k where j and k take values in {1, 2, . [sent-191, score-0.372]
46 We refer to the work of Bartlett and Wegkamp (2006, Section 4) and Tarigan and van de Geer (2006, Section 3. [sent-210, score-0.172]
47 See, for example, van der Vaart and Wellner (1996, Section 2. [sent-214, score-0.172]
48 In the situation when the approximation error inf f ∈F R( f ) − R∗ is zero (the model class F contains the Bayes classifier), Steinwart and Scovel (2005) obtain the same rate of convergence for the excess hinge risk under the margin condition A and the complexity condition B2. [sent-221, score-0.621]
49 Proof of Theorem 2 Let f o := arg min f ∈F R( f ), the minimizer of the theoretical risk in the model class F . [sent-226, score-0.113]
50 As shorthand √ notation we write for the loss l f = l f (X,Y ) = l(Y, f (X)). [sent-227, score-0.086]
51 (9) We call inequality (9) a basic-inequality, following van de Geer (2000). [sent-230, score-0.216]
52 This upper bound enables us to work with the increments of the empirical process {νn (l f ) − νn (l f o ) : l f ∈ L } indexed by the multi-hinge loss l f ∈ L , where L = {l f : f ∈ F }. [sent-231, score-0.088]
53 It gives an upper bound for the ε-entropy with bracketing of the model class L : HB (ε, L , L2 (P)) ≤ Ao ε−2ρ , for all ε > 0, with Ao = A(m − 1)2+2ρ . [sent-251, score-0.213]
54 14 in van de Geer (2000), presented below in Lemma 6, gives the desired exponential tail probability. [sent-253, score-0.172]
55 Lemma 4 gives an upper bound of the squared L2 (P)-metric of the excess loss in terms of 2,Q -metric. [sent-261, score-0.158]
56 Definition of the loss gives ∆( f , f ∗ ) = m j=1 m = 2 1 1 ∑ p j ∑( fi + m − 1 )+ − ( fi∗ + m − 1 )+ ∑ pj j=1 i= j ∑ i∈I + ( j) ( fi − fi∗ ) + ∑ (− i∈I − ( j) 1 − fi∗ ) m−1 2 , where I + ( j) = {i = j : fi ≥ −1/(m − 1), i = 1, . [sent-270, score-0.375]
57 , m} and I − ( j) = {i = j : fi < −1/(m − and ai ∈ , and that 1), i = 1, . [sent-273, score-0.109]
58 Use the facts that (∑n ai )2 ≤ n ∑n a2 for all n ∈ i=1 i i=1 max{|I + ( j)|, |I − ( j)|} ≤ m − 1, to obtain ¡ m ∆( f , f ∗ ) ≤ (m − 1) ∑ p j j=1 ∑ i∈I + ( j) ( fi − fi∗ )2 + ∑ (− i∈I − ( j) 1 − fi∗ )2 . [sent-277, score-0.109]
59 m−1 Clearly, | − 1/(m − 1) − f i∗ | ≤ | fi − fi∗ | for all i ∈ I − ( j). [sent-278, score-0.109]
60 Hence, m m ∆( f , f ∗ ) ≤ (m − 1) ∑ p j ( ∑ | fi − fi∗ |2 ) = (m − 1) ∑ (1 − p j )| f j − f j∗ |2 , j=1 j=1 i= j 2178 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS where the last equality is obtained using ∑m p j = 1. [sent-279, score-0.109]
61 The technical lemma below is an immediate consequence of Young’s inequality (see, for ex´ ample, Hardy, Littlewood, and Polya, 1988, Chapter 8. [sent-282, score-0.167]
62 For a probability measure Q, let H be a class of uniformly bounded functions h in L2 (Q), say suph∈H |h − ho |∞ < 1, where ho is a fixed but arbitrary function in H . [sent-289, score-0.842]
63 Then for some positive constants c and no depending only on ρ and Ao , sup h∈H |νn (h) − νn (ho )| h − ho ∨ n 1 − 2+2ρ 1−ρ ≥ t ≤ c exp(−t/c2 ) , for all t > c and n > no . [sent-291, score-0.547]
64 For a probability measure Q on (Z , A ), let H be a class of uniformly bounded functions h in L2 (Q), say suph∈H |h − ho |∞ < 1, where ho is a fixed but arbitrary element in H . [sent-293, score-0.842]
65 Then for some positive constants c and no depending on ρ and Ao , sup h∈H |νn (h) − νn (ho )| h − ho ∨ n 1 − 2+2ρ 1−ρ ≥ t ≤ c exp(−t/c2 ) , for all t > c and n > no . [sent-295, score-0.547]
66 To handle (12), we divide the class H into two disjoint classes where the empirical distance h − ho n is smaller or larger than n−1/(2+2ρ) . [sent-302, score-0.461]
67 1 in van de Geer (2000), stated below in Lemma 8, for some positive constant c 1 , |νε (h) − νε (ho )| √ t n1/(1+ρ) n n ≥ t/4 ≤ c1 exp − −(1−ρ)/(2+2ρ) 64 c2 h∈Hn n 1 sup . [sent-305, score-0.317]
68 We apply the peeling device on the set {h ∈ H : 2− j ≤ h − ho n ≤ 2− j+1 , j = 1, . [sent-307, score-0.505]
69 , J} to obtain that, for all t > 1, sup h∈Hnc |νε (h) − νε (ho )| n n 1−ρ h − ho n J ∑ j=1 ≤ sup h∈H h−ho n ≤2− j+1 √ t/4 Z1 , . [sent-310, score-0.611]
70 6 in van de Geer (2000), stated below in √ Lemma 9, where we take t such that ( t/4)1/(1−ρ) ≥ 14u. [sent-318, score-0.172]
71 Let n i=1 H (δ) := {h ∈ H : h − ho 2,Q ≤ δ} , ˆ δn := sup h∈H (δ) h − ho 2,Qn , where ho is a fixed but arbitrary function in H and Qn is the corresponding empirical distribution Rˆ ˆ of Z based on {Zi }n . [sent-337, score-1.398]
72 For a ≥ 8C δn √ H 1/2 (u, H , Qn )du ∨ δn , where C is some positive i=1 a/(32 n) constant, we have sup |νε (h) − νε (ho )| ≥ n n h∈H (δ) a Z1 , . [sent-338, score-0.095]
73 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS The following lemma is a modification of Lemma 5. [sent-342, score-0.123]
74 Then, for all n, h sup h∈H ρ 2,Sn 1 − 2+2ρ ≥ 14u ≤ 4 exp(−u2 n 1+ρ ) , 2,S ∨ n h for all u ≥ 1. [sent-347, score-0.095]
75 n n We apply the randomization device in Pollard (1984, page 32), as follows. [sent-350, score-0.072]
76 Use a symmetrization lemma of Pollard (1984, Lemma II. [sent-373, score-0.177]
77 8), see Appendix, to obtain 2,S ∨ δn h h∈H 2,Sn ≥ 14u ≤ 2 h sup sup h∈H | h − h 2,Sn | ≥ 12u . [sent-375, score-0.19]
78 h 2,S ∨ δn 2,Sn The peeling device on the set {h ∈ H : (2u) j−1 δn ≤ h 2,S ≤ (2u) j δn , j = 1, 2, . [sent-376, score-0.084]
79 } and the inequality in Pollard (1984, page 33) give sup | h − h 2,Sn | ≥ 12u Z1 , . [sent-379, score-0.17]
80 , Zn h 2,S ∨ δn 2,Sn h∈H ∞ ∑ ≤ h∈H j=1 h ∞ ≤ ∞ ≤ ∑ 2 exp j=1 | h sup 2,S ≤(2u) ∑ 2 exp j=1 jδ Sn − h Sn | ≥ 6(2u) j δn | Z1 , . [sent-382, score-0.195]
81 Definition (4) of the loss j=1 and the fact that ∑m p j = 1 give j=1 m L( f ) = m ∑ p j( ∑ j=1 ( fk + k=1, k= j 1 )+ ) = m−1 m 1 ∑ (1 − p j )( f j + m − 1 )+ . [sent-397, score-0.101]
82 Write ∆( f ) := L( f ) − L( f ∗ ) = 1 ∑ (1 − p j )( f j + m − 1 )+ j=k + (1 − pk )( fk + 1 1 )+ − (1 − pk )(1 + ). [sent-410, score-0.425]
83 Here, ∆( f ) = (1 − pk )( fk − 1) + 1 ∑ (1 − p j )( f j + m − 1 )+ . [sent-412, score-0.239]
84 Divide the sum j=1 + (k) and J − (k) to obtain into the sets J ∆( f ) = ∑ j∈J + (k) (pk − p j ) ( f j + 1 1 ) + (1 − pk ) ∑ | f j + |. [sent-414, score-0.186]
85 m−1 m−1 j∈J − (k) 2182 A M OMENT B OUND M ULTI - HINGE C LASSIFIERS In both cases clearly L( f ) − L( f ∗ ) is always non-negative since pk − p j is non-negative for all j = k. [sent-416, score-0.186]
86 ,m j=k τ 2 m ∑ | f j − f j∗ | , j=1 ∗ where the second inequality is obtained from the fact that | f k − fk | ≤ ∑ j=k | f j − f j∗ |. [sent-433, score-0.097]
87 That is, the excess risk is lower bounded by Z 1 m ∑ τ| f j − f j∗ |dQ . [sent-434, score-0.189]
88 Then t sup |Z(t)| > ε ≤ β−1 sup |Z(t) − Z (t)| > ε − α . [sent-444, score-0.19]
89 Classification with a reject option using a hinge loss. [sent-448, score-0.131]
90 On the learnability and design of output codes for multiclass problems. [sent-462, score-0.065]
91 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-466, score-0.065]
92 Characterizing the solution path of multicategory support vector machines. [sent-497, score-0.219]
93 Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. [sent-500, score-0.102]
94 Support vector machines and the bayes rule in classification. [sent-503, score-0.064]
95 Classifiers of support machine type with l 1 complexity regularization. [sent-531, score-0.063]
96 On l1 -norm multiclass support vector machines: Methodology and theory. [sent-558, score-0.092]
97 Statistical behaviour and consistency of classification methods based on convex risk minimization. [sent-564, score-0.125]
98 Statistical analysis of some multi-category large margin classification methods. [sent-568, score-0.075]
99 Statistical analysis of some multi-category large margin classification methods. [sent-571, score-0.075]
100 The margin vector, admissible loss and multi-class marginbased classifiers. [sent-574, score-0.123]
wordName wordTfidf (topN-words)
[('ho', 0.421), ('tarigan', 0.235), ('fo', 0.227), ('bracketing', 0.213), ('zn', 0.211), ('geer', 0.21), ('hb', 0.193), ('multicategory', 0.192), ('pk', 0.186), ('van', 0.172), ('eer', 0.149), ('oment', 0.149), ('ao', 0.148), ('dq', 0.148), ('hinge', 0.131), ('ound', 0.127), ('ulti', 0.127), ('lemma', 0.123), ('qn', 0.117), ('excess', 0.11), ('fi', 0.109), ('sara', 0.107), ('lassifiers', 0.104), ('sup', 0.095), ('category', 0.087), ('sn', 0.081), ('condition', 0.081), ('risk', 0.079), ('margin', 0.075), ('cz', 0.072), ('gl', 0.072), ('multiclass', 0.065), ('bayes', 0.064), ('bernadetta', 0.064), ('gu', 0.059), ('pollard', 0.055), ('yoonkyung', 0.054), ('departement', 0.054), ('jon', 0.054), ('suph', 0.054), ('symmetrization', 0.054), ('fk', 0.053), ('fm', 0.053), ('classi', 0.051), ('svm', 0.05), ('lee', 0.05), ('exp', 0.05), ('aa', 0.049), ('tewari', 0.049), ('loss', 0.048), ('behaviour', 0.046), ('zi', 0.045), ('bartlett', 0.044), ('moment', 0.044), ('inequality', 0.044), ('categories', 0.043), ('barrio', 0.043), ('hardy', 0.043), ('littlewood', 0.043), ('multihinge', 0.043), ('peeling', 0.043), ('radiance', 0.043), ('crammer', 0.042), ('device', 0.041), ('empirical', 0.04), ('binary', 0.04), ('entropy', 0.04), ('vote', 0.039), ('write', 0.038), ('nb', 0.037), ('complexity', 0.036), ('auer', 0.036), ('ethz', 0.036), ('koby', 0.036), ('meir', 0.036), ('tong', 0.035), ('arg', 0.034), ('ers', 0.033), ('alexandre', 0.032), ('mammen', 0.032), ('satellite', 0.032), ('song', 0.032), ('tsybakov', 0.032), ('wellner', 0.032), ('weston', 0.032), ('er', 0.032), ('page', 0.031), ('constants', 0.031), ('lin', 0.031), ('del', 0.03), ('duan', 0.03), ('zurich', 0.03), ('convergence', 0.028), ('boucheron', 0.028), ('lectures', 0.028), ('vaart', 0.028), ('yoram', 0.028), ('peter', 0.028), ('constraint', 0.027), ('support', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
2 0.15329947 7 jmlr-2008-A Tutorial on Conformal Prediction
Author: Glenn Shafer, Vladimir Vovk
Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 − ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions
3 0.12522101 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
Author: Peter L. Bartlett, Marten H. Wegkamp
Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines
4 0.10889544 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
5 0.086990803 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
6 0.06677904 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
7 0.063071348 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
8 0.062582016 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
9 0.061697669 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
10 0.055150997 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
11 0.050147001 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
12 0.049157292 9 jmlr-2008-Active Learning by Spherical Subdivision
13 0.043433819 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
14 0.041071679 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
15 0.040103577 83 jmlr-2008-Robust Submodular Observation Selection
16 0.03973911 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
17 0.036546696 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
18 0.036348544 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
19 0.035626374 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
20 0.035158549 58 jmlr-2008-Max-margin Classification of Data with Absent Features
topicId topicWeight
[(0, 0.202), (1, -0.1), (2, -0.019), (3, -0.114), (4, 0.095), (5, -0.046), (6, 0.143), (7, -0.188), (8, 0.137), (9, -0.208), (10, 0.2), (11, -0.124), (12, -0.156), (13, -0.061), (14, 0.219), (15, -0.062), (16, -0.059), (17, 0.055), (18, 0.235), (19, -0.061), (20, -0.06), (21, 0.091), (22, 0.071), (23, 0.011), (24, -0.026), (25, -0.102), (26, -0.129), (27, -0.091), (28, 0.048), (29, 0.017), (30, 0.049), (31, 0.023), (32, -0.031), (33, -0.063), (34, -0.06), (35, -0.195), (36, -0.068), (37, -0.051), (38, 0.024), (39, 0.003), (40, -0.058), (41, -0.066), (42, 0.116), (43, 0.007), (44, 0.035), (45, -0.004), (46, 0.081), (47, -0.061), (48, 0.098), (49, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.94073594 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
2 0.6823296 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
Author: Peter L. Bartlett, Marten H. Wegkamp
Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines
3 0.64497709 7 jmlr-2008-A Tutorial on Conformal Prediction
Author: Glenn Shafer, Vladimir Vovk
Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 − ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions
4 0.38750932 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
5 0.30370384 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
6 0.30274767 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
7 0.29385102 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
8 0.27699164 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
9 0.27244478 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
10 0.27068588 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
11 0.23948415 9 jmlr-2008-Active Learning by Spherical Subdivision
12 0.20729484 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
13 0.19540808 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
14 0.17993751 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
15 0.17426766 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
16 0.17003925 58 jmlr-2008-Max-margin Classification of Data with Absent Features
17 0.16541786 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
18 0.16167261 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
19 0.16117479 26 jmlr-2008-Consistency of Trace Norm Minimization
20 0.16058424 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
topicId topicWeight
[(0, 0.057), (39, 0.377), (40, 0.022), (54, 0.038), (58, 0.056), (66, 0.058), (76, 0.013), (78, 0.015), (88, 0.112), (92, 0.073), (94, 0.036), (99, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.7377488 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
2 0.38430429 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
Author: Peter L. Bartlett, Marten H. Wegkamp
Abstract: We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss—the φ-risk—also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. Keywords: Bayes classifiers, classification, convex surrogate loss, empirical risk minimization, hinge loss, large margin classifiers, margin condition, reject option, support vector machines
3 0.38155636 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
4 0.37482905 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park
Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform
5 0.37152594 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
6 0.37092125 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
7 0.36701947 96 jmlr-2008-Visualizing Data using t-SNE
8 0.36676899 9 jmlr-2008-Active Learning by Spherical Subdivision
9 0.36647868 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
10 0.36424276 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
11 0.35954228 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
12 0.35922906 56 jmlr-2008-Magic Moments for Structured Output Prediction
13 0.35710084 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
14 0.35540697 86 jmlr-2008-SimpleMKL
15 0.35396197 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
17 0.35321218 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
18 0.35193324 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
19 0.35172972 83 jmlr-2008-Robust Submodular Observation Selection
20 0.351385 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks