jmlr jmlr2009 jmlr2009-82 knowledge-graph by maker-knowledge-mining

82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Source: pdf

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classiﬁcation that explicitly build in protection to noise, and at the same time control overﬁtting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. [sent-8, score-0.221]

2 On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. [sent-10, score-0.53]

3 We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. [sent-11, score-0.592]

4 (potentially adversarial) disturbance is then added to the samples we observe. [sent-32, score-0.265]

5 Moreover, there has not been an explicit connection to the regularized classiﬁer, although at a high-level it is known that regularization and robust optimization are related (e. [sent-40, score-0.25]

6 The main contribution in this paper is solving the robust classiﬁcation problem for a class of non-box-typed uncertainty sets, and providing a linkage between robust classiﬁcation and the standard regularization scheme of SVMs. [sent-43, score-0.365]

7 • We show that the standard regularized SVM classiﬁer is a special case of our robust classiﬁcation, thus explicitly relating robustness and regularization. [sent-46, score-0.412]

8 We consider a chanceconstrained classiﬁer (that is, a classiﬁer with probabilistic constraints on misclassiﬁcation) and show that our robust formulation can approximate it far less conservatively than previous robust formulations could possibly do. [sent-49, score-0.229]

9 • We show that the robustness perspective, stemming from a non-i. [sent-51, score-0.246]

10 This result implies that generalization ability is a direct result of robustness to local disturbances; it therefore suggests a new justiﬁcation for good performance, and consequently allows us to construct learning algorithms that generalize well by robustifying non-consistent algorithms. [sent-58, score-0.265]

11 1 Robustness and Regularization We comment here on the explicit equivalence of robustness and regularization. [sent-60, score-0.28]

12 The objective of these works was not to relate robustness and the standard regularization term that appears in the objective function. [sent-70, score-0.349]

13 Thus, although certain equivalence relationships between robustness and regularization have been established for problems other than classiﬁcation (El Ghaoui and Lebret, 1997; Ben-Tal and Nemirovski, 1999; Bishop, 1995; Xu et al. [sent-77, score-0.364]

14 , 2009), the explicit equivalence between robustness and regularization in the SVM setup is novel. [sent-78, score-0.4]

15 The connection of robustness and regularization in the SVM context is important for the following reasons. [sent-79, score-0.33]

16 The robust view of regularization regards the testing samples as a perturbed copy of the training samples. [sent-82, score-0.328]

17 Since both noise and robustness are physical processes, a close investigation of the application and noise characteristics at hand, can provide insights into how to properly robustify, and therefore regularize the classiﬁer. [sent-93, score-0.286]

18 From the robustness perspective, this has the interpretation that the noise is anisotropic (ellipsoidal) rather than spherical, and hence an appropriate robustiﬁcation must be designed to ﬁt this anisotropy. [sent-95, score-0.266]

19 This is helpful when the training samples and the testing samples are 1. [sent-99, score-0.198]

20 We need to point out that there are several different deﬁnitions of robustness in the literature. [sent-106, score-0.246]

21 In this paper, as well as the aforementioned robust classiﬁcation papers, robustness is mainly understood from a Robust Optimization (RO) perspective, where a min-max optimization is performed over all possible disturbances. [sent-107, score-0.349]

22 An alternative interpretation of robustness stems from the rich literature on robust statistics (e. [sent-108, score-0.349]

23 In the machine learning literature, another widely used notion closely related to robustness is the stability, where an algorithm is required to be robust (in the sense that the output function does not change signiﬁcantly) under a speciﬁc perturbation: deleting one sample from the training set. [sent-117, score-0.406]

24 One main difference between RO and other robustness notions is that the former is constructive rather than analytical. [sent-121, score-0.246]

25 That is, in contrast to robust statistics or the stability approach that measures the robustness of a given algorithm, RO can robustify an algorithm: it converts a given algorithm to a robust one. [sent-122, score-0.502]

26 In Section 2 we investigate the correlated disturbance case, and show the equivalence between the robust classiﬁcation and the regularization process. [sent-127, score-0.429]

27 draws and provide a novel proof of consistency of SVM based on robustness analysis. [sent-133, score-0.283]

28 We use δ to denote disturbance affecting the samples. [sent-139, score-0.208]

29 : r(w, b) + ∑ ξi i=1 ξi ≥ 1 − yi ( w, xi + b)] ξi ≥ 0, where r(w, b) is a regularization term. [sent-147, score-0.216]

30 This is equivalent to m min r(w, b) + ∑ max 1 − yi ( w, xi + b), 0 w,b . [sent-148, score-0.206]

31 , δm ) and essentially solves the following minmax problem: m min max w,b δ∈N box r(w, b) + ∑ max 1 − yi ( w, xi − δi + b), 0 , (1) i=1 for a box-type uncertainty set Nbox . [sent-154, score-0.337]

32 The goal of this paper is to obtain a robust formulation where the disturbances {δi } may be meaningfully taken to be correlated, that is, to solve for a non-box-type N : m min max r(w, b) + ∑ max 1 − yi ( w, xi − δi + b), 0 w,b δ∈N . [sent-157, score-0.405]

33 , spam senders) will manipulate the testing samples to avoid being correctly classiﬁed, and the robustness toward such manipulation should be taken into consideration in the training process (e. [sent-166, score-0.387]

34 Or alternatively, the training samples and the testing samples can be obtained from different processes and hence the standard i. [sent-169, score-0.198]

35 We deﬁne explicitly the correlated disturbance (or uncertainty) which we study below. [sent-181, score-0.208]

36 Deﬁnition 1 A set N0 ⊆ Rn is called an Atomic Uncertainty Set if (I) 0 ∈ N0 ; (II) For any w0 ∈ Rn : sup [w⊤ δ] = sup [−w⊤ δ′ ] < +∞. [sent-182, score-0.248]

37 In particular, all norm balls and ellipsoids centered at the origin are atomic uncertainty sets, while an arbitrary polytope might not be an atomic uncertainty set. [sent-185, score-0.293]

38 The following theorem is the main result of this section, which reveals that standard norm regularized SVM is the solution of a (non-regularized) robust optimization. [sent-193, score-0.204]

39 Then the following two optimization problems on (w, b) are equivalent2 m min : ∑ max )∈T max (δ1 ,··· ,δm i=1 1 − yi w, xi − δi + b , 0 , m min : c w + ∑ max 1 − yi w, xi + b , 0 . [sent-202, score-0.468]

40 (3) i=1 Proposition 4 Assume {xi , yi }m are non-separable, r(·) : Rn+1 → R is an arbitrary function, N i=1 is a Sublinear Aggregated Uncertainty set with corresponding atomic uncertainty set N0 . [sent-203, score-0.221]

41 Then the following min-max problem m min sup w,b (δ ,··· ,δ )∈N m 1 r(w, b) + ∑ max 1 − yi ( w, xi − δi + b), 0 (4) i=1 is equivalent to the following optimization problem on w, b, ξ: m min : r(w, b) + sup (w⊤ δ) + ∑ ξi , δ∈N0 i=1 s. [sent-204, score-0.472]

42 Proof Deﬁne: m v(w, b) sup (w⊤ δ) + ∑ max 1 − yi ( w, xi + b), 0 . [sent-214, score-0.312]

43 Hence, ﬁxing any (w, b) ∈ Rn+1 , the following inequalities hold: m ∑ max sup (δ1 ,··· ,δm )∈N − i=1 m ≤ ≤ ∑ max sup (δ1 ,··· ,δm )∈N i=1 m sup (δ1 ,··· ,δm )∈N ∑ max + i=1 ˆ ˆ 1 − yi ( w, xi − δi + b), 0 ˆ ˆ 1 − yi ( w, xi − δi + b), 0 ˆ ˆ 1 − yi ( w, xi − δi + b), 0 . [sent-218, score-0.936]

44 ˆ show v(w, b) Step 1: We prove that ˆ ˆ v(w, b) ≤ m ∑ max sup (δ1 ,··· ,δm )∈N − i=1 ˆ ˆ 1 − yi ( w, xi − δi + b), 0 . [sent-220, score-0.312]

45 Since Nt ⊆ N Step 2: Next we prove that m sup ∑ max (δ1 ,··· ,δm )∈N + i=1 ˆ ˆ ˆ ˆ 1 − yi ( w, xi − δi + b), 0 ≤ v(w, b). [sent-225, score-0.312]

46 ˆ δi ∈N0 Now, for any i ∈ [1 : m], the following holds, ˆ ˆ ˆ max sup 1 − yi ( w, xi − αi δi + b) , 0 ˆ δi ∈N0 ˆ ˆ ˆ ˆ = max 1 − yi ( w, xi + b) + αi sup (w⊤ δi ), 0 ˆ δi ∈N0 ˆ ˆ ˆ ˆ ≤ max 1 − yi ( w, xi + b), 0 + αi sup (w⊤ δi ). [sent-227, score-0.936]

47 ˆ δi ∈N0 Therefore, Equation (9) is upper bounded by m ˆ ˆ ∑ max 1 − yi ( w, xi + b), 0 + i=1 m m sup ˆ ˆ ∑ αi ˆsup (w⊤ δi ) ∑m αi =1; αi ≥0; i=1 i=1 δi ∈N0 ˆ ˆ ˆ ˆ ˆ = sup (w⊤ δ) + ∑ max 1 − yi ( w, xi + b), 0 = v(w, b), δ∈N0 i=1 hence Inequality (8) holds. [sent-228, score-0.624]

48 Step 3: Combining the two steps and adding r(w, b) on both sides leads to: ∀(w, b) ∈ Rn+1 , m sup ∑ max (δ1 ,··· ,δm )∈N i=1 1 − yi ( w, xi − δi + b), 0 + r(w, b) = v(w, b) + r(w, b). [sent-229, score-0.312]

49 The above equivalence implies that standard regularization essentially assumes that the disturbance is spherical; if this is not true, robustness may yield a better regularization-like algorithm. [sent-237, score-0.572]

50 Suppose the disturbance (δr , · · · δr ) follows a joint probability measure µ. [sent-254, score-0.208]

51 Then for any (w, b), with probability no less than 1 − η, the following holds, m ∑ max i=1 1 − yi ( w, xi − δr + b), 0 i m ≤ max ∑ max ∑i δi ∗ ≤c∗ i=1 1 − yi ( w, xi − δi + b), 0 . [sent-269, score-0.432]

52 This gives an additional probabilistic robustness property of the standard regularized classiﬁer. [sent-271, score-0.309]

53 Suppose the total disturbance cr ∑m δr ∗ i i=1 follows a prior distribution ρ(·). [sent-274, score-0.208]

54 This can model for example the case that the training sample set is a mixture of several data sets where the disturbance magnitude of each set is known. [sent-275, score-0.265]

55 Such a setup leads to the following classiﬁer which minimizes the Bayesian (robust) error: min : w,b m Z max ∑ max ∑ δi ∗ ≤c i=1 1 − yi w, xi − δi + b , 0 dρ(c). [sent-276, score-0.298]

56 (11) By Theorem 3, the Bayes classiﬁer (11) is equivalent to min : w,b m Z c w + ∑ max 1 − yi w, xi + b , 0 dρ(c), i=1 which can be further simpliﬁed as m min : w,b c w + ∑ max 1 − yi w, xi + b , 0 , i=1 R c dρ(c). [sent-277, score-0.412]

57 In particular, similar to the linear classiﬁcation case, we give a new interpretation of the standard kernelized SVM as the min-max empirical hinge-loss solution, where the disturbance is assumed to lie in the feature space. [sent-282, score-0.279]

58 We then relate this to the (more intuitively appealing) setup where the disturbance lies in the sample space. [sent-283, score-0.286]

59 Theorem 5 Assume {Φ(xi ), yi }m are not linearly separable, r(·) : H × R → R is an arbitrary i=1 function, N ⊆ H m is a Sublinear Aggregated Uncertainty set with corresponding atomic uncertainty set N0 ⊆ H . [sent-294, score-0.221]

60 Then the following min-max problem m min sup w,b (δ ,··· ,δ )∈N m 1 r(w, b) + ∑ max 1 − yi ( w, Φ(xi ) − δi + b), 0 i=1 is equivalent to m min : r(w, b) + sup ( w, δ ) + ∑ ξi , δ∈N0 i=1 s. [sent-295, score-0.424]

61 If {Φ(xi ), yi }m are non-separable, then the i=1 i=1 following two optimization problems on (w, b) are equivalent m min : min : max ∑ max (δ1 ,··· ,δm )∈TH i=1 m 1 − yi w, Φ(xi ) − δi + b , 0 , c w H + ∑ max 1 − yi w, Φ(xi ) + b , 0 . [sent-307, score-0.456]

62 Therefore, Corollary 6 1496 ROBUSTNESS AND R EGULARIZATION OF SVM S essentially means that the standard kernelized SVM is implicitly a robust classiﬁer (without regularization) with disturbance in the feature-space, and the sum of the magnitude of the disturbance is bounded. [sent-309, score-0.571]

63 Disturbance in the feature-space is less intuitive than disturbance in the sample space, and the next lemma relates these two different notions. [sent-310, score-0.231]

64 In the appendix, we prove a result that provides a tighter relationship between disturbance in the feature space and disturbance in the sample space, for RBF kernels. [sent-312, score-0.458]

65 Lemma 7 essentially says that under certain conditions, robustness in the feature space is a stronger requirement that robustness in the sample space. [sent-314, score-0.534]

66 Therefore, a classiﬁer that achieves robustness in the feature space (the SVM for example) also achieves robustness in the sample space. [sent-315, score-0.534]

67 In the next section we consider a more foundational property of robustness in the sample space: we show that a classiﬁer that is robust in the sample space is asymptotically consistent. [sent-318, score-0.395]

68 Consistency of Regularization In this section we explore a fundamental connection between learning and robustness, by using robustness properties to re-prove the statistical consistency of the linear classiﬁer, and then the kernelized SVM. [sent-321, score-0.335]

69 Indeed, our proof mirrors the consistency proof found in Steinwart (2005), with the key difference that we replace metric entropy, VC-dimension, and stability conditions used there, with a robustness condition. [sent-322, score-0.314]

70 We now turn to the standard statistical learning setup, by assuming that all 1497 X U , C ARAMANIS AND M ANNOR training samples and testing samples are generated i. [sent-324, score-0.198]

71 The next theorem shows that our robust classiﬁer setup and equivalently regularized SVM asymptotically minimizes an upper-bound of the expected classiﬁcation error and hinge loss. [sent-332, score-0.261]

72 the following bounds on the Bayes loss and the hinge loss hold uniformly for all (w, b): E(x,y)∼P (1y=sgn( w, x +b) ) ≤ γm,c + c w 2+ 1 m ∑ max 1 − yi ( w, xi + b), 0 ; m i=1 E(x,y)∼P max(1 − y( w, x + b), 0) ≤ γm,c (1 + K w 2 + |b|) + c w 2+ 1 m ∑ max 1 − yi ( w, xi + b), 0 . [sent-336, score-0.416]

73 For testing samples that have “small” perturbations, c w 2 + 1 m ∑i=1 max 1 − yi ( w, xi + b), 0 upper-bounds their total loss by Theorem 3. [sent-339, score-0.295]

74 We say a set of training samples and a set of testing samples form l pairings if there exist l sample pairs with no data reused. [sent-343, score-0.221]

75 2 + |b|) + cm w i=1 Therefore, the average testing error is upper bounded by 1 − Mm,c /m + c w 2+ 1 n ∑ max 1 − yi ( w, xi + b), 0], m i=1 and the average hinge loss is upper bounded by (1 − Mm,c /m)(1 + K w 2 + |b|) + c w 2+ 1499 1 m ∑ max 1 − yi ( w, xi + b), 0 . [sent-366, score-0.466]

76 Notice that Mm ≥ m − #(unpaired samples in X ′ ) − max #(training samples not in X ′ ), #(testing samples not in X ′ ) . [sent-389, score-0.227]

77 the following bounds on the Bayes loss and the hinge loss hold uniformly for all (w, b) ∈ H ×R 1 m EP (1y=sgn( w, Φ(x) +b) ) ≤ γm,c + c w H + ∑ max 1 − yi ( w, Φ(xi ) + b), 0 , m i=1 E(x,y)∼P max(1 − y( w, Φ(x) + b), 0) ≤ 1 m γm,c (1 + K w H + |b|) + c w H + ∑ max 1 − yi ( w, Φ(xi ) + b), 0 . [sent-401, score-0.32]

78 Now notice that by Lemma 7, if a testing sample x and a training sample x′ belong to a “brick” √ √ with length of each side min(ρ/ n, f −1 (c2 )/ n) in the sample space (see the proof of Lemma 9), Φ(x) − Φ(x′ ) H ≤ c. [sent-408, score-0.199]

79 Further notice that for M paired samples, the total testing error and hinge-loss are both upper-bounded by M cM w H + ∑ max 1 − yi ( w, Φ(xi ) + b), 0 . [sent-412, score-0.236]

80 This condition requires that the feature mapping is “smooth” and hence preserves “locality” of the disturbance, that is, small disturbance in the sample space guarantees the corresponding disturbance in the feature space is also small. [sent-416, score-0.477]

81 A standard RKHS regularized SVM using this kernel leads to a decision function m sign( ∑ αi k(x, xi ) + b), i=1 which equals sign(b) and provides no meaningful prediction if the testing sample x is not one of the training samples. [sent-419, score-0.241]

82 The key point of interest in our proof is the use of a robustness condition in place of a VC-dimension or stability condition used in Steinwart (2005). [sent-423, score-0.277]

83 Before concluding this section, we remark that although we focus in this paper the hinge-loss function and the RKHS norm regularizer, the robustness approach to establish consistency can be generalized to other regularization schemes and loss functions. [sent-472, score-0.386]

84 This suggests using the robustness view to derive sharp sample complexity bounds for a broad class of algorithms (e. [sent-475, score-0.269]

85 In particular, we prove that the standard norm-regularized SVM classiﬁer is in fact the solution to a robust classiﬁcation setup, and thus known results about regularized classiﬁers extend to robust classiﬁers. [sent-480, score-0.269]

86 To the best of our knowledge, this is the ﬁrst explicit such link between regularization and robustness in pattern classiﬁcation. [sent-481, score-0.33]

87 The interpretation of this link is that norm-based regularization essentially builds in a robustness to sample noise whose probability level sets are symmetric unit balls with respect to the dual of the regularizing norm. [sent-482, score-0.373]

88 It would be interesting to understand the performance gains possible when the noise does not have such characteristics, and the robust setup is used in place of regularization with appropriately deﬁned uncertainty set. [sent-483, score-0.318]

89 Based on the robustness interpretation of the regularization term, we re-proved the consistency of SVMs without direct appeal to notions of metric entropy, VC-dimension, or stability. [sent-484, score-0.367]

90 Our proof suggests that the ability to handle disturbance is crucial for an algorithm to achieve good generalization ability. [sent-485, score-0.208]

91 In particular, for “smooth” feature mappings, the robustness to disturbance in the observation space is guaranteed and hence SVMs achieve consistency. [sent-486, score-0.473]

92 On the other-hand, certain “non-smooth” feature mappings fail to be consistent simply because for such kernels the robustness in the feature-space (guaranteed by the regularization process) does not imply robustness in the observation space. [sent-487, score-0.595]

93 In this appendix we show that for RBF kernels, it is possible to relate robustness in the feature space and robustness in the sample space more directly. [sent-492, score-0.553]

94 Then we have for any x ∈ Rn , w ∈ H and c > 0, sup w, Φ(x − δ) = δ ≤c δφ H ≤ sup √ w, Φ(x) + δφ . [sent-495, score-0.248]

95 First we show sup w, Φ(x − δ) ≤ δ ≤c δφ H ≤ sup √ 2 f (0)−2 f (c) w, Φ(x) − δφ . [sent-497, score-0.248]

96 Next, we show the opposite inequality, sup w, Φ(x − δ) ≥ δ ≤c δφ H ≤ sup √ 2 f (0)−2 f (c) w, Φ(x) − δφ . [sent-500, score-0.248]

97 By (19) i we have w, Φ(x′ ) → w, Φ(x) − δ′ > sup w, Φ(x) − δφ − ε, i φ √ δφ H ≤ 2 f (0)−2 f (c) which means sup w, Φ(x − δ) ≥ δ ≤c δφ H ≤ sup √ 2 f (0)−2 f (c) w, Φ(x) − δφ − ε. [sent-506, score-0.372]

98 On robustness properties of convex risk minimization methods for pattern recognition. [sent-622, score-0.266]

99 Consistency and robustness of kernel based regression in convex risk minimization. [sent-627, score-0.289]

100 Bouligand derivatives and robustness of support vector machines for regression. [sent-632, score-0.246]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 0.539), ('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 0.539), ('xxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 0.284), ('robustness', 0.246), ('disturbance', 0.208), ('sup', 0.124), ('aramanis', 0.123), ('annor', 0.104), ('robust', 0.103), ('bhattacharyya', 0.095), ('rc', 0.087), ('egularization', 0.086), ('tc', 0.085), ('yi', 0.084), ('regularization', 0.084), ('steinwart', 0.084), ('uncertainty', 0.075), ('christmann', 0.072), ('regularized', 0.063), ('atomic', 0.062), ('svm', 0.06), ('shivaswamy', 0.057), ('samples', 0.057), ('max', 0.056), ('kernelized', 0.052), ('testing', 0.05), ('xi', 0.048), ('ro', 0.047), ('sublinear', 0.046), ('notice', 0.046), ('surely', 0.044), ('rkhs', 0.044), ('classi', 0.041), ('disturbances', 0.04), ('hinge', 0.04), ('aggregated', 0.038), ('hampel', 0.038), ('trafalis', 0.038), ('consistency', 0.037), ('setup', 0.036), ('rp', 0.036), ('perturbation', 0.035), ('training', 0.034), ('equivalence', 0.034), ('uncertain', 0.033), ('ghaoui', 0.033), ('caramanis', 0.032), ('bertsimas', 0.032), ('rn', 0.031), ('stability', 0.031), ('lim', 0.03), ('er', 0.029), ('lanckriet', 0.029), ('mcgill', 0.029), ('brick', 0.028), ('lebret', 0.028), ('nbox', 0.028), ('ntte', 0.028), ('robusti', 0.028), ('rousseeuw', 0.028), ('svms', 0.028), ('el', 0.026), ('cone', 0.025), ('gilbert', 0.024), ('perturbations', 0.023), ('kernel', 0.023), ('formulations', 0.023), ('sample', 0.023), ('yt', 0.022), ('iv', 0.022), ('nt', 0.022), ('bartlett', 0.022), ('measurable', 0.02), ('risk', 0.02), ('noise', 0.02), ('koltchinskii', 0.02), ('globerson', 0.02), ('shie', 0.02), ('vapnik', 0.019), ('inequality', 0.019), ('pac', 0.019), ('maronna', 0.019), ('messem', 0.019), ('multinomially', 0.019), ('nemirovski', 0.019), ('ntc', 0.019), ('nttr', 0.019), ('robustify', 0.019), ('robustifying', 0.019), ('senders', 0.019), ('theorem', 0.019), ('proposition', 0.019), ('xt', 0.019), ('relate', 0.019), ('feature', 0.019), ('norm', 0.019), ('xu', 0.018), ('almost', 0.018), ('min', 0.018), ('mm', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Author: Huan Xu, Constantine Caramanis, Shie Mannor

2 0.059014495 67 jmlr-2009-Online Learning with Sample Path Constraints

Author: Shie Mannor, John N. Tsitsiklis, Jia Yuan Yu

Abstract: We study online learning where a decision maker interacts with Nature with the objective of maximizing her long-term average reward subject to some sample path average constraints. We deﬁne the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature’s choices in advance. We show that in general the reward-in-hindsight is not attainable. The convex hull of the reward-in-hindsight function is, however, attainable. For the important case of a single constraint, the convex hull turns out to be the highest attainable function. Using a calibrated forecasting rule, we provide an explicit strategy that attains this convex hull. We also measure the performance of heuristic methods based on non-calibrated forecasters in experiments involving a CPU power management problem. Keywords: online learning, calibration, regret minimization, approachability

3 0.051579192 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

Author: Jacob Abernethy, Francis Bach, Theodoros Evgeniou, Jean-Philippe Vert

Abstract: We present a general approach for collaborative ﬁltering (CF) using spectral regularization to learn linear operators mapping a set of “users” to a set of possibly desired “objects”. In particular, several recent low-rank type matrix-completion methods for CF are shown to be special cases of our proposed framework. Unlike existing regularization-based CF, our approach can be used to incorporate additional information such as attributes of the users/objects—a feature currently lacking in existing regularization-based CF approaches—using popular and well-known kernel methods. We provide novel representer theorems that we use to develop new estimation methods. We then provide learning algorithms based on low-rank decompositions and test them on a standard CF data set. The experiments indicate the advantages of generalizing the existing regularization-based CF methods to incorporate related information about users and objects. Finally, we show that certain multi-task learning methods can be also seen as special cases of our proposed approach. Keywords: collaborative ﬁltering, matrix completion, kernel methods, spectral regularization

4 0.048269659 16 jmlr-2009-Classification with Gaussians and Convex Loss

Author: Dao-Hong Xiang, Ding-Xuan Zhou

Abstract: This paper considers binary classiﬁcation algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassiﬁcation error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classiﬁcation, general convex loss, varying Gaussian kernels, covering number, approximation

5 0.0463873 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classiﬁcation problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classiﬁcation under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classiﬁer for covariate shift. The optimization problem is convex under certain conditions; our ﬁndings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam ﬁltering, text classiﬁcation, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

6 0.04410512 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

7 0.041057371 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

8 0.040195659 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

9 0.039433617 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

10 0.036314253 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

11 0.035765857 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

12 0.033646904 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

13 0.033348389 29 jmlr-2009-Estimating Labels from Label Proportions

14 0.032983404 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

15 0.030452797 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

16 0.030340927 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

17 0.030240459 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

18 0.029836256 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

19 0.029768016 90 jmlr-2009-Structure Spaces

20 0.028809974 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.158), (1, 0.011), (2, 0.039), (3, -0.02), (4, -0.078), (5, 0.04), (6, 0.039), (7, 0.016), (8, 0.038), (9, 0.062), (10, 0.019), (11, -0.019), (12, -0.016), (13, -0.104), (14, 0.062), (15, -0.099), (16, 0.0), (17, 0.058), (18, 0.068), (19, -0.039), (20, -0.016), (21, 0.188), (22, -0.043), (23, 0.028), (24, 0.109), (25, 0.077), (26, 0.064), (27, 0.267), (28, 0.089), (29, 0.059), (30, -0.124), (31, -0.306), (32, 0.094), (33, 0.004), (34, -0.14), (35, -0.061), (36, 0.029), (37, 0.272), (38, 0.009), (39, 0.157), (40, -0.243), (41, -0.197), (42, -0.095), (43, 0.057), (44, 0.087), (45, -0.116), (46, -0.1), (47, 0.248), (48, 0.176), (49, 0.097)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92203742 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Author: Huan Xu, Constantine Caramanis, Shie Mannor

2 0.37601367 67 jmlr-2009-Online Learning with Sample Path Constraints

Author: Shie Mannor, John N. Tsitsiklis, Jia Yuan Yu

3 0.30030203 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

Author: Jacob Abernethy, Francis Bach, Theodoros Evgeniou, Jean-Philippe Vert

4 0.29193893 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti

Abstract: This paper reviews and extends the ﬁeld of similarity-based classiﬁcation, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classiﬁcation problems. Speciﬁcally, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indeﬁnite kernels

5 0.27587947 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

6 0.26632905 16 jmlr-2009-Classification with Gaussians and Convex Loss

7 0.2515811 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

8 0.21919081 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

9 0.20675604 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)

10 0.20140825 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

11 0.19822663 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

12 0.19636494 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

13 0.19285442 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

14 0.18853603 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

15 0.18368262 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

16 0.18211071 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

17 0.18139152 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

18 0.16558942 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

19 0.16465609 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

20 0.16450764 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.023), (11, 0.019), (15, 0.306), (19, 0.024), (21, 0.03), (26, 0.015), (38, 0.028), (45, 0.016), (47, 0.05), (52, 0.052), (55, 0.072), (58, 0.039), (66, 0.129), (68, 0.02), (90, 0.058), (96, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70284396 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Author: Huan Xu, Constantine Caramanis, Shie Mannor

2 0.47376344 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modiﬁcation of the Perceptron update with an adaptive ﬁltering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

3 0.46788394 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

Author: Junning Li, Z. Jane Wang

Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classiﬁcation or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-speciﬁed level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-speciﬁed level, and a heuristic modiﬁcation of the method is able to control the FDR more accurately around the user-speciﬁed level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton

4 0.46702552 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsiﬁcation from no sparsiﬁcation to total sparsiﬁcation. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsiﬁcation result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and ﬁnd for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

5 0.46569118 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

Author: Tong Zhang

Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches inﬁnity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefﬁcient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefﬁcients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefﬁcients. Keywords: greedy algorithm, feature selection, sparsity

6 0.46262598 18 jmlr-2009-Consistency and Localizability

7 0.46218506 84 jmlr-2009-Scalable Collaborative Filtering Approaches for Large Recommender Systems (Special Topic on Mining and Learning with Graphs and Relations)

8 0.46192893 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

9 0.46177998 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

10 0.46102214 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

11 0.46028185 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

12 0.4599576 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

13 0.45950326 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

14 0.45923692 78 jmlr-2009-Refinement of Reproducing Kernels

15 0.45834675 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

16 0.45815149 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

17 0.45545411 38 jmlr-2009-Hash Kernels for Structured Data

18 0.45515016 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

19 0.45485929 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence

20 0.45483437 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions