nips nips2010 nips2010-243 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Toyota Technological Institute at Chicago Abstract √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. [sent-6, score-1.501]
2 For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. [sent-7, score-0.364]
3 We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. [sent-8, score-0.485]
4 1 Introduction Consider empirical risk minimization for a hypothesis class H = {h : X → R} w. [sent-9, score-0.475]
5 That is, we would like to learn a predictor h with small risk L (h) = E [φ(h(X), Y )] by minimizing the n 1 ˆ empirical risk L(h) = n i=1 φ(h(xi ), yi ) of an i. [sent-13, score-0.688]
6 Statistical guarantees on the excess risk are well understood for parametric (i. [sent-20, score-0.704]
7 For such classes learning guarantees can be obtained for any bounded loss function (i. [sent-24, score-0.503]
8 the class of low-norm linear predictors HB = {hw : x → w, x | w ≤ B} , guarantees can be obtained in terms of scale-sensitive measures of complexity such as fat-shattering dimensions [1], covering numbers [23] or Rademacher complexity [2]. [sent-33, score-0.59]
9 The classical statistical learning theory approach for obtaining learning guarantees for such scalesensitive classes is to rely on the Lipschitz constant D of φ(t, y) w. [sent-34, score-0.221]
10 The excess risk can then be bounded as (in expectation over the sample): ˆ L h ≤ L∗ + 2DRn (H) = L∗ + 2 D2 R n (1) ˆ ˆ where h = arg min L(h) is the empirical risk minimizer (ERM), L∗ = inf h L (h) is the approximation error, and R/n. [sent-43, score-1.026]
11 First, the bound applies only to loss functions with bounded derivative, like the hinge loss and logistic loss popular for classification, or the absolute-value ( 1 ) loss for 1 regression. [sent-48, score-1.06]
12 It is not directly applicable to the squared loss φ(t, y) = 2 (t − y)2 , for which the second derivative is bounded, but not the first. [sent-49, score-0.419]
13 We could try to simply bound the derivative of the squared loss in terms of a bound on the magnitude of h(x), but e. [sent-50, score-0.599]
14 for norm-bounded linear predictors HB this results in a very disappointing excess risk bound of the form O( B 4 (max X )4 /n). [sent-52, score-0.808]
15 One aim of this paper is to provide clean bounds on the excess risk for smooth loss functions such as the squared loss with a bounded second, rather then first, derivative. [sent-53, score-1.304]
16 d-dimensional linear predictors), then in expectation over sample [22]: ˆ L h ≤ L∗ + O dD log n + n dDL∗ log n n (2) The 1/n term disappears in the separable case, and we get a graceful degradation between the 1/n rate to the 1/n rate for separable case. [sent-61, score-0.551]
17 For non-parametric classes, and non-smooth Lipschitz loss, such as the hinge-loss, the excess risk might scale as 1/n and not 1/n, even in the separable case. [sent-64, score-0.631]
18 However, for H-smooth non-negative loss functions, where the second derivative of φ(t, y) w. [sent-65, score-0.302]
19 t is bounded by H, a 1/n separable rate is possible. [sent-68, score-0.261]
20 In Section 2 we obtain the following bound on the excess risk (up to logarithmic factors): ˆ ˜ L h ≤ L∗ + O HR2 (H) + n √ ˜ HL∗ Rn (H) = L∗ + O HR + n HRL∗ n ˜ ≤ 2L∗ + O HR n . [sent-69, score-0.638]
21 (3) 2 In particular, for 2 -norm-bounded linear predictors HB with sup X 2 ≤ 1, the excess risk is bounded by ˜ O(HB 2 /n + HB 2 L∗ /n). [sent-70, score-0.86]
22 Another interesting distinction between parametric √ non-parametric classes, is that and even for the squared-loss, the bound (3) is tight and the non-separable rate of 1/ n is unavoidable. [sent-71, score-0.239]
23 The differences between parametric and scale-sensitive classes, and between non-smooth, smooth and strongly convex loss functions are discussed in Section 4 and summarized in Table 1. [sent-73, score-0.492]
24 The guarantees discussed thus far are general learning guarantees for the stochastic setting that rely only on the Rademacher complexity of the hypothesis class, and are phrased in terms of minimizing some scalar loss function. [sent-74, score-0.739]
25 In Section 3 we consider also the online setting, in addition to the stochastic setting, and present similar guarantees for online and stochastic convex optimization [32, 24]. [sent-75, score-0.578]
26 However, the online and stochastic convex optimization setting of Section 3 is also more restrictive, as we require the objective be convex (in Section 2 we make no assumption about the convexity of hypothesis class H nor the loss function φ). [sent-78, score-0.787]
27 Specifically, for a non-negative H-smooth convex objective, over a domain bounded by B, we prove that the average online regret (and excess risk of stochastic optimization) is bounded by O(HB 2 /n + HB 2 L∗ /n). [sent-79, score-1.079]
28 Comparing with the bound of O( D2 B 2 /n) when the loss is D-Lipschitz rather then H-smooth [32, 21], we see the same relationship discussed above for ERM. [sent-80, score-0.32]
29 Unlike the bound (3) for the ERM, the convex optimization bound avoids polylogarithmic factors. [sent-81, score-0.452]
30 Studying the online and stochastic convex optimization setting (Section 3), in addition to ERM (Section 2), has several advantages. [sent-83, score-0.296]
31 First, it allows us to obtain a learning guarantee for an efficient single-pass learning methods, namely stochastic gradient descent (or mirror descent), as well as for the non-stochastic regret. [sent-84, score-0.222]
32 Second, the bound we obtain in the convex optimization setting (Section 3) is actually better then the bound for the ERM (Section 2) as it avoids all polylogarithmic and large constant factors. [sent-85, score-0.491]
33 Third, the bound is applicable to other non-negative online or stochastic optimization problems beyond classification, including problems for which ERM is not applicable (see, e. [sent-86, score-0.37]
34 Our starting point is the learning bound (1) that applies to D-Lipschitz loss functions, i. [sent-95, score-0.32]
35 What type of bound can we obtain if we instead bound the second derivative φ (t, y)? [sent-101, score-0.309]
36 The central observation, which allows us to obtain guarantees for smooth loss functions, is that for a smooth loss, the derivative can be bounded in terms of the function value: Lemma 2. [sent-104, score-0.69]
37 Looking at the dependence of (1) on the derivative bound D, we are guided by the following heuristic intuition: Since we should be concerned only with the behavior around the ERM, perhaps it is enough to bound ˆ ˆ ˆ ˆ ˆ φ (w, x) at the ERM w. [sent-107, score-0.343]
38 What we would ˆ actually want is to bound each |φ (w, x)| separately, or at least have the absolute value inside the expectation—this is where the non-negativity of the loss plays an important role. [sent-110, score-0.359]
39 Note that only the “confidence” terms depended on b = sup |φ|, and this is typically not the dominant term—we believe it is possible to also obtain a bound that holds in expectation over the sample (rather than with high probability) and that avoids a direct dependence on sup |φ|. [sent-119, score-0.371]
40 To this end, consider the following empirically restricted loss class ˆ Lφ (r) := (x, y) → φ(h(x), y) : h ∈ H, L(h) ≤ r Lemma 2. [sent-121, score-0.258]
41 The Lemma can be seen as a higher-order version of the Lipschitz Composition Lemma [2], which states that the Rademacher complexity of the unrestricted loss class is bounded by DRn (H). [sent-123, score-0.457]
42 For a non-negative H-smooth loss φ bounded by b and any function class H bounded by B: √ √ nB n 12HB 3/2 3/2 √ Rn (Lφ (r)) ≤ 12Hr Rn (H) 16 log − 14 log Rn (H) b Applying Lemma 2. [sent-126, score-0.546]
43 3 The Finite Dimensional Case : Lee et al [16] showed faster rates for squared loss, exploiting the strong convexity of this loss function, even when L∗ > 0, but only with finite VC-subgraph-dimension. [sent-130, score-0.507]
44 Panchenko [22] provides fast rate results for general Lipschitz bounded loss functions, still in the finite VC-subgraph-dimension case. [sent-131, score-0.422]
45 Bousquet [6] provided similar guarantees for linear predictors in Hilbert spaces when the spectrum of the kernel matrix (covariance of X) is exponentially decaying, making the situation almost finite dimensional. [sent-132, score-0.267]
46 Aggregation : Tsybakov [29] studied learning rates for aggregation, where a predictor is chosen from the convex hull of a finite set of base predictors. [sent-137, score-0.408]
47 As with 1 -based analysis, since the bounds depend only logarithmically on the number of base predictors (i. [sent-139, score-0.288]
48 dimensionality), and rely on the scale of change of the loss function, they are of “scale sensitive” nature. [sent-141, score-0.247]
49 For such an aggregate classifier, Tsybakov obtained a rate of 1/n when zero (or small) risk is achieve by one of the base classifiers. [sent-142, score-0.364]
50 Using Tsybakov’s result, it is not enough for zero risk to be achieved by an aggregate (i. [sent-143, score-0.269]
51 Tsybakov then used the approximation error of a small class of base predictors w. [sent-147, score-0.26]
52 a covering) to obtain learning rates for the large hypothesis class by considering aggregation within the small class. [sent-152, score-0.382]
53 However these results only imply fast learning rates for hypothesis classes with √ very low complexity. [sent-153, score-0.403]
54 Specifically to get learning rates better than 1/ n using these results, the covering number of p the hypothesis class at scale needs to behave as 1/ for some p < 2. [sent-154, score-0.402]
55 But typical classes, including the class of linear predictors with bounded norm, have covering numbers that scale as 1/ 2 and so these methods do not imply fast rates for such function classes. [sent-155, score-0.609]
56 In fact, to get rates of 1/n with these techniques, even when L∗ = 0, requires covering numbers that do not increase with at all, and so actually finite VC-subgraph-dimension. [sent-156, score-0.272]
57 Chesneau et al [10] extend Tsybakov’s work also to general losses, deriving similar results for Lipschitz loss function. [sent-157, score-0.287]
58 The same √ caveats hold: even when L∗ = 0, rates faster when 1/ n require covering numbers that grow slower than 1/ 2 , and rates of 1/n essentially require finite VC-subgraph-dimension. [sent-158, score-0.374]
59 Our work, on the other hand, is applicable whenever the Rademacher complexity (equivalently covering numbers) can be controlled. [sent-159, score-0.222]
60 Local Rademacher Complexities : Bartlett et al [3] developed a general machinery for proving possible fast rates based on local Rademacher complexities. [sent-161, score-0.269]
61 For example, Steinwart [27] used Local Rademacher Complexity to provide fast rate on the 0/1 loss of Support Vector Machines (SVMs) ( 2 -regularized hinge-loss minimization) based on the so called “geometric margin condition” and Tsybakov’s margin condition. [sent-163, score-0.489]
62 3 Online and Stochastic Optimization of Smooth Convex Objectives We now turn to online and stochastic convex optimization. [sent-167, score-0.296]
63 In these settings a learner chooses w ∈ W, where W is a closed convex set in a normed vector space, attempting to minimize an objective (w, z) on instances z ∈ Z, where : W × Z → R is an objective function which is convex in w. [sent-168, score-0.278]
64 a convex loss function φ(t, z), where Z = X × Y and (w, (x, y)) = φ( w, x , y), and extends beyond supervised learning. [sent-172, score-0.322]
65 For the Euclidean norm we can 2 use the squared Euclidean norm regularizer: F (w) = 1 w . [sent-189, score-0.239]
66 1 Online Optimization Setting In the online convex optimization setting we consider an n round game played between a learner and an adversary (Nature) where at each round i, the player chooses a wi ∈ W and then the adversary picks a zi ∈ Z. [sent-191, score-0.594]
67 F (w∗ ) ≤ B 2 and n 1 ∗ ∗ j=1 (w , zi ) ≤ L is bounded by: n 1 n n (wi , zi ) − i=1 1 n n (w∗ , zi ) ≤ i=1 4HB 2 +2 n HB 2 L∗ n Note that the stepsize depends on the bound L∗ on the loss in hindsight. [sent-204, score-0.735]
68 , zn from some unknown distribution (as in Section 2), and we would like to find w with low risk L(w) = E [ (w, Z)]. [sent-215, score-0.269]
69 When z = (x, y) and (w, z) = φ( w, x , y) this agrees with the supervised learning risk discussed in the Introduction and analyzed in Section 2. [sent-216, score-0.269]
70 Standard arguments [8] allow us to convert the online regret bound of Theorem 2 to a bound on the excess risk: √ 1 , then Corollary 3. [sent-218, score-0.671]
71 n It is instructive to contrast this guarantee with similar looking guarantees derived recently in the stochastic convex optimization literature [14]. [sent-220, score-0.292]
72 For given regularization parameter λ > 0 define the ˆ ˆ regularized empirical loss as Lλ (w) := L(w) + λF (w) and consider the Regularized Empirical Risk Minimizer ˆ ˆ wλ = arg min Lλ (w) (6) w∈W The following theorem provides a bound on excess risk similar to Corollary 3: 2 2 Theorem 4. [sent-232, score-0.96]
73 4 Tightness In this Section we return to the learning rates for the ERM for parametric and for scale-sensitive hypothesis classes (i. [sent-236, score-0.429]
74 We compare the guarantees on the learning rates in different situations, identify differences between the parametric and scale-sensitive cases and between the smooth and non-smooth cases, and argue that these differences are real by showing that the corresponding guarantees are tight. [sent-239, score-0.505]
75 Although we discuss the tightness of the learning guarantees for ERM in the stochastic setting, similar arguments can also be made for online learning. [sent-240, score-0.368]
76 Table 1 summarizes the bounds on the excess risk of the ERM implied by Theorem 1 as well previous bounds for Lipschitz loss on finite-dimensional [22] and scale-sensitive [2] classes, and a bound for squared-loss on finite-dimensional classes [9, Theorem 11. [sent-241, score-1.087]
77 To do so, we will consider the class H = {x → w, x : w ≤ 1} of 2 -bounded linear predictors (all norms in this Section are Euclidean), with different loss functions, and various specific distributions over X ×Y, where X = x ∈ Rd : x ≤ 1 and Y = [0, 1]. [sent-245, score-0.428]
78 Infinite dimensional, Lipschitz (non-smooth), separable Consider the absolute difference loss φ(h(x), y) = |h(x) − y|, take d = 2n and consider the following distribution: X 1 is uniformly distributed over the d standard basis vectors ei and if X = ei , then Y = √n ri , where r1 , . [sent-247, score-0.517]
79 This means that for any learning algorithm, there exists a choice of ri ’s such that on at least n of the remaining √ √ points not seen by the learner, he/she has to suffer a loss of at least 1/ n, yielding an overall risk of at least 1/ 4n. [sent-256, score-0.536]
80 6 Infinite dimensional, smooth, non-separable, even if strongly convex Consider the squared loss φ(h(x), y) = (h(x) − y)2 which is 2-smooth and 2-strongly convex. [sent-257, score-0.401]
81 The minimizer of the expected risk is w = d i=1 ri √ 2 d ei , with w = and L∗ = L(w ) = σ 2 . [sent-259, score-0.444]
82 1 Implications Improved Margin Bounds “Margin bounds” provide a bound on the expected zero-one loss of a classifiers based on the margin 0/1 error on the training sample. [sent-280, score-0.407]
83 Koltchinskii and Panchenko [13] provides margin bounds for a generic class H based on the Rademacher complexity of the class. [sent-281, score-0.301]
84 This is done by using a non-smooth Lipschitz “ramp” loss that upper bounds the zero-one loss and is upper-bounded by the margin zero-one loss. [sent-282, score-0.584]
85 Following same idea we use the following smooth “ramp”: the φ(t) = 1 1+cos(πt/γ) 2 2 0 t≤0 0 <γ t≥γ π This loss function is 4γ 2 -smooth and is lower bounded by the zero-one loss and upper bounded by the γ margin loss. [sent-284, score-0.815]
86 Using Theorem 1 we can now provide improved margin bounds for the zero-one loss of any classifier based on empirical margin error. [sent-285, score-0.497]
87 Denote err(h) = E 1{h(x)=y} the zero-one risk and for any γ > 0 and sample 1 n 1 (x1 , y1 ), . [sent-286, score-0.311]
88 , (xn , yn ) ∈ X × {±1} define the γ-margin empirical zero one loss as errγ (h) := n i=1 1{yi h(xi )<γ} . [sent-289, score-0.248]
89 01 errγ (h) + K 2 log(log( 4b )/δ) 2 log3 n 2 γ Rn (H) + γ2 n Improved margin bounds of the above form have been previously shown specifically for linear prediction in a Hilbert space based on the PAC Bayes theorem [19, 15]. [sent-294, score-0.236]
90 2 Interaction of Norm and Dimension Consider the problem of learning a low-norm linear predictor with respect to the squared loss φ(t, z) = (t − z)2 , where X ∈ Rd , for finite but very large d, and where the expected norm of X is low. [sent-298, score-0.483]
91 However, for any fixed d and B, even if d B2, asymptotically as the number of samples increase, the excess risk of norm-constrained or norm-regularized regression d ˆ actually behaves as L(w) − L∗ ≈ n σ 2 , and depends (to first order) only on the dimensionality d and not on B [17]. [sent-303, score-0.659]
92 √In this non-separable situation, parametric complexity controls can lead to a 1/n rate, ultimately dominating the 1/ n rate resulting from L∗ > 0 when considering the scale-sensitive, non-parametric complexity control B. [sent-305, score-0.314]
93 The first regime has excess risk of order B 2 which occurs until n = Θ(B 2 ). [sent-307, score-0.588]
94 The second (“low-noise”) regime is one where the excess ˆ risk is dominated by the norm and behaves as B 2 /n, until n = Θ(B 2 /σ 2 ) and L(w) = Θ(L∗ ). [sent-308, score-0.725]
95 The third√ (“slow”) regime, where the excess risk is controlled by the norm and the approximation error and behaves as Bσ/ n, until ˆ n = Θ(d2 σ 2 /B 2 ) and L(w) = L∗ + Θ(B 2 /d). [sent-309, score-0.666]
96 The fourth (“asymptotic”) regime is where excess risk behaves as d/n. [sent-310, score-0.645]
97 3 Sparse Prediction The use of the 1 norm has become popular for learning sparse predictors in high dimensions, as in the LASSO. [sent-313, score-0.25]
98 The ˆ ˆ LASSO estimator [28] w is obtained by considering the squared loss φ(z, y) = (z − y)2 and minimizing L(w) subject to w 1 ≤ B. [sent-314, score-0.29]
99 Let us assume there is some (unknown) sparse reference predictor w0 that has low expected loss and sparsity (number of non-zeros) w0 0 = k, and that x ∞ ≤ 1, y ≤ 1. [sent-315, score-0.324]
100 Furthermore, we do not require that the optimal predictor is sparse or that the model is well specified: only that there exists a low risk predictor using a small number of fairly uncorrelated features. [sent-324, score-0.495]
wordName wordTfidf (topN-words)
[('rademacher', 0.324), ('risk', 0.269), ('excess', 0.26), ('hb', 0.257), ('erm', 0.233), ('loss', 0.211), ('predictors', 0.17), ('rates', 0.141), ('err', 0.131), ('hypothesis', 0.122), ('tsybakov', 0.121), ('predictor', 0.113), ('convex', 0.111), ('bound', 0.109), ('rn', 0.108), ('bounded', 0.107), ('chesneau', 0.105), ('separable', 0.102), ('online', 0.101), ('smoothness', 0.1), ('lipschitz', 0.1), ('guarantees', 0.097), ('wi', 0.094), ('complexity', 0.092), ('smooth', 0.092), ('covering', 0.092), ('derivative', 0.091), ('mirror', 0.091), ('classes', 0.088), ('margin', 0.087), ('zi', 0.084), ('polylogarithmic', 0.084), ('stochastic', 0.084), ('norm', 0.08), ('squared', 0.079), ('hrl', 0.079), ('parametric', 0.078), ('al', 0.076), ('bounds', 0.075), ('ei', 0.074), ('theorem', 0.074), ('aggregation', 0.072), ('hr', 0.069), ('lemma', 0.064), ('hl', 0.063), ('srebro', 0.059), ('bousquet', 0.059), ('stability', 0.059), ('regime', 0.059), ('behaves', 0.057), ('stepsize', 0.056), ('learner', 0.056), ('ri', 0.056), ('sup', 0.054), ('steinwart', 0.054), ('fast', 0.052), ('ddl', 0.052), ('graceful', 0.052), ('rate', 0.052), ('arguments', 0.052), ('numeric', 0.05), ('adversary', 0.05), ('regularizer', 0.049), ('player', 0.048), ('class', 0.047), ('descent', 0.047), ('dimensional', 0.047), ('panchenko', 0.046), ('liang', 0.045), ('bartlett', 0.045), ('minimizer', 0.045), ('base', 0.043), ('sample', 0.042), ('complexities', 0.042), ('regret', 0.04), ('ambuj', 0.04), ('ramp', 0.04), ('nite', 0.039), ('actually', 0.039), ('expectation', 0.039), ('avoids', 0.039), ('ni', 0.039), ('euclidean', 0.038), ('applicable', 0.038), ('karthik', 0.038), ('sridharan', 0.038), ('nb', 0.038), ('empirical', 0.037), ('log', 0.037), ('ciencies', 0.036), ('dd', 0.036), ('degradation', 0.036), ('rely', 0.036), ('koltchinskii', 0.034), ('tightness', 0.034), ('signs', 0.034), ('dimensionality', 0.034), ('dependence', 0.034), ('annals', 0.033), ('phd', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
2 0.26882362 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
3 0.21646848 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
4 0.21360213 27 nips-2010-Agnostic Active Learning Without Constraints
Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu
Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1
5 0.20512016 191 nips-2010-On the Theory of Learnining with Privileged Information
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
6 0.1707509 282 nips-2010-Variable margin losses for classifier design
7 0.16568916 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
8 0.15972881 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
9 0.14712951 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
10 0.13946821 233 nips-2010-Scrambled Objects for Least-Squares Regression
11 0.13715233 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
12 0.13295713 202 nips-2010-Parallelized Stochastic Gradient Descent
13 0.12924749 222 nips-2010-Random Walk Approach to Regret Minimization
14 0.12464343 63 nips-2010-Distributed Dual Averaging In Networks
15 0.11808988 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
16 0.11307698 22 nips-2010-Active Estimation of F-Measures
17 0.10943149 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
18 0.10751467 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
19 0.10625445 290 nips-2010-t-logistic regression
20 0.096834727 265 nips-2010-The LASSO risk: asymptotic results and real world examples
topicId topicWeight
[(0, 0.283), (1, 0.024), (2, 0.278), (3, 0.019), (4, 0.113), (5, 0.143), (6, -0.16), (7, -0.071), (8, -0.094), (9, 0.019), (10, 0.066), (11, -0.137), (12, -0.128), (13, 0.023), (14, -0.181), (15, 0.095), (16, 0.122), (17, 0.051), (18, 0.035), (19, -0.081), (20, 0.117), (21, -0.041), (22, 0.042), (23, -0.076), (24, -0.003), (25, 0.016), (26, -0.058), (27, 0.067), (28, 0.093), (29, 0.069), (30, -0.015), (31, -0.049), (32, 0.045), (33, -0.059), (34, -0.027), (35, -0.056), (36, 0.065), (37, -0.065), (38, 0.032), (39, -0.01), (40, 0.02), (41, -0.039), (42, 0.061), (43, -0.027), (44, -0.012), (45, -0.016), (46, 0.014), (47, 0.144), (48, -0.071), (49, 0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.97325397 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
2 0.89709514 191 nips-2010-On the Theory of Learnining with Privileged Information
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
3 0.76915061 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
4 0.73467624 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
Author: Hariharan Narayanan, Sanjoy Mitter
Abstract: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume log 1 and curvature is unavoidable. Whether the known lower bound of O( k + 2 δ ) 2 for the sample complexity of Empirical Risk minimization on k−means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 [3]. Here is the desired bound on the error and δ is a bound on the probability of failure. We improve the best currently known upper bound [14] of 2 log 1 log4 k log 1 O( k2 + 2 δ ) to O k min k, 2 + 2 δ . Based on these results, we 2 devise a simple algorithm for k−means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension. 1
5 0.7011236 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
Author: Malik Magdon-Ismail
Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.
6 0.69754446 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
7 0.66670191 142 nips-2010-Learning Bounds for Importance Weighting
8 0.66065198 282 nips-2010-Variable margin losses for classifier design
9 0.64827418 202 nips-2010-Parallelized Stochastic Gradient Descent
10 0.63155043 27 nips-2010-Agnostic Active Learning Without Constraints
11 0.62538093 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
12 0.60971624 290 nips-2010-t-logistic regression
13 0.5939706 233 nips-2010-Scrambled Objects for Least-Squares Regression
14 0.59256101 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
15 0.57572758 269 nips-2010-Throttling Poisson Processes
16 0.57202137 22 nips-2010-Active Estimation of F-Measures
17 0.57060337 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
18 0.56030154 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
19 0.55923384 226 nips-2010-Repeated Games against Budgeted Adversaries
20 0.54833782 222 nips-2010-Random Walk Approach to Regret Minimization
topicId topicWeight
[(9, 0.05), (13, 0.059), (17, 0.016), (27, 0.04), (30, 0.097), (35, 0.016), (39, 0.02), (42, 0.012), (45, 0.23), (49, 0.019), (50, 0.052), (52, 0.04), (59, 0.01), (60, 0.06), (77, 0.058), (78, 0.052), (90, 0.085)]
simIndex simValue paperId paperTitle
1 0.95809901 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
2 0.95328093 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
3 0.95087165 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
same-paper 4 0.94990045 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
5 0.93793368 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
6 0.9359926 282 nips-2010-Variable margin losses for classifier design
7 0.93589747 290 nips-2010-t-logistic regression
8 0.93578297 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
9 0.93528503 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
10 0.93479067 117 nips-2010-Identifying graph-structured activation patterns in networks
11 0.93458295 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
12 0.93442106 265 nips-2010-The LASSO risk: asymptotic results and real world examples
13 0.93437827 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
14 0.93419307 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
15 0.93041736 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
16 0.92984641 148 nips-2010-Learning Networks of Stochastic Differential Equations
17 0.92870498 27 nips-2010-Agnostic Active Learning Without Constraints
18 0.9283815 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
19 0.92774916 182 nips-2010-New Adaptive Algorithms for Online Classification
20 0.92704332 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case