jmlr jmlr2013 jmlr2013-35 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
Reference: text
sentIndex sentText sentNum sentScore
1 The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. [sent-10, score-0.227]
2 These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. [sent-13, score-0.378]
3 Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. [sent-14, score-0.257]
4 Introduction In this paper we pursue a tight characterization of the sample complexity of learning a classifier, under a particular data distribution, and using a particular learning rule. [sent-16, score-0.196]
5 If the VCdimension of the hypothesis class, when restricted to this subset, is smaller than d, then learning with respect to this distribution will require less examples than the upper bound predicts. [sent-24, score-0.249]
6 Of course, some sample complexity upper bounds are known to be tight or to have an almostmatching lower bound. [sent-25, score-0.335]
7 For instance, the VC-dimension upper bound is tight (Vapnik and Chervonenkis, 1974). [sent-26, score-0.203]
8 This means that there exists some data distribution in the class covered by the upper bound, for which this bound cannot be improved. [sent-27, score-0.205]
9 But it does not imply that the upper bound characterizes the true sample complexity for every specific distribution in the class. [sent-29, score-0.297]
10 d This implies a sample complexity upper bound of O ε2 using any MEM algorithm, where ε is the excess error relative to the optimal margin error. [sent-34, score-0.427]
11 1 We also have that the sample complexity of 2 any MEM algorithm is at most O γBε2 , where B2 is the average squared norm of the data and γ 2 is the size of the margin (Bartlett and Mendelson, 2002). [sent-35, score-0.252]
12 2 However, the VC-dimension upper bound indicates, for instance, that if a distribution induces a large average norm but is supported by a low-dimensional sub-space, then the true number of examples required to reach a low error is much smaller. [sent-40, score-0.198]
13 Thus, neither of these upper bounds fully describes the sample complexity of MEM for a specific distribution. [sent-41, score-0.226]
14 We obtain a tight distribution-specific characterization of the sample complexity of large-margin learning for a rich class of distributions. [sent-42, score-0.233]
15 The upper bound is universal, and the lower bound holds for a rich class of distributions with independent features. [sent-44, score-0.342]
16 The margin-adapted dimension refines both the dimension and the average norm of the data distribution, and can be easily calculated from the covariance matrix and the mean of the distribution. [sent-45, score-0.2]
17 Our sample-complexity upper bound shows ˜ kγ that O( ε2 ) examples suffice in order to learn any distribution with a margin-adapted dimension of kγ using a MEM algorithm with margin γ. [sent-47, score-0.298]
18 Our main result shows the following matching distribution-specific upper and lower bounds on the sample complexity of MEM: ˜ kγ (D) . [sent-53, score-0.268]
19 Ω(kγ (D)) ≤ m(ε, γ, D) ≤ O ε2 (1) Our tight characterization, and in particular the distribution-specific lower bound on the sample complexity that we establish, can be used to compare large-margin (L2 regularized) learning to other learning rules. [sent-54, score-0.322]
20 We provide two such examples: we use our lower bound to rigorously establish a sample complexity gap between L1 and L2 regularization previously studied in Ng (2004), and to show a large gap between discriminative and generative learning on a Gaussian-mixture distribution. [sent-55, score-0.255]
21 The tight bounds can also be used for active learning algorithms in which sample-complexity bounds are used to decide on the next label to query. [sent-56, score-0.192]
22 • Providing a new lower bound for the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. [sent-63, score-0.192]
23 In Section 8 we show that any nontrivial sample-complexity lower bound for more general distributions must employ properties other than the covariance matrix of the distribution. [sent-71, score-0.265]
24 This type of a lower bound does not, however, indicate much on the sample complexity of other distributions under the same set of assumptions. [sent-76, score-0.298]
25 The essential condition is that the metric entropy of the hypothesis class with respect to the distribution be sub-linear in the limit of an infinite sample size. [sent-79, score-0.206]
26 Benedek and Itai (1991) show that if the distribution is known to the learner, a specific hypothesis class is learnable if and only if there is a finite ε-cover of this hypothesis class with respect to the distribution. [sent-83, score-0.268]
27 (2008) consider a similar setting, and prove sample complexity lower bounds for learning with any data distribution, for some binary hypothesis classes on the real line. [sent-85, score-0.297]
28 Vayatis and Azencott (1999) provide distribution-specific sample complexity upper bounds for hypothesis classes with a limited VC-dimension, as a function of how balanced the hypotheses are with respect to the considered distributions. [sent-86, score-0.307]
29 As can be seen in Equation (1), we do not tightly characterize the dependence of the sample complexity on the desired error (as done, for example, in Steinwart and Scovel, 2007), thus our bounds are not tight for asymptotically small error levels. [sent-88, score-0.241]
30 The margin error of a classifier w with respect to a margin γ > 0 on D is ℓγ (h, D) P(X,Y )∼D [Y · h(X) ≤ γ]. [sent-96, score-0.186]
31 For a given hypothesis class H ⊆ {±1}X , the best achievable margin error on D is ℓ∗ (H , D) γ inf ℓγ (h, D). [sent-97, score-0.211]
32 h∈H The distribution-specific sample complexity for MEM algorithms is the sample size required to guarantee low excess error for the given distribution. [sent-111, score-0.216]
33 Preliminaries As mentioned above, for the hypothesis class of linear classifiers W , one can derive a samplecomplexity upper bound of the form O(B2 /γ2 ε2 ), where B2 = EX∼D [ X 2 ] and ε is the excess error relative to the γ-margin loss. [sent-136, score-0.323]
34 m (2) To get the desired upper bound for linear classifiers we use the ramp loss, which is defined as follows. [sent-149, score-0.612]
35 γ2 m Combining this with Proposition 3 we can conclude a sample complexity upper bound of O(B2 /γ2 ε2 ). [sent-159, score-0.265]
36 The Margin-Adapted Dimension When considering learning of linear classifiers using MEM, the dimension-based upper bound and the norm-based upper bound are both tight in the worst-case sense, that is, they are the best bounds that rely only on the dimensionality or only on the norm respectively. [sent-175, score-0.414]
37 Nonetheless, neither is tight in a distribution-specific sense: If the average norm is unbounded while the dimension is small, then there can be an arbitrarily large gap between the true distribution-dependent sample complexity and the bound that depends on the average norm. [sent-176, score-0.347]
38 Trivially, this is an upper bound on the sample complexity as well. [sent-179, score-0.265]
39 We will show that in such situations the sample complexity is characterized not by the minimum of dimension and norm, but by the sum of the number of high-variance dimensions and the average squared norm in the other directions. [sent-181, score-0.196]
40 The eigenvalues of the empirical covariance matrix were used to provide sample complexity bounds, for instance in Sch¨ lkopf et al. [sent-195, score-0.253]
41 A Distribution-Dependent Upper Bound In this section we prove an upper bound on the sample complexity of learning with MEM, using the margin-adapted dimension. [sent-202, score-0.265]
42 We do this by providing a tighter upper bound for the Rademacher complexity of RAMPγ . [sent-203, score-0.209]
43 left: norm bound is tight; middle: dimension bound is tight; right: neither bound is tight. [sent-215, score-0.319]
44 The first function class will be bounded because of the norm bound on the subspace V used in Definition 6, and the second function class will have a bounded pseudo-dimension. [sent-220, score-0.188]
45 One should note that a similar upper bound can be obtained much more easily under a uniform upper bound on the eigenvalues of the uncentered covariance matrix. [sent-304, score-0.4]
46 2 However, such an upper bound would not capture the fact that a finite dimension implies a finite sample complexity, regardless of the size of the covariance. [sent-305, score-0.229]
47 If one wants to estimate the sample complexity, then large covariance matrix eigenvalues imply that more examples are required to estimate the covariance matrix from a sample. [sent-306, score-0.276]
48 Moreover, estimating the covariance matrix is not necessary to achieve the sample complexity, since the upper bound holds for any margin-error minimization algorithm. [sent-308, score-0.288]
49 A Distribution-Dependent Lower Bound The new upper bound presented in Corollary 12 can be tighter than both the norm-only and the dimension-only upper bounds. [sent-310, score-0.188]
50 But does the margin-adapted dimension characterize the true sample complexity of the distribution, or is it just another upper bound? [sent-311, score-0.218]
51 1 relates fat-shattering with a lower bound on sample complexity. [sent-314, score-0.182]
52 2 we use this result to relate the smallest eigenvalue of a Gram-matrix to a lower bound on sample complexity. [sent-316, score-0.212]
53 In contrast, the average Rademacher complexity cannot be used to derive general lower bounds for MEM algorithms, since it is related to the rate of uniform convergence of the entire hypothesis class, while MEM algorithms choose low-error hypotheses (see, e. [sent-354, score-0.241]
54 However, any learning algorithm that returns a hypothesis from the hypothesis class will incur zero error on this distribution. [sent-369, score-0.199]
55 Thus 1 by Lemma 16 there exists a wy such that Ywy = y and wy ≤ wy ≤ 1. [sent-416, score-0.198]
56 Corollary 17 generalizes the requirement of linear independence for shattering with no margin: A set of vectors is shattered with no margin if the vectors are linearly independent, that is if λmin > 0. [sent-426, score-0.236]
57 3 Sub-Gaussian Distributions In order to derive a lower bound on distribution-specific sample complexity in terms of the covariance of X ∼ DX , we must assume that X is not too heavy-tailed. [sent-440, score-0.315]
58 In this work we further say that X is sub-Gaussian with relative moment ρ > 0 if X is sub-Gaussian with moment ρ E[X 2 ], that is, ∀t ∈ R, E[exp(tX)] ≤ exp(t 2 ρ2 E[X 2 ]/2). [sent-449, score-0.324]
59 Note that a sub-Gaussian variable with moment B and relative moment ρ is also sub-Gaussian with moment B′ and relative moment ρ′ for any B′ ≥ B and ρ′ ≥ ρ. [sent-450, score-0.648]
60 Specifically, if X is a mean-zero Gaussian random variable, X ∼ N(0, σ2 ), then X is sub-Gaussian with relative moment 1 and the inequalities in the definition above hold with equality. [sent-452, score-0.181]
61 As another example, if X is a uniform random variable over {±b} for some b ≥ 0, then X is sub-Gaussian with relative moment 1, since 1 E[exp(tX)] = (exp(tb) + exp(−tb)) ≤ exp(t 2 b2 /2) = exp(t 2 E[X 2 ]/2). [sent-453, score-0.181]
62 The following lemma provides a useful connection between the trace of the sub-Gaussian moment matrix and the moment-generating function of the squared norm of the random vector. [sent-456, score-0.37]
63 Definition 21 (Sub-Gaussian product distributions) A distribution DX over Rd is a sub-Gaussian product distribution with moment B and relative moment ρ if there exists some orthonormal basis a1 , . [sent-462, score-0.388]
64 , ad ∈ Rd , such that for X ∼ DX , ai , X are independent sub-Gaussian random variables, each with moment B and relative moment ρ. [sent-465, score-0.324]
65 Note that a sub-Gaussian product distribution has mean zero, thus its covariance matrix is equal to its sg uncentered covariance matrix. [sent-466, score-0.308]
66 For any fixed ρ ≥ 0, we denote by Dρ the family of all sub-Gaussian product distributions with relative moment ρ, in arbitrary dimension. [sent-467, score-0.224]
67 For instance, all multivariate Gaussian distributions and all uniform distributions on the corners of a centered hyper-rectangle sg sg are in D1 . [sent-468, score-0.246]
68 sg We will provide a lower bound for all distributions in Dρ . [sent-471, score-0.249]
69 This lower bound is linear in the margin-adapted dimension of the distribution, thus it matches the upper bound provided in Corollary 12. [sent-472, score-0.299]
70 2, to obtain a sample complexity lower bound it suffices to have a lower bound on the value of the smallest eigenvalue of a random Gram matrix. [sent-476, score-0.411]
71 In this case, then, the sample complexity lower bound is indeed the same order as kγ , which controls also the upper bound in Corollary 12. [sent-494, score-0.391]
72 sg For any DX ∈ Dρ with covariance matrix Σ ≤ I, and for any m ≤ β · trace(Σ) −C, if X is the m × d matrix of a sample drawn from Dm , then X P[λmin (XXT ) ≥ m] ≥ δ. [sent-504, score-0.268]
73 sg Theorem 24 (Sample complexity lower bound for distributions in Dρ ) For any ρ > 0 there are sg constants β > 0,C ≥ 0 such that for any D with DX ∈ Dρ , for any γ > 0 and for any ε < 1 − ℓ∗ (D), γ 2 m(ε, γ, D, 1/4) ≥ βkγ (DX ) −C. [sent-505, score-0.402]
74 By our assumptions on DX , for all i ∈ [d] the random variable X[i] is sub-Gaussian with relative moment ρ. [sent-535, score-0.181]
75 Z[i] is also sub-Gaussian with relative moment ρ, and E[Z[i]2 ] = 1. [sent-537, score-0.181]
76 Z[i] is sub-Gaussian with relative moment ρ and E[Z[i]2 ] ≤ 1. [sent-561, score-0.181]
77 On the Limitations of the Covariance Matrix We have shown matching upper and lower bounds for the sample complexity of learning with MEM, for any sub-Gaussian product distribution with a bounded relative moment. [sent-581, score-0.338]
78 Both DX and PX are sub-Gaussian random vectors, with a relative moment of 2 in all directions. [sent-586, score-0.181]
79 2 By Equation (9), PX , Da and Db are all sub-Gaussian product distribution with relative moment √ 1, thus also with moment 2 > 1. [sent-595, score-0.356]
80 Conclusions Corollary 12 and Theorem 24 together provide a tight characterization of the sample complexity of any sub-Gaussian product distribution with a bounded relative moment. [sent-604, score-0.266]
81 ε2 (10) The upper bound holds uniformly for all distributions, and the constants in the lower bound depend only on ρ. [sent-607, score-0.262]
82 An interesting conclusion can be drawn as to the influence of the conditional distribution of labels DY |X : Since Equation (10) holds for any DY |X , the effect of the direction of the best separator on the sample complexity is bounded, even for highly non-spherical distributions. [sent-609, score-0.206]
83 There are upper bounds that depend on the margin alone and on the dimension alone without logarithmic factors. [sent-611, score-0.227]
84 Equation (10) can be used to easily characterize the sample complexity behavior for interesting distributions, to compare L2 margin minimization to other learning methods, and to improve certain active learning strategies. [sent-614, score-0.257]
85 When X ∞ ≤ 1, upper bounds on learning with L1 regularization guarantee a sample complexity of O(ln(d)) for an L1 -based learning rule (Zhang, 2002). [sent-617, score-0.226]
86 In order to compare this with the sample complexity of L2 regularized learning and establish a gap, one must use a lower bound on the L2 sample complexity. [sent-618, score-0.311]
87 For instance, if each coordinate is a bounded independent sub-Gaussian random variable with a bounded relative moment, we have k1 = ⌈d/2⌉ and Theorem 24 implies a lower bound of Ω(d) on the L2 sample complexity. [sent-621, score-0.22]
88 A popular approach to active learning involves estimating the current set of possible classifiers using sample complexity upper bounds (see, e. [sent-632, score-0.261]
89 Thus, our sample complexity upper bounds can be used to improve the active learner’s label complexity. [sent-640, score-0.261]
90 Moreover, the lower bound suggests that any further improvement of such active learning strategies would require more information other than the distribution’s covariance matrix. [sent-641, score-0.221]
91 The first inequality follows since the ramp loss is upper bounded by the margin loss. [sent-657, score-0.621]
92 Therefore for all y ∈ {±1}k there exists a zy ∈ Z such that ∀i ∈ [k], sign( f (xi ) + zy (xi ) − f (xi ) − r[i]) = y[i]. [sent-678, score-0.238]
93 If y[i] = 1 then f (xi ) + zy (xi ) − f (xi ) − r[i] > 0 It follows that f (xi ) + zy (xi ) > f (xi ) + r[i] > 0, thus f (xi ) + zy (xi ) > f (xi ) + r[i]. [sent-683, score-0.357]
94 It follows that f (xi ) + zy (xi ) < f (xi ) + r[i] < 1, thus f (xi ) + zy (xi ) < f (xi ) + r[i]. [sent-686, score-0.238]
95 For each y ∈ {±1}m , select ry ∈ Rm such that for all i ∈ [m], ry [i]y[i] ≥ γ. [sent-697, score-0.238]
96 Thus, for all y ∈ {±1} there exist αy , βy ≥ 0 such that ¯ ¯ ∑y∈Y+ αy = ∑y∈Y− βy = 1, and z= ¯ ¯ ¯ ∑ αy ry = ∑ βy ry . [sent-715, score-0.238]
97 y∈Y− y∈Y+ Let za = ∑y∈Y+ αy ry and zb = ∑y∈Y− βy ry We have that ∀y ∈ Y+ , ry [m] ≥ γ, and ∀y ∈ Y− , ry [m] ≤ −γ. [sent-716, score-0.571]
98 Since S is shattered, for any y ∈ {±1}m there is an ry ∈ L such that ∀i ∈ [m], ry [i]y[i] ≥ γ. [sent-726, score-0.238]
99 5 Proof of Lemma 20 Proof [of Lemma 20] It suffices to consider diagonal moment matrices: If B is not diagonal, let V ∈ Rd×d be an orthogonal matrix such that VBVT is diagonal, and let Y = VX. [sent-731, score-0.257]
100 2 (16) To bound the second term in line (15), since Yi j are sub-Gaussian with moment ρ, E[Y4j ] ≤ 5ρ4 i (Buldygin and Kozachenko, 1998, Lemma 1. [sent-876, score-0.227]
wordName wordTfidf (topN-words)
[('ramp', 0.476), ('mem', 0.298), ('xxt', 0.254), ('ishby', 0.174), ('istribution', 0.174), ('rebro', 0.174), ('ln', 0.165), ('ependent', 0.149), ('argin', 0.149), ('abato', 0.149), ('omplexity', 0.149), ('dx', 0.145), ('moment', 0.143), ('ry', 0.119), ('zy', 0.119), ('arge', 0.109), ('ample', 0.109), ('exp', 0.109), ('trace', 0.099), ('yx', 0.099), ('rd', 0.098), ('margin', 0.093), ('shattering', 0.093), ('rx', 0.089), ('bound', 0.084), ('hypothesis', 0.081), ('sg', 0.08), ('ov', 0.079), ('cm', 0.079), ('sm', 0.075), ('conv', 0.074), ('complexity', 0.073), ('sx', 0.073), ('tight', 0.067), ('rademacher', 0.066), ('bd', 0.066), ('wy', 0.066), ('lemma', 0.062), ('ex', 0.061), ('earning', 0.06), ('covariance', 0.06), ('sabato', 0.06), ('rm', 0.059), ('yt', 0.058), ('rudelson', 0.058), ('sample', 0.056), ('upper', 0.052), ('shattered', 0.05), ('wb', 0.05), ('yyt', 0.049), ('zb', 0.049), ('gram', 0.049), ('dm', 0.048), ('argminh', 0.046), ('vbvt', 0.046), ('za', 0.046), ('bounds', 0.045), ('xt', 0.045), ('origin', 0.045), ('separator', 0.045), ('distributions', 0.043), ('lower', 0.042), ('bartlett', 0.041), ('hausdorff', 0.041), ('xi', 0.04), ('uncentered', 0.04), ('witness', 0.04), ('nm', 0.039), ('let', 0.039), ('psd', 0.039), ('relative', 0.038), ('vapnik', 0.038), ('ers', 0.038), ('dimension', 0.037), ('class', 0.037), ('theorem', 0.037), ('fg', 0.036), ('learnability', 0.036), ('db', 0.036), ('sivan', 0.036), ('matrix', 0.036), ('active', 0.035), ('ps', 0.035), ('buldygin', 0.035), ('yw', 0.035), ('mendelson', 0.034), ('invertible', 0.034), ('covering', 0.032), ('xm', 0.032), ('distribution', 0.032), ('excess', 0.031), ('px', 0.031), ('argmin', 0.03), ('norm', 0.03), ('eigenvalue', 0.03), ('min', 0.03), ('sst', 0.03), ('proof', 0.029), ('classi', 0.029), ('eigenvalues', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
2 0.13139461 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
3 0.12812057 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
4 0.10678474 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
5 0.085158519 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
6 0.078946285 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
7 0.071074739 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
8 0.061164327 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
9 0.058266461 68 jmlr-2013-Machine Learning with Operational Costs
10 0.057138912 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
11 0.053801466 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
12 0.052981269 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
13 0.051370848 76 jmlr-2013-Nonparametric Sparsity and Regularization
14 0.049396474 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
15 0.048604209 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
16 0.047048487 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
17 0.046492621 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
18 0.04588747 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
19 0.044266209 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
20 0.043854021 8 jmlr-2013-A Theory of Multiclass Boosting
topicId topicWeight
[(0, -0.236), (1, 0.127), (2, 0.084), (3, 0.096), (4, -0.043), (5, -0.179), (6, 0.114), (7, -0.081), (8, -0.039), (9, -0.047), (10, 0.06), (11, 0.08), (12, -0.019), (13, -0.019), (14, -0.017), (15, 0.08), (16, 0.003), (17, -0.086), (18, 0.039), (19, -0.025), (20, -0.011), (21, 0.089), (22, 0.027), (23, 0.158), (24, 0.021), (25, 0.055), (26, -0.118), (27, -0.26), (28, -0.028), (29, -0.024), (30, 0.178), (31, 0.022), (32, 0.071), (33, 0.141), (34, 0.119), (35, -0.026), (36, -0.06), (37, 0.145), (38, -0.13), (39, 0.084), (40, 0.064), (41, 0.036), (42, 0.03), (43, 0.136), (44, -0.08), (45, 0.07), (46, -0.068), (47, 0.079), (48, 0.009), (49, 0.11)]
simIndex simValue paperId paperTitle
same-paper 1 0.92630476 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
2 0.72586828 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
3 0.56338292 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
4 0.52947432 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
5 0.41735244 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
6 0.4078896 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
7 0.3940388 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
8 0.38829949 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
9 0.36602953 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
10 0.35232311 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
11 0.34960356 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
12 0.34071577 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
13 0.32319745 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
14 0.32211724 68 jmlr-2013-Machine Learning with Operational Costs
15 0.31379828 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
16 0.30772287 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
17 0.28718624 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
19 0.27714932 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
20 0.2769942 76 jmlr-2013-Nonparametric Sparsity and Regularization
topicId topicWeight
[(0, 0.012), (5, 0.119), (6, 0.021), (10, 0.05), (20, 0.01), (23, 0.021), (68, 0.012), (70, 0.034), (75, 0.03), (85, 0.558), (87, 0.019), (93, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.89353848 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
2 0.88839531 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
3 0.53578305 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
Author: Alon Gonen, Sivan Sabato, Shai Shalev-Shwartz
Abstract: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings. Keywords: active learning, linear classifiers, margin, adaptive sub-modularity
4 0.51346713 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
5 0.48720926 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
6 0.47407249 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
7 0.44484532 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
8 0.44309419 21 jmlr-2013-Classifier Selection using the Predicate Depth
9 0.43471611 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
10 0.42415702 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
11 0.4234243 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
12 0.41764316 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
13 0.41155946 32 jmlr-2013-Differential Privacy for Functions and Functional Data
14 0.40665522 114 jmlr-2013-The Rate of Convergence of AdaBoost
15 0.40532723 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
16 0.40118411 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
17 0.39707994 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
18 0.39616519 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
19 0.39310789 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
20 0.39177772 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion