nips nips2010 nips2010-270 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. [sent-6, score-0.675]
2 The bounds hold for a rich family of sub-Gaussian distributions. [sent-7, score-0.27]
3 1 Introduction In this paper we tackle the problem of obtaining a tight characterization of the sample complexity which a particular learning rule requires, in order to learn a particular source distribution. [sent-8, score-0.797]
4 Specifically, we obtain a tight characterization of the sample complexity required for large (Euclidean) margin learning to obtain low error for a distribution D(X, Y ), for X ∈ Rd , Y ∈ {±1}. [sent-9, score-0.937]
5 That is, on providing a bound m(D, ǫ) and proving that when using some specific learning rule, if the sample size is at least m(D, ǫ), an excess error of at most ǫ (in expectation or with high probability) can be ensured. [sent-11, score-0.495]
6 For instance, for large-margin classification we know that if PD [ X ≤ B] = 1, then m(D, ǫ) can be set to O(B 2 /(γ 2 ǫ2 )) to get true error of no more than ℓ∗ + ǫ, where γ ℓ∗ = min w ≤1 PD (Y w, X ≤ γ) is the optimal margin error at margin γ. [sent-12, score-0.318]
7 γ Such upper bounds can be useful for understanding positive aspects of a learning rule. [sent-13, score-0.284]
8 But it is difficult to understand deficiencies of a learning rule, or to compare between different rules, based on upper bounds alone. [sent-14, score-0.284]
9 Of course, some sample complexity upper bounds are known to be “tight” or to have an almostmatching lower bound. [sent-18, score-0.882]
10 This usually means that the bound is tight as a worst-case upper bound for a specific class of distributions (e. [sent-19, score-0.75]
11 That is, there exists some source distribution for which the bound is tight. [sent-22, score-0.274]
12 In other words, the bound concerns some quantity of the distribution (e. [sent-23, score-0.302]
13 But this is not to say that for any specific distribution this quantity tightly characterizes the sample complexity. [sent-26, score-0.441]
14 For instance, we know that the sample complexity can be much smaller than the radius of the support of X, if the average norm E[ X 2 ] is small. [sent-27, score-0.518]
15 However, E[ X 2 ] is also not a precise characterization of the sample complexity, for instance in low dimensions. [sent-28, score-0.313]
16 The goal of this paper is to identify a simple quantity determined by the distribution that does precisely characterize the sample complexity. [sent-29, score-0.412]
17 That is, such that the actual sample complexity for the learning rule on this specific distribution is governed, up to polylogarithmic factors, by this quantity. [sent-30, score-0.571]
18 We show that for a rich family of “light tailed” distributions (specifically, sub-Gaussian distributions with independent uncorrelated directions – see Section 2), the number of samples required for learning ˜ by minimizing the γ-margin-violations is both lower-bounded and upper-bounded by Θ(kγ ). [sent-33, score-0.304]
19 More precisely, we show that the sample complexity m(ǫ, γ, D) required for achieving excess error of no more than ǫ can be bounded from above and from below by: ˜ kγ (D) Ω(kγ (D)) ≤ m(ǫ, γ, D) ≤ O( 2 ). [sent-34, score-0.557]
20 ǫ As can be seen in this bound, we are not concerned about tightly characterizing the dependence of the sample complexity on the desired error [as done e. [sent-35, score-0.633]
21 in 1], nor with obtaining tight bounds for very small error levels. [sent-37, score-0.426]
22 In fact, our results can be interpreted as studying the sample complexity needed to obtain error well below random, but bounded away from zero. [sent-38, score-0.557]
23 As was recently shown by Liang and Srebro [2], the quantities on which the sample complexity depends on for very small ǫ (in the classical statistics asymptotic regime) can be very different from those for moderate error rates, which are more relevant for machine learning. [sent-40, score-0.558]
24 Our tight characterization, and in particular the distribution-specific lower bound on the sample complexity that we establish, can be used to compare large-margin (L2 regularized) learning to other learning rules. [sent-41, score-0.941]
25 In Section 7 we provide two such examples: we use our lower bound to rigorously establish a sample complexity gap between L1 and L2 regularization previously studied in [3], and to show a large gap between discriminative and generative learning on a Gaussian-mixture distribution. [sent-42, score-1.016]
26 But in order to obtain the distributionspecific lower bound, we develop novel tools that we believe can be useful for obtaining lower bounds also for other learning rules. [sent-44, score-0.468]
27 Related work Most work on “sample complexity lower bounds” is directed at proving that under some set of assumptions, there exists a source distribution for which one needs at least a certain number of examples to learn with required error and confidence [4, 5, 6]. [sent-45, score-0.553]
28 This type of a lower bound does not, however, indicate much on the sample complexity of other distributions under the same set of assumptions. [sent-46, score-0.857]
29 The essential condition is that the ǫ-entropy of the hypothesis class with respect to the distribution be sub-linear in the limit of an infinite sample size. [sent-49, score-0.381]
30 Benedek and Itai [8] show that if the distribution is known to the learner, a specific hypothesis class is learnable if and only if there is a finite ǫ-cover of this hypothesis class with respect to the distribution. [sent-53, score-0.282]
31 [9] consider a similar setting, and prove sample complexity lower bounds for learning with any data distribution, for some binary hypothesis classes on the real line. [sent-55, score-0.821]
32 In both of these works, the lower bounds hold for any algorithm, but only for a worst-case target hypothesis. [sent-56, score-0.302]
33 Vayatis and Azencott [10] provide distribution-specific sample complexity upper bounds for hypothesis classes with a limited VC-dimension, as a function of how balanced the hypotheses are with respect to the considered distributions. [sent-57, score-0.803]
34 These bounds are not tight for all distributions, thus this work also does not provide true distribution-specific sample complexity. [sent-58, score-0.577]
35 For a sample S = {(xi , yi )}m such that (xi , yi ) ∈ Rd × {±1}, γ i=1 1 1 ˆ the margin loss with respect to S is denoted by ℓγ (w, S) m |{i | yi xi , w ≤ γ}| and the misclas1 ˆ sification error is ℓ(w, S) m |{i | yi xi , w ≤ 0}|. [sent-65, score-0.613]
36 A margin-error minimization algorithm A is an algorithm whose input is a ˜ margin γ, a training sample S = {(xi , yi )}m and an unlabeled test sample SX = {˜i }m , x i=1 i=1 ˆγ (w, S). [sent-70, score-0.694]
37 ˜ We will be concerned with the expected test loss of the algorithm given a random training sample and ˆ ˜ ˜ a random test sample, each of size m, and define ℓm (Aγ , D) ES,S∼Dm [ℓ(A(S, SX ), S)], where ˜ ˜ S, S ∼ Dm independently. [sent-72, score-0.274]
38 For γ > 0, ǫ ∈ [0, 1], and a distribution D, we denote the distributionspecific sample complexity by m(ǫ, γ, D): this is the minimal sample size such that for any marginerror minimization algorithm A, and for any m ≥ m(ǫ, γ, D), ℓm (Aγ , D) − ℓ∗ (D) ≤ ǫ. [sent-73, score-0.803]
39 γ Sub-Gaussian distributions We will characterize the distribution-specific sample complexity in terms of the covariance of X ∼ DX . [sent-74, score-0.667]
40 A distribution DX over X ∈ Rd is independently sub-Gaussian with relative moment ρ if there exists some orthonormal basis a1 , . [sent-91, score-0.285]
41 sg We will focus on the family Dρ of all independently ρ-sub-Gaussian distributions in arbitrary disg mension, for a small fixed constant ρ. [sent-95, score-0.337]
42 Our upper bounds and lower bounds will be tight up to quantities which depend on ρ, which we will regard as a constant, but the tightness will not depend on the dimensionality of the space or the variance of the distribution. [sent-97, score-0.834]
43 3 The γ-adapted-dimension As mentioned in the introduction, the sample complexity of margin-error minimization can be upperbounded in terms of the average norm E[ X 2 ] by m(ǫ, γ, D) ≤ O(E[ X 2 ]/(γ 2 ǫ2 )) [13]. [sent-98, score-0.558]
44 Thus, 3 although both of these bounds are tight in the worst-case sense, i. [sent-100, score-0.343]
45 Seeking a distribution-specific tight analysis, one simple option to try to tighten these bounds is to consider their minimum, min(d, E[ X 2 ]/γ 2 )/ǫ2 , which, trivially, is also an upper bound on the sample complexity. [sent-104, score-0.861]
46 We will show that in such situations the sample complexity is characterized not by the minimum of dimension and norm, but by the sum of the number of high-variance dimensions and the average norm in the other directions. [sent-106, score-0.601]
47 A distribution DX over Rd is (b, k)-limited if there exists a sub-space V ⊆ Rd of dimension d − k such that EX∼DX [ X ′ P 2 ] ≤ b, with P an orthogonal projection onto V . [sent-113, score-0.312]
48 kγ is different in nature from some other quantities used for providing sample complexity bounds in terms of eigenvalues, as in [15], since it is defined based on the eigenvalues of the distribution and not of the sample. [sent-123, score-0.775]
49 In order to relate our upper and lower bounds, it will be useful to relate the γ-adapted-dimension for different margins. [sent-125, score-0.382]
50 We proceed to provide a sample complexity upper bound based on the γ-adapted-dimension. [sent-129, score-0.753]
51 4 A sample complexity upper bound using γ-adapted-dimension In order to establish an upper bound on the sample complexity, we will bound the fat-shattering dimension of the linear functions over a set in terms of the γ-adapted-dimension of the set. [sent-130, score-1.592]
52 Recall that the fat-shattering dimension is a classic quantity for proving sample complexity upper bounds: Definition 4. [sent-131, score-0.774]
53 ˜ The sample complexity of γ-loss minimization is bounded by O(dγ/8 /ǫ2 ) were dγ/8 is the γ/8fat-shattering dimension of the function class [16, Theorem 13. [sent-142, score-0.671]
54 ˜ For any ǫ > 0 we can augment X with an additional column to form the matrix X of dimensions d+1 m m × (d + 1), such that for all y ∈ {−γ, +γ} , there is a wy ∈ B1+ǫ such that Xwy = y (the details 4 can be found in the appendix). [sent-151, score-0.301]
55 the shattered set, we show that the projected rows of X We then proceed similarly to the proof of the norm-only fat-shattering bound [17]. [sent-155, score-0.357]
56 We have ∀i ≤ m, zi , wy = ti y = 1 1 j≤m ti [j]y[j]. [sent-165, score-0.506]
57 Therefore i zi y[i], wy = i≤m j≤(l+k) ti [j]y[i]y[j]. [sent-166, score-0.419]
58 Since wy ≤ 1 + ǫ, m 1 d+1 1 ∀x ∈ R , (1 + ǫ) x ≥ x wy ≥ x, wy . [sent-167, score-0.756]
59 However, since sub-Gaussian variables have an exponentially decaying tail, we can use this corollary to provide a bound for independently sub-Gaussian distributions as well (see appendix for proof): sg Theorem 4. [sent-179, score-0.46]
60 m(ǫ, γ, D) = O( ǫ2 This new upper bound is tighter than norm-only and dimension-only upper bounds. [sent-182, score-0.395]
61 But does the γ-adapted-dimension characterize the true sample complexity of the distribution, or is it just another upper bound? [sent-183, score-0.629]
62 To answer this question, we need to be able to derive sample complexity lower bounds as well. [sent-184, score-0.771]
63 5 Sample complexity lower bounds using Gram-matrix eigenvalues We wish to find a distribution-specific lower bound that depends on the γ-adapted-dimension, and matches our upper bound as closely as possible. [sent-186, score-1.196]
64 The ability to learn is closely related to the probability of a sample to be shattered, as evident from Vapnik’s formulations of learnability as a function of the ǫ-entropy. [sent-188, score-0.335]
65 For the lower bound we use the converse fact, presented below in Theorem 5. [sent-190, score-0.341]
66 We then relate the fat-shattering of a sample to the minimal eigenvalue of its Gram matrix. [sent-192, score-0.391]
67 This allows us to present a lower-bound on the sample complexity using a lower bound on the smallest eigenvalue of the Gram-matrix of a sample drawn from the data distribution. [sent-193, score-1.182]
68 If the probability of a sample of size m m drawn from DX to be γ-shattered at the origin is at least η, then there is a margin-error minimization algorithm A, such that ℓm/2 (Aγ , D) ≥ η/2. [sent-200, score-0.361]
69 , Xm ) be a sample in Rd , denote X the m×d matrix whose rows are the elements of S. [sent-213, score-0.336]
70 We can now use the minimum eigenvalue of the Gram matrix to obtain a sufficient condition for fat-shattering, after which we present the theorem linking eigenvalues and learnability. [sent-217, score-0.336]
71 sample of size m drawn from D, ′ and denote XS the m × d matrix whose rows are the points from S. [sent-236, score-0.381]
72 For unregularized (homogeneous) linear separation, a sample is shattered iff it is linearly independent, i. [sent-246, score-0.434]
73 Recall that our upper-bound on the sample ˜ complexity from Section 4 was O(kγ ). [sent-255, score-0.469]
74 The remaining question is whether we can relate mγ and kγ , to establish that the our lower bound and upper bound tightly specify the sample complexity. [sent-256, score-1.034]
75 6 A lower bound for independently sub-Gaussian distributions As discussed in the previous section, to obtain sample complexity lower bound we require a bound on the value of the smallest eigenvalue of a random Gram-matrix. [sent-257, score-1.512]
76 This asymptotic limit can be used to calculate mγ and thus provide a lower bound on the sample complexity: Let the coordinates of X ∈ Rd be i. [sent-267, score-0.668]
77 with variance σ 2 and consider a sample of size m. [sent-270, score-0.27]
78 In this case, then, we are indeed able to relate the sample complexity lower bound with kγ , the same quantity that controls our upper bound. [sent-274, score-1.022]
79 1 holds asymptotically for each distribution separately, we cannot deduce from it any finite-sample lower bounds for families of distributions. [sent-277, score-0.416]
80 For our analysis we require finite-sample bounds for the smallest eigenvalue of a random Grammatrix. [sent-278, score-0.305]
81 Rudelson and Vershynin [19, 20] provide such finite-sample lower bounds for distributions with identically distributed sub-Gaussian coordinates. [sent-279, score-0.388]
82 3, stated below, which constitutes our final sample complexity lower bound. [sent-284, score-0.598]
83 For any ρ > 0, there are a constant β > 0 sg and an integer L0 such that for any D such that DX ∈ Dρ and kγ (DX ) > L0 , for any margin 1 ∗ γ > 0 and any ǫ < 4 − ℓγ (D), m(ǫ, γ, D) ≥ βkγ (DX ). [sent-291, score-0.266]
84 2 to bound the smallest eigenvalue of XX ′ with high probability, so that we can then apply Theorem 5. [sent-314, score-0.305]
85 √ The random matrix X Σ1 is drawn from an independently sub-Gaussian distribution, such that each of its coordinates has sub-Gaussian moment ρ and covariance matrix Σ · Σ1 ≤ Id . [sent-329, score-0.479]
86 Then the random matrix X Σ2 is drawn from an independently sub-Gaussian distribution with covariance matrix Σ · Σ2 ≤ Id , such that all its coordinates have sub-Gaussian moment ρ. [sent-347, score-0.539]
87 3 provide an upper bound and a lower bound for the sample complexity sg of any distribution D whose data distribution is in Dρ for some fixed ρ > 0. [sent-358, score-1.328]
88 This result shows that the true sample complexity of learning each of these distributions is characterized by the γadapted-dimension. [sent-361, score-0.555]
89 (2) holds for any DY |X , the effect of the direction of the best separator on the sample complexity is bounded, even for highly non-spherical distributions. [sent-363, score-0.604]
90 (2) to easily characterize the sample complexity behavior for interesting distributions, and to compare L2 margin minimization to learning methods. [sent-365, score-0.671]
91 When X ∞ ≤ 1, upper bounds on learning with L1 regularization guarantee a sample complexity of O(log(d)) for an L1 -based learning rule [21]. [sent-368, score-0.829]
92 In order to compare this with the sample complexity of L2 regularized learning and establish a gap, one must use a lower bound on the L2 sample complexity. [sent-369, score-1.07]
93 However, using our results we can easily establish a lower bound of Ω(d) for many specific distributions with X ∞ ≤ 1 and Y = X[1] ∈ {±1}. [sent-371, score-0.453]
94 Indeed, we can calculate kγ = ⌈d/(1 + v4 )⌉, and ˜ conclude that the sample complexity required is Θ(d/v 2 ). [sent-378, score-0.528]
95 This establishes a rather large gap of Ω(v 2 ) between the sample complexity of the discriminative approach and that of the generative one. [sent-382, score-0.559]
96 To summarize, we have shown that the true sample complexity of large-margin learning of a rich family of specific distributions is characterized by the γ-adapted-dimension. [sent-383, score-0.652]
97 The challenge of characterizing true sample complexity extends to any distribution and any learning algorithm. [sent-385, score-0.529]
98 A general lower bound on the number of examples needed for learning. [sent-418, score-0.302]
99 Improved lower bounds for learning from noisy examples: an informationtheoretic approach. [sent-424, score-0.302]
100 Generalization bounds via eigenvalues o of the gram matrix. [sent-478, score-0.301]
wordName wordTfidf (topN-words)
[('dx', 0.262), ('wy', 0.252), ('complexity', 0.235), ('sample', 0.234), ('xx', 0.217), ('sx', 0.196), ('rd', 0.177), ('bounds', 0.173), ('bound', 0.173), ('tight', 0.17), ('sg', 0.153), ('moment', 0.136), ('shattered', 0.131), ('lower', 0.129), ('theorem', 0.128), ('margin', 0.113), ('rudelson', 0.112), ('upper', 0.111), ('learnability', 0.101), ('xs', 0.1), ('distributionspeci', 0.096), ('sivan', 0.096), ('trace', 0.095), ('coordinates', 0.089), ('ti', 0.087), ('distributions', 0.086), ('eigenvalue', 0.086), ('xm', 0.086), ('shattering', 0.084), ('bd', 0.084), ('dimension', 0.083), ('separator', 0.081), ('zi', 0.08), ('characterization', 0.079), ('pd', 0.078), ('tightly', 0.078), ('eigenvalues', 0.073), ('relate', 0.071), ('quantity', 0.069), ('unregularized', 0.069), ('dm', 0.068), ('establish', 0.065), ('benedek', 0.064), ('limi', 0.064), ('vershynin', 0.064), ('xwy', 0.064), ('covariance', 0.063), ('distribution', 0.06), ('conclude', 0.059), ('margins', 0.057), ('vayatis', 0.056), ('gap', 0.056), ('gram', 0.055), ('holds', 0.054), ('vapnik', 0.054), ('rows', 0.053), ('argminw', 0.052), ('orthogonal', 0.051), ('family', 0.05), ('hypothesis', 0.05), ('matrix', 0.049), ('characterize', 0.049), ('norm', 0.049), ('learnable', 0.048), ('independently', 0.048), ('rich', 0.047), ('id', 0.046), ('smallest', 0.046), ('spherical', 0.046), ('error', 0.046), ('drawn', 0.045), ('dy', 0.044), ('asymptotic', 0.043), ('rule', 0.042), ('proving', 0.042), ('origin', 0.042), ('dimensionality', 0.042), ('bounded', 0.042), ('coordinate', 0.041), ('exists', 0.041), ('rotated', 0.04), ('minimization', 0.04), ('concerned', 0.04), ('projection', 0.04), ('converse', 0.039), ('yi', 0.038), ('class', 0.037), ('onto', 0.037), ('gaps', 0.037), ('liang', 0.037), ('diag', 0.037), ('obtaining', 0.037), ('variance', 0.036), ('unlabeled', 0.035), ('directions', 0.035), ('xi', 0.034), ('chicago', 0.034), ('israel', 0.034), ('regularization', 0.034), ('generative', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999863 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
2 0.21646848 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
3 0.18856619 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
Author: Hariharan Narayanan, Sanjoy Mitter
Abstract: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume log 1 and curvature is unavoidable. Whether the known lower bound of O( k + 2 δ ) 2 for the sample complexity of Empirical Risk minimization on k−means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 [3]. Here is the desired bound on the error and δ is a bound on the probability of failure. We improve the best currently known upper bound [14] of 2 log 1 log4 k log 1 O( k2 + 2 δ ) to O k min k, 2 + 2 δ . Based on these results, we 2 devise a simple algorithm for k−means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension. 1
4 0.15110281 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
5 0.13289623 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
6 0.12384148 63 nips-2010-Distributed Dual Averaging In Networks
7 0.11730785 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
8 0.1167245 99 nips-2010-Gated Softmax Classification
9 0.11614726 191 nips-2010-On the Theory of Learnining with Privileged Information
10 0.11192702 142 nips-2010-Learning Bounds for Importance Weighting
11 0.11133515 148 nips-2010-Learning Networks of Stochastic Differential Equations
12 0.11120507 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
13 0.10970324 223 nips-2010-Rates of convergence for the cluster tree
14 0.10732189 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
15 0.10171003 27 nips-2010-Agnostic Active Learning Without Constraints
16 0.10120932 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points
17 0.099101856 265 nips-2010-The LASSO risk: asymptotic results and real world examples
18 0.098594971 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
19 0.095496155 216 nips-2010-Probabilistic Inference and Differential Privacy
20 0.092625193 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
topicId topicWeight
[(0, 0.268), (1, 0.03), (2, 0.222), (3, 0.055), (4, 0.076), (5, 0.084), (6, -0.07), (7, -0.041), (8, -0.066), (9, -0.006), (10, 0.006), (11, -0.167), (12, -0.103), (13, -0.052), (14, -0.12), (15, 0.008), (16, 0.159), (17, -0.064), (18, 0.06), (19, -0.051), (20, 0.09), (21, -0.04), (22, 0.001), (23, -0.009), (24, 0.026), (25, 0.028), (26, -0.09), (27, -0.032), (28, 0.045), (29, -0.066), (30, 0.048), (31, -0.01), (32, -0.015), (33, -0.014), (34, 0.021), (35, -0.014), (36, 0.057), (37, 0.068), (38, -0.038), (39, 0.036), (40, 0.014), (41, -0.064), (42, -0.015), (43, 0.003), (44, -0.106), (45, 0.13), (46, 0.042), (47, -0.012), (48, -0.087), (49, 0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.98615068 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
2 0.83362597 191 nips-2010-On the Theory of Learnining with Privileged Information
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
3 0.82251555 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
Author: Hariharan Narayanan, Sanjoy Mitter
Abstract: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume log 1 and curvature is unavoidable. Whether the known lower bound of O( k + 2 δ ) 2 for the sample complexity of Empirical Risk minimization on k−means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 [3]. Here is the desired bound on the error and δ is a bound on the probability of failure. We improve the best currently known upper bound [14] of 2 log 1 log4 k log 1 O( k2 + 2 δ ) to O k min k, 2 + 2 δ . Based on these results, we 2 devise a simple algorithm for k−means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension. 1
4 0.77626365 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
5 0.76399934 142 nips-2010-Learning Bounds for Importance Weighting
Author: Corinna Cortes, Yishay Mansour, Mehryar Mohri
Abstract: This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the R´ nyi divergence of the training and test distributions. e These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.
6 0.7471478 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
7 0.67695433 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
8 0.66380358 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
9 0.65500873 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
10 0.64259046 233 nips-2010-Scrambled Objects for Least-Squares Regression
11 0.63802928 220 nips-2010-Random Projection Trees Revisited
12 0.61199445 148 nips-2010-Learning Networks of Stochastic Differential Equations
13 0.60589564 27 nips-2010-Agnostic Active Learning Without Constraints
14 0.59660369 134 nips-2010-LSTD with Random Projections
15 0.57694477 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
16 0.57484692 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
17 0.54625016 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
18 0.54311371 223 nips-2010-Rates of convergence for the cluster tree
19 0.53359848 222 nips-2010-Random Walk Approach to Regret Minimization
20 0.53226024 202 nips-2010-Parallelized Stochastic Gradient Descent
topicId topicWeight
[(13, 0.048), (17, 0.013), (27, 0.032), (30, 0.152), (35, 0.024), (39, 0.015), (45, 0.157), (50, 0.076), (51, 0.019), (52, 0.043), (60, 0.084), (72, 0.13), (77, 0.053), (78, 0.037), (90, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.89871627 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
2 0.86454654 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points
Author: Meritxell Vinyals, Jes\'us Cerquides, Alessandro Farinelli, Juan A. Rodríguez-aguilar
Abstract: We study worst-case bounds on the quality of any fixed point assignment of the max-product algorithm for Markov Random Fields (MRF). We start providing a bound independent of the MRF structure and parameters. Afterwards, we show how this bound can be improved for MRFs with specific structures such as bipartite graphs or grids. Our results provide interesting insight into the behavior of max-product. For example, we prove that max-product provides very good results (at least 90% optimal) on MRFs with large variable-disjoint cycles1 . 1
3 0.84181064 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
4 0.83815521 129 nips-2010-Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
Author: Dirk Husmeier, Frank Dondelinger, Sophie Lebre
Abstract: Conventional dynamic Bayesian networks (DBNs) are based on the homogeneous Markov assumption, which is too restrictive in many practical applications. Various approaches to relax the homogeneity assumption have recently been proposed, allowing the network structure to change with time. However, unless time series are very long, this flexibility leads to the risk of overfitting and inflated inference uncertainty. In the present paper we investigate three regularization schemes based on inter-segment information sharing, choosing different prior distributions and different coupling schemes between nodes. We apply our method to gene expression time series obtained during the Drosophila life cycle, and compare the predicted segmentation with other state-of-the-art techniques. We conclude our evaluation with an application to synthetic biology, where the objective is to predict a known in vivo regulatory network of five genes in yeast. 1
5 0.82511699 220 nips-2010-Random Projection Trees Revisited
Author: Aman Dhesi, Purushottam Kar
Abstract: The Random Projection Tree (RPT REE) structures proposed in [1] are space partitioning data structures that automatically adapt to various notions of intrinsic dimensionality of data. We prove new results for both the RPT REE -M AX and the RPT REE -M EAN data structures. Our result for RPT REE -M AX gives a nearoptimal bound on the number of levels required by this data structure to reduce the size of its cells by a factor s ≥ 2. We also prove a packing lemma for this data structure. Our final result shows that low-dimensional manifolds have bounded Local Covariance Dimension. As a consequence we show that RPT REE -M EAN adapts to manifold dimension as well.
6 0.8216157 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
7 0.81727242 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
8 0.80233228 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
9 0.79827851 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
10 0.79252595 117 nips-2010-Identifying graph-structured activation patterns in networks
11 0.79085416 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
12 0.7833547 243 nips-2010-Smoothness, Low Noise and Fast Rates
13 0.78233564 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
14 0.78086352 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
15 0.77995372 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
16 0.7793293 148 nips-2010-Learning Networks of Stochastic Differential Equations
17 0.77884674 283 nips-2010-Variational Inference over Combinatorial Spaces
18 0.77883172 265 nips-2010-The LASSO risk: asymptotic results and real world examples
19 0.77615631 280 nips-2010-Unsupervised Kernel Dimension Reduction
20 0.77581769 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models