jmlr jmlr2010 jmlr2010-60 knowledge-graph by maker-knowledge-mining

60 jmlr-2010-Learnability, Stability and Uniform Convergence


Source: pdf

Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan

Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. [sent-10, score-0.755]

2 We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. [sent-12, score-0.421]

3 S HALEV-S HWARTZ , S HAMIR , S REBRO AND S RIDHARAN For supervised classification and regression problems, it is well known that a problem is learnable if and only if the empirical risks m FS (h) = 1 m ∑ f (h, zi ) i=1 for all h ∈ H converge uniformly to the population risk (Blumer et al. [sent-36, score-0.472]

4 If uniform convergence holds, then the empirical risk minimizer (ERM) is consistent, that is, the population risk of the ERM converges to the optimal population risk, and the problem is learnable using the ERM. [sent-39, score-0.545]

5 • A complete understanding of how to learn: since learnability is equivalent to learnability by ERM, we can focus our attention solely on empirical risk minimizers. [sent-42, score-0.515]

6 Uniform Convergence Learnable with ERM Learnable Other than uniform convergence, certain notions of stability have also been suggested as an explicit condition for learnability. [sent-44, score-0.337]

7 Therefore, stability was shown to characterize learnability only in situations where uniform convergence characterizes learnability anyway. [sent-50, score-0.783]

8 The equivalence of uniform convergence and learnability was formally established only in the supervised classification and regression setting. [sent-51, score-0.335]

9 As for the reverse implication, Vapnik showed that a notion of “non-trivial” or “strict” learnability with the ERM is indeed equivalent to uniform convergence of the empirical risks. [sent-53, score-0.355]

10 This notion was meant to exclude certain “trivial” learning problems, which are learnable without uniform convergence (see Section 3. [sent-54, score-0.338]

11 Thus, it would seem that in the General Learning Setting, as in supervised classification and regression, a problem is learnable if and only if it is learnable by empirical risk minimization. [sent-57, score-0.562]

12 1 we show an example of a learning problem in the General Learning Setting, which is learnable (using an online algorithm and an online-to-batch conversion), but which is not learnable using empirical risk minimization. [sent-60, score-0.533]

13 2 we show a modified example which is learnable using empirical risk minimization, but for which the empirical risks of the hypotheses do not converge uniformly to their expectations, not even locally for hypotheses very close to the true hypothesis. [sent-63, score-0.435]

14 We use this example to discuss how Vapnik’s notion of “strict” learnability with the ERM is too strict, and precludes cases which are far from trivial and in which learnability with empirical risk minimization is not equivalent to uniform convergence. [sent-66, score-0.626]

15 2636 L EARNABILITY, S TABILITY AND U NIFORM C ONVERGENCE In particular, we show that for learnable problems, even when they are not learnable with ERM, they are always learnable with some learning rule which is “asymptotically ERM” and (AERM - see precise definition in Section 2). [sent-68, score-0.766]

16 Namely, we have the following characterization of learnability in the General Learning Setting: Exists Stable AERM Learnable with AERM Learnable Note that this characterization holds even for learnable problems with no uniform convergence. [sent-70, score-0.491]

17 In this sense, stability emerges as a strictly more powerful notion than uniform convergence for characterizing learnability. [sent-71, score-0.379]

18 A learning rule A for which this holds is denoted as a universally consistent learning rule. [sent-136, score-0.36]

19 This definition of learnability, requiring a uniform rate for all distributions, is the relevant notion for studying learnability of a hypothesis class. [sent-137, score-0.394]

20 We say that a rule A is an AERM (Asymptotic Empirical Risk Minimizer) with rate εerm (m) under distribution D if: ˆ ES∼D m FS (A(S)) − FS (hS ) ≤ εerm (m) A learning rule is universally an AERM with rate εerm (m), if it is an AERM with rate εerm (m) under all distributions D over Z . [sent-152, score-0.499]

21 A rule universally generalizes with rate εgen (m) if it generalizes with rate εgen (m) under all distributions D over Z . [sent-155, score-0.435]

22 Formally, we say that uniform convergence holds for a learning problem, if the empirical risks of hypotheses in the hypothesis class converges to their population risk uniformly, with a distribution-independent rate: m→∞ sup ES∼D m sup |F(h) − FS (h)| −→ 0. [sent-165, score-0.344]

23 To justify the necessity of uniform convergence even in the General Learning Setting, Vapnik attempted to show that in this setting, learnability with the ERM learning rule is equivalent to uniform convergence (Vapnik, 1998). [sent-175, score-0.503]

24 , f (h; z) < f (h; z) uniformly for all z), and thus the problem is learnable without uniform convergence or any other property of H . [sent-191, score-0.334]

25 However, as we will see later on in our paper, uniform convergence is in fact not necessary for learning in the General Learning Setting, and stability plays there a key role which has nothing to do with uniform convergence. [sent-221, score-0.4]

26 Moreover, some of these problems are learnable with an ERM (again, without any uniform convergence), and some are not learnable with an ERM, but rather with a different mechanism. [sent-231, score-0.5]

27 Unfortunately, when we turn to infinite dimensional hypothesis spaces, uniform convergence might not hold and the problem might not be learnable with empirical minimization. [sent-245, score-0.439]

28 This mechanism is fundamentally related to the idea of stability, and will be a good starting point for our more general treatment of stability and learnability in the next section of the paper. [sent-248, score-0.46]

29 More formally, Equation (6) implies that the ERM is uniform-RO stability (Definition 4) with rate εstable (m) = 4L2 /(λm) and therefore Theorem 8 implies that the ERM is consistent with rate ≤ εstable (m), namely 2 ˆ ES∼D m F(hS ) − F ∗ ≤ 4L . [sent-372, score-0.392]

30 At first glance, this might seem confusing in light of the examples presented above, where we have problems learnable with the ERM without uniform convergence whatsoever. [sent-440, score-0.33]

31 In this section, we study the connection between learnability and stability in greater depth, and show that stability can in fact characterize learnability. [sent-456, score-0.725]

32 We say that a rule is universally stable with rate εstable (m), if the stability property holds with rate εstable (m) for all distributions. [sent-482, score-0.774]

33 Claim 6 Uniform-RO stability with rate εstable (m) implies average-RO stability with rate εstable (m). [sent-483, score-0.562]

34 2 Characterizing Learnability : Main Results Our overall goal is to characterize learnable problems (namely, problems for which there exists a universally consistent learning rule, as in Equation (2)). [sent-485, score-0.495]

35 In the uniform convergence setting, such a condition is the stability of the ERM (under any of several possible stability measures, including both variants of RO-stability defined above). [sent-487, score-0.605]

36 The most important result in this section is a condition which is necessary and sufficient for learnability in the General Learning Setting: Theorem 7 A learning problem is learnable if and only if there exists a uniform-RO stable, universally AERM learning rule. [sent-489, score-0.642]

37 In particular, if there exists a εcons (m)-universally consistent rule, then there exists a rule that is εstable (m)uniform-RO stable and universally εerm (m)-AERM where: 8B εerm (m) = 3εcons (m1/4 ) + √m , εstable (m) = 2B √ . [sent-490, score-0.523]

38 However, it turns out that if a universally consistent learning rule exist, then there is another learning rule for the same problem, which is generalizing (Lemma 20). [sent-516, score-0.503]

39 Combined with the results of Lemma 11, this completes the proof of Theorem 8 and the stability → consistency and generalization → consistency parts of Theorem 9. [sent-570, score-0.368]

40 However, as it will turn out, in order to establish that stability is also necessary for universal consistency, we must prove that universal consistency of an AERM implies universal generalization. [sent-574, score-0.451]

41 The assumption of universal consistency for the AERM is crucial here: mere consistency of an AERM with respect to a specific distribution does not imply generalization nor stability with respect to that distribution. [sent-575, score-0.405]

42 Intuitively, it states that if a problem is learnable at all, then although the ERM rule might fail, its empirical risk is a consistent estimator of the minimal achievable risk. [sent-587, score-0.493]

43 Lemma 16 (Main Converse Lemma) If a problem is learnable, namely there exists a universally consistent rule A with rate εcons (m), then under any distribution, ˆ E FS (hS ) − F ∗ ≤ εemp (m) where (12) ′2 2B εemp (m) = 2εcons (m′ ) + √m + 2Bm m for any m′ such that 2 ≤ m′ ≤ m/2. [sent-588, score-0.408]

44 2652 L EARNABILITY, S TABILITY AND U NIFORM C ONVERGENCE Equipped with Lemma 16, we are now ready to show that universal consistency of an AERM implies universal generalization and that any universally consistent and generalizing rule must be an AERM. [sent-600, score-0.57]

45 • In the next subsection (Example 2) we demonstrate a universally consistent learning rule which is neither generalizing nor an AERM. [sent-615, score-0.4]

46 2653 S HALEV-S HWARTZ , S HAMIR , S REBRO AND S RIDHARAN In contrast, for learnable supervised classification problems, it is not possible for a learning rule to be just universally consistent, without being an AERM and without generalization. [sent-617, score-0.547]

47 Nor is it possible for a learning rule to be a universal AERM for a learnable problem, without being generalizing and consistent. [sent-618, score-0.418]

48 We can conclude that there is no universally consistent learning rule for the problem, otherwise the corollary is violated. [sent-621, score-0.36]

49 Existence of a universally average-RO stable AERM is thus sufficient for learnability. [sent-625, score-0.357]

50 In order to prove that it is also necessary, it is enough to show that existence of a universally consistent learning rule implies existence of a universally consistent AERM. [sent-626, score-0.617]

51 We actually show how to transform a consistent rule to a consistent and generalizing rule (Lemma 20 below). [sent-628, score-0.372]

52 Lemma 20 For any rule A there exists a rule A′ , such that: 3B • A′ universally generalizes with rate √m . [sent-630, score-0.469]

53 As a final note, the following example shows that while learnability is equivalent to the existence of stable and consistent AERM’s (Theorem 7 and Theorem 9), there might still exist other learning rules, which are neither stable, nor generalize, nor AERM’s. [sent-642, score-0.453]

54 Randomization, Convexification, and a Generic Learning Algorithm The strongest result we were able to obtain for characterizing learnability so far is Theorem 7, which stated that a problem is learnable if and only if there exists a universally uniform-RO stable AERM. [sent-666, score-0.804]

55 Compared to Theorem 7, we have replaced universal AERM by the stronger notion of an always AERM, and uniform-RO stability by strongly-uniform-RO stability. [sent-700, score-0.338]

56 • If A is √ uniform-RO stable with rate εstable (m), then A′ is strongly-uniform-RO stable with rate εstable (⌊ m⌋). [sent-704, score-0.392]

57 Since the learning rule A is universally consistent, it is in particular consistent with respect to the distribution √ U (S), and therefore the expectation in the expression above is at most εcons (⌊ m⌋). [sent-731, score-0.36]

58 2658 L EARNABILITY, S TABILITY AND U NIFORM C ONVERGENCE Theorem 25 If a learning problem is learnable (namely, there exist a universally consistent learning rule with rate εcons (m)), the learning algorithm described above is universally consistent with rate √ 8B 4εcons (⌊ m⌋) + √ . [sent-747, score-0.904]

59 m m S∈Z In Theorem 9, we have seen that a universally average-RO stable AERM learning rule has to be universally consistent. [sent-752, score-0.654]

60 The inequalities above essentially say that A is in fact both strongly-uniform-RO stable (and in particular, universally average-RO stable) and an AERM, and thus is a universally consistent learning rule. [sent-753, score-0.614]

61 Minimizing a sum of both terms forces us to choose a learning rule which is an “almost”-ERM but also stable - a learning rule which must exist if the problem is learnable at all, as Theorem 23 proves. [sent-764, score-0.59]

62 However, in supervised classification, if we have learnability at all, then we have learnability at rates which are logarithmic in 1/δ. [sent-769, score-0.453]

63 This shows that both learnability and stability (under our definitions) of the ERM learning rule are not sufficient to ensure logarithmic dependence on 1/δ. [sent-774, score-0.563]

64 Theorem 26 Let A be a universally consistent learning rule with rate εcons (m), namely that ES∼D m [F(A(S)) − F ∗ ] ≤ εcons (m). [sent-776, score-0.408]

65 Example 3 There exists a √ learning problem where any ERM algorithm is universally consistent and averageRO stable with rates Θ(1/ m), but for any ERM algorithm, 1 ˆ P F(hS ) − F ∗ = 1 = Θ √ . [sent-808, score-0.42]

66 Consistency refers to univeral consistency and stability refers to universal uniform-RO stability. [sent-826, score-0.343]

67 Discussion and Conclusions In the familiar setting of supervised classification problems, the question of learnability is reduced to that of uniform convergence of empirical risks to their expectations. [sent-835, score-0.392]

68 On the other hand, we have seen that stability is both a sufficient and necessary condition for learning, even in the General Learning Setting where uniform convergence fails to characterize learnability. [sent-844, score-0.374]

69 This also allows us to frame stability as the core condition guaranteeing learnability, with uniform convergence only a sufficient, but not necessary, condition for stability (see Figure 2). [sent-848, score-0.62]

70 In studying the question of learnability and its relation to stability, we encountered several differences between this more general setting, and settings such as supervised classification where learnability is equivalent to uniform convergence. [sent-849, score-0.511]

71 In this paper we establish that if a problem is learnable, although it might not be learnable with an ERM, it must be learnable with some AERM. [sent-851, score-0.457]

72 • In supervised classification, if one AERM is universally consistent then all AERMs are universally consistent. [sent-853, score-0.48]

73 • In supervised classification, a universally consistent rule must also generalize and be AERM. [sent-855, score-0.389]

74 In the General Setting, a universally consistent rule need not generalize nor be an AERM, as example 2 demonstrates. [sent-856, score-0.36]

75 However, Theorem 10 establishes that, even in the General Setting, if a rule is universally consistent and generalizing then it must be an AERM. [sent-857, score-0.4]

76 In supervised classification, if a problem is learnable then generalization always holds (for any rule), and so universal consistency and AERM imply each other. [sent-861, score-0.366]

77 • In supervised classification, any learnable problem is learnable with an ERM, and the ERM “works” ˆ with high-confidence (namely, F(hS ) − F ∗ can be bounded with probability 1 − δ by an expression with logarithmic dependence on 1/δ). [sent-865, score-0.471]

78 We have begun exploring the issue of learnability in the General Setting, and uncovered important relationships between learnability and stability. [sent-867, score-0.424]

79 Yet another open question: We showed that existence of an uniform-RO stable AERM is necessary and sufficient for learnability (Theorem 7). [sent-880, score-0.375]

80 The second measure and third measure (uniform-RO stability and strongly-uniform-RO stability respectively) basically deal with the maximal possible change in the objective value with respect to a particular instance, by replacing a single instance in the training set. [sent-903, score-0.539]

81 However, the results for uniform stability mostly assume deterministic learning rules, while in this paper we have used stronglyuniform-RO stability solely in the context of randomized learning rules. [sent-912, score-0.581]

82 Moreover, we show in this paper that uniform-RO stable AERM’s characterize learnability, while it is well known that uniformly stable AERM’s are not necessary for learnability (see Kutin and Niyogi, 2002). [sent-914, score-0.574]

83 For the same reason, our notion of strongly-uniform-RO stability is apparently too strong to characterize learnability when we deal with deterministic learning rules, as opposed to randomized learning rules. [sent-915, score-0.527]

84 The first and last are similar to our notion of uniform-RO stability and average-RO stability respectively. [sent-936, score-0.519]

85 However, we emphasize that RO stability and LOO stability are in general incomparable notions, as we shall see later on. [sent-937, score-0.496]

86 Definition 27 A rule A is uniform-LOO stable with rate εstable (m) if for all samples S of m points and for all i: f (A(S\i ); zi ) − f (A(S); zi ) ≤ εstable (m). [sent-943, score-0.461]

87 Definition 28 A rule A is all-i-LOO stable with rate εstable (m) under distribution D if for all i: ES∼D m f (A(S\i ); zi ) − f (A(S); zi ) ≤ εstable (m). [sent-944, score-0.461]

88 Definition 29 A rule A is LOO stable with rate εstable (m) under distribution D if 1 m ∑E m m i=1 S∼D f (A(S\i ); zi ) − f (A(S); zi ) ≤ εstable (m). [sent-945, score-0.461]

89 Definition 30 A rule A is on-average-LOO stable with rate εstable (m) under distribution D if 1 m ∑ E m f (A(S\i ); zi ) − f (A(S); zi ) m i=1 S∼D ≤ εstable (m). [sent-946, score-0.461]

90 Example 4 There exists a learning problem with a universally consistent and all-i-LOO stable learning rule, but there is no universally consistent and uniform LOO stable learning rule. [sent-951, score-0.898]

91 It is also universally all-i-LOO stable, because removing an instance can change the hypothesis only if the original sample had an √ equal number of 0’s and 1’s (plus or minus one), which happens with probability at most O(1/ m) where m is the sample size. [sent-955, score-0.342]

92 However, it is not hard to see that the only uniform LOO stable learning rule, at least for large enough sample sizes, is a constant rule which always returns the same hypothesis h regardless of the sample. [sent-956, score-0.433]

93 Therefore, our learning rule universally generalizes (with rate εgen (m) = log(4/δ)/2m), and by Theorem 9, this implies that the learning rule is also universally consistent and on-average-LOO stable. [sent-990, score-0.726]

94 Note that the proof implies that on-average-LOO stability cannot be replaced even by something between on-average-LOO stability and LOO stability. [sent-998, score-0.513]

95 (2009b), we show a version of Theorem 7, which asserts that a problem is learnable if and only if there is an on-average-LOO stable AERM. [sent-1003, score-0.384]

96 However, on-average-LOO stability is qualitatively much weaker than the notion of uniform-RO stability used in Theorem 7 (see Definition 4). [sent-1004, score-0.519]

97 However, the proof of Theorem 7 does not work for these stability definitions (technically, this is because the proof relies on the sample size remaining constant, which is true for replacement stability, but not when we remove an instance as in LOO stability). [sent-1006, score-0.335]

98 In particular, the theorem implies that LOO stability is a necessary property for consistent ERM learning rules. [sent-1012, score-0.342]

99 Therefore, we can upper bound 1 m 2B ˆ ˆ ∑ E f (hS\i ; zi ) − E FS (hS ) + m m i=1 = 1 m ˆ ˆ ∑ E f (hS\i ; zi ) − f (hS , ; zi ) m i=1 ≤ εstable (m) using the assumption that the learning rule is εstable (m)-stable. [sent-1027, score-0.346]

100 Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. [sent-1103, score-0.401]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('aerm', 0.446), ('fs', 0.349), ('erm', 0.347), ('hs', 0.298), ('cons', 0.279), ('stability', 0.248), ('learnable', 0.221), ('learnability', 0.212), ('universally', 0.194), ('stable', 0.163), ('loo', 0.106), ('rule', 0.103), ('hamir', 0.093), ('hwartz', 0.093), ('rebro', 0.093), ('ridharan', 0.093), ('earnability', 0.088), ('niform', 0.088), ('zi', 0.081), ('gen', 0.075), ('tability', 0.072), ('hypothesis', 0.068), ('risk', 0.065), ('consistent', 0.063), ('uniform', 0.058), ('infh', 0.057), ('universal', 0.054), ('oag', 0.052), ('onvergence', 0.05), ('consistency', 0.041), ('minimizer', 0.04), ('generalizing', 0.04), ('zm', 0.039), ('vapnik', 0.037), ('lemma', 0.037), ('aerms', 0.036), ('convergence', 0.036), ('generalizes', 0.036), ('rate', 0.033), ('hypotheses', 0.032), ('dups', 0.031), ('theorem', 0.031), ('eh', 0.031), ('te', 0.029), ('rules', 0.029), ('supervised', 0.029), ('ro', 0.028), ('ez', 0.028), ('emp', 0.027), ('sample', 0.027), ('randomized', 0.027), ('empirical', 0.026), ('kutin', 0.026), ('instance', 0.026), ('convex', 0.025), ('mukherjee', 0.025), ('notion', 0.023), ('dence', 0.023), ('strict', 0.022), ('duplicates', 0.022), ('rakhlin', 0.022), ('generalization', 0.021), ('sridharan', 0.021), ('nitions', 0.021), ('uniformly', 0.019), ('stochastic', 0.017), ('converse', 0.017), ('objective', 0.017), ('proof', 0.017), ('generic', 0.017), ('setting', 0.017), ('characterize', 0.017), ('population', 0.017), ('notions', 0.016), ('conversion', 0.016), ('fsa', 0.016), ('karthik', 0.016), ('returned', 0.015), ('namely', 0.015), ('minimization', 0.015), ('might', 0.015), ('condition', 0.015), ('classi', 0.015), ('trivial', 0.015), ('returns', 0.014), ('inf', 0.014), ('sa', 0.014), ('randomization', 0.014), ('risks', 0.014), ('equation', 0.014), ('characterizing', 0.014), ('sup', 0.014), ('st', 0.014), ('shamir', 0.014), ('hilbert', 0.013), ('stronger', 0.013), ('srebro', 0.013), ('zinkevich', 0.013), ('regularization', 0.013), ('elisseeff', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999869 60 jmlr-2010-Learnability, Stability and Uniform Convergence

Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan

Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization

2 0.28214121 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory

3 0.080057576 82 jmlr-2010-On Learning with Integral Operators

Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito

Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods

4 0.078803286 22 jmlr-2010-Classification Using Geometric Level Sets

Author: Kush R. Varshney, Alan S. Willsky

Abstract: A variational level set method is developed for the supervised classification problem. Nonlinear classifier decision boundaries are obtained by minimizing an energy functional that is composed of an empirical risk term with a margin-based loss and a geometric regularization term new to machine learning: the surface area of the decision boundary. This geometric level set classifier is analyzed in terms of consistency and complexity through the calculation of its ε-entropy. For multicategory classification, an efficient scheme is developed using a logarithmic number of decision functions in the number of classes rather than the typical linear number of decision functions. Geometric level set classification yields performance results on benchmark data sets that are competitive with well-established methods. Keywords: level set methods, nonlinear classification, geometric regularization, consistency, complexity

5 0.050314654 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option

6 0.04411044 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

7 0.041345611 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

8 0.040392444 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

9 0.039463662 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

10 0.038679991 102 jmlr-2010-Semi-Supervised Novelty Detection

11 0.037830304 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

12 0.03678973 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

13 0.036094606 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

14 0.035997946 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

15 0.035373718 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

16 0.030095672 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

17 0.029647442 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

18 0.027642047 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

19 0.027500115 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

20 0.026717637 25 jmlr-2010-Composite Binary Losses


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.134), (1, -0.058), (2, 0.057), (3, 0.009), (4, -0.129), (5, 0.009), (6, -0.12), (7, 0.302), (8, 0.177), (9, 0.084), (10, -0.166), (11, -0.06), (12, -0.279), (13, 0.31), (14, -0.096), (15, 0.078), (16, 0.228), (17, 0.115), (18, -0.005), (19, -0.071), (20, 0.093), (21, -0.228), (22, 0.035), (23, -0.067), (24, -0.065), (25, -0.055), (26, -0.06), (27, -0.041), (28, -0.001), (29, 0.024), (30, -0.046), (31, -0.049), (32, 0.075), (33, -0.048), (34, 0.053), (35, -0.01), (36, 0.031), (37, -0.013), (38, 0.057), (39, -0.008), (40, -0.017), (41, -0.102), (42, -0.003), (43, 0.005), (44, -0.032), (45, -0.039), (46, -0.027), (47, -0.017), (48, -0.053), (49, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9549706 60 jmlr-2010-Learnability, Stability and Uniform Convergence

Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan

Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization

2 0.87943327 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory

3 0.33232746 82 jmlr-2010-On Learning with Integral Operators

Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito

Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods

4 0.30132312 22 jmlr-2010-Classification Using Geometric Level Sets

Author: Kush R. Varshney, Alan S. Willsky

Abstract: A variational level set method is developed for the supervised classification problem. Nonlinear classifier decision boundaries are obtained by minimizing an energy functional that is composed of an empirical risk term with a margin-based loss and a geometric regularization term new to machine learning: the surface area of the decision boundary. This geometric level set classifier is analyzed in terms of consistency and complexity through the calculation of its ε-entropy. For multicategory classification, an efficient scheme is developed using a logarithmic number of decision functions in the number of classes rather than the typical linear number of decision functions. Geometric level set classification yields performance results on benchmark data sets that are competitive with well-established methods. Keywords: level set methods, nonlinear classification, geometric regularization, consistency, complexity

5 0.2466785 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

Author: Gunnar Carlsson, Facundo Mémoli

Abstract: We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods. Keywords: clustering, hierarchical clustering, stability of clustering, Gromov-Hausdorff distance

6 0.22839536 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

7 0.21713778 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

8 0.20793806 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

9 0.20054543 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

10 0.18680683 102 jmlr-2010-Semi-Supervised Novelty Detection

11 0.15453106 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

12 0.1464625 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

13 0.14210354 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

14 0.14199495 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

15 0.1313332 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

16 0.12737371 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

17 0.12701474 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

18 0.12582929 25 jmlr-2010-Composite Binary Losses

19 0.11858156 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

20 0.11016954 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.025), (3, 0.05), (4, 0.019), (8, 0.021), (15, 0.019), (21, 0.02), (32, 0.079), (36, 0.029), (37, 0.049), (53, 0.414), (75, 0.089), (81, 0.013), (85, 0.047), (96, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.68519533 60 jmlr-2010-Learnability, Stability and Uniform Convergence

Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan

Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization

2 0.31386915 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

3 0.31335407 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

4 0.31063142 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

Author: Vladimir Koltchinskii

Abstract: Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way. Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered. Keywords: active learning, excess risk, Rademacher complexities, capacity function, disagreement coefficient

5 0.30938396 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

Author: Ran El-Yaniv, Yair Wiener

Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off

6 0.30878112 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

7 0.30745822 102 jmlr-2010-Semi-Supervised Novelty Detection

8 0.30692345 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

9 0.30617446 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

10 0.30442503 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

11 0.30440846 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

12 0.30392689 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

13 0.30387661 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

14 0.30385673 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

15 0.30299586 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

16 0.30294669 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

17 0.30212063 109 jmlr-2010-Stochastic Composite Likelihood

18 0.30084455 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

19 0.2999348 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

20 0.29695112 25 jmlr-2010-Composite Binary Losses