jmlr jmlr2005 jmlr2005-67 knowledge-graph by maker-knowledge-mining

67 jmlr-2005-Stability of Randomized Learning Algorithms

Source: pdf

Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil

Abstract: We extend existing theory on stability, namely how much changes in the training data inﬂuence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal deﬁnitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.1 Keywords: stability, randomized learning algorithms, sensitivity analysis, bagging, bootstrap methods, generalization error, leave-one-out error.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We give formal deﬁnitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. [sent-10, score-0.493]

2 We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms. [sent-12, score-1.049]

3 Despite certain difﬁculties with theories about stability, such as the lack so far of tight bounds as well as lower bounds (Bousquet and Elisseeff, 2002), the study of learning methods using notions of stability is promising although it is still at its infancy. [sent-17, score-0.399]

4 (2004) have shown conditions for the generalization of learning methods in terms of a stability notion that have possible implications for new insights on diverse learning problems. [sent-19, score-0.323]

5 We then prove, as an application of our results, new non-asymptotic bounds for bagging (Breiman, 1996a), a randomized learning method. [sent-29, score-0.338]

6 For completeness and comparison we ﬁrst replicate in Section 2 the key notions of stability and the generalization bounds we extend derived for deterministic methods in the literature. [sent-32, score-0.402]

7 Finally, in Section 4 we present an analysis of bagging within the stability theory framework. [sent-34, score-0.488]

8 Stability and Generalization for Deterministic Algorithms In this section we brieﬂy review the results in (Devroye and Wagner 1979; Kearns and Ron, 1999; Bousquet and Elisseeff, 2002) that show that stability is linked to generalization for deterministic learning methods. [sent-36, score-0.347]

9 In the next section, we will extend stability concepts to the case of randomized learning methods and remove this symmetry assumption. [sent-38, score-0.412]

10 56 S TABILITY OF R ANDOMIZED L EARNING A LGORITHMS We also deﬁne the empirical error as Remp [ f ] = and the leave–one–out error as Rloo [ f ] = 1 m ∑ ( f , zi ) m i=1 1 m ∑ ( f \i , zi ). [sent-56, score-0.5]

11 2 Hypothesis Stability The ﬁrst notion of stability we consider has been stated in (Bousquet and Elisseeff, 2002) and is inspired by the work of Devroye and Wagner (1979). [sent-63, score-0.312]

12 It is very close to what Kearns and Ron (1999) deﬁned as hypothesis stability: Deﬁnition 1 (Hypothesis Stability) An algorithm A has hypothesis stability βm w. [sent-64, score-0.467]

13 Theorem 2 holds for any loss functions as long as stability can be proved w. [sent-78, score-0.344]

14 In the following, we will say that an algorithm is stable when its stability scales like 1/m, in which √ case the difference between the generalization and leave-one-out error is of the order O(1/ m). [sent-82, score-0.343]

15 Let vi be the neighborhood of zi such that the closest point in the training set to any point of vi is zi . [sent-90, score-0.529]

16 Now averaging over D we need to compute ED [P(vi )] which is the same for all i because the zi are drawn i. [sent-96, score-0.254]

17 i=1 Consequently, ED [P(vi )] = 1 m and the 1-NN has hypothesis stability bounded above by 1/m. [sent-103, score-0.396]

18 A bound similar to Equation (1) can be derived for the empirical error when a slightly different notion of stability is used (Bousquet and Elisseeff, 2002). [sent-104, score-0.344]

19 2 Deﬁnition 3 (Pointwise hypothesis stability) An algorithm A has pointwise hypothesis stability βm w. [sent-105, score-0.551]

20 , m}, ED ,z ( fD , zi ) − ( fD \i ∪z , zi ) ≤ βm . [sent-111, score-0.468]

21 Note that we adopted the same notation βm for all notions of stability since it should always be clear from the context which is the referred notion. [sent-112, score-0.318]

22 To this end, we need a stronger notion of stability called uniform stability (Bousquet and Elisseeff, 2002). [sent-134, score-0.642]

23 Deﬁnition 5 (Uniform Stability) An algorithm A has uniform stability βm w. [sent-135, score-0.33]

24 (4) It is easily seen that the uniform stability is an upper bound on hypothesis and pointwise hypothesis stability (Bousquet and Elisseeff, 2002). [sent-144, score-0.892]

25 Uniform stability can be used in the context of regression to get bounds as follows (Bousquet and Elisseeff, 2002): Theorem 6 Let fD be the outcome of an algorithm with uniform stability βm w. [sent-145, score-0.739]

26 (6) 2m The dependence on δ is log(1/δ) which is better than the bounds given in terms of hypothesis and pointwise hypothesis stability. [sent-150, score-0.288]

27 Uniform stability can also be used for classiﬁcation with margin classiﬁers to get similar bounds, but we do not pursue this here for simplicity. [sent-152, score-0.299]

28 In the next section, for simplicity we also consider random uniform stability only for regression. [sent-153, score-0.343]

29 3 It can be shown (Bousquet and Elisseeff, 2002) that for Lipschitz loss functions, the uniform stability of these regularization methods scales as 1/λ. [sent-159, score-0.386]

30 Hence, there is a trade-off between stability and deviation between generalization and empirical error that is illustrated here by the role of the regularization parameter λ. [sent-162, score-0.331]

31 Finally, we note that the notion of uniform stability may appear a little restrictive since the inequality in Equation (4) has to hold over all training sets D . [sent-163, score-0.358]

32 A weaker notion of stability has been introduced by Kutin and Niyogi (2002) with related exponential bounds. [sent-164, score-0.312]

33 We consider the following issues for the deﬁnitions of stability for randomized algorithms below. [sent-189, score-0.412]

34 • We give stability deﬁnitions that reduce to deterministic stability concepts when there is no randomness, that is, R is reduced to one element with probability 1. [sent-190, score-0.635]

35 For example, this model can be used to model the randomization of bagging (Breiman, 1996a), where each rt corresponds to one random subsampling from the data, and the T subsamples are all drawn independently. [sent-199, score-0.387]

36 We need this assumption for the deﬁnitions of stability below to be well deﬁned as well as for the leave-one-out error deﬁnition we use for randomized methods. [sent-208, score-0.423]

37 2 Random Hypothesis Stability The ﬁrst deﬁnition we consider is inspired by the hypothesis stability for deterministic algorithms. [sent-229, score-0.42]

38 Deﬁnition 7 (Random Hypothesis Stability) A randomized algorithm A has random hypothesis stability βm w. [sent-230, score-0.509]

39 , m}, EDm ,r,z ( fD ,r , zi ) − ( fD \i ∪z,r , zi ) ≤ βm . [sent-262, score-0.468]

40 (11) Using the following lemma proved in the appendix, Lemma 11 For any (non-symmetric) learning algorithm A and loss function such that 0 ≤ ( f , z) ≤ M we have for the empirical error, ED (Rgen − Remp )2 ≤ 2M 2 12M m + ∑ ED ,z | ( fD , zi ) − ( fD \i∪z , zi )| . [sent-263, score-0.535]

41 m m i=1 (12) we can derive as before the theorem: Theorem 12 Let fD ,r be the outcome of a random algorithm with random pointwise hypothesis stability βm w. [sent-264, score-0.567]

42 3 Random Uniform Stability The uniform stability deﬁnition (Deﬁnition 5) for deterministic algorithms can be extended as follows: Deﬁnition 13 (Uniform Stability of Randomized Algorithms) We say that a randomized learning algorithm has uniform stability βm w. [sent-273, score-0.81]

43 To link uniform stability to generalization, the following result by McDiarmid (1989), see also (Devroye et al. [sent-281, score-0.33]

44 Theorem 15 Let fD ,r be the outcome of a randomized learning algorithm satisfying Assumptions 1 and 2 with uniform stability βm w. [sent-314, score-0.517]

45 2m 64 (18) S TABILITY OF R ANDOMIZED L EARNING A LGORITHMS Furthermore, assuming that βm−1 , the random uniform stability for training sets of size m − 1, is greater than βm , we can simplify Equation (18) to: Rgen ( fD ,r ) ≤ R oo ( f D ,r ) + βm + M + 4mβm−1 √ √ + 2T ρ ( log(2/δ)). [sent-335, score-0.44]

46 In the bagging example below we estimate a bound on ρ that depends only on T , the number of subsamples we do for the bagging process – it may or may not be possible to show that ρ depends on m, too, but this is an open question. [sent-339, score-0.426]

47 We will analyze these methods within the stability framework presented above. [sent-347, score-0.299]

48 To this end, we need to study how bagging and subbagging “affect” the stability of the base (underlying) learning algorithm. [sent-348, score-0.645]

49 In what follows, we compute an upper bound on the random hypothesis stability for bagging. [sent-369, score-0.407]

50 1 (Random hypothesis stability of bagging for regression) Assume that the loss is B−lipschitzian w. [sent-371, score-0.617]

51 Let FD ,r , r ∈ R T , be the outcome of a bagging algorithm whose base machine ( fD ) has (pointwise) hypothesis stability γm w. [sent-375, score-0.67]

52 Then the random (pointwise) hypothesis stability βm of FD ,r with respect to is bounded by m kγk Pr [d(r) = k] , k=1 m βm ≤ B ∑ where d(r), r ∈ R , is the number of distinct sampled points in one bootstrap iteration. [sent-379, score-0.44]

53 D and z: ED ,z [I(D , z)] ≤ BEr,D ,x [|∆(D (r), x)|1i∈r ] = = BEr ED ,x [|∆(D (r), x)|] 1i∈r = BEr γd(r) 1i∈r , (21) where the last equality follows by noting that ED ,x [|∆(D (r), x)|] is bounded by the hypothesis stability γd(r) of a training set of size d(r). [sent-408, score-0.411]

54 The proof for pointwise stability is exactly the same except that in Equation (21) there is no expectation w. [sent-415, score-0.393]

55 For example if B = 1 and γm scales appropriately with m the bounds on the random (pointwise) hypothesis stability of the bagging predictor are better than those on the (pointwise) hypothesis stability of a single predictor trained on the whole training set. [sent-425, score-1.047]

56 Similar results can be shown for the random (pointwise) hypothesis stability for classiﬁcation. [sent-429, score-0.396]

57 2 (Random hypothesis stability of bagging for classiﬁcation) Let FD ,r be the outcome of a bagging algorithm whose base machine has (pointwise) hypothesis stability γm w. [sent-431, score-1.242]

58 Then, the (pointwise) random hypothesis stability βm of FD ,r w. [sent-435, score-0.396]

59 67 E LISSEEFF , E VGENIOU AND P ONTIL So that stability w. [sent-441, score-0.299]

60 the 1 loss function can be replaced by stability w. [sent-444, score-0.355]

61 The next proposition establishes the link between the uniform stability of bagging and that of the base learning algorithm for regression. [sent-449, score-0.565]

62 3 (Random uniform stability of bagging for regression) Assume that the loss is B-lipschitzian with respect to its ﬁrst variable. [sent-452, score-0.564]

63 Let FD ,r be the outcome of a bagging algorithm whose base machine has uniform stability γm w. [sent-453, score-0.617]

64 Then the random uniform stability βm of FD ,r with respect to is bounded by m kγk Pr [d(r) = k] . [sent-457, score-0.356]

65 k=1 m βm ≤ B ∑ (22) Proof The random uniform stability of bagging is given by βm = sup Er1 ,. [sent-458, score-0.557]

66 68 S TABILITY OF R ANDOMIZED L EARNING A LGORITHMS Example 6 (SVM regression) We have seen in Example 2 that the uniform stability of a SVM w. [sent-470, score-0.33]

67 The uniform stability of bagging SVM is then roughly bounded by 0. [sent-474, score-0.532]

68 So that the bound on the random uniform stability of a bagged SVM is better than the bound on the uniform stability for a single SVM trained on the whole training set with the same λ. [sent-476, score-0.727]

69 The proofs above can then be used here directly which gives the following upper bounds on stability for subbagging: Proposition 4. [sent-483, score-0.335]

70 Let FD ,r be the outcome of a subbagging algorithm whose base machine is symmetric and has uniform (resp. [sent-488, score-0.272]

71 hypothesis or pointwise hypothesis) stability βm of FD ,r w. [sent-494, score-0.467]

72 m For classiﬁcation, we have also the following proposition, again only for hypothesis or pointwise hypothesis stability as in Section 2: Proposition 4. [sent-498, score-0.551]

73 ) Hypothesis stability of subbagging for classiﬁcation) Let FD ,r be the outcome of a subbagging algorithm whose base machine is symmetric and has hypothesis (resp. [sent-500, score-0.757]

74 pointwise hypothesis) stability γm with respect to classiﬁcation loss, and subbagging is done by sampling p points without replacement. [sent-501, score-0.526]

75 pointwise hypothesis) stability βm of FD ,r with respect to the 1 loss function is bounded by p βm ≤ 2γ p . [sent-503, score-0.441]

76 We present the following theorems for subbagging but the same statements hold true for bagging where, in the bounds below, pγ p m kγk m is replaced by ∑k=1 m Pr [d(r) = k] which is roughly equal to 0. [sent-506, score-0.369]

77 Assume subbagging is done with T sets of size p subsampled without replacement from D and the base learning algorithm has hypothesis stability γm and pointwise 69 E LISSEEFF , E VGENIOU AND P ONTIL hypothesis stability γm , both stabilities being w. [sent-514, score-1.051]

78 Assume subbagging is done with T sets of size p subsampled without replacement from D and the base learning algorithm has uniform stability γm w. [sent-530, score-0.517]

79 • Although we can derive bounds for bagging using our theory in section 3 that were not possible to derive with the existing theory summarized in Section 2, our results for bagging do not show that bagging actually improves performance. [sent-551, score-0.603]

80 Developing tighter bounds or lower bounds within our analysis for bagging is needed for this purpose. [sent-555, score-0.261]

81 The latter condition is √ not a problem in practice, for example one could choose T = O( m) to get convergence, but it indicates a weak point of the uniform stability analysis as opposed to the hypothesis stability analysis above. [sent-562, score-0.713]

82 Conclusions We presented a theory of random stability for randomized learning methods that we also applied to study the effects of bagging on the stability of a learning method. [sent-566, score-0.913]

83 This is an extension of the existing theory about the stability and generalization performance of deterministic (symmetric) learning methods (Bousquet and Elisseeff 2002). [sent-567, score-0.347]

84 The bounds we proved show formally the relation of the generalization error to the stability of the (random) algorithm. [sent-569, score-0.357]

85 , zm which represents the training set D where zi and z j have been replaced by z and z . [sent-593, score-0.26]

86 According to these notations we have / i j (0, z j , zi ) = ( f D \i , zi ), / that is, we replace zi by the empty set when it is removed from the training set. [sent-603, score-0.732]

87 The idea is to transform a difference a − b into a sum a − b = ∑k ai − ai+1 (a1 = a and ak+1 = b) so that i=1 the quantities ai − ai+1 in the sum can be bounded by terms of the form ED ,z i j (z, z j , zi ) − (zi ) , the latter being directly related to the notion of stability we deﬁned. [sent-608, score-0.559]

88 Most of time, this change will be done between z, zi and z j using the fact that z and the zi ’s are independently and identically distributed so that averaging w. [sent-610, score-0.488]

89 (28) Proof We have ED ,z (z) − i j (z j , zi , z) + ED ,z,z ≤ ED ,z,z i j (z (z) − , z j , z) − i j (z i j (z , z j , z) , zi , z) + ED ,z,z i j (z , zi , z) − i j (z j , zi , z) (29) Since the distribution over D is i. [sent-623, score-0.936]

90 z j or z , and we can swap the role of z and zi in the second term of the r. [sent-629, score-0.246]

91 of Equation (28) can be bounded by different quantities related to pointwise hypothesis stability or to hypothesis stability. [sent-638, score-0.564]

92 We have indeed ED ,z (z) − i j (z j , zi , z) ≤ 3 ED ,z i j (z, z j , zi ) − (zi ) +ED ,z i j (zi , z, z j ) − (z j ) , (31) which is related to the deﬁnition of pointwise hypothesis stability and will be used when the focus is on empirical error. [sent-639, score-0.945]

93 We have also ED ,z (z) − i j (z j , zi , z) ≤ 3 ED ,z / i j (0, z j , z) − (z) +ED ,z / i j (zi , 0, z) − (z) , which is related to bounds on the leave-one-out error. [sent-640, score-0.27]

94 + ED ,z [ (zi ) ( (zi ) − (z))] Now we rewrite I as (z) (z ) − ED ,z [ (z) (zi )] = ED ,z,z = ED ,z,z (z) (z ) − i j (z , z j , z) i j (z , z j , z ) , where we renamed zi as z in the second term. [sent-650, score-0.262]

95 (32) Next we rewrite J as ED [ (zi ) (z j )] − ED ,z [ (z) (zi )] = ED ,z,z i j (z, z , z) i j (z, z , z ) − (z) (zi ) where we renamed z j as z and zi as z in the ﬁrst term. [sent-653, score-0.262]

96 We have also J = ED ,z,z i j (z, z , z) i j (z, z , z ) − i j (z , zi , z) i j (z , zi , z ) where we renamed zi as z and z j as zi in the second term. [sent-654, score-0.964]

97 Using Equation 31, we have J ≤ ED ,z,z i j (z, z , z) i j (z, z , z ) − i j (zi , z , z) i j (z , zi , z ) J1 + 3M ED ,z i j (z, z j , zi ) − (zi ) +ED ,z i j (zi , z, z j ) − (z j ) . [sent-655, score-0.468]

98 For every r, s ∈ R T , and T ∈ N, we have |K(D , r) − K(D , s)| = = Ez ( fD ,r , z) − ( fD ,s , z) − ≤ Ez 1 m ∑ ( fD ,r, zi ) − ( fD ,s, zi ) m i=1 ( fD ,r , z) − ( fD ,s , z) + 76 1 m ∑ ( fD ,r , zi ) − ( fD ,s, zi ) . [sent-678, score-0.936]

99 (1999), “Algorithmic stability and sanity check bounds for leaveone-out cross validation bounds”, Neural Computation, 11(6):1427–1453. [sent-739, score-0.335]

100 (2002), “Almost-everywhere algorithmic stability and generalization error”, Uncertainty in Artiﬁcial Intelligence (UAI), August, 2002, Edmonton, Canada. [sent-742, score-0.31]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fd', 0.792), ('stability', 0.299), ('zi', 0.234), ('bagging', 0.189), ('rgen', 0.176), ('subbagging', 0.133), ('rt', 0.133), ('randomized', 0.113), ('elisseeff', 0.107), ('pointwise', 0.084), ('lisseeff', 0.084), ('hypothesis', 0.084), ('oo', 0.082), ('outcome', 0.074), ('vgeniou', 0.07), ('remp', 0.07), ('andomized', 0.07), ('bousquet', 0.068), ('tability', 0.062), ('ontil', 0.062), ('er', 0.053), ('pr', 0.049), ('loss', 0.045), ('randomness', 0.041), ('lgorithms', 0.041), ('deterministic', 0.037), ('bounds', 0.036), ('earning', 0.031), ('bootstrap', 0.031), ('uniform', 0.031), ('evgeniou', 0.031), ('ez', 0.029), ('devroye', 0.029), ('renamed', 0.028), ('subsamples', 0.028), ('breiman', 0.028), ('sup', 0.025), ('base', 0.024), ('ron', 0.024), ('wagner', 0.023), ('vi', 0.023), ('proposition', 0.022), ('poggio', 0.022), ('lipschitzian', 0.021), ('averaging', 0.02), ('notions', 0.019), ('replacement', 0.018), ('theodoros', 0.018), ('trained', 0.017), ('kutin', 0.016), ('pontil', 0.015), ('training', 0.015), ('ed', 0.015), ('draw', 0.015), ('empty', 0.015), ('niyogi', 0.014), ('chebychev', 0.014), ('permuting', 0.014), ('renaming', 0.014), ('stabilities', 0.014), ('zurich', 0.014), ('theorem', 0.014), ('ber', 0.013), ('notion', 0.013), ('random', 0.013), ('bounded', 0.013), ('ct', 0.012), ('lemma', 0.012), ('randomization', 0.012), ('svm', 0.012), ('kearns', 0.012), ('subsampling', 0.012), ('subsampled', 0.012), ('andonova', 0.012), ('swap', 0.012), ('insead', 0.012), ('andre', 0.012), ('generalization', 0.011), ('copies', 0.011), ('error', 0.011), ('stable', 0.011), ('scales', 0.011), ('appearing', 0.011), ('bound', 0.011), ('replaced', 0.011), ('pd', 0.01), ('bp', 0.01), ('fq', 0.01), ('massimiliano', 0.01), ('sampling', 0.01), ('empirical', 0.01), ('changing', 0.01), ('concerns', 0.01), ('symmetric', 0.01), ('proof', 0.01), ('bootstrapping', 0.009), ('theories', 0.009), ('equation', 0.009), ('open', 0.009), ('lemmas', 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 67 jmlr-2005-Stability of Randomized Learning Algorithms

Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil

2 0.10023837 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

Author: Mario Marchand, Marina Sokolova

Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classiﬁer it ﬁnds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory

3 0.057317562 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

Author: Damien Ernst, Pierre Geurts, Louis Wehenkel

Abstract: Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt , ut , rt , xt+1 ) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and ﬁnd that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees. Keywords: batch mode reinforcement learning, regression trees, ensemble methods, supervised learning, ﬁtted value iteration, optimal control

4 0.051920321 11 jmlr-2005-Algorithmic Stability and Meta-Learning

Author: Andreas Maurer

Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn

5 0.051365897 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

Author: Dmitry Rusakov, Dan Geiger

Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratiﬁed exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)

6 0.045728937 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

7 0.042407941 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

8 0.036139857 36 jmlr-2005-Gaussian Processes for Ordinal Regression

9 0.035022594 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

10 0.034508646 47 jmlr-2005-Learning from Examples as an Inverse Problem

11 0.031234957 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies

12 0.029218804 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

13 0.02760404 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

14 0.02702966 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

15 0.024755374 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

16 0.023024846 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

17 0.022857636 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

18 0.022690957 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

19 0.019565247 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

20 0.018215349 29 jmlr-2005-Efficient Margin Maximizing with Boosting

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.127), (1, -0.018), (2, 0.071), (3, 0.176), (4, -0.104), (5, -0.064), (6, 0.137), (7, -0.032), (8, 0.157), (9, -0.162), (10, 0.014), (11, 0.145), (12, 0.049), (13, 0.034), (14, -0.23), (15, 0.331), (16, -0.253), (17, -0.206), (18, 0.076), (19, 0.163), (20, -0.101), (21, -0.004), (22, -0.093), (23, 0.088), (24, 0.019), (25, -0.101), (26, -0.013), (27, 0.104), (28, -0.018), (29, -0.021), (30, -0.038), (31, 0.013), (32, -0.001), (33, -0.029), (34, -0.062), (35, -0.118), (36, 0.081), (37, 0.171), (38, -0.004), (39, -0.142), (40, 0.116), (41, 0.095), (42, -0.129), (43, 0.183), (44, 0.169), (45, 0.182), (46, -0.099), (47, 0.142), (48, 0.249), (49, -0.097)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96983927 67 jmlr-2005-Stability of Randomized Learning Algorithms

Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil

2 0.39834496 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

Author: Mario Marchand, Marina Sokolova

3 0.23768944 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

Author: Damien Ernst, Pierre Geurts, Louis Wehenkel

4 0.21249665 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classiﬁers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classiﬁer should minimize the conﬁdence-rated error, whereas for AdaBoost.M2 the base classiﬁer should minimize the pseudo-loss. This makes them more easily applicable to already existing base classiﬁers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classiﬁcation, decision stumps

5 0.18824996 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

Author: Dmitry Rusakov, Dan Geiger

6 0.17983045 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

7 0.15522186 11 jmlr-2005-Algorithmic Stability and Meta-Learning

8 0.14469148 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies

9 0.14064057 36 jmlr-2005-Gaussian Processes for Ordinal Regression

10 0.13239968 47 jmlr-2005-Learning from Examples as an Inverse Problem

11 0.11995485 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

12 0.11259225 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

13 0.098981127 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

14 0.08374095 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

15 0.078468956 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

16 0.077994317 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

17 0.077731371 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

18 0.073934697 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

19 0.066702634 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks

20 0.065201677 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(17, 0.019), (19, 0.02), (21, 0.04), (36, 0.011), (37, 0.02), (42, 0.011), (43, 0.038), (47, 0.016), (52, 0.106), (59, 0.012), (70, 0.013), (74, 0.391), (88, 0.113), (90, 0.025), (94, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78257668 54 jmlr-2005-Managing Diversity in Regression Ensembles

Author: Gavin Brown, Jeremy L. Wyatt, Peter Tiňo

Abstract: Ensembles are a widely used and effective technique in machine learning—their success is commonly attributed to the degree of disagreement, or ‘diversity’, within the ensemble. For ensembles where the individual estimators output crisp class labels, this ‘diversity’ is not well understood and remains an open research issue. For ensembles of regression estimators, the diversity can be exactly formulated in terms of the covariance between individual estimator outputs, and the optimum level is expressed in terms of a bias-variance-covariance trade-off. Despite this, most approaches to learning ensembles use heuristics to encourage the right degree of diversity. In this work we show how to explicitly control diversity through the error function. The ﬁrst contribution of this paper is to show that by taking the combination mechanism for the ensemble into account we can derive an error function for each individual that balances ensemble diversity with individual accuracy. We show the relationship between this error function and an existing algorithm called negative correlation learning, which uses a heuristic penalty term added to the mean squared error function. It is demonstrated that these methods control the bias-variance-covariance trade-off systematically, and can be utilised with any estimator capable of minimising a quadratic error function, for example MLPs, or RBF networks. As a second contribution, we derive a strict upper bound on the coefﬁcient of the penalty term, which holds for any estimator that can be cast in a generalised linear regression framework, with mild assumptions on the basis functions. Finally we present the results of an empirical study, showing signiﬁcant improvements over simple ensemble learning, and ﬁnding that this technique is competitive with a variety of methods, including boosting, bagging, mixtures of experts, and Gaussian processes, on a number of tasks. Keywords: ensemble, diversity, regression estimators, neural networks, hessia

same-paper 2 0.70557845 67 jmlr-2005-Stability of Randomized Learning Algorithms

Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil

3 0.40019503 11 jmlr-2005-Algorithmic Stability and Meta-Learning

Author: Andreas Maurer

4 0.3891927 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we deﬁne is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Speciﬁc kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can signiﬁcantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms

5 0.37576213 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overﬁt when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions

6 0.37386268 64 jmlr-2005-Semigroup Kernels on Measures

7 0.3719331 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

8 0.36625719 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

9 0.36266071 3 jmlr-2005-A Classification Framework for Anomaly Detection

10 0.36245269 39 jmlr-2005-Information Bottleneck for Gaussian Variables

11 0.35917318 44 jmlr-2005-Learning Module Networks

12 0.35832494 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

13 0.35708004 36 jmlr-2005-Gaussian Processes for Ordinal Regression

14 0.35682803 48 jmlr-2005-Learning the Kernel Function via Regularization

15 0.35614979 71 jmlr-2005-Variational Message Passing

16 0.35611171 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

17 0.35448742 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

18 0.35316572 20 jmlr-2005-Clustering with Bregman Divergences

19 0.35206717 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

20 0.35051444 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification