jmlr jmlr2005 jmlr2005-54 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gavin Brown, Jeremy L. Wyatt, Peter Tiňo
Abstract: Ensembles are a widely used and effective technique in machine learning—their success is commonly attributed to the degree of disagreement, or ‘diversity’, within the ensemble. For ensembles where the individual estimators output crisp class labels, this ‘diversity’ is not well understood and remains an open research issue. For ensembles of regression estimators, the diversity can be exactly formulated in terms of the covariance between individual estimator outputs, and the optimum level is expressed in terms of a bias-variance-covariance trade-off. Despite this, most approaches to learning ensembles use heuristics to encourage the right degree of diversity. In this work we show how to explicitly control diversity through the error function. The first contribution of this paper is to show that by taking the combination mechanism for the ensemble into account we can derive an error function for each individual that balances ensemble diversity with individual accuracy. We show the relationship between this error function and an existing algorithm called negative correlation learning, which uses a heuristic penalty term added to the mean squared error function. It is demonstrated that these methods control the bias-variance-covariance trade-off systematically, and can be utilised with any estimator capable of minimising a quadratic error function, for example MLPs, or RBF networks. As a second contribution, we derive a strict upper bound on the coefficient of the penalty term, which holds for any estimator that can be cast in a generalised linear regression framework, with mild assumptions on the basis functions. Finally we present the results of an empirical study, showing significant improvements over simple ensemble learning, and finding that this technique is competitive with a variety of methods, including boosting, bagging, mixtures of experts, and Gaussian processes, on a number of tasks. Keywords: ensemble, diversity, regression estimators, neural networks, hessia
Reference: text
sentIndex sentText sentNum sentScore
1 The first contribution of this paper is to show that by taking the combination mechanism for the ensemble into account we can derive an error function for each individual that balances ensemble diversity with individual accuracy. [sent-20, score-1.254]
2 Quite simply, whereas in a single regression estimator we have the well known bias-variance two-way trade-off, in an ensemble of regressors we have the bias-variance-covariance three-way trade-off. [sent-39, score-0.615]
3 In this article we focus on negative correlation (NC) learning, a successful neural network ensemble learning technique developed in the evolutionary computation literature (Liu (1998)). [sent-41, score-0.652]
4 We will describe how the ensemble error gradient can be broken into a number of individually understandable components, and that NC exploits this to blend smoothly between a group of independent learners and a single large learner, finding the optimal point between the two. [sent-45, score-0.643]
5 We begin in Section 2 with a summary of the underlying theory of regression ensemble learning, describing why there exists a trade-off between ensemble diversity and individual estimator accuracy. [sent-49, score-1.279]
6 (1992)), using it as a vehicle to introduce our notation; we then show how this decomposition naturally extends to a bias-variance-covariance decomposition (Ueda and Nakano (1996)) when using an ensemble of regression estimators. [sent-57, score-0.681]
7 We then train each individual fi separately, using Equation (2) as the error function; once this is accomplished, the outputs of the individuals are combined to give the ensemble output for any new datapoint x. [sent-79, score-1.076]
8 M i=1 (4) The ensemble f¯ can obviously be seen as an estimator in its own right; it will therefore have a biasvariance decomposition; However it transpires that, for this class of estimator, it can be extended to a bias-variance-covariance decomposition. [sent-84, score-0.593]
9 2 The Bias-Variance-Covariance Decomposition Treating the ensemble as a single learning unit, its bias-variance decomposition can be formulated as E{( f¯ − t)2 } = (E{ f¯} − t)2 + E{( f¯ − E{ f¯})2 } = bias( f¯)2 + variance( f¯). [sent-88, score-0.6]
10 (5) We will now consider how the bias-variance decomposition for an ensemble can be extended (Ueda and Nakano (1996)). [sent-89, score-0.6]
11 The first concept is bias, the averaged bias of the ensemble members, bias = 1 (E{ fi } − t). [sent-101, score-1.014]
12 M∑ i (6) The second is var, the averaged variance of the ensemble members, var = 1 E{( fi − E{ fi })2 }. [sent-102, score-1.405]
13 M∑ i (7) The third is covar, the averaged covariance of the ensemble members, covar = 1 E{( fi − E{ fi })( f j − E{ f j })}. [sent-103, score-1.399]
14 It illustrates that in addition to the bias and variance of the individual estimators, the generalisation error of an ensemble also depends on the covariance between the individuals. [sent-106, score-0.699]
15 This raises the interesting issue of why we should ever train ensemble members separately; why shouldn’t we try to find some way to capture the effect of the covariance in the error function? [sent-107, score-0.605]
16 For a single regression estimator, generalisation error is determined by a two-way bias-variance trade-off; for an ensemble of regression estimators, the ‘diversity’ issue is simply a three-way biasvariance-covariance trade-off. [sent-113, score-0.646]
17 We will review this decomposition and its relation to the ones we have already considered, then show how we can use it to train an ensemble whilst controlling the bias-variance-covariance trade-off. [sent-117, score-0.6]
18 i (10) i where t is the target value of an arbitrary datapoint, ∑i ci = 1, ci ≥ 0, and fens is the convex combination of the M component estimators fens = ∑M ci fi . [sent-120, score-0.598]
19 The second, ∑i ci ( fi − fens )2 is referred to as the Ambiguity, measuring the amount of variability among the ensemble member answers for this particular (x,t) pair. [sent-124, score-1.002]
20 Assuming a uniform weighting, we substitute the right hand side of equation (10) into the left hand side of equation (9), giving us E{ 1 1 1 1 2 ∑( fi − t)2 − M ∑( fi − f¯)2 } = bias + M var + 1 − M covar. [sent-130, score-0.88]
21 After some manipulations (see Appendix B for details) we can show 1 2 ( fi − t)2 } = bias + Ω M∑ i (12) 1 1 1 ∑( fi − f¯)2 } = Ω − M var + 1 − M covar . [sent-132, score-0.903]
22 2 Using the Decompositions to Optimise Diversity In a simple ensemble, the norm is to train learners separately—the ith member of the ensemble would have the error function3 1 (15) ei = ( fi − t)2 . [sent-138, score-1.097]
23 1 1 1 1 1 eens = ( f¯ − t)2 = ∑ ( fi − t)2 − ∑ ( fi − f¯)2 . [sent-141, score-0.927]
24 Given the relationship shown in Equation (12) and Equation (13), we could imagine a “diversity-encouraging” error function of the form ediv = i 1 1 1 1 ∑ 2 ( fi − t)2 − κ M ∑ 2 ( fi − f¯)2 . [sent-145, score-0.867]
25 If we adopt a gradient descent procedure for training, we note ∂ediv 1 i = ( fi − t) − κ( fi − f¯) . [sent-147, score-0.871]
26 At the other extreme, when κ = 1, the fi terms in Equation (18) cancel out, and we have the gradient of the entire ensemble as a single unit, κ=0 , ∂ediv i ∂ fi = 1 ( fi − t) M = 1 ∂ei M ∂ fi (19) κ=1 , ∂ediv i ∂ fi = 1 ¯ ( f − t) M = ∂eens . [sent-149, score-2.648]
27 Using this scaling parameter corresponds to explicitly varying our emphasis on minimising the covariance term within the ensemble MSE, balancing it against our emphasis on the bias and variance terms; hence we are explicitly managing the biasvariance-covariance trade-off. [sent-154, score-0.707]
28 Negative Correlation Learning Negative correlation (NC) learning (Liu (1998)) is a neural network ensemble learning technique developed in the Evolutionary Computation literature. [sent-159, score-0.608]
29 The first reference in the literature to explicitly use this idea in a learning algorithm was Rosen (1996), who trained networks sequentially using a penalty and scaling coefficient λ added to the error term, ei = 1 ( fi − t)2 + λpi 2 (21) i−1 pi = ( fi − t) ∑ ( f j − t). [sent-167, score-1.009]
30 (22) j=1 Attempting to extend this work, Liu and Yao (1997) trained the networks in parallel, and used a number of alternative penalty terms4 including one where the t is replaced by f¯, pi = ( fi − t) ∑ ( f j − t) (23) pi = ( fi − f¯) ∑ ( f j − f¯). [sent-168, score-0.902]
31 A point to note here is that the authors calculate the gradient using the assumption “that the output of the ensemble f¯ has constant value with respect to fi ” (Liu, 1998, p. [sent-174, score-1.026]
32 (25) ∂ fi Using this, and the penalty Equation (24), the following gradient was derived, ei = ∂ei ∂ fi 1 ( fi − t)2 + λ( fi − f¯) ∑ ( f j − f¯) 2 j=i = ( fi − t) + λ ∑ ( f j − f¯). [sent-179, score-2.225]
33 1 in our experiments), and: ∆w = −α ( fi (xn ) − tn ) − λ( fi (xn ) − f¯) · ∂ fi ∂w 4. [sent-189, score-1.221]
34 For any new testing pattern x, the ensemble output is given by: 1 f¯ = ∑ fi (x) M i Figure 1: Pseudocode for negative correlation learning. [sent-191, score-1.034]
35 Re-deriving the gradient, we have ei = ∂ei ∂ fi 1 ( fi − t)2 + γ( fi − f¯) ∑ ( f j − f¯) 2 j=i = ( fi − t) − γ 2(1 − 1 )( fi − f¯) . [sent-197, score-2.119]
36 Secondly, it allowed the appealing property that when λ = 1, the gradient in Equation (27) reduces ∂ei ∂ fi = ( fi − t) + λ ∑ ( f j − f¯) j=i = ( fi − t) − λ( fi − f¯) = ( f¯ − t) = M· ∂eens . [sent-202, score-1.685]
37 ∂ fi (31) Here it can be seen that the identity ∑ j=i ( f j − f¯) = −( fi − f¯) was used—the sum of deviations around a mean is equal to zero. [sent-203, score-0.814]
38 It emerges that by introducing the assumption, and subsequently violating it, the NC gradient becomes proportional to the gradient of the “diversity-encouraging” error function (18) suggested earlier, where we use λ in place of κ, ∂ei ∂ fi ∂e = ( fi − t) − λ( fi − f¯) = M · i . [sent-206, score-1.358]
39 By introducing the assumption (25), NC was inadvertently provided with the missing gradient components that correspond to the variance and covariance terms within the ensemble MSE. [sent-208, score-0.636]
40 It can therefore be concluded that NC succeeds because it trains the individual networks with error functions which more closely approximate the individual’s contribution to ensemble error, than that used by simple ensemble learning. [sent-209, score-1.17]
41 Using the 1629 ˘ B ROWN , W YATT AND T I NO penalty coefficient, we then balance the trade-off between those individual errors and the ensemble covariance. [sent-210, score-0.616]
42 The relationship to the Ambiguity decomposition is made even more apparent by noting that the penalty term can be rearranged, pi = ( fi − f¯) ∑ ( f j − f¯) = −( fi − f¯)2 . [sent-211, score-0.922]
43 (33) j=i This leads us to a restatement of the NC error function, 1 ei = ( fi − t)2 − γ( fi − f¯)2 . [sent-212, score-0.921]
44 It was stated that the bounds of λ should be [0, 1], based on the following calculation, ∂ei ∂ fi = fi − t + λ ∑ ( f j − f¯) j=i = = fi − t − λ( fi − f¯) fi − t − λ( fi − f¯) + λt − λt = (1 − λ)( fi − t) + λ( f¯ − t). [sent-230, score-2.849]
45 M−1 0 < 1 − λ(1 − (38) Since the effect of φqi cancels out,5 we find that this inequality is independent of all other network parameters, so it is a constant for any ensemble architecture using estimators of this form 5. [sent-249, score-0.668]
46 However, the utility of the bound depends entirely on how tight it is—using 1632 M ANAGING D IVERSITY I N R EGRESSION E NSEMBLES a neural network ensemble it would be pointless if the weights diverged significantly before the bound is reached. [sent-287, score-0.627]
47 038 Simple ensemble NC (Optimal Gamma) 8 Simple ensemble NC (Optimal Gamma) 0. [sent-332, score-1.082]
48 The general rule here, supported by previous empirical work on NC, seems to be to use very simple networks—in this case we can see that an ensemble of 16 networks, each with 2 hidden nodes, has equalled the performance of a similarly sized ensemble, using 6 hidden nodes per network. [sent-341, score-0.753]
49 number of hidden nodes per network) of the individual ensemble members. [sent-344, score-0.711]
50 Regarding again figures 3 to 6 these results indicate that NC is of most use when we have large ensembles of relatively low complexity ensemble members. [sent-346, score-0.633]
51 This is emphasized further looking at figure 10, where we see that an ensemble of 6 networks using 2 hidden nodes, and using NC, can equal the performance of the same ensemble using 1634 M ANAGING D IVERSITY I N R EGRESSION E NSEMBLES 2. [sent-347, score-1.189]
52 5 Simple ensemble NC (Optimal Gamma) 0 2 4 6 8 10 12 Number of Networks 14 0 16 Figure 5: LogP, γ = 0 versus optimal γ, 6 hidden nodes per network 2 4 6 8 10 12 Number of Networks 14 16 Figure 6: LogP, γ = 0 versus optimal γ, 2 hidden nodes per network far more complex networks. [sent-353, score-0.917]
53 Figures 12, 14 and 16 show the behaviour with a fixed ensemble size, M = 6, as we vary the individual complexity between 2 and 12 hidden nodes. [sent-362, score-0.635]
54 038 Simple ensemble NC (Optimal Gamma) Simple ensemble NC (Optimal Gamma) 8 0. [sent-373, score-1.082]
55 9 1 Figure 10: LogP, plotting testing MSE against Gamma for an ensemble of 6 networks M λ = 1 (or equivalently γ = 2(M−1) ) seems to be a useful heuristic bound. [sent-395, score-0.622]
56 036 2 hids 4 hids 6 hids 8 hids 10 hids 12 hids 0. [sent-418, score-0.816]
57 034 Testing MSE 2 hids 4 hids 6 hids 8 hids 10 hids 12 hids 8 0. [sent-421, score-0.816]
58 5 1 net 2 nets 4 nets 8 nets 12 nets 16 nets 2 2 hids 4 hids 6 hids 8 hids 10 hids 12 hids 2 1. [sent-449, score-1.171]
59 Table 2 shows that an ensemble of RBF networks each with 50 centres can outperform an MLP ensemble each with 50 sigmoidal hidden nodes, and applying NC to the RBF ensemble allows further gain. [sent-563, score-1.73]
60 Firstly, though an MLP ensemble cannot benefit, an RBF ensemble of the same size can 1640 M ANAGING D IVERSITY I N R EGRESSION E NSEMBLES benefit from NC. [sent-567, score-1.082]
61 9 1 Figure 23: LogP data: The effect of varying the NC γ parameter on an RBF and MLP ensemble of size M = 5, noting that our predicted upper bound γupper = 0. [sent-595, score-0.615]
62 We showed that using a penalty term and coefficient, NC explicitly includes the covariance portion of the ensemble MSE in its error function. [sent-601, score-0.634]
63 This article has served to illustrate that NC is not merely a heuristic technique, but fundamentally tied to the dynamics of training an ensemble system with the mean squared error function. [sent-602, score-0.598]
64 The NC framework can be applied to ensembles of any nonlinear regression estimator combined in an ensemble f¯ of the form 1 M f¯(x) = ∑ fi (x). [sent-605, score-1.114]
65 M i=1 (41) In addition, an upper bound on the strength parameter was shown to apply when each estimator fi is of the form K fi (x) = ∑ wki φki (x). [sent-606, score-0.929]
66 We therefore recommend a starting point as: ensemble size M ≥ 10, number of hidden nodes between 2 and 5, and a penalty strength parameter of γ = 0. [sent-613, score-0.73]
67 It is well known that overfitting of the individual estimators can be beneficial in an ensemble system (Sollich and Krogh (1996)), but obviously overfitting the entire ensemble as a unit is an undesirable prospect. [sent-621, score-1.175]
68 Making use of 1642 M ANAGING D IVERSITY I N R EGRESSION E NSEMBLES the product rule, we have ∂ei ∂wqi ∂2 ei ∂wqi 2 ∂ei ∂ fi ∂ fi ∂wqi = = ∂ ∂ fi ∂ei ∂ ∂ei ∂ fi + . [sent-629, score-1.712]
69 ∂wqi ∂ fi ∂wqi ∂wqi ∂wqi ∂ fi (43) Taking the first term on the right hand side, ∂ei ∂ fi ∂ ∂ei ∂wqi ∂ fi = ( fi − t) − λ( fi − f¯) 1 φqi ) M 1 1 − λ(1 − ) φqi . [sent-630, score-2.442]
70 M = φqi − λ(φqi − = (44) Now for the second term, remembering eq (42), we have ∂ fi ∂wqi ∂ ∂ fi ∂wqi ∂wqi = φqi (45) ∂2 f i = 0. [sent-631, score-0.814]
71 M 1 − λ(1 − = = (47) (48) It is interesting to observe that since we have ∂ei ∂ fi ∂2 ei ∂ fi 2 = ( fi − t) − λ( fi − f¯) (49) 1 ). [sent-633, score-1.712]
72 ∂ fi 2 (51) = 1 − λ(1 − then we can see ∂2 ei ∂wqi 2 = This demonstrates that, since φqi 2 is positive, the sign of the leading diagonal entry Hessian is decided by the sign of ∂2 ei ∂ fi 2 . [sent-635, score-0.982]
73 M M (52) Now using the Ambiguity decomposition, we have the result E{ 1 1 1 1 2 ∑( fi − t)2 − M ∑( fi − f¯)2 } = bias + M var + 1 − M covar. [sent-639, score-0.88]
74 We have eens = E = 1 ( fi − t)2 − α( fi − f¯)2 M∑ i 1 E ( fi − E{ f¯} + E{ f¯} − t)2 − α( fi − E{ f¯} + E{ f¯} − f¯)2 M∑ i . [sent-643, score-1.741]
75 now multiply out the brackets, thus eens = 1 E ( fi − E{ f¯})2 + (E{ f¯} − t)2 + 2( fi − E{ f¯})(E{ f¯} − t) M∑ i − α( fi − E{ f¯})2 − α(E{ f¯} − f¯)2 − 2α( fi − E{ f¯})(E{ f¯} − f¯) . [sent-644, score-1.741]
76 and evaluate the expectation and summation, giving us eens = 1 E ( fi − E{ f¯})2 + (E{ f¯} − t)2 M∑ i 1 − α ∑ E ( fi − E{ f¯})2 − αE ( f¯ − E{ f¯})2 − 2αE ( f¯ − E{ f¯})(E{ f¯} − f¯) . [sent-645, score-0.927]
77 M i and finally by rearranging the last term we obtain eens = 1 E ( fi − E{ f¯})2 + (E{ f¯} − t)2 M∑ i 1 − α ∑ E ( fi − E{ f¯})2 + αE ( f¯ − E{ f¯})2 . [sent-646, score-0.927]
78 Therefore, E{ 1 ( fi − f¯)2 } = M∑ i 1 E{( fi − E{ f¯})2 } − E{( f¯ − E{ f¯})2 } M∑ i = Ω − var( f¯). [sent-650, score-0.814]
79 And the other side of the Ambiguity decomposition, the expected value of the average individual error is whatever parts do not contain α, this is E{ 1 ( fi − t)2 } = M∑ i 1 E ( fi − E{ f¯})2 + (E{ f¯} − t)2 M∑ i = Ω + bias( f¯)2 . [sent-651, score-0.863]
80 This interaction term, Ω, is present in both sides, and cancels out to allow the normal biasvariance decomposition of ensemble error. [sent-652, score-0.6]
81 If we examine it a little further, we see 1 E{( fi − E{ f¯})2 } = M∑ i = 1 E{( fi − E{ fi } + E{ fi } − E{ f¯})2 } M∑ i 1 1 E{( fi − E{ fi })2 } + ∑(E{ fi } − E{ f¯})2 . [sent-654, score-2.849]
82 M∑ M i i where we have used that E{ fi E{ fi }} = E{ fi }2 and also E{ fi E{ f¯}} = E{ fi }E{ f¯}. [sent-655, score-2.035]
83 This is an ensemble of three input hq wqi fi f Figure 24: A typical ensemble architecture 1645 ˘ B ROWN , W YATT AND T I NO MLPs, with three inputs and three hidden nodes each, using a uniformly weighted combination as the ensemble output. [sent-661, score-2.357]
84 If we consider the ensemble as a single entity, then the error of this system at a single point is defined as 1 eens = ( f¯ − t)2 . [sent-663, score-0.677]
85 2 In this case, an update to the weight wqi would involve the gradient ∂eens ∂wqi = = (54) ∂eens ∂ fi ∂ fi ∂wqi ∂ fi 1 ¯ ( f − d) . [sent-664, score-1.467]
86 M ∂wqi (55) Note that we are assuming an ensemble of networks with linear output functions—if this is the case, ∂ i the second term in the above error gradient, ∂wfqi evaluates to simply the output of the relevant hidden node. [sent-665, score-0.713]
87 We calculated (55) in one simple step using the chain rule, treating the ensemble as a single unit—if we perform this instead starting from the decomposed form of the ensemble error, it highlights more interesting results. [sent-667, score-1.082]
88 M j=i 2 2 (56) If we do this we discover that the gradient of the ensemble error function is a sum of four distinct components, shown and described in table 3. [sent-669, score-0.621]
89 Each of these components contributes to the gradient 1 of the ensemble error in eq. [sent-670, score-0.621]
90 ) 1 − M ( fi − f¯) This is the component of the error gradient due to the difference between fi and f¯ (due to the fact that fi is changing. [sent-674, score-1.301]
91 ) f¯) This is the component of the error gradient due to the difference between fi and f¯ (due to the fact that f¯ changes, because fi is changing. [sent-675, score-0.894]
92 M M (59) Furthermore, a simple rearrangement now shows the error gradient for fi in an ensemble using NC is ∂ 1 ( fi − t)2 − γ( fi − f¯)2 ∂ fi 2 1 )( fi − f¯) M 1 = ( fi − t) − 2γ ( fi − f¯) − ( fi − f¯) M = A − 2γ(B −C). [sent-678, score-3.877]
93 From all this we can understand, a single framework, the relationships between minimising the simple ensemble error, the NC ensemble error, and a single network, described in table 4. [sent-680, score-1.102]
94 M If we set λ = 1, or equivalently γ = 2(M−1) , we see that the gradient of the individual error with NC is directly proportional to the gradient for the ensemble seen as a single entity, i. [sent-681, score-0.704]
95 We have an ensemble of M = 5 networks, but for clarity the outputs of the other ensemble members are not shown; the resulting ensemble output is f¯ = 3, too low, left of the target. [sent-690, score-1.664]
96 When updating the value of fi , a simple ensemble will use the gradient measurement ( fi −t) = 4, resulting in fi being shifted left, towards the target. [sent-691, score-1.819]
97 An ensemble using NC will include three gradient components, A − 2γ(B −C) = ( fi − t) − 2γ ( fi − f¯) − 1 ( fi − f¯) M (62) 1 = 4 − 2γ(5 − 5) 5 = 4 − γ8. [sent-693, score-1.819]
98 8, still a positive gradient for fi , meaning the ensemble output will still be moved away from the target. [sent-696, score-1.026]
99 8, giving a pressure for the network fi to move away from the target, causing the ensemble output to move closer to the target. [sent-699, score-1.013]
100 The setting of the γ value provides a way of finding a trade-off 1648 M ANAGING D IVERSITY I N R EGRESSION E NSEMBLES between these gradient components that will cause the ensemble output f¯ to move toward the target value t. [sent-700, score-0.619]
wordName wordTfidf (topN-words)
[('ensemble', 0.541), ('nc', 0.471), ('fi', 0.407), ('wqi', 0.189), ('mse', 0.148), ('hids', 0.136), ('eens', 0.113), ('rown', 0.113), ('yatt', 0.113), ('anaging', 0.106), ('iversity', 0.106), ('nsembles', 0.106), ('diversity', 0.097), ('ensembles', 0.092), ('logp', 0.089), ('ambiguity', 0.087), ('ei', 0.084), ('egression', 0.071), ('nets', 0.071), ('gamma', 0.07), ('hidden', 0.068), ('estimators', 0.067), ('liu', 0.066), ('qi', 0.064), ('hessian', 0.06), ('decomposition', 0.059), ('gradient', 0.057), ('nodes', 0.054), ('estimator', 0.052), ('penalty', 0.049), ('brown', 0.047), ('bagging', 0.046), ('tino', 0.045), ('yao', 0.044), ('network', 0.044), ('testing', 0.042), ('boston', 0.041), ('mlp', 0.039), ('networks', 0.039), ('datapoint', 0.038), ('fens', 0.038), ('wyatt', 0.038), ('generalisation', 0.038), ('rbf', 0.037), ('friedman', 0.033), ('var', 0.033), ('bias', 0.033), ('ueda', 0.032), ('mlps', 0.032), ('decompositions', 0.031), ('ediv', 0.03), ('qth', 0.03), ('varying', 0.029), ('evolutionary', 0.028), ('individual', 0.026), ('krogh', 0.025), ('nakano', 0.025), ('upper', 0.024), ('correlation', 0.023), ('emphasis', 0.023), ('error', 0.023), ('bham', 0.023), ('covar', 0.023), ('mckay', 0.023), ('regression', 0.022), ('per', 0.022), ('learners', 0.022), ('output', 0.021), ('coef', 0.021), ('covariance', 0.021), ('bound', 0.021), ('ith', 0.02), ('epochs', 0.02), ('minimising', 0.02), ('individuals', 0.02), ('members', 0.02), ('understand', 0.019), ('lambda', 0.019), ('squared', 0.018), ('strength', 0.018), ('variance', 0.017), ('geman', 0.017), ('landscape', 0.017), ('birmingham', 0.017), ('article', 0.016), ('validation', 0.016), ('ci', 0.016), ('boosting', 0.016), ('divergence', 0.016), ('architecture', 0.016), ('gure', 0.015), ('encourage', 0.015), ('entity', 0.015), ('sensible', 0.015), ('optimise', 0.015), ('minimise', 0.015), ('portions', 0.015), ('cancel', 0.015), ('abbass', 0.015), ('combiners', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 54 jmlr-2005-Managing Diversity in Regression Ensembles
Author: Gavin Brown, Jeremy L. Wyatt, Peter Tiňo
Abstract: Ensembles are a widely used and effective technique in machine learning—their success is commonly attributed to the degree of disagreement, or ‘diversity’, within the ensemble. For ensembles where the individual estimators output crisp class labels, this ‘diversity’ is not well understood and remains an open research issue. For ensembles of regression estimators, the diversity can be exactly formulated in terms of the covariance between individual estimator outputs, and the optimum level is expressed in terms of a bias-variance-covariance trade-off. Despite this, most approaches to learning ensembles use heuristics to encourage the right degree of diversity. In this work we show how to explicitly control diversity through the error function. The first contribution of this paper is to show that by taking the combination mechanism for the ensemble into account we can derive an error function for each individual that balances ensemble diversity with individual accuracy. We show the relationship between this error function and an existing algorithm called negative correlation learning, which uses a heuristic penalty term added to the mean squared error function. It is demonstrated that these methods control the bias-variance-covariance trade-off systematically, and can be utilised with any estimator capable of minimising a quadratic error function, for example MLPs, or RBF networks. As a second contribution, we derive a strict upper bound on the coefficient of the penalty term, which holds for any estimator that can be cast in a generalised linear regression framework, with mild assumptions on the basis functions. Finally we present the results of an empirical study, showing significant improvements over simple ensemble learning, and finding that this technique is competitive with a variety of methods, including boosting, bagging, mixtures of experts, and Gaussian processes, on a number of tasks. Keywords: ensemble, diversity, regression estimators, neural networks, hessia
2 0.081165858 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
Author: Malte Kuss, Carl Edward Rasmussen
Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC
3 0.077455893 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
4 0.070932433 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
Author: Damien Ernst, Pierre Geurts, Louis Wehenkel
Abstract: Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt , ut , rt , xt+1 ) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees. Keywords: batch mode reinforcement learning, regression trees, ensemble methods, supervised learning, fitted value iteration, optimal control
5 0.059535 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon
Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension
6 0.045018587 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
7 0.033050381 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
8 0.030352022 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
9 0.029521951 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
10 0.028895512 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
11 0.028892465 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
12 0.028725881 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
13 0.02818173 64 jmlr-2005-Semigroup Kernels on Measures
14 0.026967239 36 jmlr-2005-Gaussian Processes for Ordinal Regression
15 0.02556799 71 jmlr-2005-Variational Message Passing
16 0.024839075 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
17 0.024561044 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
18 0.024031496 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
19 0.023594363 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
20 0.022828816 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
topicId topicWeight
[(0, 0.159), (1, -0.001), (2, -0.002), (3, -0.054), (4, 0.042), (5, 0.003), (6, 0.063), (7, 0.009), (8, 0.137), (9, 0.064), (10, 0.046), (11, 0.122), (12, 0.291), (13, 0.056), (14, 0.076), (15, 0.079), (16, 0.103), (17, -0.057), (18, -0.307), (19, -0.04), (20, -0.414), (21, 0.007), (22, -0.095), (23, 0.023), (24, 0.024), (25, 0.107), (26, 0.217), (27, -0.321), (28, 0.013), (29, 0.221), (30, -0.107), (31, -0.148), (32, 0.006), (33, -0.043), (34, 0.141), (35, -0.177), (36, -0.049), (37, -0.022), (38, -0.127), (39, -0.111), (40, 0.031), (41, -0.007), (42, 0.042), (43, -0.135), (44, 0.053), (45, -0.082), (46, 0.039), (47, -0.024), (48, -0.082), (49, 0.109)]
simIndex simValue paperId paperTitle
same-paper 1 0.98760462 54 jmlr-2005-Managing Diversity in Regression Ensembles
Author: Gavin Brown, Jeremy L. Wyatt, Peter Tiňo
Abstract: Ensembles are a widely used and effective technique in machine learning—their success is commonly attributed to the degree of disagreement, or ‘diversity’, within the ensemble. For ensembles where the individual estimators output crisp class labels, this ‘diversity’ is not well understood and remains an open research issue. For ensembles of regression estimators, the diversity can be exactly formulated in terms of the covariance between individual estimator outputs, and the optimum level is expressed in terms of a bias-variance-covariance trade-off. Despite this, most approaches to learning ensembles use heuristics to encourage the right degree of diversity. In this work we show how to explicitly control diversity through the error function. The first contribution of this paper is to show that by taking the combination mechanism for the ensemble into account we can derive an error function for each individual that balances ensemble diversity with individual accuracy. We show the relationship between this error function and an existing algorithm called negative correlation learning, which uses a heuristic penalty term added to the mean squared error function. It is demonstrated that these methods control the bias-variance-covariance trade-off systematically, and can be utilised with any estimator capable of minimising a quadratic error function, for example MLPs, or RBF networks. As a second contribution, we derive a strict upper bound on the coefficient of the penalty term, which holds for any estimator that can be cast in a generalised linear regression framework, with mild assumptions on the basis functions. Finally we present the results of an empirical study, showing significant improvements over simple ensemble learning, and finding that this technique is competitive with a variety of methods, including boosting, bagging, mixtures of experts, and Gaussian processes, on a number of tasks. Keywords: ensemble, diversity, regression estimators, neural networks, hessia
2 0.2869623 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
Author: Damien Ernst, Pierre Geurts, Louis Wehenkel
Abstract: Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt , ut , rt , xt+1 ) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees. Keywords: batch mode reinforcement learning, regression trees, ensemble methods, supervised learning, fitted value iteration, optimal control
3 0.21745294 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
Author: Malte Kuss, Carl Edward Rasmussen
Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
5 0.1969645 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon
Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension
6 0.17559828 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
7 0.15749645 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
8 0.11335239 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
9 0.098729335 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
10 0.09865085 71 jmlr-2005-Variational Message Passing
11 0.095143482 64 jmlr-2005-Semigroup Kernels on Measures
12 0.094072282 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
13 0.09393765 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
14 0.091095865 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
15 0.087564148 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
16 0.085027374 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
17 0.082491077 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
18 0.081978478 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
19 0.079907268 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
20 0.075182162 11 jmlr-2005-Algorithmic Stability and Meta-Learning
topicId topicWeight
[(13, 0.017), (17, 0.018), (19, 0.016), (36, 0.057), (37, 0.047), (42, 0.025), (43, 0.041), (47, 0.014), (52, 0.077), (59, 0.02), (70, 0.046), (74, 0.396), (88, 0.076), (90, 0.027), (94, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.76418495 54 jmlr-2005-Managing Diversity in Regression Ensembles
Author: Gavin Brown, Jeremy L. Wyatt, Peter Tiňo
Abstract: Ensembles are a widely used and effective technique in machine learning—their success is commonly attributed to the degree of disagreement, or ‘diversity’, within the ensemble. For ensembles where the individual estimators output crisp class labels, this ‘diversity’ is not well understood and remains an open research issue. For ensembles of regression estimators, the diversity can be exactly formulated in terms of the covariance between individual estimator outputs, and the optimum level is expressed in terms of a bias-variance-covariance trade-off. Despite this, most approaches to learning ensembles use heuristics to encourage the right degree of diversity. In this work we show how to explicitly control diversity through the error function. The first contribution of this paper is to show that by taking the combination mechanism for the ensemble into account we can derive an error function for each individual that balances ensemble diversity with individual accuracy. We show the relationship between this error function and an existing algorithm called negative correlation learning, which uses a heuristic penalty term added to the mean squared error function. It is demonstrated that these methods control the bias-variance-covariance trade-off systematically, and can be utilised with any estimator capable of minimising a quadratic error function, for example MLPs, or RBF networks. As a second contribution, we derive a strict upper bound on the coefficient of the penalty term, which holds for any estimator that can be cast in a generalised linear regression framework, with mild assumptions on the basis functions. Finally we present the results of an empirical study, showing significant improvements over simple ensemble learning, and finding that this technique is competitive with a variety of methods, including boosting, bagging, mixtures of experts, and Gaussian processes, on a number of tasks. Keywords: ensemble, diversity, regression estimators, neural networks, hessia
2 0.62223929 67 jmlr-2005-Stability of Randomized Learning Algorithms
Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil
Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.1 Keywords: stability, randomized learning algorithms, sensitivity analysis, bagging, bootstrap methods, generalization error, leave-one-out error.
3 0.32572889 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
4 0.32378283 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
5 0.32298231 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
6 0.32073316 71 jmlr-2005-Variational Message Passing
7 0.31807837 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
8 0.31588733 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
9 0.3149828 64 jmlr-2005-Semigroup Kernels on Measures
10 0.31253737 3 jmlr-2005-A Classification Framework for Anomaly Detection
11 0.31214854 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
12 0.31082681 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
13 0.31064257 39 jmlr-2005-Information Bottleneck for Gaussian Variables
14 0.31059805 44 jmlr-2005-Learning Module Networks
15 0.30685708 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
16 0.30501121 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
17 0.30241483 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
18 0.30056033 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
19 0.30034378 48 jmlr-2005-Learning the Kernel Function via Regularization
20 0.29892728 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels