jmlr jmlr2008 jmlr2008-33 knowledge-graph by maker-knowledge-mining

33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting


Source: pdf

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistics Wharton School, University of Pennsylvania Philadelphia, PA, 19104-6340, USA Editor: Yoav Freund Abstract The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. [sent-5, score-0.271]

2 This broad statistical view of boosting is fairly mainstream in the statistics community. [sent-22, score-0.288]

3 In fact, the statistics community has taken to attaching the boosting label to any classification or regression algorithm that incorporates a stagewise optimization. [sent-23, score-0.305]

4 For instance, Freund and Schapire (2000) note that, “one of the main properties of boosting that has made it interesting to statisticians and others is its relative (but not complete) immunity to overfitting,” and write that the paper by Friedman, Hastie and Tibshirani “does not address this issue. [sent-28, score-0.269]

5 ” Various arguments are given in response to the question of why boosting seems to not overfit. [sent-30, score-0.248]

6 Many statisticians simply argue that boosting does in fact overfit and construct examples to prove it (e. [sent-34, score-0.269]

7 While single examples certainly disprove claims that boosting never overfits, they do nothing to help us understand why boosting resists overfitting and performs very well for the large collection of examples that raised the question in the first place. [sent-37, score-0.496]

8 Others argue that boosting will eventually overfit in most all cases if run for enough iterations, but that the number of iterations needed can be quite large since the overfitting is quite slow. [sent-38, score-0.368]

9 Some evidence supporting the argument that boosting will eventually overfit can be found in Grove and Schuurmans (1998) which has examples for which boosting overfits when run for a very large number of iterations. [sent-41, score-0.514]

10 Whatever the explanation for boosting’s resistance to overfitting in so many real and important examples, the statistical view of boosting as an optimization does little to account for this. [sent-53, score-0.288]

11 It is important to note that although this paper deals with “the statistical view of boosting”, it is an overgeneralization to imply there is only one single view of boosting in the statistical community. [sent-62, score-0.328]

12 Much of what we categorize as the statistical view of boosting can be found in that original paper, but other ideas, especially those in Sections 3. [sent-64, score-0.288]

13 2 Boosting AdaBoost (Freund and Schapire, 1996) is one of the first and the most popular boosting algorithms for classification. [sent-98, score-0.248]

14 • Replace the weights wi with wi ≡ wi e−αm gm (xi )yi and then renormalize by replacing each wi by wi /(∑ wi ). [sent-107, score-0.195]

15 Each experiment is meant to illustrate particular inconsistencies between that which is suggested by the statistical view of boosting and what is actually observed in practice. [sent-127, score-0.334]

16 For this reason it has been suggested that one should use stumps (2-node trees) if one believes the optimal Bayes rule is approximately additive, since stumps are trees which only involve single predictors and thus yield an additive model in the predictor space for AdaBoost. [sent-150, score-0.489]

17 The reason has to do with the fact that boosting with larger trees actually often overfits less than boosting with smaller trees in practice since the larger trees are more orthogonal and a self-averaging process prevents overfitting. [sent-160, score-0.973]

18 While AdaBoost with stumps (thick, black curve) leads to overfitting very early on, AdaBoost with 8-node trees (thin, red curve) does not suffer from overfitting and leads to smaller misclassification error. [sent-166, score-0.411]

19 In fact, the misclassification error by 1000 iterations was smaller for the 8-node trees in 96 of the 100 simulations. [sent-167, score-0.336]

20 It is worth further commenting on the fact that in this simulation AdaBoost with stumps leads to overfitting while AdaBoost with the larger 8-node trees does not, at least by 1000 iterations. [sent-178, score-0.353]

21 This is of special interest since many of the examples other researchers provide to show AdaBoost can in fact overfit often use very small trees such as stumps as the base learner. [sent-179, score-0.277]

22 1 The belief is that if stumps overfit then so will larger trees since the larger trees are more complex. [sent-182, score-0.462]

23 Similar arguments to those in the previous section suggest that it is necessary to use smaller trees for AdaBoost when the Bayes error is larger. [sent-195, score-0.216]

24 The reasoning is that when the Bayes error is larger, the larger trees lead to a more complex model which is more susceptible to overfitting noise. [sent-196, score-0.216]

25 This counterintuitive result may be explained by the self-averaging which occurs during the boosting iterations as discussed by Krieger et al. [sent-199, score-0.368]

26 Conversely, the smaller trees often work well for lower Bayes error rates, provided they are rich enough to capture the complexity of the signal. [sent-201, score-0.216]

27 In order to prevent overfitting, one popular regularization technique is to stop boosting algorithms after a very small number of iterations, such as 10 or 100. [sent-265, score-0.305]

28 For example, the paper “Boosting with Early Stopping: Convergence and Consistency” by Zhang and Yu (2005) tells readers that “boosting forever can overfit the data” and that “therefore in order to achieve consistency, it is necessary to stop the boosting procedure early. [sent-267, score-0.269]

29 ” Standard implementations of boosting such as the popular gbm package for R by Ridgeway (2005) implement data-derived early stopping rules. [sent-268, score-0.401]

30 The reasoning behind early stopping is that after enough iterations have occurred so that the complexity of the algorithm is equal to the complexity of the underlying true signal, then any additional iterations will lead to overfitting and consequently larger misclassification error. [sent-269, score-0.358]

31 Since the statistical view of boosting centers on the stagewise minimization of a certain loss function on the training data, a common suggestion is that regularization should be based on the behavior of that loss function on a hold out or cross-validation sample. [sent-302, score-0.487]

32 Indeed, if early stopping is to be used as regularization, the statistical view would suggest stopping when this exponential loss function begins to increase on a hold out sample. [sent-304, score-0.309]

33 Thus, early stopping regularization based on the loss function would suggest stopping after just one iteration, when in fact Figure 1 shows we do best to run the 8-node trees for the full 1000 iterations. [sent-310, score-0.408]

34 Another popular misconception about boosting is that one needs to restrict the class of trees in order to prevent overfitting. [sent-316, score-0.431]

35 It is unclear why this happens; however, we note that related tree ensemble algorithms such as PERT (Cutler and Zhao, 2001) have demonstrated success by growing the trees until only a single observation remains in each terminal node. [sent-327, score-0.205]

36 1 and compare the (unrestricted) 8-node trees used there to 8-node trees restricted to have at least 15 140 0. [sent-329, score-0.318]

37 ) Figure 6 shows the results with the unrestricted 8-node trees given by the red (thin) curve and the restricted 8-node trees given by the purple (thick) curve. [sent-336, score-0.469]

38 AdaBoost with unrestricted 8-node trees gave a lower misclassification error in 67 of the 100 repetitions. [sent-342, score-0.247]

39 Shrinkage is yet another form of regularization that is often used for boosting algorithms. [sent-345, score-0.281]

40 In Figure 7 the red (thin) curve corresponds to the misclassification error for the 8-node trees just as in Section 3. [sent-360, score-0.295]

41 By 1000 iterations shrinkage gave a larger misclassification error in 95 of the 100 repetitions. [sent-364, score-0.26]

42 The idea that boosting produces probability estimates follows directly from the statistical view through the stagewise minimization of the loss function. [sent-371, score-0.4]

43 Standard implementations of boosting such as Dettling and Buhlmann’s LogitBoost code at http://stat. [sent-377, score-0.277]

44 Despite the belief that boosting is estimating probabilities, the estimator p m (x) given above is ˆ often extremely overfit in many cases in which the classification rule from AdaBoost shows no signs of overfitting and performs quite well. [sent-381, score-0.311]

45 In Figure 1 we saw that the classification rule using 8-node trees performed well and did not overfit even by 1000 iterations. [sent-384, score-0.236]

46 Other researchers have also noted this type of overfitting with boosting and have used it as an argument in favor of regularization techniques. [sent-407, score-0.299]

47 For instance, it is possible that using a regularization technique such as shrinkage or the restriction to stumps as the base learners in this situation could produce better probability estimates. [sent-408, score-0.234]

48 This characteristic is quite typical of AdaBoost and has led some researchers to draw parallels to the (one) nearest neighbor classifier, a classifier which necessarily also yields zero misclassification error on the training data. [sent-429, score-0.227]

49 The belief in a similarity between boosting and the nearest neighbor classifier was not expressed in the original paper of Friedman et al. [sent-431, score-0.424]

50 (2000a), but rather has been expressed more recently in the statistics literature on boosting by authors such as Wenxin Jiang in papers such as Jiang (2000), Jiang (2001) and Jiang (2002). [sent-432, score-0.248]

51 However, as we will see from the experiment in this section, the behavior of AdaBoost even in d = 2 dimensions is radically different from the nearest neighbor rule. [sent-435, score-0.201]

52 The left plot in Figure 9 shows the resulting classification of AdaBoost using 8-node trees after 1000 iterations and the right plot shows the rule for nearest neighbor. [sent-450, score-0.396]

53 The nearest neighbor classifier has 20% of the region colored as light blue (as expected), while AdaBoost has only 16%. [sent-454, score-0.224]

54 4 the area (volume) of the light blue region would be essentially zero for AdaBoost (as evidenced by its misclassification error rate matching almost exactly that of the Bayes error), while for nearest neighbor it remains at 20% as expected. [sent-458, score-0.289]

55 Thus we see that the nearest neighbor classifier differs from the Bayes rule for 20% of the points in both the training data and the population while AdaBoost differs from the Bayes rule for 20% of the points in the training data but virtually none in the population. [sent-459, score-0.264]

56 1 the nearest neighbor classifier had an average misclassification error rate of 0. [sent-462, score-0.231]

57 Consequently, all work on the consistency of boosting deals with regularized techniques. [sent-497, score-0.248]

58 4 we observed that with a sample size of n = 5000 AdaBoost with 28 -node trees achieved the Bayes error rate to three decimals on a 20% pure noise example despite fitting all the training data. [sent-501, score-0.33]

59 In this section we consider a simulation with this same sample size and again 2 8 -node trees but we now include a signal in addition to the noise. [sent-502, score-0.235]

60 As before, AdaBoost fits all the training data early on, but √ misclassification error after 1000 iterations averages only 0. [sent-506, score-0.251]

61 Jiang (2002) admits this when he writes, “what about boosting forever with a higher dimensional random continuous predictor x with dim(x) > 1? [sent-522, score-0.269]

62 1, Adaboost with 8-node trees (thin, red curve) does not show any signs of overfitting while AdaBoost with stumps (thick, black curve) leads to overfitting. [sent-564, score-0.357]

63 Furthermore, AdaBoost with 8-node trees outperforms AdaBoost with stumps throughout the entire 1000 iterations. [sent-570, score-0.277]

64 The misclassification error by 1000 iterations was smaller for the 8-node trees in 93 of the 100 simulations. [sent-571, score-0.336]

65 We see that the advantage of the 8-node trees has completely disappeared, and now the 8-node trees and stumps are indistinguishable. [sent-586, score-0.436]

66 This directly contradicts the conventional wisdom that boosting with larger trees is more likely to overfit on noisy data than boosting with smaller trees. [sent-590, score-0.655]

67 Furthermore, the misclassification error for AdaBoost after 1000 iterations is (slightly) lower than the minimum misclassification error achieved by LogitBoost. [sent-597, score-0.234]

68 We note that conventional early stopping rules here would be especially harmful since they would stop the algorithm after only a few iterations when the overfitting first takes place. [sent-626, score-0.238]

69 It should also be noted that the 28 = 256-node trees used here are much richer than needed to fit the simple one-dimensional Bayes decision rule for this simulation model. [sent-628, score-0.29]

70 Despite this, the misclassification error after 1000 iterations was lower than the misclassification error after the first iteration in all 100 of the reptitions. [sent-629, score-0.234]

71 Thus it is again the self-averaging property of boosting that improves the performance as more and more iterations are run. [sent-630, score-0.368]

72 5, regularization techniques for boosting such as early stopping are often based on minimizing a loss function such as the exponential loss in the case of AdaBoost. [sent-635, score-0.499]

73 5, the exponential loss increases exponentially as more iterations are run, while the misclassification error continues to decrease. [sent-641, score-0.243]

74 6 we saw that restricting the number of observations in the terminal nodes of the trees to be at least 15 degraded the performance of AdaBoost, despite the common belief that such restrictions should be beneficial. [sent-646, score-0.3]

75 AdaBoost with unrestricted 8-node trees gave a lower misclassification error at 1000 iterations in 65 of the 100 repetitions for this simulation model. [sent-660, score-0.473]

76 7 we saw that shrinkage actually caused AdaBoost to overfit in a situation where it otherwise would not have, in spite of the popular belief that shrinkage prevents overfitting. [sent-664, score-0.232]

77 1 with 8-node trees again using a shrinkage value of 150 0. [sent-666, score-0.242]

78 7, shrinkage has the beneficial effect of producing a lower misclassification error very early on in the process, despite causing the eventual overfitting. [sent-675, score-0.223]

79 This suggests that a stopping rule which could accurately estimate the optimal number of iterations combined with shrinkage may prove very effective for this particular simulation. [sent-676, score-0.304]

80 As a result of the good performance early on, the shrinkage actually gives a lower misclassification error at our chosen stopping point of 1000 iterations than without the shrinkage. [sent-677, score-0.378]

81 However, if we run for enough iterations (the plot shows 5000 iterations) the overfitting caused by the shrinkage eventually overwhelms this advantage. [sent-678, score-0.203]

82 By 5000 iterations the shrinkage leads to a larger misclassification error in 87 of the 100 repetitions. [sent-679, score-0.26]

83 The two histograms in Figure 18 show the resulting probability estimates for m = 10 iterations and m = 1000 iterations respectively using 8-node trees. [sent-691, score-0.261]

84 At 10 iterations the estimates have not yet diverged, but by 1000 iterations almost all of the probability estimates are greater than 0. [sent-693, score-0.282]

85 9 we saw that despite the fact that boosting agrees with the nearest neighbor classifier on all the training data, its performance elsewhere is quite different for d > 1 dimensions. [sent-715, score-0.487]

86 For AdaBoost, areas surrounding points in the training data for which the observed class differs from that of the Bayes rule are classified according to the Bayes rule more often than they would be using the nearest neighbor rule. [sent-716, score-0.244]

87 The plot on the left in Figure 19 shows the resulting classification rule of AdaBoost with 8-node trees at 1000 iterations for a single repetition using the new simulation model. [sent-720, score-0.419]

88 1 the misclassification error for the nearest neighbor classifier was 0. [sent-727, score-0.207]

89 10, we use 2 8 -node trees and a Bayes error rate of q = 0. [sent-764, score-0.24]

90 10, this is extremely close to the Bayes error rate and much less than the nearest neighbor bound of 2q(1 − q) = 0. [sent-772, score-0.231]

91 We encourage readers to rerun the simulation with larger n to make the misclassification error even closer to the Bayes error. [sent-774, score-0.197]

92 Concluding Remarks and Practical Suggestions By way of the simulations in Sections 3 and 4 we have seen that there are many problems with the statistical view of boosting and practical suggestions arising from that view. [sent-788, score-0.307]

93 Simply put, the goal of this paper has been to call into question this view of boosting that has come to dominate in the statistics community. [sent-790, score-0.288]

94 The statistical view of boosting focuses only on one aspect of the algorithm - the optimization. [sent-792, score-0.288]

95 A more comprehensive view of boosting should also consider the stagewise nature of the algorithm as well as the empirical variance reduction that can be observed on hold out samples as with the experiments in this paper. [sent-793, score-0.366]

96 , Breiman, 2000, 2001) who subsequently abandoned interest in boosting and went on to work on his own classification technique known as Random Forests. [sent-796, score-0.248]

97 First of all, AdaBoost remains one of, if not the, most successful boosting algorithms. [sent-799, score-0.248]

98 One should not assume that newer, regularized and modified versions of boosting are necessarily better. [sent-800, score-0.248]

99 Secondly, if classification is your goal, the best way to judge the effectiveness of boosting is by monitoring the misclassification error on hold out (or cross-validation) samples. [sent-803, score-0.326]

100 On weak base hypotheses and their implications for boosting regression and classification. [sent-902, score-0.248]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('adaboost', 0.76), ('boosting', 0.248), ('misclassi', 0.195), ('logitboost', 0.176), ('trees', 0.159), ('bayes', 0.121), ('iterations', 0.12), ('stumps', 0.118), ('thick', 0.113), ('misclassification', 0.105), ('yner', 0.091), ('jiang', 0.087), ('iew', 0.084), ('ontrary', 0.084), ('vidence', 0.084), ('tting', 0.084), ('shrinkage', 0.083), ('thin', 0.081), ('nearest', 0.08), ('simulation', 0.076), ('tatistical', 0.075), ('oosting', 0.075), ('neighbor', 0.07), ('stopping', 0.064), ('mease', 0.063), ('red', 0.059), ('error', 0.057), ('additive', 0.057), ('stagewise', 0.057), ('early', 0.054), ('fm', 0.053), ('friedman', 0.052), ('ease', 0.051), ('terminal', 0.046), ('gm', 0.045), ('buja', 0.041), ('purple', 0.041), ('saw', 0.04), ('view', 0.04), ('writes', 0.038), ('rule', 0.037), ('hastie', 0.036), ('gbm', 0.035), ('rerun', 0.035), ('colored', 0.035), ('ridgeway', 0.035), ('loss', 0.034), ('breiman', 0.034), ('dettling', 0.033), ('regularization', 0.033), ('exponential', 0.032), ('buhlmann', 0.031), ('unrestricted', 0.031), ('repetitions', 0.03), ('classi', 0.03), ('despite', 0.029), ('code', 0.029), ('experiment', 0.029), ('encourage', 0.029), ('repetition', 0.027), ('belief', 0.026), ('annals', 0.026), ('schapire', 0.026), ('iid', 0.025), ('wi', 0.025), ('freund', 0.024), ('prevent', 0.024), ('rate', 0.024), ('pure', 0.024), ('tibshirani', 0.024), ('cation', 0.024), ('logistic', 0.023), ('blue', 0.022), ('dimensions', 0.022), ('statisticians', 0.021), ('forever', 0.021), ('boosted', 0.021), ('revisit', 0.021), ('estimates', 0.021), ('black', 0.021), ('hold', 0.021), ('ts', 0.021), ('curve', 0.02), ('displays', 0.02), ('training', 0.02), ('krieger', 0.019), ('evidenced', 0.019), ('suggestions', 0.019), ('noted', 0.018), ('er', 0.018), ('evidence', 0.018), ('degradation', 0.017), ('light', 0.017), ('averaged', 0.017), ('cube', 0.017), ('decimals', 0.017), ('destroy', 0.017), ('experimenting', 0.017), ('inconsistencies', 0.017), ('latin', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

2 0.3562879 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

Author: Kristin P. Bennett

Abstract: unkown-abstract

3 0.1663208 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

4 0.053338379 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values

5 0.046771079 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

Author: Elena Marchiori

Abstract: In supervised learning, a training set consisting of labeled instances is used by a learning algorithm for generating a model (classifier) that is subsequently employed for deciding the class label of new instances (for generalization). Characteristics of the training set, such as presence of noisy instances and size, influence the learning algorithm and affect generalization performance. This paper introduces a new network-based representation of a training set, called hit miss network (HMN), which provides a compact description of the nearest neighbor relation over pairs of instances from each pair of classes. We show that structural properties of HMN’s correspond to properties of training points related to the one nearest neighbor (1-NN) decision rule, such as being border or central point. This motivates us to use HMN’s for improving the performance of a 1-NN classifier by removing instances from the training set (instance selection). We introduce three new HMN-based algorithms for instance selection. HMN-C, which removes instances without affecting accuracy of 1-NN on the original training set, HMN-E, based on a more aggressive storage reduction, and HMN-EI, which applies iteratively HMN-E. Their performance is assessed on 22 data sets with different characteristics, such as input dimension, cardinality, class balance, number of classes, noise content, and presence of redundant variables. Results of experiments on these data sets show that accuracy of 1-NN classifier increases significantly when HMN-EI is applied. Comparison with state-of-the-art editing algorithms for instance selection on these data sets indicates best generalization performance of HMN-EI and no significant difference in storage requirements. In general, these results indicate that HMN’s provide a powerful graph-based representation of a training set, which can be successfully applied for performing noise and redundance reduction in instance-based learning. Keywords: graph-based training set repre

6 0.04671799 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

7 0.045719232 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

8 0.044491101 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

9 0.036066838 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

10 0.034898244 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

11 0.031395759 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

12 0.028878611 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

13 0.026206851 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents

14 0.026094815 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines    (Special Topic on Model Selection)

15 0.025448631 52 jmlr-2008-Learning from Multiple Sources

16 0.025226412 58 jmlr-2008-Max-margin Classification of Data with Absent Features

17 0.024090337 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

18 0.023851857 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

19 0.023824476 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

20 0.023761859 91 jmlr-2008-Trust Region Newton Method for Logistic Regression


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.159), (1, -0.065), (2, 0.511), (3, -0.161), (4, -0.017), (5, 0.163), (6, -0.367), (7, -0.168), (8, -0.089), (9, 0.036), (10, 0.086), (11, 0.005), (12, -0.047), (13, -0.008), (14, 0.07), (15, 0.044), (16, -0.038), (17, -0.018), (18, -0.034), (19, 0.015), (20, 0.025), (21, 0.005), (22, 0.03), (23, 0.022), (24, -0.036), (25, 0.04), (26, -0.015), (27, -0.012), (28, 0.049), (29, -0.035), (30, 0.023), (31, 0.022), (32, -0.023), (33, -0.019), (34, -0.063), (35, 0.006), (36, -0.036), (37, -0.024), (38, -0.023), (39, 0.039), (40, -0.007), (41, -0.018), (42, -0.027), (43, -0.014), (44, 0.035), (45, 0.008), (46, 0.005), (47, -0.004), (48, -0.029), (49, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97212547 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

Author: Kristin P. Bennett

Abstract: unkown-abstract

same-paper 2 0.96460474 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

3 0.51143324 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

4 0.20435813 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

Author: Elena Marchiori

Abstract: In supervised learning, a training set consisting of labeled instances is used by a learning algorithm for generating a model (classifier) that is subsequently employed for deciding the class label of new instances (for generalization). Characteristics of the training set, such as presence of noisy instances and size, influence the learning algorithm and affect generalization performance. This paper introduces a new network-based representation of a training set, called hit miss network (HMN), which provides a compact description of the nearest neighbor relation over pairs of instances from each pair of classes. We show that structural properties of HMN’s correspond to properties of training points related to the one nearest neighbor (1-NN) decision rule, such as being border or central point. This motivates us to use HMN’s for improving the performance of a 1-NN classifier by removing instances from the training set (instance selection). We introduce three new HMN-based algorithms for instance selection. HMN-C, which removes instances without affecting accuracy of 1-NN on the original training set, HMN-E, based on a more aggressive storage reduction, and HMN-EI, which applies iteratively HMN-E. Their performance is assessed on 22 data sets with different characteristics, such as input dimension, cardinality, class balance, number of classes, noise content, and presence of redundant variables. Results of experiments on these data sets show that accuracy of 1-NN classifier increases significantly when HMN-EI is applied. Comparison with state-of-the-art editing algorithms for instance selection on these data sets indicates best generalization performance of HMN-EI and no significant difference in storage requirements. In general, these results indicate that HMN’s provide a powerful graph-based representation of a training set, which can be successfully applied for performing noise and redundance reduction in instance-based learning. Keywords: graph-based training set repre

5 0.20316434 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

Author: Amit Dhurandhar, Alin Dobra

Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees

6 0.19347633 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

7 0.18689981 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

8 0.14807039 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents

9 0.1321819 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

10 0.13171187 52 jmlr-2008-Learning from Multiple Sources

11 0.12624465 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

12 0.12135892 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

13 0.11446998 87 jmlr-2008-Stationary Features and Cat Detection

14 0.11262578 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines    (Special Topic on Model Selection)

15 0.10757318 56 jmlr-2008-Magic Moments for Structured Output Prediction

16 0.10619292 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2    (Machine Learning Open Source Software Paper)

17 0.10517643 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

18 0.1040369 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

19 0.10306612 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

20 0.10155126 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.012), (5, 0.013), (31, 0.013), (40, 0.024), (54, 0.424), (58, 0.031), (66, 0.204), (76, 0.013), (88, 0.072), (92, 0.033), (94, 0.037), (99, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95223248 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

2 0.93120986 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

Author: Balázs Csanád Csáji, László Monostori

Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms

3 0.91293812 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

Author: Vojtěch Franc, Bogdan Savchynskyy

Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines

4 0.71604133 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

Author: Rémi Munos, Csaba Szepesvári

Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control

5 0.62787592 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction

6 0.60481453 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

7 0.56655765 83 jmlr-2008-Robust Submodular Observation Selection

8 0.55958164 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

9 0.55804884 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

10 0.55708534 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

11 0.55627036 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

12 0.55118102 56 jmlr-2008-Magic Moments for Structured Output Prediction

13 0.54785234 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

14 0.54393309 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

15 0.53939927 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

16 0.53403616 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

17 0.53216672 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

18 0.53099138 58 jmlr-2008-Max-margin Classification of Data with Absent Features

19 0.52473527 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity    (Special Topic on Model Selection)

20 0.5243389 86 jmlr-2008-SimpleMKL