jmlr jmlr2008 jmlr2008-82 knowledge-graph by maker-knowledge-mining

82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting


Source: pdf

Author: Kristin P. Bennett

Abstract: unkown-abstract

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We especially appreciate the fact that the authors have supplied R code for their examples, as this allows the reader to understand and assess their ideas. [sent-6, score-0.068]

2 The paper inspired us to re-visit many of these issues underlying boosting methods. [sent-7, score-0.293]

3 However in the end we do not believe that the examples provided in the paper contradict our statistical view, although other views may well prove informative. [sent-8, score-0.062]

4 , 2001) argue that boosting methods have three important properties that contribute to their success: 1. [sent-12, score-0.325]

5 they fit an additive model in a flexible set of basis functions 2. [sent-13, score-0.079]

6 they use a suitable loss function for the fitting process 3. [sent-14, score-0.041]

7 they regularize by forward stagewise fitting; with shrinkage this mimics an L 1 (lasso) penalty on the weights. [sent-15, score-0.507]

8 In many cases the paper ascribes consequences of this statistical view that are not the case. [sent-16, score-0.062]

9 For example, it does not follow that smaller trees are necessarily better than larger ones for noisier problems (Sections 3. [sent-17, score-0.202]

10 2), that the basis should necessarily be restricted as described in Sections 3. [sent-19, score-0.034]

11 6, or that regularization should be based on the loss function used for fitting (Sections 3. [sent-21, score-0.081]

12 To the extent possible model selection should be based on the ultimate loss associated with the application. [sent-24, score-0.109]

13 Also, there is no requirement that test error have a unique minimum as a function of the number of included terms (Sections 3. [sent-25, score-0.077]

14 However, to the extent that these are commonly held beliefs, the paper provides a valuable service by pointing out that they need not hold in all applications. [sent-28, score-0.11]

15 There is no direct relation between the application of shrinkage and overfitting (Sections 3. [sent-29, score-0.211]

16 Heavy shrinkage emulates L1 regularization, whereas its absence corresponds to stagewise c 2008 Jerome Friedman, Trevor Hastie and Robert Tibshirani. [sent-32, score-0.357]

17 There is nothing in the statistical view that requires L 1 to be superior to L0 in every application, although this is often the case. [sent-34, score-0.096]

18 The best regularizer depends on the problem: namely the nature of the true target function, the particular basis used, signal-to-noise ratio, and sample size. [sent-35, score-0.089]

19 Finally, there is nothing in our statistical interpretation suggesting that boosting is similar to one nearest neighbor classification (Sections 3. [sent-36, score-0.327]

20 None-the-less, the paper does provide some interesting examples that appear to contradict the statistical interpretation. [sent-39, score-0.062]

21 However these examples may have been carefully chosen, and the effects seems to vanish under various perturbations of the problem. [sent-40, score-0.083]

22 40 stump 8−node 0 200 400 600 800 1000 Number of Trees 0 200 400 600 800 1000 Number of Trees Figure 1: Average test misclassification error for 20 replications of Mease and Wyner’s example used in their Figure 1. [sent-55, score-0.474]

23 The left panel shows that 8-node trees outperform stumps. [sent-57, score-0.632]

24 The right panel shows that stumps with shrinkage win handily. [sent-58, score-0.875]

25 The left panel of Figure 1 shows a version of the paper’s Figure 1. [sent-59, score-0.407]

26 We see that boosting with 8 node trees seems to outperform stumps, despite the fact that the generative model is additive in the predictor variables. [sent-60, score-0.716]

27 However the right panel shows what happens to both stumps and 8 nodes trees when shrinkage is applied. [sent-61, score-1.037]

28 Here shrinkage helps in both cases, and we see that stumps with shrinkage work the best of all. [sent-62, score-0.641]

29 25 Test Mean Absolute Error stump 8−node stump shrunk 0. [sent-71, score-1.005]

30 40 Estimated Probabilities 0 200 400 600 800 1000 Number of Trees 0 200 400 600 800 1000 Number of Trees Figure 2: Left panel: average absolute deviations of the fitted probabilities from the true probabilities for the same simulations as in Figure 1. [sent-78, score-0.221]

31 Right panel: average test misclassification error for the same simulations as in Figure 1, except using 2000 rather than 200 training examples. [sent-79, score-0.126]

32 We are not sure why unshrunken 8 node trees outperform unshrunken stumps in this example. [sent-80, score-0.882]

33 As in the paper, we speculate that the extra splits in the 8 node tree might act as a type of regularizer, and hence they help avoid the overfitting displayed by unshrunken stumps in this example. [sent-81, score-0.515]

34 However this explanation becomes less convincing and indeed the effect itself seems to fade when we look more deeply. [sent-83, score-0.073]

35 Figure 2 [left panel] shows the average absolute error in the estimated probabilities, while Figure 2[right panel] shows what happens when we increase the sample size to 2000. [sent-84, score-0.106]

36 In Figure 3[left panel] we use the Bernoulli loss rather than exponential of Adaboost, and Figure 3[right panel] shows results for the regression version of this problem. [sent-85, score-0.041]

37 In every case, the effect noted by the authors goes away and both the correct bases and shrinkage help performance. [sent-86, score-0.211]

38 Thus the effect illustrated by the authors is hard to explain, and seems to hold only for misclassification error. [sent-88, score-0.039]

39 It depends on a very carefully chosen set of circumstances. [sent-89, score-0.044]

40 Looking at the right panel of Figure 1, which method would anyone choose? [sent-91, score-0.452]

41 Clearly, shrunken stumps work best here, just as might be expected from the statistical view. [sent-92, score-0.342]

42 45 Bernoulli Loss 0 200 400 600 800 1000 Number of Trees 1 5 10 50 100 500 Number of Trees (log scale) Figure 3: Left panel: test misclassification error when boosting with Bernoulli loss for the same simulations as in Figure 1. [sent-103, score-0.46]

43 Right panel: root mean-squared test error when boosting with squared-error loss for the same simulations as in Figure 1 (legend as in left panel). [sent-104, score-0.501]

44 Figure 4 shows the fitted probabilities over the 20 runs, separately for each class, when using 250 shrunken stumps. [sent-105, score-0.189]

45 This is an appropriate tradeoff curve if we are interested in probabilities; test deviance would also be fine. [sent-107, score-0.08]

46 A similar argument can be made concerning the paper’s Figure 3. [sent-111, score-0.031]

47 But using the statistical view of boosting, we have moved on and developed better methods like gradient boosting (Friedman, 2001) that typically outperform both of these methods. [sent-113, score-0.419]

48 (2007) add further support to (3) of the statistical interpretation of boosting: they show that the incremental forward stagewise procedure used in boosting (with shrinkage) optimizes a criterion similar to but smoother than the L1 penalized loss. [sent-115, score-0.515]

49 Conclusion No theory, at least initially, can fully explain every observed phenomenon. [sent-117, score-0.043]

50 There is still considerable ongoing research in the literature concerning the interplay between the target function, basis used, and regularization method. [sent-119, score-0.171]

51 Hopefully, some of the apparent anomalies illustrated in this paper will eventually be explained with 178 R ESPONSE TO M EASE AND W YNER , E VIDENCE C ONTRARY TO THE S TATISTICAL V IEW OF B OOSTING Probability Estimates at 250 Trees 3 2 0 1 Density 4 Y=0 0. [sent-120, score-0.041]

52 6 Probability estimate Figure 4: Fitted probabilities shown for the two classes, at 250 shrunken stumps. [sent-132, score-0.189]

53 The vertical black bars are the target probabilities for this problem, and the green bars are the median of the estimates in each class. [sent-133, score-0.14]

54 The paper provides a service in reminding us that there is still work remaining. [sent-135, score-0.073]

55 Although we would not begin to suggest that our statistical view of boosting has anywhere near the substance or importance of the Darwin’s theory of evolution, the latter provides a useful analogy. [sent-136, score-0.392]

56 The proponents of Intelligent Design point out that the theory of evolution does not seem to explain certain observed biological phenomena. [sent-137, score-0.112]

57 And therefore they argue that evolution must be wrong despite the fact that it does explain an overwhelming majority of observed phenomena, and without offering an alternative testable theory. [sent-138, score-0.24]

58 We are sure that the authors will mount counter-arguments to our remarks, and due to the (standard format) of this discussion, they will have the last word. [sent-139, score-0.034]

59 We look forward to constructive counter-arguments and alternative explanations for the success of boosting methods that can be used to extend their application and produce methods that perform better in practice (as in the right panel of Figure 1). [sent-140, score-0.81]

60 Additive logistic regression: A statistical view of boosting (with discussion). [sent-149, score-0.355]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('stump', 0.368), ('panel', 0.366), ('boosting', 0.293), ('shrunk', 0.269), ('stumps', 0.219), ('shrinkage', 0.211), ('misclassification', 0.164), ('trees', 0.161), ('stagewise', 0.146), ('astie', 0.145), ('ibshirani', 0.145), ('riedman', 0.145), ('unshrunken', 0.145), ('stanford', 0.125), ('shrunken', 0.123), ('hastie', 0.121), ('node', 0.114), ('esponse', 0.096), ('adaboost', 0.088), ('iew', 0.082), ('mease', 0.082), ('ontrary', 0.082), ('vidence', 0.082), ('wyner', 0.082), ('yner', 0.082), ('friedman', 0.08), ('forward', 0.076), ('tatistical', 0.073), ('oosting', 0.073), ('service', 0.073), ('jerome', 0.073), ('misclassi', 0.073), ('evolution', 0.069), ('probabilities', 0.066), ('outperform', 0.064), ('contradict', 0.062), ('view', 0.062), ('tting', 0.062), ('bernoulli', 0.061), ('trevor', 0.059), ('electronic', 0.059), ('regularizer', 0.055), ('tibshirani', 0.051), ('simulations', 0.049), ('ease', 0.046), ('test', 0.046), ('additive', 0.045), ('right', 0.045), ('carefully', 0.044), ('tted', 0.044), ('explain', 0.043), ('robert', 0.042), ('loss', 0.041), ('left', 0.041), ('gbm', 0.041), ('fitted', 0.041), ('noisier', 0.041), ('anyone', 0.041), ('anomalies', 0.041), ('regularization', 0.04), ('absolute', 0.04), ('seems', 0.039), ('logitboost', 0.037), ('book', 0.037), ('mimics', 0.037), ('regularize', 0.037), ('interplay', 0.037), ('anywhere', 0.037), ('speculate', 0.037), ('extent', 0.037), ('bars', 0.037), ('sections', 0.036), ('happens', 0.035), ('basis', 0.034), ('sure', 0.034), ('nothing', 0.034), ('convincing', 0.034), ('testable', 0.034), ('deviance', 0.034), ('appreciate', 0.034), ('darwin', 0.034), ('win', 0.034), ('supplied', 0.034), ('hopefully', 0.034), ('argue', 0.032), ('error', 0.031), ('legend', 0.031), ('yes', 0.031), ('ultimate', 0.031), ('overwhelming', 0.031), ('beliefs', 0.031), ('concerning', 0.031), ('wrong', 0.031), ('figure', 0.031), ('success', 0.03), ('replications', 0.029), ('yoav', 0.029), ('verlag', 0.029), ('bagging', 0.029), ('ongoing', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

Author: Kristin P. Bennett

Abstract: unkown-abstract

2 0.3562879 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

3 0.27920818 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

4 0.056443024 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values

5 0.035186525 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

Author: Suhrid Balakrishnan, David Madigan

Abstract: Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multipass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function. Keywords: Laplace approximation, expectation propagation, LASSO

6 0.033678304 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

7 0.032893121 52 jmlr-2008-Learning from Multiple Sources

8 0.029793758 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

9 0.028427385 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

10 0.027700452 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

11 0.02706857 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

12 0.025296219 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

13 0.024625858 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

14 0.024524508 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

15 0.024359953 58 jmlr-2008-Max-margin Classification of Data with Absent Features

16 0.022970131 56 jmlr-2008-Magic Moments for Structured Output Prediction

17 0.022482894 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

18 0.022025431 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

19 0.020844743 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

20 0.020391766 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.154), (1, -0.062), (2, 0.551), (3, -0.184), (4, 0.005), (5, 0.176), (6, -0.436), (7, -0.148), (8, -0.096), (9, 0.065), (10, 0.057), (11, -0.007), (12, -0.024), (13, -0.003), (14, 0.101), (15, 0.044), (16, -0.076), (17, 0.006), (18, -0.015), (19, -0.005), (20, 0.04), (21, -0.002), (22, 0.025), (23, -0.018), (24, -0.046), (25, 0.019), (26, -0.028), (27, -0.016), (28, 0.04), (29, -0.012), (30, -0.001), (31, 0.037), (32, -0.032), (33, 0.008), (34, -0.013), (35, 0.025), (36, -0.015), (37, -0.015), (38, 0.026), (39, 0.006), (40, -0.018), (41, 0.003), (42, -0.036), (43, -0.016), (44, 0.01), (45, 0.006), (46, -0.004), (47, -0.029), (48, -0.015), (49, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98091906 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

Author: Kristin P. Bennett

Abstract: unkown-abstract

2 0.92248362 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

3 0.55243826 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

4 0.17337668 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values

5 0.150489 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

Author: Amit Dhurandhar, Alin Dobra

Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees

6 0.12976773 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

7 0.12601537 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

8 0.11164168 56 jmlr-2008-Magic Moments for Structured Output Prediction

9 0.1088521 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

10 0.10250665 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

11 0.099103749 52 jmlr-2008-Learning from Multiple Sources

12 0.097001903 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

13 0.093675576 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

14 0.091481067 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

15 0.09134008 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents

16 0.088512301 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

17 0.086762644 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

18 0.083897904 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data

19 0.082345702 86 jmlr-2008-SimpleMKL

20 0.081952356 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(54, 0.054), (58, 0.018), (66, 0.747), (88, 0.017), (92, 0.023), (94, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99150175 82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting

Author: Kristin P. Bennett

Abstract: unkown-abstract

2 0.95076215 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont

Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive 1 -norm penalized regression. Our second algorithm, based on Nesterov’s first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data. Keywords: model selection, maximum likelihood estimation, convex optimization, Gaussian graphical model, binary data

3 0.92866838 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny

Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace

4 0.60017824 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators

5 0.54290599 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

6 0.49872553 26 jmlr-2008-Consistency of Trace Norm Minimization

7 0.45744625 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

8 0.44004449 83 jmlr-2008-Robust Submodular Observation Selection

9 0.43424764 57 jmlr-2008-Manifold Learning: The Price of Normalization

10 0.43163922 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

11 0.42276224 58 jmlr-2008-Max-margin Classification of Data with Absent Features

12 0.41969365 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

13 0.41957432 56 jmlr-2008-Magic Moments for Structured Output Prediction

14 0.41688994 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity    (Special Topic on Model Selection)

15 0.41306719 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

16 0.41214734 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

17 0.40716097 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

18 0.39377478 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

19 0.39137104 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

20 0.39111331 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration