nips nips2001 nips2001-137 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We give an unified convergence analysis of ensemble learning methods including e. [sent-10, score-0.227]
2 These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. [sent-13, score-0.566]
3 We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. [sent-14, score-0.269]
4 Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning. [sent-15, score-0.201]
5 1 Introduction We show convergence rates of ensemble learning methods such as AdaBoost [10], Logistic Regression (LR) [11, 5] and the Least-Square (LS) regression algorithm called LS-Boost [12]. [sent-16, score-0.419]
6 These algorithms have in common that they iteratively call a base learning algorithm (also called weak learner) on a weighted training sample. [sent-17, score-0.51]
7 The base learner is expected from some hypothesis set of weak hypotheses to return in each iteration a hypothesis that has small weighted training error. [sent-18, score-1.402]
8 These hypotheses are then ; in classification one linearly combined to form the final hypothesis uses the sign of for prediction. [sent-20, score-0.546]
9 The hypothesis coefficient is determined at iteration , such that a certain objective is minimized or approximately minimized, and is fixed for later iterations. [sent-21, score-0.417]
10 Here we will work out sufficient conditions on the base learning algorithm to achieve linear convergence to the minimum of an associated loss function . [sent-22, score-0.72]
11 Relation to Previous Work In the original work on AdaBoost it has been shown that the optimization objective (which is an upper bound on the training error) converges exponentially fast to zero, if the base learner is consistently better than random guessing, i. [sent-24, score-0.589]
12 In this case the convergence is known to be linear (i. [sent-27, score-0.177]
13 weighted training error cannot be upper bounded by a constant smaller , otherwise one could use AdaBoost to find a separation using the aforementioned convergence result. [sent-38, score-0.327]
14 2 For AdaBoost and Logistic Regression it has been shown [5] that they generate a combined hypothesis asymptotically minimizing a loss functional only depending on the output of the combined hypothesis . [sent-39, score-1.042]
15 This holds for the non-separable case; however, the assumed conditions in [5] on the performance of the base learner are rather strict and can usually not be satisfied in practice. [sent-40, score-0.522]
16 Although the analysis in [5] holds in principle for any strictly convex cost function of Legendre-type (e. [sent-41, score-0.24]
17 258, and [1]), one needs to show the existence of a so-called auxiliary function [7, 5] for each cost function other than the exponential or the logistic loss. [sent-44, score-0.249]
18 In an earlier attempt to show the convergence of such methods for arbitrary loss functions [17], one needed to assume that the hypothesis coefficients are upper bounded by a rather small constant. [sent-49, score-0.781]
19 For this case it has been shown that the algorithm asymptotically converges to a combined hypothesis minimizing . [sent-50, score-0.54]
20 In [9] it has been shown that for loss functions which are (essentially) exponentially decreasing (including the loss functions of AdaBoost and Logistic regression), the loss is in the first iterations and afterwards . [sent-52, score-0.706]
21 However, this only holds, if the loss reaches zero, i. [sent-54, score-0.189]
22 In this work convergence is proven for the general non-separable case, however, only for the exponential loss, i. [sent-60, score-0.25]
23 3 The framework set up in this paper is more general and we are able to treat any strictly convex loss function. [sent-63, score-0.354]
24 In this paper we propose a family of algorithms that are able to generate a combined hypothesis converging to the minimum of (if it exists), which is a functional depending on the outputs of the function evaluated on the training set. [sent-64, score-0.559]
25 While assuming mild conditions on the base learning algorithm and the loss function , we can show linear convergence rates [15] (beginning in the first iteration) of the type for some fixed . [sent-66, score-0.75]
26 This means that the difference to the minimum loss converges exponentially fast to zero (in the number of iterations). [sent-67, score-0.282]
27 A similar convergence has been proven for AdaBoost in the special case of separable data [10], although the constant shown in [10] can be considerable smaller [see also 9]. [sent-68, score-0.298]
28 To prove the convergence of leveraging, we exploit results of Luo & Tseng [16] for a variant of the Gauss-Southwell method known from numerical optimization. [sent-69, score-0.241]
29 Since in practice the hypothesis set can be quite large, ensemble learning algorithms without any regularization often suffer from overfitting [22, 12, 2, 19]. [sent-70, score-0.467]
30 Here, the complexity can only be controlled by the size of the base hypothesis set or by early stopping after a few iterations. [sent-71, score-0.584]
31 However, it has been shown that shrinkage regularization implied by penalizing some norm of the hypothesis coefficients is the favorable strategy [6, 12]. [sent-72, score-0.444]
32 We therefore extend our analysis to the case of -norm regularized loss functions. [sent-73, score-0.275]
33 lating these results to leveraging algorithms, we present an extension to regularized cost functions in Sec. [sent-82, score-0.634]
34 2 Leveraging algorithms revisited We first briefly review some of the most well known leveraging algorithms for classification and regression. [sent-84, score-0.573]
35 1 as a template for a generic leveraging algorithm, since these algorithms have the same algorithmical structure. [sent-89, score-0.528]
36 In each iteration they call a base learning algorithm on the training set (cf. [sent-92, score-0.406]
37 The base learner is expected to return a hypothesis from some hypothesis space 4 that 5 has small weighted classification error [10, 11], where and . [sent-96, score-1.138]
38 For AdaBoost this is (1) and for Logistic Regression it is (2) is the combined hypothesis of the previous iteration given by . [sent-100, score-0.435]
39 The weighting on the : sample is updated based on the new combined hypothesis and for AdaBoost and Logistic Regression, respectively. [sent-104, score-0.404]
40 It first selects a hypothesis solving where (3) and then finds the hypothesis weight bined hypothesis: by minimizing the squared error of the new com(4) The “weighting” of the sample is computed as , which is the residual of [12]. [sent-108, score-0.704]
41 In a second version of LS-Boost, the base hypothesis and its weight are found simultaneously by solving [12]: (5) Since in (5) one reaches a lower loss function value than with (3) and (4), it might be the favorable strategy. [sent-109, score-0.813]
42 4 Notice that always contains only a finite number of different hypotheses when evaluated on the training set and is effectively finite [2]. [sent-110, score-0.173]
43 Algorithm 1 – A Leveraging algorithm for the loss function . [sent-112, score-0.234]
44 of Iterations , Loss function , for all , (a) Train classifier on and obtain hypothesis (b) Set (c) Update and 4. [sent-117, score-0.318]
45 To avoid confusions, note that hypotheses and coefficients generated during the iterative algorithm are marked by a hat. [sent-131, score-0.186]
46 In the algorithms discussed so far, the optimization takes place by employing the leveraging scheme outlined in Alg. [sent-132, score-0.624]
47 The output of such an algorithm is a sequence of pairs and a combined hypothesis . [sent-134, score-0.417]
48 Other Preliminaries Throughout the paper we assume the loss function is of the form Although, this assumption is not necessary, the presentation becomes easier. [sent-136, score-0.223]
49 However, note that additive loss functions are commonly used, if one considers i. [sent-138, score-0.224]
50 Furthermore, the function is assumed to be strictly convex for all . [sent-143, score-0.194]
51 For regression problems the hypothesis space might be infinite. [sent-149, score-0.465]
52 Then we show how the reviewed leveraging algorithms fit into this optimization framework. [sent-152, score-0.622]
53 1 Coordinate Descent The idea of coordinate descent is to iteratively select a coordinate, say the -th, and find such that some functional is minimized with respect to . [sent-154, score-0.378]
54 15]; however, we are in particular interested in the Gauss-Southwell-type (GS) selection scheme: It selects the coordinate that has the largest absolute value in the gradient vector , i. [sent-157, score-0.331]
55 Instead of doing steps in the direction of the negative gradient as in standard gradient descent methods, one only changes the variable that has the largest gradient component. [sent-160, score-0.28]
56 We start with the following general convergence result, which seemed to be fallen into oblivion even in the optimization community. [sent-162, score-0.236]
57 It will be very useful in the analysis of leveraging algorithms. [sent-163, score-0.483]
58 Suppose is twice continuously differentiable and strictly convex on . [sent-166, score-0.193]
59 Let be the sequence generated by coordinate descent, where the coordinate selection satisfies (8) for some , where is the optimal value of if it would be selected, i. [sent-169, score-0.334]
60 Then a coordinate selection such that (10) satisfies (8) of (11) as described above converges. [sent-180, score-0.192]
61 To Thus the approximate Gauss-Southwell method on show the convergence of the second variant of LS-Boost (cf. [sent-181, score-0.208]
62 Then a coordinate selection satisfies (8), if there exists a fixed with (12) Thus the maximal improvement scheme on as above converges in the sense of Thm. [sent-184, score-0.332]
63 Finally we can also state a rate of convergence, which is surprisingly not worse than the rates for standard gradient descent methods: Theorem 4 (Rate of Convergence of Coordinate Descent, [16]). [sent-186, score-0.174]
64 Then we have (13) is the estimate after the -th coordinate descent step, where tion, and . [sent-191, score-0.263]
65 Following [16] one can show that the constant is , where is the Lipschitz constant of and is a constant that depends on and therefore on the geometry of the hypothesis set (cf. [sent-193, score-0.411]
66 While the upper bound on can be rather large, making the convergence slow, it is important to note (i) that this is only a rough estimate of the true constant and (ii) still guarantees an exponential decrease in the error functional with the number of iterations. [sent-195, score-0.287]
67 2 Leveraging and Coordinate Descent We now return from the abstract convergence results in Sec. [sent-197, score-0.215]
68 Thus, the coordinate with maximal absolute grawhere dient corresponds to the hypothesis with largest absolute edge (see definition). [sent-206, score-0.598]
69 However, according to Proposition 2 and 3 we need to assume less on the base learner. [sent-207, score-0.3]
70 It either has to return a hypothesis that (approximately) maximizes the edge, or alternatively (approximately) minimizes the loss function. [sent-208, score-0.545]
71 A base learning algorithm is called -optimal, if it always returns hypotheses that either satisfy condition (11) or (12) for some fixed . [sent-210, score-0.536]
72 Since we have assumed is closed under complementation, there always exist two hypotheses having the same absolute gradient ( and ). [sent-211, score-0.274]
73 We therefore only need to consider the hypothesis with maximum edge as opposed to the maximum absolute edge. [sent-212, score-0.405]
74 For classification it means: if the base learner returns the hypothesis with approximately smallest weighted training error, this condition is satisfied. [sent-213, score-0.898]
75 The loss functions of AdaBoost, Logistic regression and LS-Boost are bounded, strongly convex and fulfill the conditions in Thm. [sent-217, score-0.517]
76 We can finally state the convergence result for leveraging algorithms: Theorem 7. [sent-219, score-0.66]
77 Let be a loss function satisfying the conditions in Thm. [sent-220, score-0.232]
78 Since , the base learner prefers hypotheses with small and could therefore stop improving the objective while being not optimal (see [20, Section 4. [sent-229, score-0.546]
79 It additionally allows for box constraints and a linear function in terms of the hypothesis coefficients. [sent-233, score-0.318]
80 Here, we are in particular interested in -norm penalized loss functions of the type , which are frequently used in machine learning. [sent-234, score-0.254]
81 The LASSO algorithm for regression [26] and the PBVM algorithm for classification [25] are examples. [sent-235, score-0.237]
82 Since we assumed complementation closeness of , we can assume without loss of generality that a solution satisfies . [sent-236, score-0.328]
83 Clearly, the regularization defines a structure of nested subsets of , where the hypothesis set is restricted to a smaller set for larger values of . [sent-238, score-0.372]
84 The constraint causes some minor complications with the assumptions on the base learning algorithm. [sent-239, score-0.266]
85 [21]), while not assuming more on the base learning algorithm. [sent-241, score-0.266]
86 Roughly speaking, this induces the problem that hypothesis coefficient chosen too large in a previous iteration, cannot be reduced again. [sent-244, score-0.318]
87 To solve this problem one can check for each coefficient of a previously selected hypothesis whether not selecting it would violate the -optimality condition (11) or (12). [sent-245, score-0.356]
88 If so, the Algorithm 2 – A Leveraging algorithm for -norm regularized loss . [sent-246, score-0.32]
89 Do for , for all , and obtain hypothesis (a) Train classifier on (b) Let (c) (d) if , Reg. [sent-251, score-0.318]
90 Output: algorithm selects such a coordinate for the next iteration instead of calling the base learning algorithm. [sent-255, score-0.551]
91 1, is strictly convex, , and the base learner satisfies (15) . [sent-260, score-0.467]
92 2 converges linearly to a minimum of the regularized loss function. [sent-262, score-0.366]
93 for This can also be shown for a maximum-improvement like condition on the base learner, which we have to omit due to space limitation. [sent-263, score-0.333]
94 For this algorithm one can show order one convergence (which is (keeping weaker than linear convergence), which also holds if the hypothesis set is infinite. [sent-265, score-0.585]
95 5 Conclusion We gave a unifying convergence analysis for a fairly general family of leveraging methods. [sent-266, score-0.694]
96 These convergence results were obtained under rather mild assumptions on the base learner and, additionally, led to linear convergence rates. [sent-267, score-0.789]
97 This was achieved by relating leveraging algorithms to the Gauss-Southwell method known from numerical optimization. [sent-268, score-0.561]
98 Future investigations include the generalization to infinite hypotheses spaces and an improvement of the convergence rate . [sent-270, score-0.318]
99 On the convergence of coordinate descent method for convex differentiable minimization. [sent-396, score-0.571]
100 Sparse regression ensembles in infinite and finite a hypothesis spaces. [sent-436, score-0.465]
wordName wordTfidf (topN-words)
[('leveraging', 0.483), ('adaboost', 0.337), ('hypothesis', 0.318), ('base', 0.266), ('loss', 0.189), ('logistic', 0.184), ('convergence', 0.177), ('regression', 0.147), ('coordinate', 0.142), ('hypotheses', 0.141), ('learner', 0.139), ('tsch', 0.124), ('descent', 0.121), ('convex', 0.103), ('regularized', 0.086), ('satis', 0.083), ('complementation', 0.076), ('classi', 0.071), ('coef', 0.065), ('iteration', 0.063), ('strictly', 0.062), ('weighted', 0.059), ('optimization', 0.059), ('converges', 0.058), ('proposition', 0.056), ('combined', 0.054), ('regularization', 0.054), ('bregman', 0.053), ('gradient', 0.053), ('cation', 0.053), ('separable', 0.052), ('demiriz', 0.051), ('luo', 0.051), ('absolute', 0.051), ('mika', 0.05), ('selection', 0.05), ('nite', 0.05), ('ensemble', 0.05), ('xed', 0.047), ('returns', 0.046), ('algorithm', 0.045), ('algorithms', 0.045), ('holds', 0.045), ('exists', 0.045), ('leveraged', 0.044), ('lasso', 0.044), ('functional', 0.044), ('conditions', 0.043), ('nally', 0.042), ('boosting', 0.04), ('lr', 0.04), ('duffy', 0.04), ('gs', 0.04), ('favorable', 0.04), ('proven', 0.038), ('condition', 0.038), ('pages', 0.038), ('return', 0.038), ('della', 0.038), ('onoda', 0.038), ('pietra', 0.038), ('scheme', 0.037), ('minimized', 0.036), ('edge', 0.036), ('selects', 0.035), ('iteratively', 0.035), ('fraunhofer', 0.035), ('manfred', 0.035), ('reviewed', 0.035), ('october', 0.035), ('functions', 0.035), ('exponential', 0.035), ('exponentially', 0.035), ('iterations', 0.034), ('es', 0.034), ('family', 0.034), ('assume', 0.034), ('minimizing', 0.033), ('linearly', 0.033), ('numerical', 0.033), ('asymptotically', 0.032), ('weighting', 0.032), ('march', 0.032), ('converging', 0.032), ('shrinkage', 0.032), ('training', 0.032), ('constant', 0.031), ('variant', 0.031), ('cost', 0.03), ('theorem', 0.03), ('cients', 0.03), ('penalized', 0.03), ('mild', 0.03), ('omit', 0.029), ('brie', 0.029), ('assumed', 0.029), ('weak', 0.028), ('bounded', 0.028), ('differentiable', 0.028), ('ller', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
2 0.26016915 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
3 0.21902612 139 nips-2001-Online Learning with Kernels
Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson
Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. ¡
4 0.20442632 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
Author: Paul Viola, Michael Jones
Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.
5 0.20257488 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
Author: Guy Lebanon, John D. Lafferty
Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.
6 0.19932057 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
7 0.19734672 167 nips-2001-Semi-supervised MarginBoost
8 0.14977118 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
9 0.14562941 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
10 0.11038642 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
11 0.1000285 62 nips-2001-Duality, Geometry, and Support Vector Regression
12 0.098747432 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
13 0.098609798 31 nips-2001-Algorithmic Luckiness
14 0.097221881 22 nips-2001-A kernel method for multi-labelled classification
15 0.089609399 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
16 0.089386478 105 nips-2001-Kernel Machines and Boolean Functions
17 0.087572277 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
18 0.084035069 13 nips-2001-A Natural Policy Gradient
19 0.080949754 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
20 0.076745406 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning
topicId topicWeight
[(0, -0.26), (1, 0.126), (2, 0.048), (3, 0.31), (4, 0.13), (5, -0.133), (6, 0.032), (7, 0.057), (8, 0.063), (9, -0.133), (10, 0.031), (11, 0.121), (12, -0.002), (13, 0.026), (14, -0.052), (15, -0.126), (16, -0.044), (17, -0.036), (18, -0.045), (19, 0.196), (20, 0.18), (21, -0.026), (22, -0.074), (23, 0.107), (24, -0.012), (25, 0.062), (26, -0.011), (27, -0.091), (28, -0.077), (29, 0.086), (30, 0.023), (31, -0.075), (32, -0.034), (33, -0.042), (34, 0.009), (35, 0.032), (36, 0.008), (37, 0.019), (38, -0.008), (39, -0.07), (40, 0.112), (41, 0.032), (42, 0.005), (43, 0.007), (44, 0.011), (45, 0.047), (46, 0.119), (47, -0.056), (48, 0.133), (49, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.95715296 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
2 0.79533511 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
Author: Guy Lebanon, John D. Lafferty
Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.
3 0.6501677 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
4 0.64741594 139 nips-2001-Online Learning with Kernels
Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson
Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. ¡
5 0.62327409 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile
Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.
6 0.57714486 167 nips-2001-Semi-supervised MarginBoost
7 0.5505814 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
8 0.51056778 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
10 0.46736005 31 nips-2001-Algorithmic Luckiness
11 0.43791789 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
12 0.38413954 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
13 0.36618641 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
14 0.36022535 22 nips-2001-A kernel method for multi-labelled classification
15 0.36000288 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
16 0.35437447 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
17 0.34121421 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
18 0.32542604 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
19 0.31427035 180 nips-2001-The Concave-Convex Procedure (CCCP)
20 0.31305194 105 nips-2001-Kernel Machines and Boolean Functions
topicId topicWeight
[(14, 0.109), (17, 0.034), (19, 0.053), (27, 0.18), (30, 0.063), (36, 0.014), (37, 0.166), (38, 0.011), (59, 0.024), (72, 0.06), (79, 0.049), (83, 0.037), (88, 0.022), (91, 0.1)]
simIndex simValue paperId paperTitle
same-paper 1 0.9082132 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
2 0.80796832 142 nips-2001-Orientational and Geometric Determinants of Place and Head-direction
Author: Neil Burgess, Tom Hartley
Abstract: We present a model of the firing of place and head-direction cells in rat hippocampus. The model can predict the response of individual cells and populations to parametric manipulations of both geometric (e.g. O'Keefe & Burgess, 1996) and orientational (Fenton et aI., 2000a) cues, extending a previous geometric model (Hartley et al., 2000). It provides a functional description of how these cells' spatial responses are derived from the rat's environment and makes easily testable quantitative predictions. Consideration of the phenomenon of remapping (Muller & Kubie, 1987; Bostock et aI., 1991) indicates that the model may also be consistent with nonparametric changes in firing, and provides constraints for its future development. 1
3 0.80021489 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
4 0.79135293 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile
Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.
5 0.78965896 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
6 0.78918678 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
7 0.78789115 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
8 0.78778416 134 nips-2001-On Kernel-Target Alignment
9 0.78390437 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules
10 0.78312445 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
11 0.77773798 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
12 0.77678823 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
13 0.77544016 139 nips-2001-Online Learning with Kernels
14 0.7714082 58 nips-2001-Covariance Kernels from Bayesian Generative Models
15 0.77126104 190 nips-2001-Thin Junction Trees
16 0.76593494 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
17 0.76287091 31 nips-2001-Algorithmic Luckiness
18 0.761262 114 nips-2001-Learning from Infinite Data in Finite Time
19 0.75871587 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
20 0.75800329 60 nips-2001-Discriminative Direction for Kernel Classifiers