nips nips2001 nips2001-8 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. [sent-5, score-0.424]
2 In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. [sent-6, score-0.772]
3 We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases. [sent-7, score-0.45]
4 1 Introduction The goal of machine learning is to obtain a certain input/output functional relationship from a set of training examples. [sent-8, score-0.19]
5 In order to do so, we need to start with a model of the functional relationship. [sent-9, score-0.125]
6 In practice, it is often desirable to find the simplest model that can explain the data. [sent-10, score-0.078]
7 In addition, the philosophy of Occam’s Razor implies that the simplest solution is likely to be the best solution among all possible solutions, In this paper, we are interested in composite models that can be expressed as linear combinations of basic models. [sent-12, score-0.505]
8 In this framework, it is natural to measure the simplicity of a composite model by the number of its basic model components. [sent-13, score-0.269]
9 Since a composite model in our framework corresponds to a linear weight over the basic model space, therefore our measurement of model simplicity corresponds to the sparsity of the linear weight representation. [sent-14, score-0.593]
10 In this paper, we are interested in achieving sparsity through a greedy optimization algorithm which we propose in the next section. [sent-15, score-0.56]
11 This algorithm is closely related to a number of previous works. [sent-16, score-0.195]
12 The basic idea was originated in [5], where Jones observed that if a target vector in a Hilbert space is a convex combination of a library of basic vectors, then using with basic library vecgreedy approximation, one can achieve an error rate of tors. [sent-17, score-1.049]
13 The idea has been refined in [1] to analyze the approximation property of sigmoidal functions including neural networks. [sent-18, score-0.249]
14 § © § ¥£¡ ¨¦¤¢ The above methods can be regarded as greedy sparse algorithms for functional approximation, which is the noise-free case of regression problems. [sent-19, score-0.982]
15 A similar greedy algorithm can also be used to solve general regression problems under noisy conditions [6]. [sent-20, score-0.737]
16 In addition to regression, greedy approximation can also be applied to classification problems. [sent-21, score-0.512]
17 The resulting algorithm is closely related to boosting [2] under the additive model point of view [3]. [sent-22, score-0.568]
18 This paper shows how to generalize the method in [5, 1] for analyzing greedy algorithms (in their case, for functional approximation problems) and apply it to boosting. [sent-23, score-0.637]
19 Our method can also be used to obtain sparse kernel representations for regression problems. [sent-25, score-0.528]
20 Such a sparse representation is what support vector regression machines try to achieve. [sent-26, score-0.449]
21 In this regard, the method given in this paper complements some recently proposed greedy kernel methods for Gaussian processes such as [9, 10]. [sent-27, score-0.425]
22 The proposed greedy approximation method can also be applied to other prediction problems with different loss functions. [sent-28, score-0.739]
23 For example, in density estimation, the goal is to find a model that has the smallest negative log-likelihood. [sent-29, score-0.13]
24 Similar approximation bounds can be directly obtained under the general framework proposed in this paper. [sent-31, score-0.322]
25 Section 2 formalizes the general class of problems considered in this paper, and proposes a greedy algorithm to solve the formulation. [sent-33, score-0.546]
26 The convergence rate of the algorithm is investigated in Section 3. [sent-34, score-0.202]
27 2 General Algorithm In machine learning, our goal is often to predict an unobserved output value based on an observed input vector . [sent-37, score-0.078]
28 This requires us to estimate a functional relationship from a set of example pairs of . [sent-38, score-0.125]
29 Usually the quality of the predictor can be measured by a loss function that is problem dependent. [sent-39, score-0.286]
30 This family of models can be regarded as additive models in statistics [4]. [sent-41, score-0.266]
31 Formally, each basic model can be regarded as a vector in a linear functional space. [sent-42, score-0.358]
32 Our problem in its most general form can thus be described as to find a vector in the convex hull of to minimize a functional of that measures the quality of . [sent-43, score-0.642]
33 This functional of plays the role of loss function for learning problems. [sent-44, score-0.311]
34 © © ¡ More formally, we consider a linear vector space the convex hull of : to denote the set of positive integers. [sent-47, score-0.377]
35 We consider the following optimization problem on In this paper, we assume that : is a differentiable convex function on We propose the following algorithm to approximately solved (1). [sent-48, score-0.401]
36 (1) ¨ £ D " © w© §¤£¡ ¦D ¨ © © £ ¤ D " ¤¥R © a1 £ ¡ ¡ £ ( ¦ 0§ £ G 5 ¦ 4 ¢ ¦ £ § © 0 ¡ GV1 ¡ and D © ¨¡ let end that minimize © ¨¡ given for find D For simplicity, we assume that the minimization of in Algorithm 2. [sent-51, score-0.046]
37 However due to the space limitation, we shall not consider this generalization. [sent-54, score-0.056]
38 For convenience, we introduce the following quantity R ! [sent-55, score-0.046]
39 hg © ¡ ( 7© ¡ ( In the next section, we show that under appropriate regularity conditions, as , where is computed from Algorithm 2. [sent-57, score-0.115]
40 In addition, the convergence rate can be bounded as . [sent-59, score-0.156]
41 #§ 3 Approximation bound Given any convex function , we have the following proposition, which is a direct consequence of the definition of convexity. [sent-61, score-0.343]
42 In convex analysis, The gradient can be replaced by the concept of subgradient, which we do not consider in this paper for simplicity. [sent-62, score-0.267]
43 1 Consider a convex function , we have The following lemma is the main theoretical result of the paper, which bounds the performance of each greedy solution step in Algorithm 2. [sent-65, score-0.92]
44 For all vectors E , we have ¦ © ¡ ( ¤ © ) , we have the following inequality 6© ) 7¡ 6¢ C © ¡ ( & y r q 5 5 Ar 9 @ C wDF¤£¡ ¡ ( pBi 'h" g ©C ¦ ) 7 8¤ © ¡ ( Proof. [sent-70, score-0.175]
45 hg £ C £ ¢ ) y r q £ ( q ( ¤ h¡ i ¦ © ¡ ip'g ¡ C © ¡ © C w©D§£ ¡ ¡ ( p'hg C ¦ 54 D 0 6V1 " §%@C C© ¡ C i wC §¤£¡ ¡ ( pC 'hg © ¦ C ¤ © ¡ ( ¦ © ¡ C C Now by setting the lemma. [sent-74, score-0.115]
46 1, we obtain in the above inequality, we obtain Using the above lemma and note that , it is easy to obtain the following theorem by induction. [sent-77, score-0.446]
47 1 approximately solves (1), and the rate of convergence for is given by © ¡ ( £ Q § ) ¤ ) , then we also have ¥ ¦ G § ) ¨! [sent-82, score-0.164]
48 We show that the general formulation considered in this paper includes some previous formulations as special cases. [sent-85, score-0.276]
49 1 Regression © ¡¡ ¤ so that the expected loss of 6 ¡ ¤ ©© ¡ ¦ In regression, we would like to approximate as ¡ ! [sent-88, score-0.186]
50 5 © © ¡ ¤ ¡ ( is small, where we use the squared loss for simplicity (this choice is obviously not crucial in our framework). [sent-89, score-0.24]
51 is the expectation over and , which often corresponds to the empirical distribution of pairs. [sent-90, score-0.097]
52 Given a set of basis functions with , we may consider the following regression formulation that is slightly different from (1): " §' 6 ¡ © © ¨¦ ¡ ¤ # B ¦ ¥¤ £ # £ " ! [sent-93, score-0.257]
53 ¤ where is a positive regularization parameter which is used to control the size of the weight vector . [sent-98, score-0.086]
54 The above formulation can be readily converted into (1) by considering the following set of basic vectors: ¡ 6 © ¨¦ ¡ ¤ 6 ¤ ¢ 41 0 3 ) R R # ¡ ¦ ¤ d ©1 ¦ ¨¤ £ £¦ I © %¦ ¡ ¤ §8 0 ) in Algorithm 2. [sent-99, score-0.185]
55 ¤ ¨¤ # ( ) # 0 We may start with can be bounded as This implies that the sparse solution in Algorithm 2. [sent-103, score-0.375]
56 1, represented as weight and ( ), satisfies the following inequality: § 6 " §%C ¡ © © ¨¦ C ¡ ¤ C # @ B ¡ ! [sent-104, score-0.047]
57 This leads to the original functional approximation results in [1, 5] and its generalization in [6]. [sent-107, score-0.265]
58 The sparse regression algorithm studied in this section can also be applied to kernel methods. [sent-108, score-0.549]
59 In this case, corresponds to the input training data space , and the basis . [sent-109, score-0.058]
60 Clearly, this corresponds to a special case of predictors are of the form (2). [sent-110, score-0.242]
61 A sparse kernel representation can be obtained easily from Algorithm 2. [sent-111, score-0.272]
62 Our sparse kernel regression formulation is related to Gaussian processes, where greedy style algorithms have also been proposed [9, 10]. [sent-113, score-0.963]
63 The bound given here is comparable to the bound given in [10] where a sparse approximation rate of the form was obtained. [sent-114, score-0.573]
64 In fact, even in many other popular methods, such as logistic regression and support vector machines, some kind of convex formulations have to be employed. [sent-118, score-0.618]
65 Although it is possible for us to analyze their formulations, in this section, we only consider the following form of loss that is closely related to Adaboost [2]: ¦ H 1 © ¨¦ ¡ ¤ ¡ I P£ £ ¦ ¦ © © © ¡ ¡ ¤ ¤ (¦¡ 3 ¥£! [sent-119, score-0.295]
66 Again, we consider a set of basis predictors , which are often called weak learners in the boosting literature. [sent-121, score-0.482]
67 We would like to find a strong learner as a convex combination of weak learners to approximately minimize the above loss: © ¡¡ ¤ " ! [sent-122, score-0.47]
68 (4) (5) This can be written as formulation (1) with d£ ¤ ¡ 8 ¦ ¤ R I © ¨¦ ¡ ¤ ¦ 0 Using simple algebra, it is easy to verify that "§%C ¡ ¡ ¤ 5 " i ¢ 6 © ¥¦¤ G § © ¨¦ C ¡ ¤ C # @ B ¤ (¡ 3 ¦¢ ! [sent-127, score-0.115]
69 hg ¤ © © %¦ ¡ ¤ # ¦ § ¦ G ¦ £ # R x t f'y6wvur § 5 § ¤¢ ¤ ¦£! [sent-129, score-0.115]
70 1 implies that the sparse solution represented as weight and ( ), satisfies the following inequality: , (6) £ Q § # . [sent-135, score-0.382]
71 Now we consider the for all special situation that there exists such that (7) 0 R ©¨ " §%C ¡ ¨ ¢ Q © %¦ C ¡ ¤ C # @ B R 0 ¨ © ¨¦ ¡ ¡ £ ¤ " # " §%C ¡ ¤ ¤ 5 ¢ 6© ¤ ¨ ¢ (¦¡ 3 ¥¢ ¤ © © ¨¦ C ¡ ¤ C # @ B ¤ (¦¡ 3 ¥¢ ! [sent-137, score-0.084]
72 5 " pi 'hg This condition will be satisfied in the large margin linearly separable case where there exists and such that and for all data , C ¦ SQ C # R Now, under (7), we obtain from (6) that ! [sent-138, score-0.258]
73 ¥ "£ ¥ © ¦ ¨ ¤ (¦¡ 3 ¦¢ ¤ © ¨ ¤ ¤ ¡§ ¨ ¤ " §' ¡ © %¦ ¡ ¤ # B ¡ © , we can choose 6© ¥ 6 ¤ G § to obtain " ! [sent-139, score-0.065]
74 ¡ ¤ © ¢ ¥ ¥ © ¦ ¡§ 6 ¨ (¦¡ 3 ¥¢ ¤ © ¨ ¤ © ¨¦ ¡ ¤ # B ¡ © ¥ £ Q § Fix any (8) This implies that the misclassification error rate decays exponentially. [sent-140, score-0.171]
75 The exponential decay of misclassification error is the original motivation of Adaboost [2]. [sent-141, score-0.174]
76 Boosting was later viewed as greedy approximation in the additive model framework [3]. [sent-142, score-0.715]
77 From the learning theory perspective, the good generalization ability of boosting is related to its tendency to improve the misclassification error under a positive margin [8]. [sent-143, score-0.41]
78 From this point of view, inequality (8) gives a much more explicit margin error bound (which decreases exponentially) than a related result in [8]. [sent-144, score-0.431]
79 In the framework of additive models, Adaboost corresponds to the exponential loss (3) analyzed in this section. [sent-145, score-0.513]
80 As pointed out in [3], other loss functions can also be used. [sent-146, score-0.186]
81 Using our analysis, we may also obtain sparse approximation bounds for these different loss functions. [sent-147, score-0.685]
82 However, it is also easy to observe that they will not lead to the exponential decay of classification error in the separable case. [sent-148, score-0.298]
83 Although the exponential loss in (3) is attractive for separable problems due to the exponential decay of margin error, it is very sensitive to outliers in the non-separable case. [sent-149, score-0.576]
84 We shall mention that an interesting aspect of boosting is the concept of adaptive resampling or sample reweighting. [sent-150, score-0.353]
85 Although this idea has dominated the interpretation of boosting algorithms, it has been argued in [3] that adaptive resampling is only a computational by-product. [sent-151, score-0.347]
86 The idea corresponds to a Newton step approximation in the sparse greedy solution of in Algorithm 2. [sent-152, score-0.888]
87 1 under the additive model framework which we consider here. [sent-153, score-0.203]
88 Our analysis further confirmed that the greedy sparse solution of an additive model in (1), rather than reweighting itself is the key component in boosting. [sent-154, score-0.825]
89 In our framework, it is also much easier to related the idea of boosting to the greedy function approximation method outlined in [1, 5]. [sent-155, score-0.854]
90 3 Mixture density estimation In mixture density estimation, the output is the probability density function of the input vector at . [sent-157, score-0.415]
91 The following negative log-likelihood is commonly used as loss function: ¡ ¦ © © ¡ ¤ ¡ ( is a probability density function. [sent-158, score-0.316]
92 ¦ © ¡¡ ¤ h ¡ © %¦ ¡ ¤ R Q © ¡¡ ¤ where Again, we consider a set of basis predictors , which are often called mixture comas a convex components. [sent-159, score-0.518]
93 We would like to find a mixture probability density model bination of mixture components to approximately minimize the negative log-likelihood: © ¡¡ ¤ Q R 7 # ¦ £ # B " §! [sent-160, score-0.364]
94 1 can be computed 6 %¦ 6 ¤ ¡ x t 5 6 ¡ 6 f'yGwvu r x 5 x ¡ 6 ©© %¦ " ¡¡ ¤ 420 6 ©© ¡ ¡¡ " 3 1 3 1 420 ¥ ¦ £ ¥ ¡ £ ¢ ¤ ¡ ¢ ) An approximation bound can now be directly obtained from Theorem 3. [sent-165, score-0.216]
95 5 Conclusion This paper studies a formalization of a general class of prediction problems in machine learning, where the goal is to approximate the best model as a convex combination of a family of basic models. [sent-168, score-0.56]
96 The quality of the approximation can be measured by a loss function which we want to minimize. [sent-169, score-0.373]
97 We proposed a greedy algorithm to solve the problem, and we have shown that for a variety of loss functions, a convergence rate of can be achieved using a convex combination of basic models. [sent-170, score-1.184]
98 We have illustrated the consequence of this general algorithm in regression, classification and density estimation, and related the resulting algorithms to previous methods. [sent-171, score-0.284]
99 Universal approximation bounds for superpositions of a sigmoidal function. [sent-175, score-0.274]
100 A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. [sent-201, score-0.914]
wordName wordTfidf (topN-words)
[('greedy', 0.372), ('convex', 0.267), ('boosting', 0.23), ('sparse', 0.219), ('regression', 0.191), ('loss', 0.186), ('inequality', 0.175), ('lemma', 0.157), ('additive', 0.143), ('predictors', 0.142), ('approximation', 0.14), ('functional', 0.125), ('basic', 0.119), ('hg', 0.115), ('cation', 0.1), ('library', 0.097), ('spc', 0.097), ('composite', 0.096), ('adaboost', 0.091), ('misclassi', 0.091), ('density', 0.089), ('algorithm', 0.086), ('classi', 0.085), ('wee', 0.084), ('proposition', 0.079), ('formulations', 0.077), ('margin', 0.076), ('bound', 0.076), ('bounds', 0.075), ('regarded', 0.075), ('separable', 0.075), ('hull', 0.071), ('learners', 0.071), ('mixture', 0.07), ('resampling', 0.067), ('sun', 0.067), ('implies', 0.067), ('formulation', 0.066), ('exponential', 0.066), ('decay', 0.066), ('obtain', 0.065), ('tong', 0.064), ('rate', 0.062), ('related', 0.062), ('pc', 0.061), ('nd', 0.061), ('framework', 0.06), ('satis', 0.059), ('sigmoidal', 0.059), ('corresponds', 0.058), ('ar', 0.056), ('shall', 0.056), ('sparsity', 0.054), ('convergence', 0.054), ('simplicity', 0.054), ('kernel', 0.053), ('hilbert', 0.053), ('predictor', 0.053), ('sq', 0.053), ('hastie', 0.051), ('idea', 0.05), ('limitation', 0.05), ('solution', 0.049), ('easy', 0.049), ('freund', 0.048), ('interested', 0.048), ('family', 0.048), ('approximately', 0.048), ('weight', 0.047), ('closely', 0.047), ('general', 0.047), ('quality', 0.047), ('minimize', 0.046), ('robert', 0.046), ('quantity', 0.046), ('theorem', 0.045), ('peter', 0.044), ('logistic', 0.044), ('includes', 0.044), ('lee', 0.043), ('exists', 0.042), ('special', 0.042), ('error', 0.042), ('newton', 0.042), ('reweighting', 0.042), ('subgradient', 0.042), ('tzhang', 0.042), ('yorktown', 0.042), ('problems', 0.041), ('negative', 0.041), ('bounded', 0.04), ('often', 0.039), ('simplest', 0.039), ('estimation', 0.039), ('vector', 0.039), ('combination', 0.038), ('eighteenth', 0.038), ('occam', 0.038), ('philosophy', 0.038), ('yoav', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
2 0.37801287 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
3 0.26016915 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
4 0.2216938 139 nips-2001-Online Learning with Kernels
Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson
Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. ¡
5 0.2044199 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
Author: Guy Lebanon, John D. Lafferty
Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.
6 0.17969246 62 nips-2001-Duality, Geometry, and Support Vector Regression
7 0.17081137 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
8 0.16259389 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
9 0.15337233 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
10 0.15264194 167 nips-2001-Semi-supervised MarginBoost
11 0.15242842 143 nips-2001-PAC Generalization Bounds for Co-training
12 0.1490345 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
13 0.12694982 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
14 0.12661752 105 nips-2001-Kernel Machines and Boolean Functions
15 0.12660585 22 nips-2001-A kernel method for multi-labelled classification
16 0.11589137 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
17 0.11514585 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
18 0.10810811 13 nips-2001-A Natural Policy Gradient
19 0.10635613 1 nips-2001-(Not) Bounding the True Error
20 0.10567056 21 nips-2001-A Variational Approach to Learning Curves
topicId topicWeight
[(0, -0.343), (1, 0.179), (2, 0.062), (3, 0.324), (4, 0.158), (5, -0.144), (6, 0.071), (7, 0.031), (8, 0.03), (9, -0.16), (10, 0.003), (11, 0.182), (12, -0.004), (13, 0.003), (14, -0.054), (15, -0.003), (16, -0.045), (17, -0.048), (18, -0.085), (19, 0.152), (20, 0.144), (21, -0.01), (22, -0.025), (23, 0.031), (24, 0.03), (25, 0.057), (26, -0.013), (27, 0.053), (28, -0.023), (29, 0.015), (30, 0.073), (31, -0.041), (32, 0.007), (33, 0.069), (34, 0.048), (35, -0.013), (36, 0.071), (37, -0.066), (38, 0.009), (39, 0.021), (40, -0.074), (41, -0.077), (42, -0.069), (43, -0.102), (44, -0.034), (45, -0.027), (46, -0.127), (47, 0.084), (48, -0.107), (49, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.97451699 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
2 0.8415888 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
3 0.76725054 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
4 0.69993228 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
Author: Guy Lebanon, John D. Lafferty
Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.
5 0.66463727 139 nips-2001-Online Learning with Kernels
Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson
Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. ¡
6 0.66165513 62 nips-2001-Duality, Geometry, and Support Vector Regression
7 0.60097903 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
8 0.59657699 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
9 0.56811816 143 nips-2001-PAC Generalization Bounds for Co-training
10 0.53814387 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
11 0.51401484 31 nips-2001-Algorithmic Luckiness
12 0.51385355 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
13 0.50258189 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
14 0.47957739 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
15 0.45392269 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
16 0.44558349 180 nips-2001-The Concave-Convex Procedure (CCCP)
17 0.43786469 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
18 0.42817891 22 nips-2001-A kernel method for multi-labelled classification
19 0.41610906 114 nips-2001-Learning from Infinite Data in Finite Time
20 0.40990236 167 nips-2001-Semi-supervised MarginBoost
topicId topicWeight
[(14, 0.085), (17, 0.018), (19, 0.038), (27, 0.17), (30, 0.077), (36, 0.095), (38, 0.022), (59, 0.026), (72, 0.109), (79, 0.052), (83, 0.023), (91, 0.116), (92, 0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.91700149 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
2 0.9149729 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
3 0.89468628 76 nips-2001-Fast Parameter Estimation Using Green's Functions
Author: K. Wong, F. Li
Abstract: We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. Simulation results show that it is efficient and precise, when compared with cross-validation and other techniques which require matrix inversion. 1
4 0.87842625 193 nips-2001-Unsupervised Learning of Human Motion Models
Author: Yang Song, Luis Goncalves, Pietro Perona
Abstract: This paper presents an unsupervised learning algorithm that can derive the probabilistic dependence structure of parts of an object (a moving human body in our examples) automatically from unlabeled data. The distinguished part of this work is that it is based on unlabeled data, i.e., the training features include both useful foreground parts and background clutter and the correspondence between the parts and detected features are unknown. We use decomposable triangulated graphs to depict the probabilistic independence of parts, but the unsupervised technique is not limited to this type of graph. In the new approach, labeling of the data (part assignments) is taken as hidden variables and the EM algorithm is applied. A greedy algorithm is developed to select parts and to search for the optimal structure based on the differential entropy of these variables. The success of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled real image sequences.
5 0.85292697 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
Author: Olivier Chapelle, Bernhard Schćžšlkopf
Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1
6 0.85208589 13 nips-2001-A Natural Policy Gradient
7 0.84851378 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
8 0.84671873 137 nips-2001-On the Convergence of Leveraging
9 0.84608042 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
10 0.84032381 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules
11 0.83956534 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
12 0.83560896 58 nips-2001-Covariance Kernels from Bayesian Generative Models
13 0.83507895 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
14 0.83506942 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
15 0.83476663 143 nips-2001-PAC Generalization Bounds for Co-training
16 0.83381021 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
17 0.82975078 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
18 0.82965595 60 nips-2001-Discriminative Direction for Kernel Classifiers
19 0.82882106 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade