nips nips2001 nips2001-137 knowledge-graph by maker-knowledge-mining

137 nips-2001-On the Convergence of Leveraging

Source: pdf

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

Abstract: We give an uniﬁed convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We give an uniﬁed convergence analysis of ensemble learning methods including e. [sent-10, score-0.227]

2 These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. [sent-13, score-0.566]

3 We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. [sent-14, score-0.269]

4 Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning. [sent-15, score-0.201]

5 1 Introduction We show convergence rates of ensemble learning methods such as AdaBoost [10], Logistic Regression (LR) [11, 5] and the Least-Square (LS) regression algorithm called LS-Boost [12]. [sent-16, score-0.419]

6 These algorithms have in common that they iteratively call a base learning algorithm (also called weak learner) on a weighted training sample. [sent-17, score-0.51]

7 The base learner is expected from some hypothesis set of weak hypotheses to return in each iteration a hypothesis that has small weighted training error. [sent-18, score-1.402]

8 These hypotheses are then ; in classiﬁcation one linearly combined to form the ﬁnal hypothesis uses the sign of for prediction. [sent-20, score-0.546]

9 The hypothesis coefﬁcient is determined at iteration , such that a certain objective is minimized or approximately minimized, and is ﬁxed for later iterations. [sent-21, score-0.417]

10 Here we will work out sufﬁcient conditions on the base learning algorithm to achieve linear convergence to the minimum of an associated loss function . [sent-22, score-0.72]

11 Relation to Previous Work In the original work on AdaBoost it has been shown that the optimization objective (which is an upper bound on the training error) converges exponentially fast to zero, if the base learner is consistently better than random guessing, i. [sent-24, score-0.589]

12 In this case the convergence is known to be linear (i. [sent-27, score-0.177]

13 weighted training error cannot be upper bounded by a constant smaller , otherwise one could use AdaBoost to ﬁnd a separation using the aforementioned convergence result. [sent-38, score-0.327]

14 2 For AdaBoost and Logistic Regression it has been shown [5] that they generate a combined hypothesis asymptotically minimizing a loss functional only depending on the output of the combined hypothesis . [sent-39, score-1.042]

15 This holds for the non-separable case; however, the assumed conditions in [5] on the performance of the base learner are rather strict and can usually not be satisﬁed in practice. [sent-40, score-0.522]

16 Although the analysis in [5] holds in principle for any strictly convex cost function of Legendre-type (e. [sent-41, score-0.24]

17 258, and [1]), one needs to show the existence of a so-called auxiliary function [7, 5] for each cost function other than the exponential or the logistic loss. [sent-44, score-0.249]

18 In an earlier attempt to show the convergence of such methods for arbitrary loss functions [17], one needed to assume that the hypothesis coefﬁcients are upper bounded by a rather small constant. [sent-49, score-0.781]

19 For this case it has been shown that the algorithm asymptotically converges to a combined hypothesis minimizing . [sent-50, score-0.54]

20 In [9] it has been shown that for loss functions which are (essentially) exponentially decreasing (including the loss functions of AdaBoost and Logistic regression), the loss is in the ﬁrst iterations and afterwards . [sent-52, score-0.706]

21 However, this only holds, if the loss reaches zero, i. [sent-54, score-0.189]

22 In this work convergence is proven for the general non-separable case, however, only for the exponential loss, i. [sent-60, score-0.25]

23 3 The framework set up in this paper is more general and we are able to treat any strictly convex loss function. [sent-63, score-0.354]

24 In this paper we propose a family of algorithms that are able to generate a combined hypothesis converging to the minimum of (if it exists), which is a functional depending on the outputs of the function evaluated on the training set. [sent-64, score-0.559]

25 While assuming mild conditions on the base learning algorithm and the loss function , we can show linear convergence rates [15] (beginning in the ﬁrst iteration) of the type for some ﬁxed . [sent-66, score-0.75]

26 This means that the difference to the minimum loss converges exponentially fast to zero (in the number of iterations). [sent-67, score-0.282]

27 A similar convergence has been proven for AdaBoost in the special case of separable data [10], although the constant shown in [10] can be considerable smaller [see also 9]. [sent-68, score-0.298]

28 To prove the convergence of leveraging, we exploit results of Luo & Tseng [16] for a variant of the Gauss-Southwell method known from numerical optimization. [sent-69, score-0.241]

29 Since in practice the hypothesis set can be quite large, ensemble learning algorithms without any regularization often suffer from overﬁtting [22, 12, 2, 19]. [sent-70, score-0.467]

30 Here, the complexity can only be controlled by the size of the base hypothesis set or by early stopping after a few iterations. [sent-71, score-0.584]

31 However, it has been shown that shrinkage regularization implied by penalizing some norm of the hypothesis coefﬁcients is the favorable strategy [6, 12]. [sent-72, score-0.444]

32 We therefore extend our analysis to the case of -norm regularized loss functions. [sent-73, score-0.275]

33 lating these results to leveraging algorithms, we present an extension to regularized cost functions in Sec. [sent-82, score-0.634]

34 2 Leveraging algorithms revisited We ﬁrst brieﬂy review some of the most well known leveraging algorithms for classiﬁcation and regression. [sent-84, score-0.573]

35 1 as a template for a generic leveraging algorithm, since these algorithms have the same algorithmical structure. [sent-89, score-0.528]

36 In each iteration they call a base learning algorithm on the training set (cf. [sent-92, score-0.406]

37 The base learner is expected to return a hypothesis from some hypothesis space 4 that 5 has small weighted classiﬁcation error [10, 11], where and . [sent-96, score-1.138]

38 For AdaBoost this is (1) and for Logistic Regression it is (2) is the combined hypothesis of the previous iteration given by . [sent-100, score-0.435]

39 The weighting on the : sample is updated based on the new combined hypothesis and for AdaBoost and Logistic Regression, respectively. [sent-104, score-0.404]

40 It ﬁrst selects a hypothesis solving where (3) and then ﬁnds the hypothesis weight bined hypothesis: by minimizing the squared error of the new com(4) The “weighting” of the sample is computed as , which is the residual of [12]. [sent-108, score-0.704]

41 In a second version of LS-Boost, the base hypothesis and its weight are found simultaneously by solving [12]: (5) Since in (5) one reaches a lower loss function value than with (3) and (4), it might be the favorable strategy. [sent-109, score-0.813]

42 4 Notice that always contains only a ﬁnite number of different hypotheses when evaluated on the training set and is effectively ﬁnite [2]. [sent-110, score-0.173]

43 Algorithm 1 – A Leveraging algorithm for the loss function . [sent-112, score-0.234]

44 of Iterations , Loss function , for all , (a) Train classiﬁer on and obtain hypothesis (b) Set (c) Update and 4. [sent-117, score-0.318]

45 To avoid confusions, note that hypotheses and coefﬁcients generated during the iterative algorithm are marked by a hat. [sent-131, score-0.186]

46 In the algorithms discussed so far, the optimization takes place by employing the leveraging scheme outlined in Alg. [sent-132, score-0.624]

47 The output of such an algorithm is a sequence of pairs and a combined hypothesis . [sent-134, score-0.417]

48 Other Preliminaries Throughout the paper we assume the loss function is of the form Although, this assumption is not necessary, the presentation becomes easier. [sent-136, score-0.223]

49 However, note that additive loss functions are commonly used, if one considers i. [sent-138, score-0.224]

50 Furthermore, the function is assumed to be strictly convex for all . [sent-143, score-0.194]

51 For regression problems the hypothesis space might be inﬁnite. [sent-149, score-0.465]

52 Then we show how the reviewed leveraging algorithms ﬁt into this optimization framework. [sent-152, score-0.622]

53 1 Coordinate Descent The idea of coordinate descent is to iteratively select a coordinate, say the -th, and ﬁnd such that some functional is minimized with respect to . [sent-154, score-0.378]

54 15]; however, we are in particular interested in the Gauss-Southwell-type (GS) selection scheme: It selects the coordinate that has the largest absolute value in the gradient vector , i. [sent-157, score-0.331]

55 Instead of doing steps in the direction of the negative gradient as in standard gradient descent methods, one only changes the variable that has the largest gradient component. [sent-160, score-0.28]

56 We start with the following general convergence result, which seemed to be fallen into oblivion even in the optimization community. [sent-162, score-0.236]

57 It will be very useful in the analysis of leveraging algorithms. [sent-163, score-0.483]

58 Suppose is twice continuously differentiable and strictly convex on . [sent-166, score-0.193]

59 Let be the sequence generated by coordinate descent, where the coordinate selection satisﬁes (8) for some , where is the optimal value of if it would be selected, i. [sent-169, score-0.334]

60 Then a coordinate selection such that (10) satisﬁes (8) of (11) as described above converges. [sent-180, score-0.192]

61 To Thus the approximate Gauss-Southwell method on show the convergence of the second variant of LS-Boost (cf. [sent-181, score-0.208]

62 Then a coordinate selection satisﬁes (8), if there exists a ﬁxed with (12) Thus the maximal improvement scheme on as above converges in the sense of Thm. [sent-184, score-0.332]

63 Finally we can also state a rate of convergence, which is surprisingly not worse than the rates for standard gradient descent methods: Theorem 4 (Rate of Convergence of Coordinate Descent, [16]). [sent-186, score-0.174]

64 Then we have (13) is the estimate after the -th coordinate descent step, where tion, and . [sent-191, score-0.263]

65 Following [16] one can show that the constant is , where is the Lipschitz constant of and is a constant that depends on and therefore on the geometry of the hypothesis set (cf. [sent-193, score-0.411]

66 While the upper bound on can be rather large, making the convergence slow, it is important to note (i) that this is only a rough estimate of the true constant and (ii) still guarantees an exponential decrease in the error functional with the number of iterations. [sent-195, score-0.287]

67 2 Leveraging and Coordinate Descent We now return from the abstract convergence results in Sec. [sent-197, score-0.215]

68 Thus, the coordinate with maximal absolute grawhere dient corresponds to the hypothesis with largest absolute edge (see deﬁnition). [sent-206, score-0.598]

69 However, according to Proposition 2 and 3 we need to assume less on the base learner. [sent-207, score-0.3]

70 It either has to return a hypothesis that (approximately) maximizes the edge, or alternatively (approximately) minimizes the loss function. [sent-208, score-0.545]

71 A base learning algorithm is called -optimal, if it always returns hypotheses that either satisfy condition (11) or (12) for some ﬁxed . [sent-210, score-0.536]

72 Since we have assumed is closed under complementation, there always exist two hypotheses having the same absolute gradient ( and ). [sent-211, score-0.274]

73 We therefore only need to consider the hypothesis with maximum edge as opposed to the maximum absolute edge. [sent-212, score-0.405]

74 For classiﬁcation it means: if the base learner returns the hypothesis with approximately smallest weighted training error, this condition is satisﬁed. [sent-213, score-0.898]

75 The loss functions of AdaBoost, Logistic regression and LS-Boost are bounded, strongly convex and fulﬁll the conditions in Thm. [sent-217, score-0.517]

76 We can ﬁnally state the convergence result for leveraging algorithms: Theorem 7. [sent-219, score-0.66]

77 Let be a loss function satisfying the conditions in Thm. [sent-220, score-0.232]

78 Since , the base learner prefers hypotheses with small and could therefore stop improving the objective while being not optimal (see [20, Section 4. [sent-229, score-0.546]

79 It additionally allows for box constraints and a linear function in terms of the hypothesis coefﬁcients. [sent-233, score-0.318]

80 Here, we are in particular interested in -norm penalized loss functions of the type , which are frequently used in machine learning. [sent-234, score-0.254]

81 The LASSO algorithm for regression [26] and the PBVM algorithm for classiﬁcation [25] are examples. [sent-235, score-0.237]

82 Since we assumed complementation closeness of , we can assume without loss of generality that a solution satisﬁes . [sent-236, score-0.328]

83 Clearly, the regularization deﬁnes a structure of nested subsets of , where the hypothesis set is restricted to a smaller set for larger values of . [sent-238, score-0.372]

84 The constraint causes some minor complications with the assumptions on the base learning algorithm. [sent-239, score-0.266]

85 [21]), while not assuming more on the base learning algorithm. [sent-241, score-0.266]

86 Roughly speaking, this induces the problem that hypothesis coefﬁcient chosen too large in a previous iteration, cannot be reduced again. [sent-244, score-0.318]

87 To solve this problem one can check for each coefﬁcient of a previously selected hypothesis whether not selecting it would violate the -optimality condition (11) or (12). [sent-245, score-0.356]

88 If so, the Algorithm 2 – A Leveraging algorithm for -norm regularized loss . [sent-246, score-0.32]

89 Do for , for all , and obtain hypothesis (a) Train classiﬁer on (b) Let (c) (d) if , Reg. [sent-251, score-0.318]

90 Output: algorithm selects such a coordinate for the next iteration instead of calling the base learning algorithm. [sent-255, score-0.551]

91 1, is strictly convex, , and the base learner satisﬁes (15) . [sent-260, score-0.467]

92 2 converges linearly to a minimum of the regularized loss function. [sent-262, score-0.366]

93 for This can also be shown for a maximum-improvement like condition on the base learner, which we have to omit due to space limitation. [sent-263, score-0.333]

94 For this algorithm one can show order one convergence (which is (keeping weaker than linear convergence), which also holds if the hypothesis set is inﬁnite. [sent-265, score-0.585]

95 5 Conclusion We gave a unifying convergence analysis for a fairly general family of leveraging methods. [sent-266, score-0.694]

96 These convergence results were obtained under rather mild assumptions on the base learner and, additionally, led to linear convergence rates. [sent-267, score-0.789]

97 This was achieved by relating leveraging algorithms to the Gauss-Southwell method known from numerical optimization. [sent-268, score-0.561]

98 Future investigations include the generalization to inﬁnite hypotheses spaces and an improvement of the convergence rate . [sent-270, score-0.318]

99 On the convergence of coordinate descent method for convex differentiable minimization. [sent-396, score-0.571]

100 Sparse regression ensembles in inﬁnite and ﬁnite a hypothesis spaces. [sent-436, score-0.465]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('leveraging', 0.483), ('adaboost', 0.337), ('hypothesis', 0.318), ('base', 0.266), ('loss', 0.189), ('logistic', 0.184), ('convergence', 0.177), ('regression', 0.147), ('coordinate', 0.142), ('hypotheses', 0.141), ('learner', 0.139), ('tsch', 0.124), ('descent', 0.121), ('convex', 0.103), ('regularized', 0.086), ('satis', 0.083), ('complementation', 0.076), ('classi', 0.071), ('coef', 0.065), ('iteration', 0.063), ('strictly', 0.062), ('weighted', 0.059), ('optimization', 0.059), ('converges', 0.058), ('proposition', 0.056), ('combined', 0.054), ('regularization', 0.054), ('bregman', 0.053), ('gradient', 0.053), ('cation', 0.053), ('separable', 0.052), ('demiriz', 0.051), ('luo', 0.051), ('absolute', 0.051), ('mika', 0.05), ('selection', 0.05), ('nite', 0.05), ('ensemble', 0.05), ('xed', 0.047), ('returns', 0.046), ('algorithm', 0.045), ('algorithms', 0.045), ('holds', 0.045), ('exists', 0.045), ('leveraged', 0.044), ('lasso', 0.044), ('functional', 0.044), ('conditions', 0.043), ('nally', 0.042), ('boosting', 0.04), ('lr', 0.04), ('duffy', 0.04), ('gs', 0.04), ('favorable', 0.04), ('proven', 0.038), ('condition', 0.038), ('pages', 0.038), ('return', 0.038), ('della', 0.038), ('onoda', 0.038), ('pietra', 0.038), ('scheme', 0.037), ('minimized', 0.036), ('edge', 0.036), ('selects', 0.035), ('iteratively', 0.035), ('fraunhofer', 0.035), ('manfred', 0.035), ('reviewed', 0.035), ('october', 0.035), ('functions', 0.035), ('exponential', 0.035), ('exponentially', 0.035), ('iterations', 0.034), ('es', 0.034), ('family', 0.034), ('assume', 0.034), ('minimizing', 0.033), ('linearly', 0.033), ('numerical', 0.033), ('asymptotically', 0.032), ('weighting', 0.032), ('march', 0.032), ('converging', 0.032), ('shrinkage', 0.032), ('training', 0.032), ('constant', 0.031), ('variant', 0.031), ('cost', 0.03), ('theorem', 0.03), ('cients', 0.03), ('penalized', 0.03), ('mild', 0.03), ('omit', 0.029), ('brie', 0.029), ('assumed', 0.029), ('weak', 0.028), ('bounded', 0.028), ('differentiable', 0.028), ('ller', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

2 0.26016915 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

3 0.21902612 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efﬁcient and leads to simple algorithms. In particular we derive update equations for classiﬁcation, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. ¡

4 0.20442632 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classiﬁers each trained to achieve high detection rates and modest false positive rates can yield a ﬁnal detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classiﬁers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields signiﬁcant improvements in performance over conventional AdaBoost. The ﬁnal face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.

5 0.20257488 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

Author: Guy Lebanon, John D. Lafferty

Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.

6 0.19932057 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

7 0.19734672 167 nips-2001-Semi-supervised MarginBoost

8 0.14977118 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

9 0.14562941 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

10 0.11038642 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

11 0.1000285 62 nips-2001-Duality, Geometry, and Support Vector Regression

12 0.098747432 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

13 0.098609798 31 nips-2001-Algorithmic Luckiness

14 0.097221881 22 nips-2001-A kernel method for multi-labelled classification

15 0.089609399 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

16 0.089386478 105 nips-2001-Kernel Machines and Boolean Functions

17 0.087572277 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

18 0.084035069 13 nips-2001-A Natural Policy Gradient

19 0.080949754 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

20 0.076745406 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.26), (1, 0.126), (2, 0.048), (3, 0.31), (4, 0.13), (5, -0.133), (6, 0.032), (7, 0.057), (8, 0.063), (9, -0.133), (10, 0.031), (11, 0.121), (12, -0.002), (13, 0.026), (14, -0.052), (15, -0.126), (16, -0.044), (17, -0.036), (18, -0.045), (19, 0.196), (20, 0.18), (21, -0.026), (22, -0.074), (23, 0.107), (24, -0.012), (25, 0.062), (26, -0.011), (27, -0.091), (28, -0.077), (29, 0.086), (30, 0.023), (31, -0.075), (32, -0.034), (33, -0.042), (34, 0.009), (35, 0.032), (36, 0.008), (37, 0.019), (38, -0.008), (39, -0.07), (40, 0.112), (41, 0.032), (42, 0.005), (43, 0.007), (44, 0.011), (45, 0.047), (46, 0.119), (47, -0.056), (48, 0.133), (49, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95715296 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

2 0.79533511 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

Author: Guy Lebanon, John D. Lafferty

3 0.6501677 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

4 0.64741594 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

5 0.62327409 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classiﬁcation and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

6 0.57714486 167 nips-2001-Semi-supervised MarginBoost

7 0.5505814 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

8 0.51056778 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

9 0.488821 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

10 0.46736005 31 nips-2001-Algorithmic Luckiness

11 0.43791789 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

12 0.38413954 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

13 0.36618641 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

14 0.36022535 22 nips-2001-A kernel method for multi-labelled classification

15 0.36000288 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

16 0.35437447 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

17 0.34121421 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

18 0.32542604 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

19 0.31427035 180 nips-2001-The Concave-Convex Procedure (CCCP)

20 0.31305194 105 nips-2001-Kernel Machines and Boolean Functions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.109), (17, 0.034), (19, 0.053), (27, 0.18), (30, 0.063), (36, 0.014), (37, 0.166), (38, 0.011), (59, 0.024), (72, 0.06), (79, 0.049), (83, 0.037), (88, 0.022), (91, 0.1)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9082132 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

2 0.80796832 142 nips-2001-Orientational and Geometric Determinants of Place and Head-direction

Author: Neil Burgess, Tom Hartley

Abstract: We present a model of the firing of place and head-direction cells in rat hippocampus. The model can predict the response of individual cells and populations to parametric manipulations of both geometric (e.g. O'Keefe & Burgess, 1996) and orientational (Fenton et aI., 2000a) cues, extending a previous geometric model (Hartley et al., 2000). It provides a functional description of how these cells' spatial responses are derived from the rat's environment and makes easily testable quantitative predictions. Consideration of the phenomenon of remapping (Muller & Kubie, 1987; Bostock et aI., 1991) indicates that the model may also be consistent with nonparametric changes in firing, and provides constraints for its future development. 1

3 0.80021489 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

4 0.79135293 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

5 0.78965896 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

6 0.78918678 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

7 0.78789115 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

8 0.78778416 134 nips-2001-On Kernel-Target Alignment

9 0.78390437 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

10 0.78312445 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

11 0.77773798 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

12 0.77678823 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

13 0.77544016 139 nips-2001-Online Learning with Kernels

14 0.7714082 58 nips-2001-Covariance Kernels from Bayesian Generative Models

15 0.77126104 190 nips-2001-Thin Junction Trees

16 0.76593494 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

17 0.76287091 31 nips-2001-Algorithmic Luckiness

18 0.761262 114 nips-2001-Learning from Infinite Data in Finite Time

19 0.75871587 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

20 0.75800329 60 nips-2001-Discriminative Direction for Kernel Classifiers