nips nips2001 nips2001-138 knowledge-graph by maker-knowledge-mining

138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms


Source: pdf

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 it Abstract In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. [sent-7, score-0.702]

2 Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. [sent-8, score-0.112]

3 Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds. [sent-9, score-0.346]

4 First, obtaining tight uniform risk bounds in terms of meaningful empirical quantities is generally a difficult task. [sent-12, score-0.586]

5 Second, searching for the hypothesis minimizing a given empirical functional is often computationally expensive and, furthermore, the minimizing algorithm is seldom incremental (if new data is added to the training set then the algorithm needs be run again from scratch). [sent-13, score-0.552]

6 On-line learning algorithms, such as the Perceptron algorithm [17], the Winnow algorithm [14], and their many variants [16, 6, 13, 10, 2, 9], are general methods for solving classification and regression problems that can be used in a fully incremental fashion. [sent-14, score-0.294]

7 That is, they need (in most cases) a short time to process each new training example and adjust their current hypothesis. [sent-15, score-0.04]

8 While the behavior of these algorithms is well understood in the so-called mistake bound model [14], where no assumptions are made on the way the training sequence is generated, there are fewer results concerning how to use these algorithms to obtain hypotheses with small statistical risk. [sent-16, score-0.592]

9 Littlestone [15] proposed a method for obtaining small risk hypotheses from a run of an arbitrary on-line algorithm by using a cross validation set to test each one of the hypotheses generated during the run. [sent-17, score-1.142]

10 This method does not require any convergence property of the online algorithm and provides risk tail bounds that are sharper than those obtainable choosing, for instance, the hypothesis in the run that survived the longest. [sent-18, score-1.054]

11 Helmbold, Warmuth, and others [11, 6, 8] showed that, without using any cross-validation sets, one can obtain expected risk bounds (as opposed to the more informative tail bounds) for a hypothesis randomly drawn among those generated during the run. [sent-19, score-0.863]

12 In this paper we prove, via refinements and extensions of the previous analyses, that online algorithms naturally lead to good data-dependent tail bounds without employing the complicated concentration-of-measure machinery needed by other frameworks [19]. [sent-20, score-0.386]

13 When applied to concrete algorithms, the loss bound translates into a function of meaningful data-dependent quantities. [sent-22, score-0.332]

14 For classification problems, the mistake bound for the -norm Perceptron algorithm yields a tail risk bound in terms of the empirical distribution of the margins — see (4). [sent-23, score-0.862]

15 For regression problems, the square loss bound for ridge regression yields a tail risk bound in terms of the eigenvalues of the Gram matrix — see (5). [sent-24, score-1.308]

16 £   £¡ ¤¢  ¥   2 Preliminaries and notation be arbitrary sets and . [sent-25, score-0.083]

17 An example is a pair , where is an Let instance belonging to and is the label associated with . [sent-26, score-0.07]

18 We let be the pair of random variables , where and take values in and , respectively. [sent-28, score-0.035]

19 Throughout the paper, we assume that data are generated i. [sent-29, score-0.032]

20 All probabilities and expectations will be understood with to denote the vectorrespect to this underlying distribution. [sent-33, score-0.032]

21   ¦  1 3 4( $ 3 CB@A@875( (§ 999§6  ) $ 1§ 2#¢0) © & '¢" ¦ ©  ¦   ©§ ¨¦ A hypothesis is any (measurable) mapping from instances to predictions , where is a given decision space. [sent-36, score-0.187]

22 The risk of is defined by , where is a nonnegative loss function. [sent-37, score-0.469]

23 The on-line algorithms we investigate are defined within a well-known mathematical model, which is a generalization of a learning model introduced by Littlestone [14] and Angluin [1]. [sent-39, score-0.068]

24 In this learning model, an on-line algorithm processes the examples in one at a time in trials, generating a sequence of hypotheses . [sent-41, score-0.372]

25 At the beginning of the -th trial, the algorithm receives the and uses its current hypothesis to compute a prediction instance for the label associated with . [sent-42, score-0.375]

26 Then, the true value of the label is disclosed and , measuring how bad is the prediction the algorithm suffers a loss for the label . [sent-43, score-0.319]

27 Before the next trial begins, the algorithm generates a new hypothesis which may or may not be equal to . [sent-44, score-0.336]

28 We measure the algorithm’s performance on by its cumulative loss fedcba0)! [sent-45, score-0.258]

29 In particular, throughout the paper will denote the (deterministic) initial hypothesis of an arbitrary on-line algorithm and, for each , will be a random variable denoting the -th hypothesis of the on-line algorithm and such that the value of does not change upon changes in the values of . [sent-47, score-0.641]

30 Our goal is to relate the risk of the hypotheses produced by an on-line algorithm running on an i. [sent-48, score-0.643]

31 sequence to the cumulative loss of the algorithm on that sequence. [sent-51, score-0.387]

32 The cumulative loss will be our key empirical (data-dependent) quantity. [sent-52, score-0.296]

33 Via our analysis we will obtain bounds of the form $  t3 f(sh § l k  £k       ¨ ©§ $3  tnf(£ sh ¢ ¡ $ i§ 999 #$ 3 #BA@A§ ‘ gi ¤AP Q ¥ ¦ £ ¤ where is a specific function of the sequence of hypotheses produced by the algorithm, and is a suitable positive constant. [sent-53, score-0.546]

34 We will see that for specific on-line algorithms the ratio can be further bounded in terms of meaningful empirical quantities. [sent-54, score-0.181]

35 3 #‡@AA“Wi i§ 999§ ‘ £ $3  8¡ tr5(sh $ 3 #‡@A@`2gi i§ 999§‘ ¨ £ Our method centers on the following simple concentration lemma about bounded losses. [sent-55, score-0.267]

36 Lemma 1 Let be an arbitrary bounded loss satisfying on-line algorithm output (not necessarily distinct) hypotheses on . [sent-56, score-0.727]

37 Furthermore, $ where denotes the -algebra generated by . [sent-65, score-0.032]

38 A direct application of the Hoeffding-Azuma inequality [3] to the bounded random variables proves the lemma. [sent-66, score-0.13]

39 2 ' 3 4 3 Concentration for convex losses In this section we investigate the risk of the average hypothesis 6 F •e 6 § ‡– • D 3 £k  D d 9 7 5 @86 £ where are the hypotheses generated by some on-line algorithm run on training examples. [sent-67, score-1.118]

40 1 The average hypothesis generates valid predictions whenever the decision space is convex. [sent-68, score-0.187]

41 9 l k £ u £ h 'i @AP $ Q k w 2t w 3 r( X e #’ Vt p ©  I g X u§ I I 3 cB@A@§ 6 c`’D D§ 999 D§‘ Theorem 2 Let be convex and be convex in the first argument. [sent-69, score-0.086]

42 Let an arbitrary on-line algorithm for output (not necessarily distinct) hypotheses when the algorithm is run on . [sent-70, score-0.671]

43 Then for any the following holds     B   § A ¥  is not used in this average. [sent-71, score-0.034]

44   D EC ¢ ¡ Notice that the last hypothesis 3 #‡@AA§ ‘ i i§ 999 1 4 Proof. [sent-72, score-0.187]

45 Since is convex in the first argument, by Jensen’s inequality we have Taking expectation with respect to yields . [sent-73, score-0.146]

46 Using the last inequality along with Lemma 1 yields the thesis. [sent-74, score-0.103]

47 D ¡X ¢ "§$    6 Q $ ‡– • gD AP 6 e •3 ¤¤ 3 9 $ "§$ 6 •  q%#bG08‡– RgDvX 6 e •3 ¥366 X $ 1§ dc¢)  This theorem, which can be viewed as the tail bound version of the expected bound in [11], implies that the risk of the average hypothesis is close to for “most” samples . [sent-76, score-0.841]

48 Here we just note that by applying this theorem to the Weighted Majority algorithm [16], we can prove a version of [5, Theorem 4] for the absolute loss without resorting to sophisticated concentration inequalities (details in the full paper). [sent-79, score-0.448]

49 4 Penalized risk estimation for general losses If the loss function is nonconvex (such as the 0-1 loss) then the risk of the average hypothesis cannot be bounded in the way shown in the previous section. [sent-80, score-1.124]

50 However, the risk of the best hypothesis, among those generated by the on-line algorithm, cannot be higher than the average risk of the same hypotheses. [sent-81, score-0.648]

51 Hence, Lemma 1 immediately tells us that, at under no conditions on the loss function other than boundedness, for most samples least one of the hypotheses generated has risk close to . [sent-82, score-0.744]

52 In this section we give a technique (Lemma 4) that, using a penalized risk estimate, finds with high probability such a hypothesis. [sent-83, score-0.393]

53 Unlike Littlestone’s, our technique does not require a cross validation set. [sent-85, score-0.037]

54 Therefore we are able to obtain bounds on the risk whose main term is , where is the size of the whole set of examples available to the learning algorithm (i. [sent-86, score-0.574]

55 Similar observations are made in [4], though the analysis there does actually refer only to randomized hypotheses with 0-1 loss (namely, to absolute loss). [sent-89, score-0.404]

56 X 3 £ $ ¤¡ t3     £ £ 8¡ ™$3     • ƒD Let us define the penalized risk estimate of hypothesis by • •   3 ! [sent-90, score-0.58]

57 ” ¢£ where is the length of the suffix of the training sequence that the on-line is the cumulative loss of on that algorithm had not seen yet when was generated, suffix, and • RD 9 $k  £ £  §    k ! [sent-93, score-0.427]

58 $ ¦ SG0 §  ¨  Our algorithm chooses the hypothesis , where e § †k@% Vt X 6 3 9 %$$t £  ¦§ ” ‡–  ! [sent-94, score-0.279]

59 For the sake of simplicity, we will restrict to losses with range . [sent-98, score-0.127]

60 However, it should be clear that losses taking values in arbitrary bounded real interval can be handled using techniques similar to those shown in Section 3. [sent-99, score-0.243]

61 Theorem 3 Let an arbitrary on-line algorithm output (not necessarily distinct) hypotheses when it is run on . [sent-101, score-0.579]

62 Then, for any , the hypothesis chosen using the penalized risk estimate based on satisfies ¨i  9 l $¤k  £k £ k l jt w   §    B  '& (¦ ¨ 3 ( 12 £ 0'¨ i  AP h )$ Q § ¢ 3 #! [sent-102, score-0.626]

63 ¡ $  ¡ £  ¦ 7$  0i¤A •$ gi @P w QP Q ¦ $ Q ) $ A ¡ £  § @ 0i¤AP ¡ ¤¨ $” ¦ • QP ta £  § 5$j0i Ayl • ¤¨ ¨ ! [sent-123, score-0.051]

64 Let , and set for brevity  ¡ Lemma 4 Let an arbitrary on-line algorithm output (not necessarily distinct) hypotheses when it is run on . [sent-144, score-0.579]

65 Then for any the following holds:  The proof of this theorem is based on the two following technical lemmas. [sent-145, score-0.068]

66 Lemma 5 Let an arbitrary on-line algorithm output (not necessarily distinct) hypotheses when it is run on . [sent-146, score-0.579]

67 §     § where the last inequality follows from . [sent-160, score-0.066]

68 The proof follows by combining Lemma 4 and Lemma 5, and by overapproximating the square root terms therein. [sent-164, score-0.031]

69 4 5 Applications For the sake of concreteness we now sketch two generalization bounds which can be obtained through a direct application of our techniques. [sent-165, score-0.205]

70 The -norm Perceptron algorithm [10, 9] is a linear threshold algorithm which keeps in the -th trial a weight vector . [sent-166, score-0.241]

71 On instance , the algorithm predicts by sign , where and . [sent-167, score-0.129]

72 , if ) then the algorithm performs the weight update . [sent-170, score-0.092]

73 On the other hand, gets an algorithm which performs like a multiplicative algorithm, such as the Normalized Winnow algorithm [10]. [sent-172, score-0.184]

74 Applying Theorem 3 to the bound on the number of mistakes for the -norm Perceptron algorithm shown in [9], we immediately obtain that, with probability at least with respect to the draw of the training sample , the risk of the penalized estimator is at most $5  2 ! [sent-173, score-0.712]

75 The margin-based quantity is called soft margin in [20] and accounts for the distribution of margin values achieved by the examples in with respect to hyperplane . [sent-182, score-0.13]

76 , [19]) are typically expressed in terms of the sample margin , i. [sent-185, score-0.101]

77 , in terms of the fraction of training points whose margin is at most . [sent-187, score-0.105]

78  % E % We remark that bound (4) does not have the extra log factors appearing in the analyses based on uniform convergence. [sent-190, score-0.153]

79 Furthermore, it is significantly better than the bound in [20] whenever is constant, which typically occurs when the data sequence is not linearly separable. [sent-191, score-0.129]

80 8 6 £ ¤¡ As a second application, we consider the ridge regression algorithm [12] for square loss. [sent-192, score-0.346]

81 This algorithm computes at the beginning of the -th trial the vector which minimizes , where . [sent-194, score-0.175]

82 On instance the algorithm predicts with , where is the “clipping” function if , if and if . [sent-195, score-0.129]

83   § ¢    § % % § % %  for any , where , denotes the determinant of matrix , is the -dimensional identity matrix and is the transpose of . [sent-205, score-0.056]

84 2 Let us denote by the matrix whose columns are the data vectors , . [sent-206, score-0.028]

85 ¤ § ¡      " where the ’s are the eigenvalues of . [sent-213, score-0.09]

86 Risk bounds in terms of same as the nonzero eigenvalues of the Gram matrix the eigenvalues of the Gram matrix were also derived in [23]; we defer to the full paper a comparison between these results and ours. [sent-215, score-0.416]

87 Finally, our bound applies also to kernel ridge regression [18] by replacing the eigenvalues of with the eigenvalues of the kernel Gram matrix , , where is the kernel being considered. [sent-216, score-0.523]

88 Relative loss bounds for on-line density estimation with the exponential family of distributions, Machine Learning, 43:211–246, 2001. [sent-230, score-0.306]

89 2 Using a slightly different linear regression algorithm, Forster and Warmuth [7] have proven a sharper bound on the expected relative loss. [sent-235, score-0.248]

90 In particular, they have exhibited an algorithm computing hypothesis such that in expectation (over ) the relative risk is bounded by . [sent-236, score-0.651]

91 Beating the hold-out: bounds for k-fold and progressive cross-validation. [sent-241, score-0.145]

92 Cristianini, On the generalization of soft margin algorithms, 2000. [sent-333, score-0.094]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ap', 0.328), ('risk', 0.308), ('qp', 0.248), ('hypotheses', 0.243), ('ffe', 0.233), ('gi', 0.208), ('littlestone', 0.203), ('hypothesis', 0.187), ('tail', 0.162), ('loss', 0.161), ('gd', 0.154), ('bounds', 0.145), ('ridge', 0.139), ('lemma', 0.135), ('fe', 0.124), ('cumulative', 0.097), ('losses', 0.096), ('aa', 0.093), ('algorithm', 0.092), ('sh', 0.092), ('bound', 0.092), ('eigenvalues', 0.09), ('vt', 0.09), ('perceptron', 0.09), ('gentile', 0.088), ('milan', 0.088), ('rgd', 0.088), ('penalized', 0.085), ('regression', 0.084), ('arbitrary', 0.083), ('colt', 0.077), ('run', 0.077), ('helmbold', 0.076), ('gram', 0.071), ('theorem', 0.068), ('concentration', 0.068), ('inequality', 0.066), ('margin', 0.065), ('bounded', 0.064), ('bramante', 0.058), ('conconi', 0.058), ('crema', 0.058), ('dti', 0.058), ('qap', 0.058), ('italy', 0.058), ('necessarily', 0.057), ('trial', 0.057), ('warmuth', 0.053), ('distinct', 0.052), ('ta', 0.051), ('forster', 0.051), ('angluin', 0.051), ('winnow', 0.051), ('baa', 0.051), ('jt', 0.046), ('ji', 0.045), ('convex', 0.043), ('sharper', 0.043), ('mistake', 0.041), ('training', 0.04), ('online', 0.04), ('meaningful', 0.04), ('algorithms', 0.039), ('concrete', 0.039), ('williamson', 0.039), ('empirical', 0.038), ('validation', 0.037), ('cb', 0.037), ('sequence', 0.037), ('classi', 0.037), ('instance', 0.037), ('yields', 0.037), ('ba', 0.036), ('sample', 0.036), ('nonzero', 0.035), ('let', 0.035), ('holds', 0.034), ('analyses', 0.033), ('label', 0.033), ('generated', 0.032), ('understood', 0.032), ('rd', 0.031), ('sake', 0.031), ('square', 0.031), ('cation', 0.03), ('draw', 0.03), ('prove', 0.03), ('freund', 0.029), ('proven', 0.029), ('inequalities', 0.029), ('obtain', 0.029), ('generalization', 0.029), ('matrix', 0.028), ('uniform', 0.028), ('fd', 0.028), ('majority', 0.028), ('obtaining', 0.027), ('output', 0.027), ('incremental', 0.026), ('beginning', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

2 0.23972279 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

3 0.19932057 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.

4 0.17876296 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

Author: T. Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.

5 0.17081137 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

6 0.15991823 105 nips-2001-Kernel Machines and Boolean Functions

7 0.15672602 31 nips-2001-Algorithmic Luckiness

8 0.14455266 163 nips-2001-Risk Sensitive Particle Filters

9 0.12523869 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

10 0.11990483 1 nips-2001-(Not) Bounding the True Error

11 0.10743587 164 nips-2001-Sampling Techniques for Kernel Methods

12 0.09356717 136 nips-2001-On the Concentration of Spectral Properties

13 0.090986922 143 nips-2001-PAC Generalization Bounds for Co-training

14 0.089722671 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

15 0.082885541 119 nips-2001-Means, Correlations and Bounds

16 0.079295367 22 nips-2001-A kernel method for multi-labelled classification

17 0.078567147 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

18 0.076732062 114 nips-2001-Learning from Infinite Data in Finite Time

19 0.075034603 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

20 0.074401125 147 nips-2001-Pranking with Ranking


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.236), (1, 0.134), (2, 0.047), (3, 0.202), (4, 0.158), (5, -0.136), (6, 0.061), (7, 0.193), (8, -0.035), (9, -0.027), (10, -0.073), (11, 0.16), (12, 0.027), (13, 0.097), (14, -0.133), (15, 0.015), (16, 0.011), (17, 0.011), (18, -0.053), (19, 0.012), (20, 0.017), (21, -0.014), (22, 0.086), (23, 0.002), (24, -0.026), (25, -0.012), (26, -0.07), (27, 0.008), (28, 0.009), (29, -0.104), (30, 0.011), (31, 0.013), (32, 0.087), (33, 0.056), (34, -0.069), (35, 0.002), (36, -0.028), (37, 0.081), (38, 0.021), (39, -0.119), (40, 0.195), (41, 0.231), (42, 0.037), (43, 0.019), (44, -0.005), (45, 0.026), (46, 0.017), (47, -0.074), (48, -0.009), (49, 0.11)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96525043 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

2 0.74880081 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

3 0.73002797 31 nips-2001-Algorithmic Luckiness

Author: Ralf Herbrich, Robert C. Williamson

Abstract: In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1

4 0.65168411 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.

5 0.62708467 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

Author: Roni Khardon, Dan Roth, Rocco A. Servedio

Abstract: We study online learning in Boolean domains using kernels which capture feature expansions equivalent to using conjunctions over basic features. We demonstrate a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability of the resulting classifier. We first describe several kernel functions which capture either limited forms of conjunctions or all conjunctions. We show that these kernels can be used to efficiently run the Perceptron algorithm over an exponential number of conjunctions; however we also prove that using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. We also consider an analogous use of kernel functions to run the multiplicative-update Winnow algorithm over an expanded feature space of exponentially many conjunctions. While known upper bounds imply that Winnow can learn DNF formulae with a polynomial mistake bound in this setting, we prove that it is computationally hard to simulate Winnow’s behavior for learning DNF over such a feature set, and thus that such kernel functions for Winnow are not efficiently computable.

6 0.59635973 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

7 0.54826653 105 nips-2001-Kernel Machines and Boolean Functions

8 0.51845145 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

9 0.51514578 8 nips-2001-A General Greedy Approximation Algorithm with Applications

10 0.48743692 114 nips-2001-Learning from Infinite Data in Finite Time

11 0.37143856 1 nips-2001-(Not) Bounding the True Error

12 0.37100953 164 nips-2001-Sampling Techniques for Kernel Methods

13 0.36925924 136 nips-2001-On the Concentration of Spectral Properties

14 0.35768482 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

15 0.3399581 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

16 0.329487 163 nips-2001-Risk Sensitive Particle Filters

17 0.32707936 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

18 0.32693195 143 nips-2001-PAC Generalization Bounds for Co-training

19 0.32309353 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

20 0.31948194 22 nips-2001-A kernel method for multi-labelled classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.237), (14, 0.088), (17, 0.042), (19, 0.045), (27, 0.124), (30, 0.072), (36, 0.016), (38, 0.021), (59, 0.028), (70, 0.018), (72, 0.065), (79, 0.032), (83, 0.022), (91, 0.109)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84533197 71 nips-2001-Estimating the Reliability of ICA Projections

Author: Frank C. Meinecke, Andreas Ziehe, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: When applying unsupervised learning techniques like ICA or temporal decorrelation, a key question is whether the discovered projections are reliable. In other words: can we give error bars or can we assess the quality of our separation? We use resampling methods to tackle these questions and show experimentally that our proposed variance estimations are strongly correlated to the separation error. We demonstrate that this reliability estimation can be used to choose the appropriate ICA-model, to enhance significantly the separation performance, and, most important, to mark the components that have a actual physical meaning. Application to 49-channel-data from an magneto encephalography (MEG) experiment underlines the usefulness of our approach. 1

same-paper 2 0.84422052 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

3 0.66344559 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

4 0.6565069 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

5 0.65434849 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

6 0.65294325 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

7 0.65048891 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

8 0.64973557 137 nips-2001-On the Convergence of Leveraging

9 0.64405334 155 nips-2001-Quantizing Density Estimators

10 0.64403248 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

11 0.64204818 58 nips-2001-Covariance Kernels from Bayesian Generative Models

12 0.6399619 60 nips-2001-Discriminative Direction for Kernel Classifiers

13 0.63914037 56 nips-2001-Convolution Kernels for Natural Language

14 0.63806075 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

15 0.637725 31 nips-2001-Algorithmic Luckiness

16 0.63749278 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

17 0.63724625 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

18 0.6369592 74 nips-2001-Face Recognition Using Kernel Methods

19 0.63662595 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

20 0.63561666 22 nips-2001-A kernel method for multi-labelled classification