nips nips2001 nips2001-139 knowledge-graph by maker-knowledge-mining

139 nips-2001-Online Learning with Kernels


Source: pdf

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Williamson Research School of Information Sciences and Engineering Australian National University Canberra, ACT 0200 Abstract We consider online learning in a Reproducing Kernel Hilbert Space. [sent-3, score-0.348]

2 Our method is computationally efficient and leads to simple algorithms. [sent-4, score-0.08]

3 In particular we derive update equations for classification, regression, and novelty detection. [sent-5, score-0.415]

4 The inclusion of the -trick allows us to give a robust parameterization. [sent-6, score-0.084]

5 Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. [sent-7, score-0.637]

6 ¡ 1 Introduction While kernel methods have proven to be successful in many batch settings (Support Vector Machines, Gaussian Processes, Regularization Networks) the extension to online methods has proven to provide some unsolved challenges. [sent-8, score-0.815]

7 Firstly, the standard online settings for linear methods are in danger of overfitting, when applied to an estimator using a feature space method. [sent-9, score-0.449]

8 This calls for regularization (or prior probabilities in function space if the Gaussian Process view is taken). [sent-10, score-0.126]

9 Secondly, the functional representation of the estimator becomes more complex as the number of observations increases. [sent-11, score-0.192]

10 More specifically, the Representer Theorem [10] implies that the number of kernel functions can grow up to linearly with the number of observations. [sent-12, score-0.236]

11 Depending on the loss function used [15], this will happen in practice in most cases. [sent-13, score-0.315]

12 Thereby the complexity of the estimator used in prediction increases linearly over time (in some restricted situations this can be reduced to logarithmic cost [8]). [sent-14, score-0.195]

13 Finally, training time of batch and/or incremental update algorithms typically increases superlinearly with the number of observations. [sent-15, score-0.387]

14 Incremental update algorithms [2] attempt to overcome this problem but cannot guarantee a bound on the number of operations required per iteration. [sent-16, score-0.227]

15 Projection methods [3] on the other hand, will ensure a limited number of updates per iteration. [sent-17, score-0.055]

16 The size of the matrix is given by the number of kernel functions required at each step. [sent-19, score-0.13]

17 Recently several algorithms have been proposed [5, 8, 6, 12] performing perceptron-like updates for classification at each step. [sent-20, score-0.092]

18 Some algorithms work only in the noise free case, others not for moving targets, and yet again others assume an upper bound on the complexity of the estimators. [sent-21, score-0.094]

19 In the present paper we present a simple method which will allows the use of kernel estimators for classification, regression, and novelty detection and which copes with a large number of kernel functions efficiently. [sent-22, score-0.693]

20 This means that there exists a kernel and a dot product such that 1) (reproducing property); 2) is the closure of the span of all with . [sent-24, score-0.13]

21 In other words, all are linear combinations of kernel functions. [sent-25, score-0.13]

22   #     BA(9   9  3@ Typically is used as a regularization functional. [sent-27, score-0.126]

23 To state our algorithm we need to compute derivatives of functionals defined on . [sent-29, score-0.101]

24 For the evaluation functional we compute the derivative by using the reproducing property of and obtain . [sent-32, score-0.269]

25 Regularized Risk Functionals and Learning In the standard learning setting we are drawn according to some underlying supplied with pairs of observations distribution . [sent-35, score-0.152]

26 Our aim is to predict the likely outcome at location . [sent-36, score-0.056]

27 Several variants are possible: (i) may change over time, (ii) the training sample may be the next observation on which to predict which leads to a true online setting, or (iii) we may want to find an algorithm which approximately minimizes a regularized risk functional on a given training set. [sent-37, score-0.709]

28 0 ' x u x 50% t 0 § ¥ q  q  £ …„7ƒ¡ h We assume that we want to minimize a loss function which penalizes the deviation between an observation at location and the prediction , based on . [sent-38, score-0.422]

29 Since is unknown, a standard approach is to observations instead minimize the empirical risk ' 10&  % (1) 't ¤2 50% ‚ t ' ‡ u G50% ("Y"(' I u I 10% t ‡ ††† t I D h ' x % t x j˜x FG  EC kg 2' G10&  x 2 G10% h — – 3 G  EC iSG  D ”’‘4ˆ 3 fG  D ’e5‰4ˆ F D h gF “ ‰ ¡F d ‡ • F D G  EC I ! [sent-39, score-0.353]

30 ˜x 'u x' % t x ™50&  x u ™50% h — • – 3 G  D ”’‘ˆ F “ ‰ ‡ or, in order to avoid overly complex hypotheses, minimize the empirical risk plus an additional regularization term . [sent-40, score-0.371]

31 This sum is known as the regularized risk † Wh m l for (2) Common loss functions are the soft margin loss function [1] or the logistic loss for classification and novelty detection [14], the quadratic loss, absolute loss, Huber’s robust loss [9], or the -insensitive loss [16] for regression. [sent-41, score-2.489]

32 ¡ In some cases the loss function depends on an additional parameter such as the width of the margin or the size of the -insensitive zone. [sent-43, score-0.522]

33 One may make these variables themselves parameters of the optimization problem [15] in order to make the loss function adaptive to the amount or type of noise present in the data. [sent-44, score-0.359]

34 n ¡ n o     ¡ ' % t 2' 50&  vu 50% h Stochastic Approximation In order to find a good estimator we would like to minimize . [sent-46, score-0.189]

35 This can be costly if the number of observations is large. [sent-47, score-0.057]

36 Recently several gradient descent algorithms for minimizing such functionals efficiently have been proposed [13, 7]. [sent-48, score-0.296]

37 Below we extend these methods to stochastic gradient descent by approximating F d G  D ’e5‰pˆ F G  D ’e5‰d ˆ &' o & †  1) 03 • & ' 2' % x u h X h (%o 3 ¥ x t m 3 '  % t  & u' 50&   2 10% X  h 10! [sent-49, score-0.158]

38 Consequently the gradient of with respect to is (4) ˆRQ The last equality holds if . [sent-53, score-0.066]

39 The the update equations are hence straightforward: (5)  9% U Here is the learning rate controlling the size of updates undertaken at each iteration. [sent-55, score-0.355]

40 For this purpose we have to express as a kernel expansion (7) where the are (previously seen) training patterns. [sent-59, score-0.242]

41 (8) means that at each iteration the kernel expansion may grow by one term. [sent-61, score-0.263]

42 Furthermore, the cost for training at each step is not larger than the prediction cost: once we have computed , is obtained by the value of the derivative of at . [sent-62, score-0.176]

43 This is particularly useful if the derivatives of the loss function will only assume discrete values, say as is the case when using the soft-margin type loss functions (see Section 3). [sent-64, score-0.63]

44 h & Truncation The problem with (8) and (10) is that without any further measures, the number of basis functions will grow without bound. [sent-65, score-0.064]

45 This is not desirable since determines the amount of computation needed for prediction. [sent-66, score-0.07]

46 Furthermore, the total truncation error by dropping all terms which are at least steps old is bounded by 6 (11) A 8 7 ' h o %• I ` h x  8 & G A H8 x ` ' h o •%   I 7 F` — A8 7 ' h o %• 6 ! [sent-71, score-0.228]

47 3 The regularization parameter can thus be used to control the storage requirements for the expansion. [sent-75, score-0.166]

48 In addition, it naturally allows for distributions that change over time in which cases it is desirable to forget instances that are much older than the average time scale of the distribution change [11]. [sent-76, score-0.07]

49 ¦   ¤ F £ D ©¨§¥¤£¢¡ˆ ¥Q o g ' 10&  ' 50% £ % 3 ¢ ¢ ¢ ¢ § 6 © y  6 ¢ Classification A typical loss function in SVMs is the soft margin, given by . [sent-82, score-0.367]

50 In this situation the update equations become • G ' 50% t  £ '¢ ( & ' ' x  g  x m2 xx & ' hh t ¢ t if otherwise. [sent-83, score-0.415]

51 o •• u%% ¥ (  x % %  ' £t 2' 10% ¢o • 1m% ¨3 2' 10% ¦v2 10% h  © § ' £ t In classification with the -trick we avoid having to fix the margin by treating it as a variable [15]. [sent-86, score-0.211]

52 The value of is found automatically by using the loss function n   n &u;' 10% ¢o 2 1m% o' £t n (13)    © § 3' 2' t 50% £v2 50% h n • m where is another parameter. [sent-87, score-0.315]

53 p2a & ' o 2 o n ¢ 'n ¢ % ' 2' o %• g 2 x  g 4 x mu xx & ' o •• 2%%  ¥ S2a  & x & % n t ¢ t % if otherwise. [sent-93, score-0.131]

54 (15) we obtain the Novelty Detection The results for novelty detection [14] are similar in spirit. [sent-97, score-0.394]

55 The setting is most useful here particularly where the estimator acts as a warning device (e. [sent-98, score-0.14]

56 The relevant loss function is and usually [14] one uses rather than where in order to avoid trivial solutions. [sent-101, score-0.365]

57 The update equations are   n u' 50&  o 2 1m% o' % n   § ' i6 2' ¢ ™u ¢ h „  3 50&  t 10% g %  © § ! [sent-102, score-0.191]

58 (16) n ' o u &' o n 'n % 'u' o • % g num xx & ' o •• 2%%  ¥ S2  & x & % % 2 © 6 V  n G % …@' 50&      Considering the update of we can see that on average only a fraction of observations will be considered for updates. [sent-104, score-0.371]

59 Thus we only have to store a small fraction of the . [sent-105, score-0.091]

60 x f0 Regression We consider the following four settings: squared loss, the -insensitive loss using the -trick, Huber’s robust loss function, and trimmed mean estimators. [sent-106, score-0.714]

61 We begin with squared loss where is given by Consequently the update equation is (17) This means that we have to store every observation we make, or more precisely, the prediction error we made on the observation. [sent-109, score-0.556]

62 The -insensitive loss avoids this problem but introduces a new parameter in turn — the width of the insensitivity zone . [sent-110, score-0.557]

63 By making a variable of the optimization problem we have The update equations now have to be stated in terms of , and which is allowed to change during the optimization process. [sent-111, score-0.279]

64 (18) This means that every time the prediction error exceeds , we increase the insensitivity zone by . [sent-113, score-0.263]

65 Likewise, if it is smaller than , the insensitive zone is decreased by . [sent-114, score-0.111]

66 Next let us analyze the case of regression with Huber’s robust loss. [sent-115, score-0.151]

67 4 & ' o 2 o % ' ' o • % g (2' 10&  o  t % ©§¥$ xx & ' hh o •• 2%% ' % ¨ ¦ ¤ m % ¥'     ¡   ¡ ' o %• x % & & ¡     ¡ (19)  ' 50&  o t   %    ¢ '' u2' o t I `  & ' ! [sent-117, score-0.187]

68 o 2 % ' %  2'u' 50% &10&  o  t  %  xx & ' ! [sent-118, score-0.131]

69 o •• 2%%  ¥ '  & x & % ¨% ¦ ¤ % ' @ u' o t I @ 3 '   I@ d50o &'%    50&  % o t    u' 50&  v2 10% h % t % As before we obtain update equations by computing the derivative of with respect to    l ¢' 50&h;   o  t    % if otherwise. [sent-119, score-0.299]

70 (20) Comparing (20) with (18) leads to the question whether might not also be adjusted adaptively. [sent-122, score-0.078]

71 This is a desirable goal since we may not know the amount of noise present in the data. [sent-123, score-0.07]

72 While the -setting allowed us to form such adaptive estimators for batch learning with the -insensitive loss, this goal has proven elusive for other estimators in the standard batch setting. [sent-124, score-0.499]

73 In the online situation, however, such an extension is quite natural (see also [4]). [sent-125, score-0.348]

74 o u % ' 2' o • % g  (2'  10 &50% &  o  t  % ©§¥ $ xx & ' ! [sent-128, score-0.131]

75 o •• u%% ' %  ¨% ¦ ¤ %      ¥'   x % & & 4 Theoretical Analysis 3 2' 50&  ™u 10% h ' % t Consider now the classification problem with the soft margin loss ; here is a fixed margin parameter. [sent-129, score-0.689]

76 Let denote the hypothesis of the online algorithm after seeing the first observations. [sent-130, score-0.364]

77 Thus, at time , the algorithm receives an input , makes its prediction , receives the correct outcome , and updates its hypothesis into according to (5). [sent-131, score-0.235]

78 ˆ F ‚ I 0  t2  5% I" ¦ F     D ©¨§¥¤£¢  ˆ ˜ ‡ ¡   t  F0 ' % t n © u' 50&  Vo 2 1m% §     ' 0 1%  o   •  n Then it would be desirable for the online hypothesis to converge towards arg min . [sent-135, score-0.434]

79 If we can show that the cumulative risk is asymptotically , we see that at least in some sense does converge to . [sent-136, score-0.256]

80 Hence, as a first step in our convergence analysis, we obtain an upper bound for the cumulative risk. [sent-137, score-0.2]

81 Then for any we have I I ˜ † ' %• ¦g @ §I – A   g F  v£ D ¨§¤£¢  ˆ ˜  9 D ¦ ¦ ¨ ¤ — F     ©¨§¥¤£¢  ˆ — ‡ ‡   9 ¦¡9 £9 ' ¤ Y@ ¥I – A % £¡B ¢  3 I m gl    (22) Notice that the bound does not depend on any probabilistic assumptions. [sent-141, score-0.057]

82 If the example sequence is such that some fixed predictor has a small cumulative risk, then the cumulative risk of the online algorithm will also be small. [sent-142, score-0.702]

83 There is a slight catch here in that the learning rate must be chosen a priori, and the optimal setting depends on . [sent-143, score-0.163]

84 The longer the sequence of examples, the smaller learning rate we want. [sent-144, score-0.147]

85 We can avoid this by using a learning rate that starts from a fairly large value and decreases as learning progresses. [sent-145, score-0.2]

86 This leads to a bound similar to Theorem 1 but with somewhat worse constant coefficients. [sent-146, score-0.097]

87 Then for any such £ @ ' t % ˜ ‡ u'  u  50u% Theorem 2 Let be an example sequence such that , and use at update the learning rate Fix that we have (23) † 'I ` D ¤ – % ¨g @ ¥I ` – @ ' h ¢A g   %fh ¦g F £ D ˆ 9 ! [sent-149, score-0.28]

88 Let where is the -th online hypothesis based on an example sequence that is drawn i. [sent-155, score-0.402]

89 Then for any such we have that  ‚ £ (24) If we know in advance how many examples we are going to draw, we can use a fixed learning rate as in Theorem 1 and obtain somewhat better constants. [sent-160, score-0.151]

90 5 Experiments and Discussion In our experiments we studied the performance of online -SVM algorithms in various settings. [sent-161, score-0.344]

91 Due to space constraints we only report the findings in novelty detection as given in Figure 1 (where the training algorithm was fed the patterns sans class labels). [sent-163, score-0.462]

92 Already after one pass through the USPS database (5000 training patterns, 2000 test patterns, each of them of size pixels), which took in MATLAB less than 15s on a 433MHz Celeron, the results can be used for weeding out badly written digits. [sent-164, score-0.083]

93 ” Based setting was used (with on the theoretical analysis of Section 4 we used a decreasing learning rate with . [sent-166, score-0.163]

94 "`  h # • m† m 3   Conclusion We have presented a range of simple online kernel-based algorithms for a variety of standard machine learning tasks. [sent-168, score-0.385]

95 The algorithms have constant memory requirements and are computationally cheap at each update step. [sent-169, score-0.25]

96 They allow the ready application of powerful kernel based methods such as novelty detection to online and time-varying problems. [sent-170, score-0.789]

97 The learning rate was adjusted to where is the number of iterations. [sent-173, score-0.147]

98 Top: the first 50 patterns which incurred a margin error; bottom left: the 50 worst patterns according to on the training set, bottom right: the 50 worst patterns on an unseen test set. [sent-174, score-0.564]

99 ‡I ¥ ¡ 3   • 3 ¢  † m H@  £ ¤ – n % Vo ' 10&  • m† m 3   Figure 1: Online novelty detection on the USPS dataset with . [sent-175, score-0.352]

100 Support vector method for function approximation, regression estimation, and signal processing. [sent-333, score-0.067]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('loss', 0.315), ('online', 0.307), ('novelty', 0.224), ('ec', 0.177), ('margin', 0.161), ('risk', 0.155), ('update', 0.133), ('xx', 0.131), ('kernel', 0.13), ('detection', 0.128), ('regularization', 0.126), ('huber', 0.125), ('truncation', 0.118), ('classi', 0.118), ('batch', 0.116), ('reproducing', 0.112), ('zone', 0.111), ('kg', 0.101), ('functionals', 0.101), ('cumulative', 0.101), ('fg', 0.099), ('smola', 0.098), ('usps', 0.094), ('descent', 0.092), ('estimator', 0.086), ('insensitivity', 0.085), ('robust', 0.084), ('estimators', 0.081), ('cation', 0.079), ('tresp', 0.073), ('regularized', 0.072), ('leen', 0.072), ('editors', 0.072), ('desirable', 0.07), ('theorem', 0.07), ('expansion', 0.069), ('rate', 0.068), ('prediction', 0.067), ('patterns', 0.067), ('regression', 0.067), ('dietterich', 0.066), ('derivative', 0.066), ('gradient', 0.066), ('fix', 0.064), ('grow', 0.064), ('proven', 0.064), ('dropping', 0.063), ('incurred', 0.063), ('vo', 0.063), ('vu', 0.063), ('lkopf', 0.062), ('coef', 0.062), ('sch', 0.059), ('bl', 0.059), ('kivinen', 0.059), ('equations', 0.058), ('incremental', 0.058), ('hypothesis', 0.057), ('bound', 0.057), ('observations', 0.057), ('settings', 0.056), ('outcome', 0.056), ('williamson', 0.056), ('dec', 0.056), ('hh', 0.056), ('pages', 0.055), ('updates', 0.055), ('setting', 0.054), ('ff', 0.054), ('xed', 0.052), ('soft', 0.052), ('consequently', 0.051), ('advances', 0.051), ('fraction', 0.05), ('avoid', 0.05), ('functional', 0.049), ('bartlett', 0.049), ('worst', 0.048), ('bounded', 0.047), ('hilbert', 0.046), ('width', 0.046), ('optimization', 0.044), ('training', 0.043), ('obtain', 0.042), ('linearly', 0.042), ('store', 0.041), ('learning', 0.041), ('mit', 0.041), ('extension', 0.041), ('minimize', 0.04), ('pass', 0.04), ('requirements', 0.04), ('computationally', 0.04), ('leads', 0.04), ('logistic', 0.038), ('sequence', 0.038), ('adjusted', 0.038), ('algorithms', 0.037), ('situation', 0.037), ('unsolved', 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

2 0.23972279 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

3 0.22489342 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

Author: T. Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.

4 0.22177128 105 nips-2001-Kernel Machines and Boolean Functions

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

Abstract: We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

5 0.2216938 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

6 0.21902612 137 nips-2001-On the Convergence of Leveraging

7 0.15652265 22 nips-2001-A kernel method for multi-labelled classification

8 0.15462719 164 nips-2001-Sampling Techniques for Kernel Methods

9 0.14347447 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

10 0.13765864 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

11 0.13203888 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

12 0.12287199 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

13 0.11996999 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

14 0.11806085 31 nips-2001-Algorithmic Luckiness

15 0.11713091 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

16 0.11549758 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

17 0.11188325 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

18 0.11165699 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

19 0.10998922 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

20 0.10998768 144 nips-2001-Partially labeled classification with Markov random walks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.359), (1, 0.174), (2, 0.023), (3, 0.273), (4, 0.119), (5, -0.054), (6, 0.079), (7, 0.132), (8, 0.009), (9, -0.026), (10, -0.059), (11, 0.081), (12, -0.016), (13, 0.043), (14, -0.082), (15, 0.014), (16, 0.024), (17, 0.115), (18, 0.019), (19, 0.037), (20, 0.135), (21, -0.01), (22, 0.041), (23, -0.016), (24, -0.014), (25, 0.01), (26, -0.066), (27, -0.004), (28, 0.007), (29, -0.04), (30, -0.012), (31, -0.008), (32, -0.001), (33, 0.14), (34, -0.014), (35, -0.044), (36, 0.02), (37, 0.046), (38, -0.008), (39, -0.039), (40, 0.054), (41, 0.119), (42, 0.039), (43, 0.095), (44, 0.097), (45, -0.02), (46, 0.016), (47, -0.132), (48, 0.036), (49, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9725613 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

2 0.85251349 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

3 0.78021336 137 nips-2001-On the Convergence of Leveraging

Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth

Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.

4 0.6741271 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

5 0.65895283 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

Author: T. Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.

6 0.64721072 105 nips-2001-Kernel Machines and Boolean Functions

7 0.63479316 31 nips-2001-Algorithmic Luckiness

8 0.63259333 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

9 0.61491209 22 nips-2001-A kernel method for multi-labelled classification

10 0.58026028 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

11 0.55622566 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

12 0.54559529 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

13 0.52872241 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

14 0.51430899 164 nips-2001-Sampling Techniques for Kernel Methods

15 0.476895 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA

16 0.46699366 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

17 0.46401563 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

18 0.46377429 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

19 0.45903692 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

20 0.4575437 147 nips-2001-Pranking with Ranking


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.421), (17, 0.036), (19, 0.03), (27, 0.129), (30, 0.061), (38, 0.013), (59, 0.018), (72, 0.061), (79, 0.023), (83, 0.029), (91, 0.123)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98343581 49 nips-2001-Citcuits for VLSI Implementation of Temporally Asymmetric Hebbian Learning

Author: A. Bofill, D. P. Thompson, Alan F. Murray

Abstract: Experimental data has shown that synaptic strength modification in some types of biological neurons depends upon precise spike timing differences between presynaptic and postsynaptic spikes. Several temporally-asymmetric Hebbian learning rules motivated by this data have been proposed. We argue that such learning rules are suitable to analog VLSI implementation. We describe an easily tunable circuit to modify the weight of a silicon spiking neuron according to those learning rules. Test results from the fabrication of the circuit using a O.6J.lm CMOS process are given. 1

same-paper 2 0.928334 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

3 0.92519081 170 nips-2001-Spectral Kernel Methods for Clustering

Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola

Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1

4 0.8403424 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

Author: Manfred Opper, Robert Urbanczik

Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1

5 0.83680332 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

6 0.67667985 137 nips-2001-On the Convergence of Leveraging

7 0.65152973 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

8 0.64104879 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

9 0.63716125 58 nips-2001-Covariance Kernels from Bayesian Generative Models

10 0.63703275 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

11 0.63305235 112 nips-2001-Learning Spike-Based Correlations and Conditional Probabilities in Silicon

12 0.63099808 8 nips-2001-A General Greedy Approximation Algorithm with Applications

13 0.6183641 105 nips-2001-Kernel Machines and Boolean Functions

14 0.6173346 22 nips-2001-A kernel method for multi-labelled classification

15 0.61597323 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

16 0.61500502 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

17 0.60933328 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

18 0.60618204 74 nips-2001-Face Recognition Using Kernel Methods

19 0.60465789 166 nips-2001-Self-regulation Mechanism of Temporally Asymmetric Hebbian Plasticity

20 0.5968309 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines