nips nips2001 nips2001-31 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ralf Herbrich, Robert C. Williamson
Abstract: In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1
Reference: text
sentIndex sentText sentNum sentScore
1 au Abstract In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. [sent-6, score-0.409]
2 Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. [sent-7, score-0.816]
3 The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. [sent-8, score-0.441]
4 We show how the resulting framework relates to the VC, luckiness and compression frameworks. [sent-9, score-0.832]
5 Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. [sent-10, score-0.777]
6 1 Introduction Statistical learning theory is mainly concerned with the study of uniform bounds on the expected error of hypotheses from a given hypothesis space [9, 1]. [sent-11, score-0.368]
7 Such bounds have the appealing feature that they provide performance guarantees for classifiers found by any learning algorithm. [sent-12, score-0.256]
8 However, it has been observed that these bounds tend to be overly pessimistic. [sent-13, score-0.071]
9 One explanation is that only in the case of learning algorithms which minimise the training error it has been proven that uniformity of the bounds is equivalent to studying the learning algorithm's generalisation performance directly. [sent-14, score-0.493]
10 In this paper we present a theoretical framework which aims at directly studying the generalisation error of a learning algorithm rather than taking the detour via the uniform convergence of training errors to expected errors in a given hypothesis space. [sent-15, score-0.689]
11 In addition, our new model of learning allows the exploitation of the fact that we serendipitously observe a training sample which is easy to learn by a given learning algorithm. [sent-16, score-0.272]
12 In that sense, our framework is a descendant of the luckiness framework of Shawe-Taylor et al. [sent-17, score-0.798]
13 In the present case, the luckiness is a function of a given learning algorithm and a given training sample and characterises the diversity of the algorithms solutions. [sent-19, score-0.892]
14 The notion of luckiness allows us to study given learning algorithms at many different perspectives. [sent-20, score-0.709]
15 For example, the maximum margin algorithm [9] can either been studied via the number of dimensions in feature space, the margin of the classifier learned or the sparsity of the resulting classifier. [sent-21, score-0.493]
16 Our main results are two generalisation error bounds for learning algorithms: one for the zero training error scenario and one agnostic bound (Section 2). [sent-22, score-0.505]
17 We shall demonstrate the usefulness of our new framework by studying its relation to the VC framework, the original luckiness framework and the compression framework of Littlestone and Warmuth [6] (Section 3). [sent-23, score-1.086]
18 Finally, we present an application of the new framework to the maximum margin algorithm for linear classifiers (Section 4). [sent-24, score-0.419]
19 ) over the random draw of its argument X and the indicator function, respectively. [sent-37, score-0.118]
20 The shorthand notation Z(oo) := U;;'=l zm denotes the union of all m- fold Cartesian products of the set Z. [sent-38, score-0.245]
21 , m }m as the set of all permutations of the numbers 1, . [sent-42, score-0.112]
22 Given a 2m- vector i E hm and a sample z E z2m we define Wi : {I, . [sent-52, score-0.107]
23 , Z7ri(2m))' 2 Algorithmic Luckiness Suppose we are given a training sample z = (x, y) E (X x y)m = zm of size mEN independently drawn (iid) from some unknown but fixed distribution P Xy = P z together with a learning algorithm A : Z( 00) -+ yX . [sent-61, score-0.481]
24 For a predefined loss l : y x y -+ [0,1] we would like to investigate the generalisation error Gl [A, z] := Rl [A (z)] - infhEYx Rl [h] of the algorithm where the expected error Rl [h] of his defined by Rl [h] := Exy [l (h (X) ,Y)] . [sent-62, score-0.39]
25 Since infhEYx Rl [h] (which is also known as the Bayes error) is independent of A it suffices to bound Rl [A (z)]. [sent-63, score-0.178]
26 Although we know that for any fixed hypothesis h the training error ~ 1 Rdh,z]:=~ L l(h(xi),Yi) (X i ,Yi) E z is with high probability (over the random draw of the training sample z E Z(oo)) close to Rl [h], this might no longer be true for the random hypothesis A (z). [sent-64, score-0.653]
27 Hence we would like to state that with only small probability (at most 8) , the expected error Rl [A (z)] is larger than the training error HI [A (z), z] plus some sample and algorithm dependent complexity c (A, z, 8), Pzm (Rl [A (Z)] > HI [A (Z), Z] + c (A, Z,8)) < 8. [sent-65, score-0.299]
28 (1) In order to derive such a bound we utilise a modified version of the basic lemma of Vapnik and Chervonenkis [10]. [sent-66, score-0.162]
29 If we now condition on the first m examples, A (Z[l:mj) is fixed and therefore by an application of Hoeffding's inequality (see, e. [sent-79, score-0.057]
30 [1]) and since m€2 > 2 the additional event Q has probability of at least ~ over the random draw of (Zm+1, . [sent-81, score-0.091]
31 0 = Use of Lemma 1 - which is similar to the approach of classical VC analysis reduces the original problem (1) to the problem of studying the deviation of the training errors on the first and second half of a double sample z E z2m of size 2m. [sent-85, score-0.454]
32 It is of utmost importance that the hypothesis A (Z[l:mj) is always learned from the first m examples. [sent-86, score-0.149]
33 Now, in order to fully exploit our assumptions of the mutual independence of the double sample Z E z2m we use a technique known as symmetrisation by permutation: since PZ2~ is a product measure, it has the property that PZ2»> (J (Z)) = PZ2~ (J (ITi (Z))) for any i E hm. [sent-87, score-0.302]
34 Hence, it suffices to bound the probability of permutations Jri such that J (ITi (z)) is true for a given and fixed double sample z. [sent-88, score-0.607]
35 As a consequence thereof, we only need to count the number of different hypotheses that can be learned by A from the first m examples when permuting the double sample. [sent-89, score-0.384]
36 Any function L that maps an algorithm A : Z( oo ) -+ yX and a training sample z E Z( oo ) to a real value is called an algorithmic luckiness. [sent-91, score-0.556]
37 For all mEN, for any z E z2m , the lucky set HA (L , z) ~ yX is the set of all hypotheses that are learned from the first m examples (Z7ri(1),···, Z7ri(m)) when permuting the whole sample z whilst not decreasing the luckiness, i. [sent-92, score-0.338]
38 (2) where Given a fixed loss function 1 : y x y -+ [0,1] the induced loss function set £1 (HA (L,z)) is defined by £1 (HA (L,z)) := {(x,y) r-+ 1(h(x) ,y) I h E HA (L,z)} . [sent-94, score-0.197]
39 For any luckiness function L and any learning algorithm A , the complexity of the double sample z is the minimal number N1 (T, £1 (HA (L, z)) ,z) of hypotheses h E yX needed to cover £1 (HA (L , z)) at some predefined scale T, i. [sent-95, score-1.159]
40 for any hypothesis hE HA (L, z) there exists a h E yX such that (4) To see this note that whenever J (ITi (z)) is true (over the random draw of permutations) then there exists a function h which has a difference in the training errors on the double sample of at least ~ + 2T. [sent-97, score-0.645]
41 By an application of the union bound we see that the number N 1 (T, £1 (HA (L , z)) , z) is of central importance. [sent-98, score-0.141]
42 Hence, if we are able to bound this number over the random draw of the double sample z only using the luckiness on the first m examples we can use this bound in place of the worst case complexity SUPzEZ2~ N1 (T, £1 (HA (L , z)) ,z) as usually done in the VC framework (see [9]). [sent-99, score-1.301]
43 Given an algorithm A : Z (00 ) -+ yX and a loss l : y x y -+ [a, 1] the algorithmic luckiness function Lis w- small at scale T E jR+ if for all mEN, all J E (a , 1] and all P z PZ2~ (Nl (T, £"1 (1iA (L, Z)), Z) > w (L (A, Z[l:ml) ,l, m, J,T)) < J. [sent-101, score-0.918]
44 , " v S(Z) Note that if the range of l is {a, I} then N 1 (2~ ' £"1 (1iA (L, z)) , z) equals the number of dichotomies on z incurred by £"1 (1iA (L , z)). [sent-102, score-0.044]
45 Suppos e we have a learning algorithm A : Z( oo ) -+ yX and an algorithmic luckiness L that is w-small at scale T for a loss function l : y X Y -+ [a, 1]. [sent-104, score-1.039]
46 For any probability measure P z , any dEN and any J E (a , 1], with probability at least 1 - J over the random draw of the training sample z E zm of size m, if w (L (A, z) ,l, m, J/4, T) :::; 2d then ! [sent-105, score-0.47]
47 (5) Furthermore, under the above conditions if the algorithmic luckiness L is wsmall at scale 2~ for a binary loss function l (". [sent-107, score-0.927]
48 We will only sketch the proof of equation (5) ; the proof of (6) is similar and can be found in [5]. [sent-109, score-0.131]
49 We now exploit the fact that PZ2~ (J (Z)) :Z2~ (J (Z) 1\ S (Z) ), +PZ2~ (J (Z) 1\ . [sent-111, score-0.042]
50 Following the above-mentioned argument it suffices to bound the probability of a random permutation III (z) that J (III (z)) 1\ . [sent-118, score-0.242]
51 Thus let us consider such a cover of size not more than 2. [sent-126, score-0.055]
52 ,S (IIi (z)) is true for a swapping i then there exists a hypothesis h E yX in the cover (III (z)) [(m+1) :2ml] - Rl (III (z)) [l:ml] > ~ + 2T. [sent-130, score-0.212]
53 Using the such that Rl union bound and Hoeffding's inequality for a particular choice of PI shows that PI (J (III (z)) 1\ . [sent-131, score-0.141]
54 Note that the usage of permutations in the definition of (2) is not only a technical matter; it fully exploits all the assumptions made for the training sample, namely the training sample is drawn iid. [sent-136, score-0.472]
55 3 Relationship to Other Learning Frameworks In this section we present the relationship of algorithmic luckiness to other learning frameworks (see [9, 8, 6] for further details of these frameworks). [sent-137, score-0.863]
56 VC Framework If we consider a binary loss function l (". [sent-138, score-0.07]
57 ) E {a, I} and assume that the algorithm A selects functions from a given hypothesis space H ~ yX then L (A, z) = - VCDim (H) is a w- smallluckiness function where w (Lo,l,m,8, 1) :S (2em) -Lo 2m -Lo . [sent-139, score-0.154]
58 (7) This can easily be seen by noticing that the latter term is an upper bound on maxz EZ 2 ", I{ (l (h (Xl) ,yI) , . [sent-140, score-0.137]
59 Note that this luckiness function neither exploits the particular training sample observed nor the learning algorithm used. [sent-144, score-0.936]
60 Luckiness Framework Firstly, the luckiness framework of Shawe-Taylor et al. [sent-145, score-0.708]
61 [8] only considered binary loss functions l and the zero training error case. [sent-146, score-0.178]
62 In this work, the luckiness £ is a function of hypothesis and training samples and is called wsmall if the probability over the random draw of a 2m sample z that there exists a hypothesis h with w(£(h, (Zl, . [sent-147, score-1.202]
63 Although similar in spirit, the classical luckiness framework does not allow exploitation of the learning algorithm used to the same extent as our new luckiness. [sent-151, score-0.846]
64 In fact, in this framework not only the covering number must be estimable but also the variation of the luckiness £ itself. [sent-152, score-0.775]
65 Compression Framework In the compression framework of Littlestone and Warmuth [6] one considers learning algorithms A which are compression schemes, i. [sent-154, score-0.397]
66 A (z) = :R (e (z)) where e (z) selects a subsample z ~ z and :R : Z(oo) -+ yX is a permutation invariant reconstruction function. [sent-156, score-0.093]
67 For this class of learning algorithms, the luckiness L(A,z) = -le(z)1 is w- small where w is given by (7). [sent-157, score-0.649]
68 In order to see this we note that (3) ensures that we only consider permutations 7ri where e (IIi (z)) :S Ie (z)l, i. [sent-158, score-0.112]
69 we use not more than -L training examples from z E z2m. [sent-160, score-0.102]
70 As there are exactly distinct choices of d training examples from 2m examples the result follows by application of Sauer's lemma [9]. [sent-161, score-0.197]
71 Disregarding constants, Theorem 1 gives exactly the same bound as in [6]. [sent-162, score-0.103]
72 e;;) 4 A New Margin Bound For Support Vector Machines In this section we study the maximum margin algorithm for linear classifiers, i. [sent-163, score-0.23]
73 Let us assume that l (h (x) ,y) = lO - l (h (x) ,y) := lIyh(x)::;o, Classical VC generalisation error bounds exploit the fact that VCDim (Hcp) = nand (7). [sent-166, score-0.305]
74 Now, the maximum margin algorithm finds the weight vector WMM that maximises "(z (w). [sent-170, score-0.23]
75 Our new margin bound follows from the following theorem together with (6). [sent-173, score-0.298]
76 Let fi (x) be the smallest 10 > 0 such that {cfJ (Xl) , . [sent-175, score-0.059]
77 14>(Xi) the maximum margin algorithm A , the luckiness function L(A ,Z ) =_ mIn {. [sent-184, score-0.848]
78 First we note that by a slight refinement of a theorem of Makovoz [7] we know that for any Z E zm there exists a weight vector w = 2:: 1 iiicfJ (Xi) such that (10) and a WA(z) E ]Rm has no more than - L (A, z) non-zero components. [sent-191, score-0.265]
79 Consider a fixed double sample Z E z2m and let ko := L (A, (Zl , . [sent-194, score-0.429]
80 By virtue of (3) and the aforementioned argument we only need to consider permutations tri such that there exists a weight vector w = 2:;:1 iijcfJ (Xj) with no more than ko non-zero iij. [sent-198, score-0.296]
81 , ko} training examples from the 2m examples Z there are no more than (2emlko)kO different subsamples to be used in w. [sent-202, score-0.138]
82 For each particular subsample z ~ Z the weight vector w is a member of the class of linear classifiers in a ko (or less) dimensional space. [sent-203, score-0.267]
83 Thus, from (7) it follows that for the given subsample z there are no more (2emlko)kO different dichotomies induced on the double sample Z E z2m. [sent-204, score-0.36]
84 As this holds for any D double sample, the theorem is proven. [sent-205, score-0.195]
85 There are several interesting features about this margin bound. [sent-206, score-0.153]
86 Firstly, observe that 2:;:1 IA (Z)j I is a measure of sparsity of the solution found by the maximum margin algorithm which, in the present case, is combined with margin. [sent-207, score-0.301]
87 Secondly, the quantity fi (x) can be considered as a measure of the distribution of the mapped data points in feature space. [sent-213, score-0.117]
88 Note that for all i E N, fi (x) :S 101 (x) :S maxjE{l ,. [sent-214, score-0.059]
89 An conditional probabilities PX 1 extension of this reasoning is useful in the multi-class case; binary maximum margin classifiers are often used to solve multi-class problems [9]. [sent-219, score-0.287]
90 There appears to be also a close relationship of fi (x) with the notion of kernel alignment recently introduced in [3]. [sent-220, score-0.12]
91 Finally, one can use standard entropy number techniques to bound fi (x) in terms of eigenvalues of the inner product matrix or its centred variants. [sent-221, score-0.162]
92 It is worth mentioning that although our aim was to study the maximum margin algorithm the above theorem actually holds for any algorithm whose solution can be represented as a linear combination of the data points. [sent-222, score-0.314]
93 5 Conclusions In this paper we have introduced a new theoretical framework to study the generalisation error of learning algorithms. [sent-223, score-0.313]
94 In contrast to previous approaches, we considered specific learning algorithms rather than specific hypothesis spaces. [sent-224, score-0.171]
95 We introduced the notion of algorithmic luckiness which allowed us to devise data dependent generalisation error bounds. [sent-225, score-1.003]
96 Thus we were able to relate the compression framework of Littlestone and Warmuth with the VC framework. [sent-226, score-0.214]
97 Furthermore, we presented a new bound for the maximum margin algorithm which not only exploits the margin but also the distribution of the actual training data in feature space. [sent-227, score-0.654]
98 Perhaps the most appealing feature of our margin based bound is that it naturally combines the three factors considered important for generalisation with linear classifiers: margin, sparsity and the distribution of the data. [sent-228, score-0.504]
99 Further research is concentrated on studying Bayesian algorithms and the relation of algorithmic luckiness to the recent findings for stable learning algorithms [2]. [sent-229, score-0.94]
100 A PAC-Bayesian margin bound for linear classifiers: Why SVMs work. [sent-261, score-0.256]
wordName wordTfidf (topN-words)
[('luckiness', 0.618), ('rl', 0.178), ('zm', 0.178), ('yx', 0.169), ('algorithmic', 0.161), ('double', 0.153), ('margin', 0.153), ('generalisation', 0.15), ('ha', 0.133), ('iii', 0.129), ('compression', 0.124), ('ko', 0.112), ('permutations', 0.112), ('hypothesis', 0.112), ('sample', 0.107), ('bound', 0.103), ('hcp', 0.102), ('vc', 0.1), ('classifiers', 0.099), ('draw', 0.091), ('framework', 0.09), ('oo', 0.09), ('mj', 0.088), ('littlestone', 0.088), ('men', 0.088), ('hypotheses', 0.082), ('iti', 0.076), ('permuting', 0.076), ('vcdim', 0.076), ('suffices', 0.075), ('studying', 0.074), ('exploits', 0.072), ('bounds', 0.071), ('loss', 0.07), ('covering', 0.067), ('training', 0.066), ('ez', 0.06), ('fi', 0.059), ('lemma', 0.059), ('fixed', 0.057), ('subsample', 0.056), ('hi', 0.055), ('sketch', 0.055), ('cover', 0.055), ('frameworks', 0.053), ('wa', 0.052), ('xi', 0.051), ('cfj', 0.051), ('ilcfj', 0.051), ('infheyx', 0.051), ('jri', 0.051), ('pzm', 0.051), ('rz', 0.051), ('wmm', 0.051), ('wsmall', 0.051), ('herbrich', 0.05), ('williamson', 0.05), ('microsoft', 0.049), ('definition', 0.049), ('nl', 0.048), ('warmuth', 0.046), ('exists', 0.045), ('dichotomies', 0.044), ('predefined', 0.044), ('sparsity', 0.043), ('exploit', 0.042), ('algorithm', 0.042), ('error', 0.042), ('theorem', 0.042), ('bousquet', 0.04), ('proof', 0.038), ('union', 0.038), ('exploitation', 0.037), ('hoeffding', 0.037), ('learned', 0.037), ('permutation', 0.037), ('examples', 0.036), ('maximum', 0.035), ('australian', 0.034), ('noticing', 0.034), ('yi', 0.033), ('notion', 0.032), ('ia', 0.032), ('zl', 0.032), ('learning', 0.031), ('feature', 0.03), ('uniform', 0.03), ('wi', 0.03), ('firstly', 0.03), ('shorthand', 0.029), ('alignment', 0.029), ('measure', 0.028), ('algorithms', 0.028), ('classical', 0.028), ('scale', 0.027), ('argument', 0.027), ('xl', 0.027), ('errors', 0.026), ('ml', 0.025), ('appealing', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 31 nips-2001-Algorithmic Luckiness
Author: Ralf Herbrich, Robert C. Williamson
Abstract: In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1
2 0.15672602 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile
Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.
3 0.14266935 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
4 0.11806085 139 nips-2001-Online Learning with Kernels
Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson
Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. ¡
5 0.10924946 1 nips-2001-(Not) Bounding the True Error
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
6 0.10029268 134 nips-2001-On Kernel-Target Alignment
7 0.098609798 137 nips-2001-On the Convergence of Leveraging
8 0.095235489 8 nips-2001-A General Greedy Approximation Algorithm with Applications
9 0.088640086 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
10 0.079086147 143 nips-2001-PAC Generalization Bounds for Co-training
11 0.078676052 119 nips-2001-Means, Correlations and Bounds
12 0.077905774 105 nips-2001-Kernel Machines and Boolean Functions
13 0.073894076 167 nips-2001-Semi-supervised MarginBoost
14 0.072721511 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
15 0.07239531 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
16 0.070626147 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
17 0.065963976 170 nips-2001-Spectral Kernel Methods for Clustering
18 0.065309107 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
19 0.061853796 136 nips-2001-On the Concentration of Spectral Properties
20 0.060319811 114 nips-2001-Learning from Infinite Data in Finite Time
topicId topicWeight
[(0, -0.189), (1, 0.091), (2, 0.035), (3, 0.092), (4, 0.136), (5, -0.066), (6, 0.049), (7, 0.073), (8, -0.073), (9, -0.018), (10, -0.049), (11, 0.166), (12, 0.006), (13, 0.059), (14, 0.002), (15, 0.112), (16, 0.01), (17, -0.043), (18, -0.027), (19, 0.011), (20, -0.111), (21, 0.063), (22, -0.128), (23, -0.062), (24, -0.047), (25, 0.001), (26, -0.075), (27, 0.0), (28, 0.003), (29, 0.019), (30, 0.055), (31, 0.049), (32, 0.003), (33, -0.067), (34, -0.054), (35, -0.011), (36, 0.044), (37, 0.053), (38, 0.135), (39, -0.018), (40, 0.138), (41, 0.178), (42, 0.054), (43, 0.023), (44, -0.026), (45, 0.126), (46, -0.057), (47, -0.072), (48, -0.025), (49, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.93846083 31 nips-2001-Algorithmic Luckiness
Author: Ralf Herbrich, Robert C. Williamson
Abstract: In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1
2 0.69606924 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile
Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.
3 0.53415883 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
4 0.52950454 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
5 0.50653011 139 nips-2001-Online Learning with Kernels
Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson
Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. ¡
6 0.50057524 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
7 0.49616387 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
8 0.45647913 1 nips-2001-(Not) Bounding the True Error
9 0.44616273 137 nips-2001-On the Convergence of Leveraging
10 0.43833804 119 nips-2001-Means, Correlations and Bounds
11 0.42187643 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
12 0.40870023 167 nips-2001-Semi-supervised MarginBoost
13 0.39189234 8 nips-2001-A General Greedy Approximation Algorithm with Applications
14 0.38991788 136 nips-2001-On the Concentration of Spectral Properties
15 0.38606766 143 nips-2001-PAC Generalization Bounds for Co-training
16 0.37833172 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
17 0.37627265 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
18 0.33257592 105 nips-2001-Kernel Machines and Boolean Functions
19 0.33123338 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
20 0.31749007 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
topicId topicWeight
[(14, 0.087), (17, 0.047), (19, 0.044), (27, 0.129), (30, 0.028), (36, 0.034), (59, 0.026), (70, 0.308), (72, 0.055), (79, 0.032), (83, 0.028), (91, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.80139697 31 nips-2001-Algorithmic Luckiness
Author: Ralf Herbrich, Robert C. Williamson
Abstract: In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1
2 0.72866392 155 nips-2001-Quantizing Density Estimators
Author: Peter Meinicke, Helge Ritter
Abstract: We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data.
3 0.63314658 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
Author: Michael Collins, S. Dasgupta, Robert E. Schapire
Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.
4 0.53628463 134 nips-2001-On Kernel-Target Alignment
Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola
Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1
5 0.53386909 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
6 0.53176725 137 nips-2001-On the Convergence of Leveraging
7 0.53076363 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
8 0.53000319 13 nips-2001-A Natural Policy Gradient
9 0.52485871 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
10 0.52361101 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
11 0.51873559 58 nips-2001-Covariance Kernels from Bayesian Generative Models
12 0.51637465 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
13 0.51586002 139 nips-2001-Online Learning with Kernels
14 0.51460236 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
15 0.51327062 170 nips-2001-Spectral Kernel Methods for Clustering
16 0.5121156 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules
17 0.50667936 143 nips-2001-PAC Generalization Bounds for Co-training
18 0.50611567 114 nips-2001-Learning from Infinite Data in Finite Time
19 0.50545382 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
20 0.50487566 135 nips-2001-On Spectral Clustering: Analysis and an algorithm