nips nips2000 nips2000-145 knowledge-graph by maker-knowledge-mining

145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting


Source: pdf

Author: Shie Mannor, Ron Meir

Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract The problem of constructing weak classifiers for boosting algorithms is studied. [sent-4, score-0.526]

2 We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. [sent-5, score-0.45]

3 While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. [sent-6, score-0.757]

4 Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. [sent-7, score-0.09]

5 Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. [sent-8, score-1.249]

6 1 Introduction The recently introduced boosting approach to classification (e. [sent-9, score-0.311]

7 Although a great deal of numerical evidence suggests that boosting works very well across a wide spectrum of tasks, it is not a panacea for solving classification problems. [sent-15, score-0.35]

8 In fact, many versions of boosting algorithms currently exist (e. [sent-16, score-0.257]

9 , [4],[9]), each possessing advantages and disadvantages in terms of classification accuracy, interpretability and ease of implementation. [sent-18, score-0.054]

10 The field of boosting provides two major theoretical results. [sent-19, score-0.3]

11 First, it is shown that in certain situations the training error of the classifier formed converges to zero (see (2)). [sent-20, score-0.417]

12 Moreover, under certain conditions, a positive margin can be guaranteed. [sent-21, score-0.161]

13 Second, bounds are provided for the generalization error of the classifier (see (1)). [sent-22, score-0.43]

14 First, we present a simple and efficient algorithm which is shown, for every distribution on the data, to yield a linear classifier with guaranteed error which is smaller than 1/2 - 'Y where 'Y is strictly positive. [sent-24, score-0.425]

15 From the theory of boosting [10] it is known that such a condition suffices to guarantee that the training error converges to zero as the number of boosting iterations increases. [sent-26, score-0.641]

16 In fact, the empirical error with a finite margin is shown to converge to zero if , is sufficiently large. [sent-27, score-0.28]

17 It is then clear that in order to construct useful weak learners, some assumptions need to be made about the data. [sent-29, score-0.293]

18 In this work we show that under certain natural conditions, a useful weak learner can be constructed for one-dimensional problems, in which case the linear hyper-plane degenerates to a point. [sent-30, score-0.416]

19 The second contribution of our work consists of establishing faster convergence rates for the generalized error bounds introduced recently by Mason et al. [sent-33, score-0.407]

20 These improved bounds show that faster convergence can be achieved if we allow for convergence to a slightly larger value than in previous bounds. [sent-35, score-0.463]

21 Given the guaranteed convergence of the empirical loss to zero (in the limited situations in which we have proved such a bound), such a result may yield a better trade-off between the terms appearing in the bound, offering a better model selection criterion (see Chapter 15 in [1]). [sent-36, score-0.428]

22 2 Construction of a Linear Weak Learner We recall the basic generalization bound for convex combinations of classifiers. [sent-37, score-0.278]

23 Let H be a class of binary classifiers of VC-dimension dv , and denote by co(H) the convex hull of H. [sent-38, score-0.398]

24 , (xm, Ym)} E (X x {-I, +1})m of m examples drawn independently at random from a probability distribution D over X x {-I, +1}, Schapire et al. [sent-42, score-0.089]

25 [10] show that with probability at least 1 - 15, for every f E co(H) and every () > 0, P DIY f(X} '" 0] '" Ps [Y f(X} '" 8] + 0 ( Jm (d. [sent-43, score-0.204]

26 ) + Iog(l/5)) 'I') , (1) where the margin-error P slY f(X) :::; ()] denotes the fraction of training points for which yd(Xi) :::; (). [sent-45, score-0.066]

27 Clearly, if the first term can be made small for a large value of the margin (), a tight bound can be established. [sent-46, score-0.225]

28 [10] also show that if each weak classifier can achieve an error smaller than 1/2 -" then P slY f(X) :::; ()] :::; ((1 - 2,)1-11 (1 + 2,)1+8) T/2 , (2) where T is the number of boosting iterations. [sent-48, score-0.745]

29 Note that if , > (), the bound decreases to zero exponentially fast. [sent-49, score-0.126]

30 However, if , (and thus ()) behaves like m -/3 for some (3 > 0, the rate of convergence in the second term in (1) will deteriorate, leading to worse bounds than those available by using standard VC results [11]. [sent-51, score-0.263]

31 What is needed is a characterization of conditions under which the achievable () does not decrease rapidly with m. [sent-52, score-0.079]

32 In this section we present such conditions for one-dimensional problems, and mention recent work [7] that proves a similar result in higher dimensions. [sent-53, score-0.126]

33 We begin by demonstrating that for any distribution on m points, a linear classifier can achieve an error smaller than 1/2 -" where, = O(I/m). [sent-54, score-0.277]

34 In view of our comments above, such a fast convergence of, to zero may be useless for generalization bounds. [sent-55, score-0.294]

35 We then use our construction to show that, under certain regularity conditions, a value of" and thus of (), which is independent of m can be established for one-dimensional problems. [sent-56, score-0.057]

36 A linear decision rule takes the form y(x) = sgn(a· X + b), where· is the standard inner product in IRd. [sent-66, score-0.069]

37 The weighted misclassification error for a classifier Y is Pe (a, b) = Lz:,l PiI(Yi i Yi). [sent-68, score-0.275]

38 Since there is at least one point X whose weight is not smaller than 11m, we consider the four possible linear classifiers defined by h with boundaries near x (at both sides of it and with opposite sign), and show that one of these yields the desired result. [sent-72, score-0.153]

39 Thus, we can assume, without loss of generality, that ILz:,l PiYil < 2~· Next, note that there exists a direction u such that i i j implies that U· Xi i U· Xj. [sent-86, score-0.095]

40 Construct all one-dimensional lines containing two data points or more; clearly the number of such lines is at most m(m -1)/2. [sent-88, score-0.22]

41 It is then obvious that any line, which is not perpendicular to any of these lines obeys the required condition. [sent-89, score-0.119]

42 Such an E always exists since the points are assumed to be distinct. [sent-94, score-0.118]

43 , m let the classification be given by Yj = sgn(u· Xj + b), where the bias b is given by b = -U·Xi +EYi. [sent-98, score-0.113]

44 Otherwise, if IA + BI < 112m, consider the classifier yj = sgn( -u . [sent-106, score-0.25]

45 Xi + EYi (note that y~ = Yi and yj = -Yj, j i i). [sent-108, score-0.096]

46 Using (3) with 61 = 112m and 62 = 11m the claim follows. [sent-109, score-0.054]

47 • We comment that the upper bound in Lemma 1 may be improved to 1/2 -II (4(m1)), m 2: 2, using a more refined argument. [sent-110, score-0.17]

48 Remark 1 Lemma 1 implies that an error of 1/2 - 'Y, where 'Y = O(l/m), can be guaranteed for any set of arbitrarily weighted points. [sent-111, score-0.195]

49 It is well known that the problem of finding a linear classifier with minimal classification error is NP-hard (in d) [5]. [sent-112, score-0.288]

50 Since the algorithm described in Lemma 1 is clearly polynomial (in m and d), there seems to be a transition as a function of 'Y between the class NP and P (assuming, as usual, that they are different). [sent-114, score-0.095]

51 While the result given in Lemma 1 is interesting, its generality precludes its usefulness for bounding generalization error. [sent-116, score-0.079]

52 This can be seen by observing that the theorem guarantees the given margin even in the case where the labels Yi are drawn uniformly at random from {±1}, in which case no generalization can be expected. [sent-117, score-0.409]

53 We do this by imposing constraints on the types of decision regions characterizing the data. [sent-119, score-0.122]

54 In order to generate complex, yet tractable, decision regions we consider a multi-linear mapping from Rd to {-I, l}k, generated by the k hyperplanes Pi = {x: WiX + WiO,X E Rd},i = 1, . [sent-120, score-0.164]

55 Such a mapping generates a partition of the input space Rd into M connected components, {Rd \ U~=l Pi }, each characterized by a unique binary vector of length k. [sent-124, score-0.214]

56 An upper bound is given by 2(e(k - 1)/d)d, m ~ d. [sent-133, score-0.079]

57 In order to generate a binary classification problem, we observe that there exists a binary function from {-I, l}k I--t {-I, I}, characterized by these M decision regions. [sent-135, score-0.421]

58 Proceeding by induction, we generate a binary classification problem composed of exactly M decision regions. [sent-139, score-0.267]

59 Thus, we have constructed a binary classification problem, characterized by at least 2(k;;l) ~ 2((k -1)/d))d decision regions. [sent-140, score-0.277]

60 Clearly as k becomes arbitrarily large, very elaborate regions are formed. [sent-141, score-0.053]

61 Note that in this case the partition is composed of intervals. [sent-143, score-0.063]

62 Theorem 1 Let F be a class of functions from R to {±1} which partitions the real line into at most k intervals, k ~ 2. [sent-144, score-0.051]

63 Associate with every interval a point in R, = h - 1, X2 = (h + l2) /2, . [sent-155, score-0.076]

64 We conjecture that in d dimensions 'Y behaves like k-l(d) for some function l, where k is a measure for the number of homogeneous convex regions defined by the data (a homogeneous region is one in which all points possess identical labels). [sent-176, score-0.44]

65 While we do not have a general proof at this stage, we have recently shown [7] that the conjecture holds under certain natural conditions on the data. [sent-177, score-0.333]

66 This result implies that, at least under appropriate conditions, boosting-like algorithm are expected to have excellent generalization performance. [sent-178, score-0.174]

67 For this simulation we used random lines to generate a partition of the unit square in IR2. [sent-180, score-0.207]

68 We then drew 1000 points at random from the unit square and assigned them labels according to the partition. [sent-181, score-0.197]

69 Finally, in order to have a non-trivial problem we made sure that the cumulative weights of each class are equal. [sent-182, score-0.093]

70 We then calculated the optimal linear classifier by exhaustive search. [sent-183, score-0.154]

71 In Figure 1(b) we show a sample decision region with 93 regions. [sent-184, score-0.11]

72 Figure l(a) shows the dependence of'Y on the number of regions k. [sent-185, score-0.053]

73 As it turns out there is a significant logarithmic dependence between'Y and k, which leads us to conjecture that 'Y '" Ck- l + E for some C, land E. [sent-186, score-0.111]

74 In the presented case it turns out that 1 = 3 turns out to fit our model well. [sent-187, score-0.078]

75 It is important to note, however, that the procedure described above only supports our claim in an average-case, rather than worst-case, setting as is needed. [sent-188, score-0.054]

76 oeo '----------c~-----c_=_=_--~--=' Number of R e gion s Figure 1: (a) 'Y as a function of the number of regions. [sent-195, score-0.067]

77 (b) A typical complex partition of the unit square used in the simulations. [sent-196, score-0.063]

78 3 Improved Convergence Rates In Section 2 we proved that under certain conditions a weak learner exists with a sufficiently large margin, and thus the first term in (1) indeed converges to zero. [sent-197, score-0.507]

79 We now analyze the second term in (1) and show that it may be made to converge considerably faster, if the first term is made somewhat larger. [sent-198, score-0.084]

80 Remark 2 The most appealing feature of Lemma 2, as of other results for convex combinations, is the fact that the bound does not depend on the number of hypotheses from H defining f E co(H), which may in fact be infinite. [sent-209, score-0.158]

81 [11]) would lead to useless bounds, since the VC dimension of these classes is often huge (possibly infinite). [sent-212, score-0.067]

82 Since recent works has demonstrated the effectiveness of using real valued hypotheses, we consider the case where the weak classifiers may be confidence-rated, i. [sent-214, score-0.316]

83 Note that the variables Zi in Definition 1 are no longer binary in this case. [sent-218, score-0.102]

84 Lemma 3 Let the conditions of Lemma 2 hold, except that H is a class of real valued functions from X to [-1, +1] of pseudo-dimension dp . [sent-219, score-0.13]

85 Assume further that WN in Definition 1 obeys a Lipschitz condition of the form IWN(X) - wN(x')1 :::; Llx - x'I for every x, x' EX. [sent-220, score-0.14]

86 Then with probability at least 1- c5, Pn[yf(x) :::; 0] :::; Es[CN(yf(x)] + EN, where EN =0 ([(LB 2 Nd p logm + log(N/c5))/m] 1/2) . [sent-221, score-0.119]

87 Proof The proof is very similar to the proof of Theorem 2, and will be omitted for the sake of brevity. [sent-222, score-0.25]

88 In this situation, case (ii) provides better overall bounds, often leading to the optimal minimax rates for non parametric problems (see a discussion of these issues in Sec. [sent-225, score-0.096]

89 We now establish a faster convergence rate to a slightly larger value than Es [CN(Yf(X))]. [sent-228, score-0.154]

90 In situations where the latter quantity approaches zero, the overall convergence rate may be improved, as discussed above. [sent-229, score-0.18]

91 Theorem 2 Let D be a distribution over X x {-I, + I}, and let S be a sample of m points chosen independently at random according to D. [sent-232, score-0.213]

92 Let dp be the pseudodimension of the class H, and assume that CN(O:) obeys condition (5). [sent-233, score-0.115]

93 Then for sufficiently large m, with probability at least 1- c5, every function f E co( H) satisfies the following bound for every 0 < 0: < 1/(3N 1+ ) P n [Yf(X):::; 0]:::; ( 1- O:;N Es [CN(Yf(X))] +0 (d Nlog +20:) ) log! [sent-234, score-0.283]

94 mo:/(:P+ m p (j • Proof The proof combines two ideas. [sent-235, score-0.125]

95 Due to lack of space, we defer the complete proof to the full version of the paper. [sent-238, score-0.125]

96 First, we have shown that, under reasonable conditions, an effective weak classifier exists for one dimensional problems. [sent-240, score-0.417]

97 We conjectured, and supported our claim by numerical simulations, that such a result holds for multi-dimensional problems as well. [sent-241, score-0.093]

98 The non-trivial extension of the proof to multiple dimensions can be found in [7]. [sent-242, score-0.168]

99 Second, using recent advances in the theory of uniform convergence and boosting we have presented bounds on the generalization error, which may, under certain conditions, be significantly better than standard bounds, being particularly useful in the context of model selection. [sent-243, score-0.738]

100 On the existence of weak learners and applications to boosting. [sent-283, score-0.315]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cn', 0.313), ('boosting', 0.257), ('lemma', 0.227), ('weak', 0.211), ('sgn', 0.195), ('yf', 0.169), ('classifier', 0.154), ('co', 0.126), ('proof', 0.125), ('bounds', 0.117), ('es', 0.115), ('pi', 0.114), ('yi', 0.109), ('learner', 0.108), ('margin', 0.104), ('learners', 0.104), ('mason', 0.104), ('binary', 0.102), ('convergence', 0.101), ('lz', 0.1), ('yj', 0.096), ('bartlett', 0.094), ('improved', 0.091), ('rd', 0.09), ('xi', 0.089), ('vc', 0.086), ('labels', 0.084), ('error', 0.08), ('generalization', 0.079), ('conditions', 0.079), ('situations', 0.079), ('convex', 0.079), ('bound', 0.079), ('ird', 0.078), ('every', 0.076), ('conjecture', 0.072), ('guaranteed', 0.072), ('decision', 0.069), ('schapire', 0.068), ('axi', 0.067), ('gion', 0.067), ('hull', 0.067), ('ilz', 0.067), ('logm', 0.067), ('mannor', 0.067), ('piyil', 0.067), ('pjyjsgn', 0.067), ('sly', 0.067), ('useless', 0.067), ('points', 0.066), ('obeys', 0.064), ('partition', 0.063), ('let', 0.059), ('classifiers', 0.058), ('guessing', 0.057), ('iog', 0.057), ('certain', 0.057), ('rates', 0.056), ('en', 0.055), ('lines', 0.055), ('classification', 0.054), ('claim', 0.054), ('faster', 0.053), ('theorem', 0.053), ('regions', 0.053), ('exists', 0.052), ('lies', 0.052), ('remark', 0.052), ('pii', 0.052), ('least', 0.052), ('xj', 0.051), ('class', 0.051), ('empirical', 0.049), ('connected', 0.049), ('zero', 0.047), ('random', 0.047), ('recent', 0.047), ('behaves', 0.045), ('applies', 0.045), ('clearly', 0.044), ('dimensions', 0.043), ('smaller', 0.043), ('theoretical', 0.043), ('ez', 0.043), ('pe', 0.043), ('ik', 0.043), ('implies', 0.043), ('made', 0.042), ('generate', 0.042), ('drawn', 0.042), ('sample', 0.041), ('recall', 0.041), ('dv', 0.041), ('misclassification', 0.041), ('homogeneous', 0.041), ('useful', 0.04), ('better', 0.04), ('numerical', 0.039), ('turns', 0.039), ('family', 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

Author: Shie Mannor, Ron Meir

Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1

2 0.26746148 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

Author: Richard S. Zemel, Toniann Pitassi

Abstract: In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Recently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. This formulation justifies key elements and parameters in the methods, all chosen to optimize a single common objective function. We propose an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. We prove that this method reduces training error, and compare its performance to other regression methods. The aim of boosting algorithms is to

3 0.23673254 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'

4 0.21818954 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

5 0.19705187 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

6 0.1805305 37 nips-2000-Convergence of Large Margin Separable Linear Classification

7 0.17415407 21 nips-2000-Algorithmic Stability and Generalization Performance

8 0.13477273 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

9 0.12119053 75 nips-2000-Large Scale Bayes Point Machines

10 0.11230455 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

11 0.11153577 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

12 0.10947324 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

13 0.10941444 74 nips-2000-Kernel Expansions with Unlabeled Examples

14 0.1065924 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

15 0.10215229 44 nips-2000-Efficient Learning of Linear Perceptrons

16 0.099471726 13 nips-2000-A Tighter Bound for Graphical Models

17 0.099140488 133 nips-2000-The Kernel Gibbs Sampler

18 0.098927163 54 nips-2000-Feature Selection for SVMs

19 0.09670192 94 nips-2000-On Reversing Jensen's Inequality

20 0.093324482 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.322), (1, 0.244), (2, -0.099), (3, -0.064), (4, 0.012), (5, -0.262), (6, 0.068), (7, -0.042), (8, -0.007), (9, 0.006), (10, -0.235), (11, 0.095), (12, 0.135), (13, 0.059), (14, 0.016), (15, 0.076), (16, 0.176), (17, 0.139), (18, 0.114), (19, -0.12), (20, -0.048), (21, 0.035), (22, -0.008), (23, -0.078), (24, 0.074), (25, 0.141), (26, -0.08), (27, 0.057), (28, 0.011), (29, 0.099), (30, -0.102), (31, -0.044), (32, -0.025), (33, 0.079), (34, -0.03), (35, -0.042), (36, 0.075), (37, -0.034), (38, -0.076), (39, 0.033), (40, 0.018), (41, 0.052), (42, -0.042), (43, -0.043), (44, 0.034), (45, -0.064), (46, 0.017), (47, 0.004), (48, 0.042), (49, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96449357 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

Author: Shie Mannor, Ron Meir

Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1

2 0.85114098 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

Author: Richard S. Zemel, Toniann Pitassi

Abstract: In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Recently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. This formulation justifies key elements and parameters in the methods, all chosen to optimize a single common objective function. We propose an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. We prove that this method reduces training error, and compare its performance to other regression methods. The aim of boosting algorithms is to

3 0.84815037 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'

4 0.57918978 21 nips-2000-Algorithmic Stability and Generalization Performance

Author: Olivier Bousquet, Andr茅 Elisseeff

Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1

5 0.53909296 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

6 0.52690011 37 nips-2000-Convergence of Large Margin Separable Linear Classification

7 0.48301339 58 nips-2000-From Margin to Sparsity

8 0.46626547 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

9 0.45771554 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

10 0.42419741 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

11 0.41399115 44 nips-2000-Efficient Learning of Linear Perceptrons

12 0.40235969 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

13 0.36242819 75 nips-2000-Large Scale Bayes Point Machines

14 0.35396701 120 nips-2000-Sparse Greedy Gaussian Process Regression

15 0.34056145 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

16 0.33959621 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

17 0.32878259 54 nips-2000-Feature Selection for SVMs

18 0.32793635 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

19 0.32294348 111 nips-2000-Regularized Winnow Methods

20 0.32256871 74 nips-2000-Kernel Expansions with Unlabeled Examples


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.044), (17, 0.103), (32, 0.347), (33, 0.085), (48, 0.012), (55, 0.014), (62, 0.059), (65, 0.019), (67, 0.055), (75, 0.016), (76, 0.063), (81, 0.025), (90, 0.062), (97, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92685163 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach

Author: Frank Dellaert, Steven M. Seitz, Sebastian Thrun, Charles E. Thorpe

Abstract: When trying to recover 3D structure from a set of images, the most difficult problem is establishing the correspondence between the measurements. Most existing approaches assume that features can be tracked across frames, whereas methods that exploit rigidity constraints to facilitate matching do so only under restricted camera motion. In this paper we propose a Bayesian approach that avoids the brittleness associated with singling out one

same-paper 2 0.91153795 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

Author: Shie Mannor, Ron Meir

Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1

3 0.53061599 60 nips-2000-Gaussianization

Author: Scott Saobing Chen, Ramesh A. Gopinath

Abstract: High dimensional data modeling is difficult mainly because the so-called

4 0.52788919 79 nips-2000-Learning Segmentation by Random Walks

Author: Marina Meila, Jianbo Shi

Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1

5 0.52191573 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

6 0.51829565 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

7 0.51099551 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.50969696 133 nips-2000-The Kernel Gibbs Sampler

9 0.50962776 37 nips-2000-Convergence of Large Margin Separable Linear Classification

10 0.50668836 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

11 0.50516844 75 nips-2000-Large Scale Bayes Point Machines

12 0.49969813 122 nips-2000-Sparse Representation for Gaussian Process Models

13 0.49846068 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

14 0.49515674 22 nips-2000-Algorithms for Non-negative Matrix Factorization

15 0.49174878 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

16 0.48371994 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

17 0.48353636 111 nips-2000-Regularized Winnow Methods

18 0.48267582 52 nips-2000-Fast Training of Support Vector Classifiers

19 0.48005363 92 nips-2000-Occam's Razor

20 0.4791548 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates