nips nips2000 nips2000-3 knowledge-graph by maker-knowledge-mining

3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

Source: pdf

Author: Richard S. Zemel, Toniann Pitassi

Abstract: In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Recently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. This formulation justifies key elements and parameters in the methods, all chosen to optimize a single common objective function. We propose an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. We prove that this method reduces training error, and compare its performance to other regression methods. The aim of boosting algorithms is to

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Zemel Toniann Pitassi Department of Computer Science University of Toronto Abstract In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. [sent-2, score-0.372]

2 Recently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. [sent-3, score-0.779]

3 This formulation justifies key elements and parameters in the methods, all chosen to optimize a single common objective function. [sent-4, score-0.208]

4 We propose an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. [sent-5, score-1.43]

5 We prove that this method reduces training error, and compare its performance to other regression methods. [sent-6, score-0.204]

6 The aim of boosting algorithms is to "boost" the small advantage that a hypothesis produced by a weak learner can achieve over random guessing, by using the weak learning procedure several times on a sequence of carefully constructed distributions. [sent-7, score-1.008]

7 Boosting methods, notably AdaBoost (Freund & Schapire, 1997), are simple yet powerful algorithms that are easy to implement and yield excellent results in practice. [sent-8, score-0.036]

8 Two crucial elements of boosting algorithms are the way in which a new distribution is constructed for the learning procedure to produce the next hypothesis in the sequence, and the way in which hypotheses are combined to produce a highly accurate output. [sent-9, score-1.123]

9 Recently boosting algorithms have been derived as gradient descent algorithms (Breiman, 1997; Schapire & Singer, 1998; Friedman et al. [sent-11, score-0.767]

10 These formulations justify the parameter values as all serving to optimize a single common objective function. [sent-14, score-0.236]

11 These optimization formulations of boosting originally developed for classification problems have recently been applied to regression problems. [sent-15, score-0.76]

12 However, key properties of these regression boosting methods deviate significantly from the classification boosting approach. [sent-16, score-1.269]

13 We propose a new boosting algorithm for regression problems, also derived from a central objective function, which retains these properties. [sent-17, score-0.948]

14 In this paper, we describe the original boosting algorithm and summarize boosting methods for regression. [sent-18, score-1.205]

15 We present our method and provide a simple proof that elucidates conditions under which convergence on training error can be guaranteed. [sent-19, score-0.157]

16 We propose a probabilistic framework that clarifies the relationship between various optimization-based boosting methods. [sent-20, score-0.54]

17 Finally, we summarize empirical comparisons between our method and others on some standard problems. [sent-21, score-0.055]

18 1 A Brief Summary of Boosting Methods Adaptive boosting methods are simple modular algorithms that operate as follows. [sent-22, score-0.607]

19 The algorithm uses a learning procedure, which has access to n training examples, {(Xl, Y1), . [sent-24, score-0.164]

20 , (xn, Yn)}, drawn randomly from X x Yaccording to distribution D; it outputs a hypothesis I : X -t Y, whose error is the expected value of a loss function on I(x) , g(x), where X is chosen according to D. [sent-27, score-0.269]

21 Given f, cl > 0 and access to random examples, a strong learning procedure outputs with probability 1 - cl a hypothesis with error at most f, with running time polynomial in 1/ f, 1/ cl and the number of examples. [sent-28, score-0.408]

22 A weak learning procedure satisfies the same conditions, but where f need only be better than random guessing. [sent-29, score-0.124]

23 Schapire (1990) showed that any weak learning procedure, denoted WeakLeam, can be efficiently transformed ("boosted") into a strong learning procedure. [sent-30, score-0.081]

24 The AdaBoost algorithm achieves this by calling WeakLeam multiple times, in a sequence of T stages, each time presenting it with a different distribution over a fixed training set and finally combining all of the hypotheses. [sent-31, score-0.167]

25 The algorithm maintains a weight for each training example i at stage i, and a distribution D t is computed by normalizing these weights. [sent-32, score-0.257]

26 At stagei, the distribution D t is given to WeakLeam, which generates a hypothesis It- The error rate ft of It w. [sent-34, score-0.538]

27 The new training distribution is obtained from the new weights: W;+l w: * (ft/ (l - ft))Hf,(x')-y'l After T stages, a test example X will be classified by a combined weighted-majority hypothesis: y = sgn(2::;=1 cdt (x)). [sent-38, score-0.21]

28 Each combination coefficient Ct = log( (1- fd/ fd takes into account the accuracy of hypothesis It with respect to its distribution. [sent-39, score-0.271]

29 The optimization approach derives these equations as all minimizing a common objective function J, the expected error of the combined hypotheses, estimated from the training set. [sent-40, score-0.408]

30 The new hypothesis is the step in function space in the direction of steepest descent of this objective. [sent-41, score-0.235]

31 Similarly, the training distribution is formed by normalizing updated weights: = * exp(-yicdt(x i )) = * exp(s~cdwhere s: = 1 if It (xi) i- yi, else s~ = -1. [sent-43, score-0.196]

32 Note that because the objective function J is multiplicative in the costs of the hypotheses, a key property follows: The objective for each hypothesis is formed simply by re-weighting the training distribution. [sent-44, score-0.732]

33 w:+1 w: w; This boosting algorithm applies to binary classification problems, but it does not readily generalize to regression problems. [sent-45, score-0.764]

34 Intuitively, regression problems present special difficulties because hypotheses may not just be right or wrong, but can be a little wrong or very wrong. [sent-46, score-0.37]

35 Recently a spate of clever optimization-based boosting methods have been proposed for regression (Duffy & Helmbold, 2000; Friedman, 1999; Karakoulas & Shawe-Taylor, 1999; R~itsch et al. [sent-47, score-0.748]

36 While these methods involve diverse objectives and optimization approaches, they are alike in that new hypotheses are formed not by simply changing the example weights, but instead by modifying the target values. [sent-49, score-0.401]

37 As such they can be viewed as forms of forward stage-wise additive models (Hastie & Tibshirani, 1990), which produce hypotheses sequentially to reduce residual error. [sent-50, score-0.301]

38 We study a simple example of this approach, in which hypothesis T is trained not to produce the target output yi on a given case i, but instead to fit the current residual, r~, where r~ = yi - L,;;11 ctft(x). [sent-51, score-0.636]

39 Note that this approach develops a series of hypotheses all based on optimizing a common objective, but it deviates from standard boosting in that the distribution of examples is not used to control the generation of hypotheses, and each hypothesis is not trained to learn the same function. [sent-52, score-1.034]

40 2 An Objective Function for Boosting Regression Problems We derive a boosting algorithm for regression from a different objective function. [sent-53, score-0.918]

41 In the standard AdaBoost algorithm, the combination coefficient Ct can be analytically determined by solving %I; = 0 for Ct. [sent-56, score-0.083]

42 Unfortunately, one cannot analytically determine the combination coefficient Ct in our algorithm, but a simple line search can be used to find value of Ct that minimizes the cost Jt . [sent-57, score-0.119]

43 Finally, optimizing J with respect to y produces a simple linear combination rule for the estimate: fj = L,t Ct It (x) / L,t Ct· We introduce a constant r as a threshold used to demarcate correct from incorrect responses. [sent-59, score-0.106]

44 This threshold is the single parameter of this algorithm that must be chosen in a problem-dependent manner. [sent-60, score-0.066]

45 It is used to judge when the performance of a new hypothesis warrants its inclusion: ft = L,i p~ exp[(ft(x i ) - yi )2 - r] < 1. [sent-61, score-0.614]

46 The algorithm can be summarized as follows: New Boosting Algorithm 1. [sent-62, score-0.066]

47 (X n , Yn ) with Y E ~; • WeakLeam: learning procedure produces a hypothesis h(x) whose accuracy on the training set is judged according to J 2. [sent-67, score-0.331]

48 Choose initial distribution P1(xi) = P~ = w~ = ~ 3. [sent-68, score-0.029]

49 Iterate: • • • • Call WeakLearn - minimize Jt with distribution Pt Accept iff E = ~i P~ exp[(ft(xi ) - yi )2 - r] < 1 t Set a ~ Ct ~ 1 to minimize J t (using line search) Update training distribution n w;+l / L W{+l j=l 4. [sent-69, score-0.428]

50 Estimate output y on input x: Y = L cdt (x)/ L Ct t t 3 Proof of Convergence Theorem: Assume that for all t ~ T , hypothesis t makes error Et on its distribution. [sent-70, score-0.365]

51 We show that the sum of the weights at stage T is bounded above by a constant times the product of the Et'S, while at the same time, for each input i that is incorrect, its corresponding weight w~ at stage T is significant. [sent-72, score-0.111]

52 We now compute the new weights: t L Ct(ft(x i ) - yi )2 = [L ct][Var(fi ) + (yi - yi )2] ~ [L Ct][(yi - yi )2] t t t i )/ ~t Ct and Var(fi ) = ~t ct (h(x i ) - yi )2/ ~t C . [sent-74, score-1.077]

53 Thus, where yi = ~t cth(x t T W~ +l = T (II C;1/2) exp(L Ct(ft (x i ) - T yi )2) ~ T (II C; 1/2) exp([L Ct][(yi _ yi )2]) t=l Now consider an example input k such that the final answer is an error. [sent-75, score-0.502]

54 4 Comparing the Objectives We can compare the objectives by adopting a probabilistic framework. [sent-81, score-0.083]

55 We associate a probability distribution with the output of each hypothesis on input x, and combine them to form a consensus model M by multiplying the distributions: g(y lx, M) == TIt pt(ylx, (1d,where (1t are parameters specific to hypothesis t. [sent-82, score-0.468]

56 An alternative objective can be derived by first normalizing g(y lx, M): TIt pt(ylx, (1d ( I M) = g(ylx , M) = pyx, - J,y,g(ylx,M) - J,y' TI tPt ( y'lx, (1t)dy' TIUs probability model underlies the product-of-experts model (Hinton, 2000) and the logarithmic opinion pool (Bordley, 1982). [sent-85, score-0.26]

57 The objective for this model is: JR=-logp(y* lx,M )=c[y*-f(x)f -~logc (2) TIUs objective corresponds to a type of residual-fitting algorithm. [sent-87, score-0.36]

58 If r(x) [y* - f (x) ] , and {Ct} for t < T are assumed frozen, then training J R is achieved by using r (x) as a target. [sent-88, score-0.072]

59 iT to minimize These objectives can be further compared w. [sent-89, score-0.139]

60 The main term in our objective can be re-expressed: L Ct [yO t ft(X)]2 =L t Ct [yO - f(x)] 2 + L Ct [ft(x) - f( x) ] 2 = bias+variance t Meanwhile, the main term of JR corresponds to the bias term. [sent-94, score-0.18]

61 Hence a new hypothesis can minimize JR by having low error (ft (x) = y*), or with a deviant (ambiguous) response (ft(x) -=F f(x) (Krogh & Vedelsby, 1995). [sent-95, score-0.296]

62 Thus our objective attempts to minimize the average error of the models, while the residual-fitting objective minimizes the error of the average model. [sent-96, score-0.52]

63 25 '------~---4 2 c---~---8 -----c ~ 10 Figure 1: Generalization results for our gradient-based boosting algorithm, compared to the residual-fitting and mixture-of-experts algorithms. [sent-107, score-0.54]

64 Normalized mean-squared error is plotted against the number of stages of boosting (or number of experts for the mixture-of-experts). [sent-109, score-0.711]

65 5 Empirical Tests of Algorithm We report results comparing the performance of our new algorithm with two other algorithms. [sent-110, score-0.066]

66 The first is a residual-fitting algorithm based on the J R objective (Eq. [sent-111, score-0.246]

67 The second algorithm is a version of the mixture-of-experts algorithm aacobs et al. [sent-113, score-0.177]

68 In the standard mixture-of-experts the combination coefficients depend on the input; to make this model comparable to the others, we allowed each expert one input-independent, adaptable coefficient. [sent-116, score-0.08]

69 This algorithm provides a good alternative to the greedy stage-wise methods, in that the experts are trained simultaneously to collectively fit the data. [sent-117, score-0.153]

70 The first is the nonlinear prediction problem F1 (Friedman, 1991), which has 10 independent input variables uniform in [0 , 1]: y = 10 sin( 7rX1X2) + 20(X3 - . [sent-119, score-0.031]

71 In this problem, only five input variables (Xl to X5) have predictive value. [sent-121, score-0.031]

72 We used 400 training examples, and 100 validation and test examples. [sent-123, score-0.112]

73 The second test problem is the standard Boston Housing problem Here there are 506 examples and twelve continuous input variables. [sent-124, score-0.069]

74 We scaled the input variables to be in [0,1], and the outputs to be in [0, 5] . [sent-125, score-0.031]

75 We used 400 of the examples for training, 50 for validation, and the remainder to test generalization. [sent-126, score-0.038]

76 We used neural networks as the hypotheses and back-propagation as the learning procedure to train them. [sent-127, score-0.219]

77 Each network had a layer of tanhO units between the input units and a single linear output. [sent-128, score-0.031]

78 For each algorithm, we used early stopping with a validation set in order to reduce over-fitting in the hypotheses. [sent-129, score-0.04]

79 One finding was that the other algorithms out-performed ours when the hypotheses were simple: when the weak learners had only one or two hidden nodes, the residual-fitting algorithm reduced test error. [sent-130, score-0.397]

80 With more hidden nodes the relative performance of our algorithm improved. [sent-131, score-0.066]

81 Figure 1 shows average results for threehidden-unit networks over 20 runs of each algorithm on the two problems, with examples randomly assigned to the three sets on each run. [sent-132, score-0.104]

82 Overall, the residual-fitting algorithm exhibited more over-fitting than our method. [sent-135, score-0.066]

83 Over-fitting in these approaches may be tempered: a regularization technique known as shrinkage, which scales combination coefficients by a fractional parameter, has been found to im- prove generalization in gradient boosting applications to classification (Friedman, 1999). [sent-136, score-0.679]

84 Finally, the mixture-of-experts algorithm generally out-performed the sequential training algorithm. [sent-137, score-0.138]

85 A drawback of this method is the need to specify the number of hypotheses in advance; however, given that number, simultaneous training is likely less prone to local minima than the sequential approaches. [sent-138, score-0.289]

86 6 Conclusion We have proposed a new boosting algorithm for regression problems. [sent-139, score-0.738]

87 Like several recent boosting methods for regression, the parameters and updates can be derived from a single common objective. [sent-140, score-0.629]

88 Unlike these methods, our algorithm forms new hypotheses by simply modifying the distribution over training examples. [sent-141, score-0.374]

89 Preliminary empirical comparisons have suggested that our method will not perform as well as a residual-fitting approach for simple hypotheses, but it works well for more complex ones, and it seems less prone to over-fitting. [sent-142, score-0.068]

90 The lack of over-fitting in our method can be traced to the inherent bias-variance tradeoff, as new hypotheses are forced to resemble existing ones if they cannot improve the combined estimate. [sent-143, score-0.223]

91 The combination coefficients can be input-dependent: a learner returns not only ft(x i ) but also kt(x i ) E [0 ,1], a measure of confidence in its prediction. [sent-145, score-0.119]

92 This elaboration makes the weak learning task harder, but may extend the applicability of the algorithm: letting each learner focus on a subset of its weighted training distribution permits a divide-and-conquer approach to function approximation. [sent-146, score-0.221]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('boosting', 0.54), ('ct', 0.449), ('ft', 0.269), ('hypothesis', 0.188), ('objective', 0.18), ('hypotheses', 0.176), ('yi', 0.157), ('regression', 0.132), ('adaboost', 0.12), ('schapire', 0.114), ('exp', 0.103), ('weakleam', 0.096), ('lx', 0.088), ('friedman', 0.084), ('objectives', 0.083), ('weak', 0.081), ('training', 0.072), ('tibshirani', 0.072), ('stages', 0.067), ('algorithm', 0.066), ('pt', 0.063), ('cdt', 0.062), ('ylx', 0.062), ('minimize', 0.056), ('error', 0.052), ('hastie', 0.052), ('jr', 0.052), ('experts', 0.052), ('normalizing', 0.05), ('combination', 0.049), ('jt', 0.049), ('tit', 0.049), ('bordley', 0.048), ('heskes', 0.048), ('karakoulas', 0.048), ('rounds', 0.048), ('tius', 0.048), ('vedelsby', 0.048), ('yo', 0.048), ('combined', 0.047), ('descent', 0.047), ('formed', 0.045), ('et', 0.045), ('procedure', 0.043), ('breiman', 0.041), ('duffy', 0.041), ('helmbold', 0.041), ('housing', 0.041), ('prone', 0.041), ('multiplicative', 0.04), ('validation', 0.04), ('stage', 0.04), ('freund', 0.039), ('learner', 0.039), ('adaptive', 0.038), ('examples', 0.038), ('boost', 0.038), ('colt', 0.038), ('krogh', 0.038), ('learners', 0.038), ('mason', 0.038), ('singer', 0.038), ('var', 0.038), ('cost', 0.036), ('algorithms', 0.036), ('boston', 0.035), ('target', 0.035), ('trained', 0.035), ('problems', 0.034), ('coefficient', 0.034), ('additive', 0.033), ('cl', 0.033), ('proof', 0.033), ('gradient', 0.033), ('geman', 0.033), ('produce', 0.032), ('output', 0.032), ('methods', 0.031), ('hinton', 0.031), ('modifying', 0.031), ('residual', 0.031), ('coefficients', 0.031), ('input', 0.031), ('derived', 0.03), ('minimizing', 0.029), ('distribution', 0.029), ('iff', 0.029), ('incorrect', 0.029), ('sequentially', 0.029), ('common', 0.028), ('formulations', 0.028), ('summarize', 0.028), ('wrong', 0.028), ('yk', 0.028), ('xi', 0.028), ('produces', 0.028), ('comparisons', 0.027), ('costs', 0.027), ('classification', 0.026), ('access', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999851 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

Author: Richard S. Zemel, Toniann Pitassi

2 0.26746148 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

Author: Shie Mannor, Ron Meir

Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1

3 0.14694262 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'

4 0.092713274 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

5 0.08940132 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

Author: Dörthe Malzahn, Manfred Opper

Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1

6 0.080570027 120 nips-2000-Sparse Greedy Gaussian Process Regression

7 0.077469796 21 nips-2000-Algorithmic Stability and Generalization Performance

8 0.075621977 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

9 0.072954074 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

10 0.066575378 75 nips-2000-Large Scale Bayes Point Machines

11 0.066193603 74 nips-2000-Kernel Expansions with Unlabeled Examples

12 0.065013342 82 nips-2000-Learning and Tracking Cyclic Human Motion

13 0.064494982 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

14 0.059549928 85 nips-2000-Mixtures of Gaussian Processes

15 0.059527025 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

16 0.05877696 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

17 0.057950702 111 nips-2000-Regularized Winnow Methods

18 0.05524252 122 nips-2000-Sparse Representation for Gaussian Process Models

19 0.054901987 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

20 0.052348327 92 nips-2000-Occam's Razor

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.205), (1, 0.105), (2, -0.008), (3, -0.039), (4, 0.033), (5, -0.148), (6, 0.032), (7, -0.008), (8, 0.02), (9, -0.026), (10, -0.193), (11, 0.073), (12, 0.207), (13, 0.014), (14, 0.043), (15, 0.094), (16, 0.11), (17, 0.199), (18, 0.183), (19, -0.196), (20, -0.066), (21, 0.006), (22, 0.058), (23, -0.09), (24, 0.022), (25, 0.132), (26, -0.167), (27, 0.12), (28, -0.039), (29, 0.109), (30, -0.081), (31, 0.003), (32, -0.079), (33, 0.145), (34, -0.067), (35, 0.006), (36, 0.065), (37, -0.071), (38, -0.143), (39, 0.114), (40, -0.025), (41, 0.112), (42, -0.06), (43, -0.056), (44, 0.08), (45, -0.077), (46, -0.038), (47, 0.14), (48, 0.136), (49, -0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95889807 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

Author: Richard S. Zemel, Toniann Pitassi

2 0.75204587 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

Author: Shie Mannor, Ron Meir

3 0.61414516 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

4 0.39138359 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

Author: Dörthe Malzahn, Manfred Opper

5 0.36839581 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

Author: Igor V. Cadez, Padhraic Smyth

Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of

6 0.36436707 85 nips-2000-Mixtures of Gaussian Processes

7 0.35943484 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

8 0.28320023 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

9 0.27850786 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

10 0.2681725 82 nips-2000-Learning and Tracking Cyclic Human Motion

11 0.26463592 120 nips-2000-Sparse Greedy Gaussian Process Regression

12 0.24722192 21 nips-2000-Algorithmic Stability and Generalization Performance

13 0.24631222 37 nips-2000-Convergence of Large Margin Separable Linear Classification

14 0.24232289 74 nips-2000-Kernel Expansions with Unlabeled Examples

15 0.24125373 44 nips-2000-Efficient Learning of Linear Perceptrons

16 0.23223178 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

17 0.22783983 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

18 0.2269779 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

19 0.2265617 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

20 0.21822095 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.029), (17, 0.077), (32, 0.039), (33, 0.034), (36, 0.012), (55, 0.033), (62, 0.034), (65, 0.019), (67, 0.046), (75, 0.014), (76, 0.052), (79, 0.014), (81, 0.015), (90, 0.474), (97, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96741331 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

Author: Richard S. Zemel, Toniann Pitassi

2 0.94555819 18 nips-2000-Active Support Vector Machine Classification

Author: Olvi L. Mangasarian, David R. Musicant

Abstract: An active set strategy is applied to the dual of a simple reformulation of the standard quadratic program of a linear support vector machine. This application generates a fast new dual algorithm that consists of solving a finite number of linear equations, with a typically large dimensionality equal to the number of points to be classified. However, by making novel use of the Sherman-MorrisonWoodbury formula , a much smaller matrix of the order of the original input space is inverted at each step. Thus, a problem with a 32-dimensional input space and 7 million points required inverting positive definite symmetric matrices of size 33 x 33 with a total running time of 96 minutes on a 400 MHz Pentium II. The algorithm requires no specialized quadratic or linear programming code, but merely a linear equation solver which is publicly available. 1

3 0.90447414 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

Author: Paul M. Hayton, Bernhard Schölkopf, Lionel Tarassenko, Paul Anuzis

Abstract: A system has been developed to extract diagnostic information from jet engine carcass vibration data. Support Vector Machines applied to novelty detection provide a measure of how unusual the shape of a vibration signature is, by learning a representation of normality. We describe a novel method for Support Vector Machines of including information from a second class for novelty detection and give results from the application to Jet Engine vibration analysis.

4 0.83493024 81 nips-2000-Learning Winner-take-all Competition Between Groups of Neurons in Lateral Inhibitory Networks

Author: Xiaohui Xie, Richard H. R. Hahnloser, H. Sebastian Seung

Abstract: It has long been known that lateral inhibition in neural networks can lead to a winner-take-all competition, so that only a single neuron is active at a steady state. Here we show how to organize lateral inhibition so that groups of neurons compete to be active. Given a collection of potentially overlapping groups, the inhibitory connectivity is set by a formula that can be interpreted as arising from a simple learning rule. Our analysis demonstrates that such inhibition generally results in winner-take-all competition between the given groups, with the exception of some degenerate cases. In a broader context, the network serves as a particular illustration of the general distinction between permitted and forbidden sets, which was introduced recently. From this viewpoint, the computational function of our network is to store and retrieve memories as permitted sets of coactive neurons. In traditional winner-take-all networks, lateral inhibition is used to enforce a localized, or

5 0.69626385 52 nips-2000-Fast Training of Support Vector Classifiers

Author: Fernando Pérez-Cruz, Pedro Luis Alarcón-Diana, Angel Navia-Vázquez, Antonio Artés-Rodríguez

Abstract: In this communication we present a new algorithm for solving Support Vector Classifiers (SVC) with large training data sets. The new algorithm is based on an Iterative Re-Weighted Least Squares procedure which is used to optimize the SVc. Moreover, a novel sample selection strategy for the working set is presented, which randomly chooses the working set among the training samples that do not fulfill the stopping criteria. The validity of both proposals, the optimization procedure and sample selection strategy, is shown by means of computer experiments using well-known data sets. 1 INTRODUCTION The Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition problems [13, 14] in such a way that the solution is completely described as a linear combination of several training samples, named the Support Vectors. The training procedure for solving the SVC is usually based on Quadratic Programming (QP) which presents some inherent limitations, mainly the computational complexity and memory requirements for large training data sets. This problem is typically avoided by dividing the QP problem into sets of smaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for the whole set of training samples. These schemes rely on an optimizing engine, QP, and in the sample selection strategy for each sub-problem, in order to obtain a fast solution for the SVC. An Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as an alternative solver for the SVC [10] and the Support Vector Regressor [9], being computationally efficient in absolute terms. In this communication, we will show that the IRWLS algorithm can replace the QP one in any chunking scheme in order to find the SVC solution for large training data sets. Moreover, we consider that the strategy to decide which training samples must j oin the working set is critical to reduce the total number of iterations needed to attain the SVC solution, and the runtime complexity as a consequence. To aim for this issue, the computer program SV cradit have been developed so as to solve the SVC for large training data sets using IRWLS procedure and fixed-size working sets. The paper is organized as follows. In Section 2, we start by giving a summary of the IRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme to obtain an overall implementation which efficiently deals with large training data sets. We present in Section 3 a novel strategy to make up the working set. Section 4 shows the capabilities of the new implementation and they are compared with the fastest available SVC implementation, SV Mlight [6]. We end with some concluding remarks. 2 IRWLS-SVC In order to solve classification problems, the SVC has to minimize Lp = ~llwI12+CLei- LJliei- LQi(Yi(¢(xifw+b)-l+ei) (1) i i i with respectto w, band ei and maximize it with respectto Qi and Jli, subject to Qi, Jli ~ 0, where ¢(.) is a nonlinear transformation (usually unknown) to a higher dimensional space and C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker (KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey by Burges [2] and to the work ofVapnik [13, 14]. In order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that the terms depending on ei can be removed because, at the solution C - Qi - Jli = 0 Vi (one of the KKT conditions [2]) must hold. Lp = 1 Qi(l- Yi(¢T(Xi)W + b)) 211wl12 + L i = (2) where The weighted least square nature of (2) can be understood if ei is defined as the error on each sample and ai as its associated weight, where! IIwl1 2 is a regularizing functional. The minimization of (2) cannot be accomplished in a single step because ai = ai(ei), and we need to apply an IRWLS procedure [4], summarized below in tree steps: 1. Considering the ai fixed, minimize (2). 2. Recalculate ai from the solution on step 1. 3. Repeat until convergence. In order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP procedure does, we require that w = Ei (JiYi¢(Xi) and in order to obtain a non-zero b, that Ei {JiYi = O. Substituting them into (2), its minimum with respect to {Ji and b for a fixed set of ai is found by solving the following linear equation system l (3) IThe detailed description of the steps needed to obtain (3) from (2) can be found in [10]. where y = [Yl, Y2, ... Yn]T (4) 'r/i,j = 1, ... ,n 'r/i,j = 1, ... ,n (H)ij = YiYj¢T(Xi)¢(Xj) = YiyjK(Xi,Xj) (Da)ij = aio[i - j] 13 = [,81, ,82, ... (5) (6) (7) , ,8n]T and 0[·] is the discrete impulse function. Finally, the dependency of ai upon the Lagrange multipliers is eliminated using the KKT conditions, obtaining a, ai 2.1 ={~ ei Yi' eiYi < Yt.et. > - ° ° (8) IRWLS ALGORITHMIC IMPLEMENTATION The SVC solution with the IRWLS procedure can be simplified by dividing the training samples into three sets. The first set, SI, contains the training samples verifying < ,8i < C, which have to be determined by solving (3). The second one, S2, includes every training sample whose,8i = 0. And the last one, S3, is made up of the training samples whose ,8i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm is shown in Table 1. ° 0. Initialization: SI will contain every training sample, S2 = 0 and S3 = 0. Compute H. e_a = y, f3_a = 0, b_a = 0, G 13 = Gin, a = 1 and G b3 = G bi n . 1 Solve [ (H)Sb S1 + D(al S1 . =° = e-lt a, 3. ai = { ~ (13) S2 2. e ° 1[ (Y)Sl (f3)Sl ] (y ) ~1 b and (13) Ss = C DyH(f3 - f3_a) - (b - b_a)1 =[1- G 13 ] G b3 ' °. eiYi < e- _ > O'r/Z E SI U S2 U S3 tYt 4. Sets reordering: a. Move every sample in S3 with eiYi < to S2. b. Move every sample in SI with ,8i = C to S3. c. Move every sample in SI with ai = to S2 . d. Move every sample in S2 with ai :I to SI. 5. e_a = e, f3_a = 13, G 13 = (H)Sl,SS (f3)ss + (G in )Sl' b-lt = band Gb3 = -y~s (f3)ss + Gbin · 6. Go to step 1 and repeat until convergence. ei Yi ' ° ° ° Table 1: IRWLS-SVC algorithm. The IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk:ing scheme as the one proposed in [8, 6], such that it can be directly applied in the one proposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large to fit into memory. In those cases, several SVC with a reduced set of training samples are iteratively solved until the solution for the whole set is found. The samples are divide into a working set, Sw, which is solved as a full SVC problem, and an inactive set, Sin. If there are support vectors in the inactive set, as it might be, the inactive set modifies the IRWLSSVC procedure, adding a contribution to the independent term in the linear equation system (3) . Those support vectors in S in can be seen as anchored samples in S3, because their ,8i is not zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin and G bin ) will be calculated as G 13 and G b3 are (Table 1, 5th step), before calling the IRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider Gin and G bin , which must be set to zero if the Hessian matrix, H, fits into memory for the whole set of training samples. The resolution of the SVC for large training data sets, employing as minimization engine the IRWLS procedure, is summarized in the following steps: 1. Select the samples that will form the working set. 2. Construct Gin = (H)Sw,Sin (f3)s.n and G bin = -yIin (f3)Sin 3. Solve the IRWLS-SVC procedure, following the steps in Table 1. 4. Compute the error of every training sample. 5. If the stopping conditions Yiei < C eiYi> -c leiYil < C 'Vii 'Vii 'Vii (Ji = 0 (Ji = C 0 < (Ji < C (9) (10) (11) are fulfilled, the SVC solution has been reached. The stopping conditions are the ones proposed in [6] and C must be a small value around 10 - 3 , a full discussion concerning this topic can be found in [6]. 3 SAMPLE SELECTION STRATEGY The selection of the training samples that will constitute the working set in each iteration is the most critical decision in any chunking scheme, because such decision is directly involved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the number of reproducing kernel evaluations to be made, which are, by far, the two most time consuming operations in any chunking schemes. In order to solve the SVC efficiently, we first need to define a candidate set of training samples to form the working set in each iteration. The candidate set will be made up, as it could not be otherwise, with all the training samples that violate the stopping conditions (9)-(11); and we will also add all those training samples that satisfy condition (11) but a small variation on their error will make them violate such condition. The strategies to select the working set are as numerous as the number of problems to be solved, but one can think three different simple strategies: • Select those samples which do not fulfill the stopping criteria and present the largest Iei I values. • Select those samples which do not fulfill the stopping criteria and present the smallest Iei I values. • Select them randomly from the ones that do not fulfill the stopping conditions. The first strategy seems the more natural one and it was proposed in [6]. If the largest leil samples are selected we guanrantee that attained solution gives the greatest step towards the solution of (1). But if the step is too large, which usually happens, it will cause the solution in each iteration and the (Ji values to oscillate around its optimal value. The magnitude of this effect is directly proportional to the value of C and q (size of the working set), so in the case ofsmall C (C < 10) and low q (q < 20) it would be less noticeable. The second one is the most conservative strategy because we will be moving towards the solution of (1) with small steps. Its drawback is readily discerned if the starting point is inappropriate, needing too many iterations to reach the SVC solution. The last strategy, which has been implemented together with the IRWLS-SVC procedure, is a mid-point between the other two, but if the number of samples whose 0 < (3i < C increases above q there might be some iterations where we will make no progress (working set is only made up of the training samples that fulfill the stopping condition in (11)). This situation is easily avoided by introducing one sample that violates each one of the stopping conditions per class. Finally, if the cardinality of the candidate set is less than q the working set is completed with those samples that fulfil the stopping criteria conditions and present the least leil. In summary, the sample selection strategy proposed is 2 : 1. Construct the candidate set, Se with those samples that do not fulfill stopping conditions (9) and (10), and those samples whose (3 obeys 0 < (3i < C. 2. IfISel < ngot05. 3. Choose a sample per class that violates each one of the stopping conditions and move them from Se to the working set, SW. 4. Choose randomly n - ISw I samples from Se and move then to SW. Go to Step 6. 5. Move every sample form Se to Sw and then-ISwl samples that fulfill the stopping conditions (9) and (10) and present the lowest leil values are used to complete SW . 6. Go on, obtaining Gin and Gbin. 4 BENCHMARK FOR THE IRWLS-SVC We have prepared two different experiments to test both the IRWLS and the sample selection strategy for solving the SVc. The first one compares the IRWLS against QP and the second one compares the samples selection strategy, together with the IRWLS, against a complete solving procedure for SVC, the SV Mlight. In the first trial, we have replaced the LOQO interior point optimizer used by SV M1ig ht version 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing engines with equal samples selection strategy. The comparison has been made over a Pentium ill-450MHz with 128Mb running on Window98 and the programs have been compiled using Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first q 20 40 70 Adult44781 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 21.25 20.70 0.61 0.39 20.60 19.22 1.01 0.17 21.15 18.72 2.30 0.46 Splice 2175 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 46.19 30.76 21.94 4.77 71.34 24.93 46.26 8.07 53.77 20.32 34.24 7.72 Table 2: CPU Time indicates the consume time in seconds for the whole procedure. The Optimize Time indicates the consume time in second for the LOQO or IRWLS procedure. one, containing 4781 training samples, needs most CPU resources to compute the RKHS and the second one, containing 2175 training samples, uses most CPU resources to solve the SVC for each Sw, where q indicates the size of the working set. The value of C has 2In what follows, I . I represents absolute value for numbers and cardinality for sets been set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has been employed, where its parameter a has been set, respectively, to 10 and 70. As it can be seen, the SV M1ig ht with IRWLS is significantly faster than the LOQO procedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for both procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC solver. For the second trial, we have compiled a computer program that uses the IRWLS-SVC procedure and the working set selection in Section 3, we will refer to it as svcradit from now on. We have borrowed the chunking and shrinking ideas from the SV Mlight [6] for our computer program. To test these two programs several data sets have been used. The Adult and Web data sets have been obtained from 1. Platt's web page http://research.microsoft.comr jplatt/smo.html/; the Gauss-M data set is a two dimensional classification problem proposed in [3] to test neural networks, which comprises a gaussian random variable for each class, which highly overlap. The Banana, Diabetes and Splice data sets have been obtained from Gunnar Ratsch web page http://svm.first.gmd.der raetschl. The selection of C and the RKHS has been done as indicated in [11] for Adult and Web data sets and in http://svm.first.gmd.derraetschl for Banana, Diabetes and Splice data sets. In Table 3, we show the runtime complexity for each data set, where the value of q has been elected as the one that reduces the runtime complexity. Database Dim Adult6 Adult9 Adult! Web 1 Web7 Gauss-M Gauss-M Banana Banana Diabetes Splice 123 123 123 300 300 2 2 2 2 8 69 N Sampl. 11221 32562 1605 2477 24693 4000 4000 400 4900 768 2175 C a SV 1 1 1000 5 5 1 100 316.2 316.2 10 1000 10 10 10 10 10 1 1 1 1 2 70 4477 12181 630 224 1444 1736 1516 80 1084 409 525 q CPU time radit light radit light 150 130 100 100 150 70 100 40 70 40 150 40 70 10 10 10 10 10 70 40 10 20 118.2 1093.29 25.98 2.42 158.13 12.69 61.68 0.33 22.46 2.41 14.06 124.46 1097.09 113.54 2.36 124.57 48.28 3053.20 0.77 1786.56 6.04 49.19 Table 3: Several data sets runtime complexity, when solved with the short, and SV Mlight, light for short. s v c radit , radit for One can appreciate that the svcradit is faster than the SV M1ig ht for most data sets. For the Web data set, which is the only data set the SV Mlight is sligthly faster, the value of C is low and most training samples end up as support vector with (3i < C. In such cases the best strategy is to take the largest step towards the solution in every iteration, as the SV Mlig ht does [6], because most training samples (3i will not be affected by the others training samples (3j value. But in those case the value of C increases the SV c radit samples selection strategy is a much more appropriate strategy than the one used in SV Mlight. 5 CONCLUSIONS In this communication a new algorithm for solving the SVC for large training data sets has been presented. Its two major contributions deal with the optimizing engine and the sample selection strategy. An IRWLS procedure is used to solve the SVC in each step, which is much faster that the usual QP procedure, and simpler to implement, because the most difficult step is the linear equation system solution that can be easily obtained by LU decomposition means [12]. The random working set selection from the samples not fulfilling the KKT conditions is the best option if the working is be large, because it reduces the number of chunks to be solved. This strategy benefits from the IRWLS procedure, which allows to work with large training data set. All these modifications have been concreted in the svcradit solving procedure, publicly available at http://svm.tsc.uc3m.es/. 6 ACKNOWLEDGEMENTS We are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to use his SV Mlight to test our IRWLS procedure, comparisons which could not have been properly done otherwise. References [1] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, U.S.A., 1992. [2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [3] S. Haykin. Neural Networks: A comprehensivefoundation. Prentice-Hall, 1994. [4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least squares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. [5] T. Joachims. http://www-ai.infonnatik.uni-dortmund.de/forschung/verfahren Isvmlight Isvmlight.eng.html. Technical report, University of Dortmund, Informatik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multivariate Data', 1998. [6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Kernel Methods- Support Vector Learning, Editors SchOlkopf, B., Burges, C. 1. C. and Smola, A. 1., pages 169-184. M.I.T. Press, 1999. [7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal Processing, pages 276-285, Amelia Island, U.S.A, 1997. [8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In ICPR'98, Brisbane, Australia, August 1998. [9] F. Perez-Cruz, A. Navia-Vazquez

6 0.59775394 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

7 0.54998285 111 nips-2000-Regularized Winnow Methods

8 0.54704154 74 nips-2000-Kernel Expansions with Unlabeled Examples

9 0.54537994 21 nips-2000-Algorithmic Stability and Generalization Performance

10 0.54295439 4 nips-2000-A Linear Programming Approach to Novelty Detection

11 0.53491616 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

12 0.53207558 100 nips-2000-Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks

13 0.52834547 75 nips-2000-Large Scale Bayes Point Machines

14 0.5158267 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

15 0.50673604 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

16 0.4986982 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

17 0.48880276 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

18 0.48127174 133 nips-2000-The Kernel Gibbs Sampler

19 0.48065454 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

20 0.47878641 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition