jmlr jmlr2005 jmlr2005-69 knowledge-graph by maker-knowledge-mining

69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

Source: pdf

Author: John Langford

Abstract: We discuss basic prediction theory and its impact on classiﬁcation success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The ﬁrst is that common practices for reporting results in classiﬁcation should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classiﬁcation, quantitative bounds

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The second is that train set bounds can sometimes be used to directly motivate learning algorithms. [sent-6, score-0.215]

2 draw of some sample, the expected future error rate of a classiﬁer is bounded by f (δ, error rate on sample)”. [sent-18, score-0.218]

3 This is an error-prone practice, and the test set bound in Section 3 implies a better method by nearly any metric. [sent-25, score-0.215]

4 After discussing the test set bound we cover the Occam’s Razor bound, the simplest train set bound, which explains (and quantiﬁes) the common phenomenon of overﬁtting. [sent-27, score-0.348]

5 L ANGFORD the Occam’s Razor bound cannot be improved without incorporating extra information and apply the bound to decision trees. [sent-29, score-0.268]

6 Next, we discuss two train set bounds, the PAC-Bayes bound and the sample compression bound, which have proved to give practical results for more general classiﬁers, such as support vector machines and neural networks. [sent-30, score-0.421]

7 The formal model and test set bound must be understood in order to appreciate all later results. [sent-50, score-0.215]

8 There is no particular dependency between the various train set bounds we present. [sent-51, score-0.188]

9 Sometimes, we decorate these objects with labels like Strain (a train set1 ) or Stest (a test set). [sent-73, score-0.214]

10 The empirical error is sometimes called the “training error”, “test error”, or “observed error” depending on whether it is the error on a training set, test set, or a more general set. [sent-98, score-0.253]

11 Example 2 (continued) The classiﬁer c which always predicts “not rain” might have an empirical error of 38 out of 100 examples and an unknown true error rate (which might in fact be 0. [sent-99, score-0.253]

12 1 presents the basic upper bound on the true error rate, handy approximations, and a lower bound. [sent-116, score-0.221]

13 2 discusses the implications of the test set bound on error reporting practice. [sent-118, score-0.335]

14 This is a very familiar distribution in statistics called the binomial and so it should not be surprising that the bounds presented here are fundamentally dependent upon the cumulative distribution of a binomial. [sent-136, score-0.275]

15 With these deﬁnitions, we can interpret the binomial tail as the probability of an empirical error less than or equal to k. [sent-145, score-0.293]

16 Since we are interested in calculating a bound on the true error given a conﬁdence δ, and an empirical error cS , it is handy to deﬁne the inversion of a binomial tail. [sent-146, score-0.426]

17 2 empirical error empirical error true error bound 0. [sent-174, score-0.399]

18 The third ﬁgure shows the binomials consistent with the tail cut and observed test error. [sent-189, score-0.235]

19 The worst case over all true error rates is the consistent binomial with the largest bias. [sent-190, score-0.203]

20 Since each example is picked independently, the distribution of the empirical error is a binomial distribution. [sent-195, score-0.205]

21 Assuming (correctly with probability 1 − δ) that the empirical error is not in the binomial tail, we can constrain (and therefore bound) the value of the true error cD . [sent-197, score-0.292]

22 The test set bound is, essentially, perfectly tight. [sent-198, score-0.215]

23 1 A PPROXIMATIONS There are several immediate corollaries of the test set bound (3. [sent-207, score-0.215]

24 Proof Specializing the test set bound (Theorem 3. [sent-212, score-0.215]

25 Approximations which hold for arbitrary (nonzero) error rates rely upon the Chernoff bound which we state next, for completeness. [sent-216, score-0.245]

26 Using this, we get f (λ∗ ) = = k k p 1− m + ln ln k m m (1 − p) p k k ln k + 1 − ln m m m 1− p k 1− m 1− p k 1− m = −KL k ||p . [sent-230, score-0.352]

27 m Using the Chernoff bound, we can loosen the test set bound to achieve a more analytic form. [sent-231, score-0.215]

28 The agnostic test set bound can be further loosened by bounding the value of KL(q||p). [sent-239, score-0.281]

29 S∼Dm m 2m Proof Use the approximation KL k ||cD m ≥ 2(cD − k 2 ) m with the Chernoff bound and test set bounds to get the result. [sent-242, score-0.27]

30 The differences between the agnostic and realizable case are fundamentally related to the decrease in the variance of a binomial as the bias (i. [sent-243, score-0.24]

31 Note that this implies using the exact binomial tail calculation can result in functional (rather than merely constant) improvements on the above corollary. [sent-246, score-0.233]

32 3 T HE S TATE OF THE A RT Although the test set bound is very well understood, the same cannot be said of other testing methods. [sent-266, score-0.215]

33 When attempting to calculate a conﬁdence interval on the true error rate given the test set, many people follow a standard statistical prescription: cS ˆ 1. [sent-274, score-0.278]

34 The basic problem with this approach is that the binomial distribution is not similar to a Gaussian when the error rate is near 0. [sent-288, score-0.225]

35 The test set bound can satisfy this requirement (and, in fact, operates well for all true error regimes). [sent-290, score-0.302]

36 The test set bound based conﬁdence interval always returns an upper and lower bound in [0, 1]. [sent-294, score-0.377]

37 It is now simple and easy to calculate a bound based upon the cumulative distribution of the binomial (see Langford). [sent-296, score-0.329]

38 The test set bound can be thought of as a game where a “Learner” attempts to convince a reasonable “Veriﬁer” of the amount of learning which has occurred. [sent-297, score-0.241]

39 25 adult shroom audio balance car votes krkp lung nursery postop shuttle soybean yellow 0 Learning Problem Figure 4: This is a graph of the conﬁdence intervals implied by the test set bound (theorem 3. [sent-305, score-0.279]

40 The upper bounds of the test set bound have δ = 0. [sent-307, score-0.27]

41 The test set bound is better behaved as the conﬁdence interval is conﬁned to the interval [0, 1] and is never over-optimistic. [sent-309, score-0.271]

42 The only requirement for applying this bound is that the learner must commit to a classiﬁer without knowledge of the test examples. [sent-311, score-0.242]

43 A similar diagram for train set bounds is presented later (and is somewhat more complicated). [sent-312, score-0.188]

44 The Occam’s Razor Bound Given that the simple test set bound works well, why do we need to engage in further work? [sent-316, score-0.215]

45 Many learning algorithms implicitly assume that the train set accuracy “behaves like” the true error in choosing the hypothesis. [sent-322, score-0.22]

46 1 (Occam’s Razor Bound) For all D, for all “priors” P(c) over the classiﬁers c, for all δ ∈ (0, 1]: Pr m ∀c : cD ≤ Bin (m, cS , δP(c)) ≥ 1 − δ ˆ S∼D The application of the Occam’s Razor bound is somewhat more complicated than the application of the test set bound. [sent-341, score-0.215]

47 The proof starts with the test set bound: ∀c Pr m cD ≤ Bin (m, cS , δP(c)) ≥ 1 − δP(c) ˆ S∼D Negating this statement, we get ∀c Pr m cD > Bin (m, cS , δP(c)) < δP(c) ˆ S∼D then, we apply the union bound in a nonuniform manner. [sent-346, score-0.215]

48 Then, the bound is calculated based upon the chosen classiﬁer. [sent-349, score-0.189]

49 2 empirical error true error bound Probability empirical error Probability 1 0. [sent-364, score-0.399]

50 1 O CCAM ’ S R AZOR C OROLLARIES Just as with the test set bound, we can relax the Occam’s Razor bound (Theorem 4. [sent-383, score-0.215]

51 2 (Chernoff Occam’s Razor Bound) For all D, for all “priors” P(c) over classiﬁers, for all δ ∈ (0, 1]:   1 ln P(c) + ln 1 cS ˆ δ  ≥ 1−δ Pr m ∀c : cD ≤ + S∼D m 2m Proof Approximate the binomial tail with the Chernoff Bound (lemma 3. [sent-386, score-0.38]

52 2 O CCAM ’ S R AZOR L OWER B OUND Just as for the test set bound, a lower bound of the same form applies. [sent-397, score-0.215]

53 ˆ p Example 4 (continued) Suppose that instead of having 100 test examples, we had 100 train examples. [sent-400, score-0.214]

54 Train and test bounds (Langford, 2002) which combine train set and test set bounds. [sent-419, score-0.35]

55 2 The Occam’s Razor Bound is Sometimes Tight The question of tightness for train set bounds is important to address, as many of them have been extremely loose. [sent-430, score-0.245]

56 The simplest method to address this tightness is constructive: exhibit a learning problem/algorithm pair for which the bound is almost achieved. [sent-431, score-0.191]

57 This particular choice of true errors implies that if any classiﬁer has a too-small train error, then the classiﬁer with minimal train error must have a too-small train error. [sent-455, score-0.486]

58 The next lemma shows that a certain expectation is bounded by the Kullback-Leibler distance between two coin ﬂips, just as for the relative entropy Chernoff bound (Lemma 3. [sent-457, score-0.202]

59 For all Q(c): Ec∼Q ln Pr 1 ˆ ˆ S ∼Dm (cS =cs ) m Proof Ec∼Q ln Pr 1 ˆ ˆ S ∼Dm (cS =cs ) m ≥ 1 Ec∼Q ln m = ˆ ≥ KL(QS ||QD ). [sent-461, score-0.264]

60 2 The PAC-Bayes Bound is Sometimes Tight Since the PAC-Bayes bound is (almost) a generalization of the Occam’s Razor bound, the tightness result for Occam’s Razor also applies to PAC-Bayes bounds. [sent-476, score-0.191]

61 P RACTICAL P REDICTION T HEORY FOR C LASSIFICATION ¯ Using the corollary, the true error bound Q(w, µ)D satisﬁes the equation: ˆ ¯ KL Q(w, µ)S ||Q(w, µ)D = µ2 2 + ln m+1 δ . [sent-511, score-0.309]

62 The computational time of the bound calculation is dominated by the calculation of the margins which is O m2 where m is the number of support vectors with a nonzero associated α. [sent-521, score-0.192]

63 It is important to note that the PAC-Bayes margin bound is not precisely a bound (or conﬁdence interval) on the true error rate of the learned classiﬁer. [sent-532, score-0.446]

64 8 test set errors test set bound margin bound 0. [sent-539, score-0.468]

65 2 0 liver bostonpima ion sonar dig1 dig2 adult problem Figure 10: This ﬁgure shows the results of applying SVMlight to 8 datasets with a Gaussian kernel and a 70/30 train/test split. [sent-542, score-0.209]

66 On the test set, we calculate a binomial conﬁdence interval (probability of bound failure equals 0. [sent-544, score-0.386]

67 8 test set errors test set bound margin bound 7/10 margin 0. [sent-548, score-0.506]

68 2 0 liver bostonpima ion sonar dig1 dig2 adult problem Figure 11: In addition to comparing with everything in Figure 10, we graph the margin bound when all of the data is used for the train set. [sent-551, score-0.476]

69 Note that it improves somewhat on the margin bound calculated using the 70% train set (7/10 margin bound), but not enough to compete with the test set bound. [sent-552, score-0.424]

70 These bounds can be regarded as bounds for the original classiﬁer only under an additional assumption: that picking a classiﬁer according to the majority vote of this stochastic distribution does not worsen the true error rate. [sent-554, score-0.197]

71 It is of course unfair to compare the train set bound with the test set bound on a 70/30 train/test split because a very tight train set bound would imply that it is unnecessary to even have a test set. [sent-556, score-0.862]

72 In Figure 11 we compare the true error bounds on all of the data to the true error bounds generated from the 70/30 train/test split. [sent-557, score-0.284]

73 The results show that the PAC-Bayes margin bound is tight enough to give useful information, but still not competitive with the test set bounds. [sent-558, score-0.285]

74 The train set bound might easily become tighter for smaller sample sizes. [sent-564, score-0.267]

75 The train set bound might still have the right “shape” for choosing an optimal parameter setting, such as “C” in a support vector machine. [sent-567, score-0.267]

76 Sample Compression Bound The sample compression bound (Littlestone and Warmuth), (Floyd and Warmuth, 1995) is like the PAC-Bayes bound in that it applies to arbitrary precision continuous valued classiﬁers. [sent-569, score-0.422]

77 Mainstream learning algorithms do not optimize the sample compression metric, so the bound application is somewhat rarer. [sent-571, score-0.288]

78 Nonetheless, there do exist some reasonably competitive learning algorithms for which the sample compression bound produces signiﬁcant results. [sent-572, score-0.288]

79 2 shows that the sample compression bound is nearly as tight as possible given the observations. [sent-578, score-0.32]

80 3 discusses results from the application of the sample compression bound to support vector machines. [sent-581, score-0.288]

81 1 The Sample Compression Bound The sample compression bound (Littlestone and Warmuth) (Floyd and Warmuth, 1995) stated here differs from older results by generalization and simpliﬁcation but the bound behavior is qualitatively identical. [sent-583, score-0.422]

82 The sample compression bound is dependent on the errors, cS−S on the subset S − S . [sent-585, score-0.288]

83 In general, the bound becomes tighter as the dependent subset S becomes smaller and as the error number on the nondependent subset S − S becomes smaller. [sent-587, score-0.19]

84 Viewed as an interactive proof of learning (in Figure 12), the sample compression bound is unique amongst training set bounds because it does not require any initial commitment to a measure over the classiﬁers. [sent-588, score-0.415]

85 Viewed from this perspective, the sample compression bound is the Occam’s Razor bound, except for the minor detail that the set of evaluating examples varies. [sent-597, score-0.312]

86 2 The Sample Compression Bound is Sometimes Tight We can construct a learning algorithm/learning problem pair such that the sample compression bound is provably near optimal, as a function of its observables. [sent-604, score-0.288]

87 3 Application of the Sample Compression Bound One obvious application of the sample compression bound is to support vector machines, since the learned classiﬁer is only dependent on the set of support vectors. [sent-625, score-0.288]

88 There are other less common learning algorithms for which the sample compression bound works well. [sent-628, score-0.288]

89 The set covering machine (Marchand and Shawe-Taylor, 2001) has an associated bound which is a variant of the sample compression bound. [sent-629, score-0.288]

90 8 test set errors test set bound compression bound 0. [sent-631, score-0.584]

91 2 0 liver bostonpima ion sonar dig1 dig2 adult problem Figure 13: The sample compression bound applied to the output of a support vector machine with a Gaussian kernel. [sent-634, score-0.459]

92 1 Learning Algorithm Design Every train set bound implies a learning algorithm: choose the classiﬁer which minimizes the true error bound. [sent-639, score-0.354]

93 It is important to note that the form of a train set bound does not imply that this minimization is a good idea. [sent-642, score-0.267]

94 Choosing between two classiﬁers based upon their true error bound implies a better worst-case bound on the true error. [sent-643, score-0.441]

95 Another strong caveat is that, historically, train set bounds have simply not been tight enough on real datasets for a nonvacuous application. [sent-647, score-0.258]

96 The train set bound presented here essentially shows that a reasonable person will be convinced of learning success when a short-description classiﬁer does well on train set data. [sent-654, score-0.4]

97 3 Conclusion This introduction to prediction theory covered two styles of bound: the test set bound and the train set bound. [sent-659, score-0.348]

98 Test set bounds provide a better way to report error rates and conﬁdence intervals on future error rates than some current methods. [sent-661, score-0.2]

99 It is important to note that the train set bound and test set bound techniques are not mutually exclusive. [sent-664, score-0.482]

100 Test set bounds are improved by the “free” information about the training error and train set bounds become applicable, even when not always tight. [sent-666, score-0.299]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('razor', 0.33), ('cd', 0.326), ('occam', 0.312), ('pr', 0.295), ('bin', 0.267), ('angford', 0.197), ('ractical', 0.184), ('dm', 0.177), ('compression', 0.154), ('rediction', 0.154), ('heory', 0.154), ('cs', 0.138), ('bound', 0.134), ('train', 0.133), ('chernoff', 0.124), ('binomial', 0.116), ('lassification', 0.115), ('langford', 0.106), ('prs', 0.105), ('rain', 0.105), ('ec', 0.098), ('ln', 0.088), ('tail', 0.088), ('er', 0.085), ('kl', 0.083), ('test', 0.081), ('coin', 0.068), ('dence', 0.067), ('agnostic', 0.066), ('binomials', 0.066), ('negating', 0.066), ('ers', 0.066), ('classi', 0.063), ('tightness', 0.057), ('error', 0.056), ('upon', 0.055), ('bounds', 0.055), ('subsection', 0.054), ('rate', 0.053), ('chervonenkis', 0.052), ('pictorially', 0.052), ('valiant', 0.049), ('yw', 0.044), ('bostonpima', 0.039), ('commitment', 0.039), ('depiction', 0.039), ('liver', 0.039), ('mtest', 0.039), ('verifier', 0.039), ('datasets', 0.038), ('pe', 0.038), ('margin', 0.038), ('heads', 0.035), ('perpendicular', 0.035), ('implications', 0.035), ('empirical', 0.033), ('intervals', 0.033), ('interactive', 0.033), ('floyd', 0.033), ('realizable', 0.033), ('ion', 0.033), ('ew', 0.033), ('tight', 0.032), ('warmuth', 0.032), ('true', 0.031), ('adult', 0.031), ('sonar', 0.029), ('mkl', 0.029), ('calculation', 0.029), ('reporting', 0.029), ('people', 0.029), ('interval', 0.028), ('sometimes', 0.027), ('failure', 0.027), ('learner', 0.027), ('quantitatively', 0.027), ('pressure', 0.027), ('tutorial', 0.026), ('addressable', 0.026), ('azor', 0.026), ('ccam', 0.026), ('clopper', 0.026), ('cloudy', 0.026), ('conversions', 0.026), ('convince', 0.026), ('mpr', 0.026), ('practices', 0.026), ('reinterpretations', 0.026), ('specialization', 0.026), ('stest', 0.026), ('tate', 0.026), ('yv', 0.026), ('draws', 0.026), ('con', 0.025), ('fundamentally', 0.025), ('pg', 0.025), ('pm', 0.025), ('kernelized', 0.025), ('examples', 0.024), ('cumulative', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

Author: John Langford

2 0.10601156 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overﬁt when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions

3 0.090010971 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

Author: Mario Marchand, Marina Sokolova

Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classiﬁer it ﬁnds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory

4 0.081176311 11 jmlr-2005-Algorithmic Stability and Meta-Learning

Author: Andreas Maurer

Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn

5 0.065204032 32 jmlr-2005-Expectation Consistent Approximate Inference

Author: Manfred Opper, Ole Winther

Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode diﬀerent features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Signiﬁcant improvements are obtained when a spanning tree is used instead.

6 0.061063647 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

7 0.057635605 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

8 0.055567257 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

9 0.046213564 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

10 0.041738335 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions

11 0.041127883 22 jmlr-2005-Concentration Bounds for Unigram Language Models

12 0.040158655 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

13 0.039307859 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

14 0.03929954 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

15 0.038480289 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

16 0.038311083 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

17 0.037954852 71 jmlr-2005-Variational Message Passing

18 0.035874426 29 jmlr-2005-Efficient Margin Maximizing with Boosting

19 0.035072379 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

20 0.035022594 67 jmlr-2005-Stability of Randomized Learning Algorithms

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.215), (1, -0.126), (2, 0.074), (3, 0.179), (4, -0.098), (5, -0.161), (6, -0.013), (7, -0.08), (8, 0.126), (9, -0.207), (10, 0.041), (11, 0.064), (12, -0.026), (13, 0.015), (14, -0.18), (15, -0.19), (16, 0.043), (17, -0.013), (18, 0.251), (19, -0.24), (20, -0.081), (21, 0.03), (22, 0.132), (23, 0.034), (24, -0.161), (25, 0.079), (26, 0.167), (27, -0.055), (28, 0.138), (29, 0.051), (30, 0.041), (31, 0.05), (32, -0.086), (33, 0.004), (34, 0.138), (35, 0.151), (36, -0.017), (37, -0.081), (38, 0.012), (39, -0.096), (40, -0.139), (41, 0.14), (42, 0.143), (43, -0.169), (44, 0.096), (45, -0.041), (46, 0.135), (47, -0.123), (48, 0.178), (49, -0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96607244 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

Author: John Langford

2 0.45730579 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

3 0.35661843 11 jmlr-2005-Algorithmic Stability and Meta-Learning

Author: Andreas Maurer

4 0.34309226 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

Author: Mario Marchand, Marina Sokolova

5 0.27992076 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classiﬁers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classiﬁer should minimize the conﬁdence-rated error, whereas for AdaBoost.M2 the base classiﬁer should minimize the pseudo-loss. This makes them more easily applicable to already existing base classiﬁers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classiﬁcation, decision stumps

6 0.23795068 32 jmlr-2005-Expectation Consistent Approximate Inference

7 0.21857914 22 jmlr-2005-Concentration Bounds for Unigram Language Models

8 0.2116859 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

9 0.20753874 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

10 0.20205697 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

11 0.19968207 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

12 0.17885898 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

13 0.17295751 71 jmlr-2005-Variational Message Passing

14 0.16593598 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

15 0.16173466 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions

16 0.16064017 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression

17 0.15824105 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

18 0.15664887 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

19 0.15379944 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

20 0.14995842 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.014), (17, 0.02), (19, 0.02), (36, 0.017), (37, 0.04), (43, 0.04), (47, 0.016), (52, 0.103), (59, 0.02), (70, 0.045), (88, 0.094), (90, 0.021), (92, 0.441), (94, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73084939 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

Author: John Langford

2 0.3432948 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

3 0.33035958 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou

Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efﬁcient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We ﬁrst present an online SVM algorithm based on this premise. LASVM yields competitive misclassiﬁcation rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.

4 0.32565993 36 jmlr-2005-Gaussian Processes for Ordinal Regression

Author: Wei Chu, Zoubin Ghahramani

Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative ﬁltering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative ﬁltering, gene expression analysis, feature selection

5 0.32348445 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by deﬁning a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semideﬁnite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classiﬁcation, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semideﬁnite programming

6 0.32184815 64 jmlr-2005-Semigroup Kernels on Measures

7 0.3211078 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

8 0.32032856 71 jmlr-2005-Variational Message Passing

9 0.31830063 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

10 0.31792718 3 jmlr-2005-A Classification Framework for Anomaly Detection

11 0.31631789 39 jmlr-2005-Information Bottleneck for Gaussian Variables

12 0.31626135 11 jmlr-2005-Algorithmic Stability and Meta-Learning

13 0.31486821 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

14 0.31335905 44 jmlr-2005-Learning Module Networks

15 0.30859673 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

16 0.30712998 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

17 0.30682358 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton

18 0.30446869 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

19 0.30260664 20 jmlr-2005-Clustering with Bregman Divergences

20 0.30242226 48 jmlr-2005-Learning the Kernel Function via Regularization