jmlr jmlr2005 jmlr2005-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
Reference: text
sentIndex sentText sentNum sentScore
1 The second is that train set bounds can sometimes be used to directly motivate learning algorithms. [sent-6, score-0.215]
2 draw of some sample, the expected future error rate of a classifier is bounded by f (δ, error rate on sample)”. [sent-18, score-0.218]
3 This is an error-prone practice, and the test set bound in Section 3 implies a better method by nearly any metric. [sent-25, score-0.215]
4 After discussing the test set bound we cover the Occam’s Razor bound, the simplest train set bound, which explains (and quantifies) the common phenomenon of overfitting. [sent-27, score-0.348]
5 L ANGFORD the Occam’s Razor bound cannot be improved without incorporating extra information and apply the bound to decision trees. [sent-29, score-0.268]
6 Next, we discuss two train set bounds, the PAC-Bayes bound and the sample compression bound, which have proved to give practical results for more general classifiers, such as support vector machines and neural networks. [sent-30, score-0.421]
7 The formal model and test set bound must be understood in order to appreciate all later results. [sent-50, score-0.215]
8 There is no particular dependency between the various train set bounds we present. [sent-51, score-0.188]
9 Sometimes, we decorate these objects with labels like Strain (a train set1 ) or Stest (a test set). [sent-73, score-0.214]
10 The empirical error is sometimes called the “training error”, “test error”, or “observed error” depending on whether it is the error on a training set, test set, or a more general set. [sent-98, score-0.253]
11 Example 2 (continued) The classifier c which always predicts “not rain” might have an empirical error of 38 out of 100 examples and an unknown true error rate (which might in fact be 0. [sent-99, score-0.253]
12 1 presents the basic upper bound on the true error rate, handy approximations, and a lower bound. [sent-116, score-0.221]
13 2 discusses the implications of the test set bound on error reporting practice. [sent-118, score-0.335]
14 This is a very familiar distribution in statistics called the binomial and so it should not be surprising that the bounds presented here are fundamentally dependent upon the cumulative distribution of a binomial. [sent-136, score-0.275]
15 With these definitions, we can interpret the binomial tail as the probability of an empirical error less than or equal to k. [sent-145, score-0.293]
16 Since we are interested in calculating a bound on the true error given a confidence δ, and an empirical error cS , it is handy to define the inversion of a binomial tail. [sent-146, score-0.426]
17 2 empirical error empirical error true error bound 0. [sent-174, score-0.399]
18 The third figure shows the binomials consistent with the tail cut and observed test error. [sent-189, score-0.235]
19 The worst case over all true error rates is the consistent binomial with the largest bias. [sent-190, score-0.203]
20 Since each example is picked independently, the distribution of the empirical error is a binomial distribution. [sent-195, score-0.205]
21 Assuming (correctly with probability 1 − δ) that the empirical error is not in the binomial tail, we can constrain (and therefore bound) the value of the true error cD . [sent-197, score-0.292]
22 The test set bound is, essentially, perfectly tight. [sent-198, score-0.215]
23 1 A PPROXIMATIONS There are several immediate corollaries of the test set bound (3. [sent-207, score-0.215]
24 Proof Specializing the test set bound (Theorem 3. [sent-212, score-0.215]
25 Approximations which hold for arbitrary (nonzero) error rates rely upon the Chernoff bound which we state next, for completeness. [sent-216, score-0.245]
26 Using this, we get f (λ∗ ) = = k k p 1− m + ln ln k m m (1 − p) p k k ln k + 1 − ln m m m 1− p k 1− m 1− p k 1− m = −KL k ||p . [sent-230, score-0.352]
27 m Using the Chernoff bound, we can loosen the test set bound to achieve a more analytic form. [sent-231, score-0.215]
28 The agnostic test set bound can be further loosened by bounding the value of KL(q||p). [sent-239, score-0.281]
29 S∼Dm m 2m Proof Use the approximation KL k ||cD m ≥ 2(cD − k 2 ) m with the Chernoff bound and test set bounds to get the result. [sent-242, score-0.27]
30 The differences between the agnostic and realizable case are fundamentally related to the decrease in the variance of a binomial as the bias (i. [sent-243, score-0.24]
31 Note that this implies using the exact binomial tail calculation can result in functional (rather than merely constant) improvements on the above corollary. [sent-246, score-0.233]
32 3 T HE S TATE OF THE A RT Although the test set bound is very well understood, the same cannot be said of other testing methods. [sent-266, score-0.215]
33 When attempting to calculate a confidence interval on the true error rate given the test set, many people follow a standard statistical prescription: cS ˆ 1. [sent-274, score-0.278]
34 The basic problem with this approach is that the binomial distribution is not similar to a Gaussian when the error rate is near 0. [sent-288, score-0.225]
35 The test set bound can satisfy this requirement (and, in fact, operates well for all true error regimes). [sent-290, score-0.302]
36 The test set bound based confidence interval always returns an upper and lower bound in [0, 1]. [sent-294, score-0.377]
37 It is now simple and easy to calculate a bound based upon the cumulative distribution of the binomial (see Langford). [sent-296, score-0.329]
38 The test set bound can be thought of as a game where a “Learner” attempts to convince a reasonable “Verifier” of the amount of learning which has occurred. [sent-297, score-0.241]
39 25 adult shroom audio balance car votes krkp lung nursery postop shuttle soybean yellow 0 Learning Problem Figure 4: This is a graph of the confidence intervals implied by the test set bound (theorem 3. [sent-305, score-0.279]
40 The upper bounds of the test set bound have δ = 0. [sent-307, score-0.27]
41 The test set bound is better behaved as the confidence interval is confined to the interval [0, 1] and is never over-optimistic. [sent-309, score-0.271]
42 The only requirement for applying this bound is that the learner must commit to a classifier without knowledge of the test examples. [sent-311, score-0.242]
43 A similar diagram for train set bounds is presented later (and is somewhat more complicated). [sent-312, score-0.188]
44 The Occam’s Razor Bound Given that the simple test set bound works well, why do we need to engage in further work? [sent-316, score-0.215]
45 Many learning algorithms implicitly assume that the train set accuracy “behaves like” the true error in choosing the hypothesis. [sent-322, score-0.22]
46 1 (Occam’s Razor Bound) For all D, for all “priors” P(c) over the classifiers c, for all δ ∈ (0, 1]: Pr m ∀c : cD ≤ Bin (m, cS , δP(c)) ≥ 1 − δ ˆ S∼D The application of the Occam’s Razor bound is somewhat more complicated than the application of the test set bound. [sent-341, score-0.215]
47 The proof starts with the test set bound: ∀c Pr m cD ≤ Bin (m, cS , δP(c)) ≥ 1 − δP(c) ˆ S∼D Negating this statement, we get ∀c Pr m cD > Bin (m, cS , δP(c)) < δP(c) ˆ S∼D then, we apply the union bound in a nonuniform manner. [sent-346, score-0.215]
48 Then, the bound is calculated based upon the chosen classifier. [sent-349, score-0.189]
49 2 empirical error true error bound Probability empirical error Probability 1 0. [sent-364, score-0.399]
50 1 O CCAM ’ S R AZOR C OROLLARIES Just as with the test set bound, we can relax the Occam’s Razor bound (Theorem 4. [sent-383, score-0.215]
51 2 (Chernoff Occam’s Razor Bound) For all D, for all “priors” P(c) over classifiers, for all δ ∈ (0, 1]: 1 ln P(c) + ln 1 cS ˆ δ ≥ 1−δ Pr m ∀c : cD ≤ + S∼D m 2m Proof Approximate the binomial tail with the Chernoff Bound (lemma 3. [sent-386, score-0.38]
52 2 O CCAM ’ S R AZOR L OWER B OUND Just as for the test set bound, a lower bound of the same form applies. [sent-397, score-0.215]
53 ˆ p Example 4 (continued) Suppose that instead of having 100 test examples, we had 100 train examples. [sent-400, score-0.214]
54 Train and test bounds (Langford, 2002) which combine train set and test set bounds. [sent-419, score-0.35]
55 2 The Occam’s Razor Bound is Sometimes Tight The question of tightness for train set bounds is important to address, as many of them have been extremely loose. [sent-430, score-0.245]
56 The simplest method to address this tightness is constructive: exhibit a learning problem/algorithm pair for which the bound is almost achieved. [sent-431, score-0.191]
57 This particular choice of true errors implies that if any classifier has a too-small train error, then the classifier with minimal train error must have a too-small train error. [sent-455, score-0.486]
58 The next lemma shows that a certain expectation is bounded by the Kullback-Leibler distance between two coin flips, just as for the relative entropy Chernoff bound (Lemma 3. [sent-457, score-0.202]
59 For all Q(c): Ec∼Q ln Pr 1 ˆ ˆ S ∼Dm (cS =cs ) m Proof Ec∼Q ln Pr 1 ˆ ˆ S ∼Dm (cS =cs ) m ≥ 1 Ec∼Q ln m = ˆ ≥ KL(QS ||QD ). [sent-461, score-0.264]
60 2 The PAC-Bayes Bound is Sometimes Tight Since the PAC-Bayes bound is (almost) a generalization of the Occam’s Razor bound, the tightness result for Occam’s Razor also applies to PAC-Bayes bounds. [sent-476, score-0.191]
61 P RACTICAL P REDICTION T HEORY FOR C LASSIFICATION ¯ Using the corollary, the true error bound Q(w, µ)D satisfies the equation: ˆ ¯ KL Q(w, µ)S ||Q(w, µ)D = µ2 2 + ln m+1 δ . [sent-511, score-0.309]
62 The computational time of the bound calculation is dominated by the calculation of the margins which is O m2 where m is the number of support vectors with a nonzero associated α. [sent-521, score-0.192]
63 It is important to note that the PAC-Bayes margin bound is not precisely a bound (or confidence interval) on the true error rate of the learned classifier. [sent-532, score-0.446]
64 8 test set errors test set bound margin bound 0. [sent-539, score-0.468]
65 2 0 liver bostonpima ion sonar dig1 dig2 adult problem Figure 10: This figure shows the results of applying SVMlight to 8 datasets with a Gaussian kernel and a 70/30 train/test split. [sent-542, score-0.209]
66 On the test set, we calculate a binomial confidence interval (probability of bound failure equals 0. [sent-544, score-0.386]
67 8 test set errors test set bound margin bound 7/10 margin 0. [sent-548, score-0.506]
68 2 0 liver bostonpima ion sonar dig1 dig2 adult problem Figure 11: In addition to comparing with everything in Figure 10, we graph the margin bound when all of the data is used for the train set. [sent-551, score-0.476]
69 Note that it improves somewhat on the margin bound calculated using the 70% train set (7/10 margin bound), but not enough to compete with the test set bound. [sent-552, score-0.424]
70 These bounds can be regarded as bounds for the original classifier only under an additional assumption: that picking a classifier according to the majority vote of this stochastic distribution does not worsen the true error rate. [sent-554, score-0.197]
71 It is of course unfair to compare the train set bound with the test set bound on a 70/30 train/test split because a very tight train set bound would imply that it is unnecessary to even have a test set. [sent-556, score-0.862]
72 In Figure 11 we compare the true error bounds on all of the data to the true error bounds generated from the 70/30 train/test split. [sent-557, score-0.284]
73 The results show that the PAC-Bayes margin bound is tight enough to give useful information, but still not competitive with the test set bounds. [sent-558, score-0.285]
74 The train set bound might easily become tighter for smaller sample sizes. [sent-564, score-0.267]
75 The train set bound might still have the right “shape” for choosing an optimal parameter setting, such as “C” in a support vector machine. [sent-567, score-0.267]
76 Sample Compression Bound The sample compression bound (Littlestone and Warmuth), (Floyd and Warmuth, 1995) is like the PAC-Bayes bound in that it applies to arbitrary precision continuous valued classifiers. [sent-569, score-0.422]
77 Mainstream learning algorithms do not optimize the sample compression metric, so the bound application is somewhat rarer. [sent-571, score-0.288]
78 Nonetheless, there do exist some reasonably competitive learning algorithms for which the sample compression bound produces significant results. [sent-572, score-0.288]
79 2 shows that the sample compression bound is nearly as tight as possible given the observations. [sent-578, score-0.32]
80 3 discusses results from the application of the sample compression bound to support vector machines. [sent-581, score-0.288]
81 1 The Sample Compression Bound The sample compression bound (Littlestone and Warmuth) (Floyd and Warmuth, 1995) stated here differs from older results by generalization and simplification but the bound behavior is qualitatively identical. [sent-583, score-0.422]
82 The sample compression bound is dependent on the errors, cS−S on the subset S − S . [sent-585, score-0.288]
83 In general, the bound becomes tighter as the dependent subset S becomes smaller and as the error number on the nondependent subset S − S becomes smaller. [sent-587, score-0.19]
84 Viewed as an interactive proof of learning (in Figure 12), the sample compression bound is unique amongst training set bounds because it does not require any initial commitment to a measure over the classifiers. [sent-588, score-0.415]
85 Viewed from this perspective, the sample compression bound is the Occam’s Razor bound, except for the minor detail that the set of evaluating examples varies. [sent-597, score-0.312]
86 2 The Sample Compression Bound is Sometimes Tight We can construct a learning algorithm/learning problem pair such that the sample compression bound is provably near optimal, as a function of its observables. [sent-604, score-0.288]
87 3 Application of the Sample Compression Bound One obvious application of the sample compression bound is to support vector machines, since the learned classifier is only dependent on the set of support vectors. [sent-625, score-0.288]
88 There are other less common learning algorithms for which the sample compression bound works well. [sent-628, score-0.288]
89 The set covering machine (Marchand and Shawe-Taylor, 2001) has an associated bound which is a variant of the sample compression bound. [sent-629, score-0.288]
90 8 test set errors test set bound compression bound 0. [sent-631, score-0.584]
91 2 0 liver bostonpima ion sonar dig1 dig2 adult problem Figure 13: The sample compression bound applied to the output of a support vector machine with a Gaussian kernel. [sent-634, score-0.459]
92 1 Learning Algorithm Design Every train set bound implies a learning algorithm: choose the classifier which minimizes the true error bound. [sent-639, score-0.354]
93 It is important to note that the form of a train set bound does not imply that this minimization is a good idea. [sent-642, score-0.267]
94 Choosing between two classifiers based upon their true error bound implies a better worst-case bound on the true error. [sent-643, score-0.441]
95 Another strong caveat is that, historically, train set bounds have simply not been tight enough on real datasets for a nonvacuous application. [sent-647, score-0.258]
96 The train set bound presented here essentially shows that a reasonable person will be convinced of learning success when a short-description classifier does well on train set data. [sent-654, score-0.4]
97 3 Conclusion This introduction to prediction theory covered two styles of bound: the test set bound and the train set bound. [sent-659, score-0.348]
98 Test set bounds provide a better way to report error rates and confidence intervals on future error rates than some current methods. [sent-661, score-0.2]
99 It is important to note that the train set bound and test set bound techniques are not mutually exclusive. [sent-664, score-0.482]
100 Test set bounds are improved by the “free” information about the training error and train set bounds become applicable, even when not always tight. [sent-666, score-0.299]
wordName wordTfidf (topN-words)
[('razor', 0.33), ('cd', 0.326), ('occam', 0.312), ('pr', 0.295), ('bin', 0.267), ('angford', 0.197), ('ractical', 0.184), ('dm', 0.177), ('compression', 0.154), ('rediction', 0.154), ('heory', 0.154), ('cs', 0.138), ('bound', 0.134), ('train', 0.133), ('chernoff', 0.124), ('binomial', 0.116), ('lassification', 0.115), ('langford', 0.106), ('prs', 0.105), ('rain', 0.105), ('ec', 0.098), ('ln', 0.088), ('tail', 0.088), ('er', 0.085), ('kl', 0.083), ('test', 0.081), ('coin', 0.068), ('dence', 0.067), ('agnostic', 0.066), ('binomials', 0.066), ('negating', 0.066), ('ers', 0.066), ('classi', 0.063), ('tightness', 0.057), ('error', 0.056), ('upon', 0.055), ('bounds', 0.055), ('subsection', 0.054), ('rate', 0.053), ('chervonenkis', 0.052), ('pictorially', 0.052), ('valiant', 0.049), ('yw', 0.044), ('bostonpima', 0.039), ('commitment', 0.039), ('depiction', 0.039), ('liver', 0.039), ('mtest', 0.039), ('verifier', 0.039), ('datasets', 0.038), ('pe', 0.038), ('margin', 0.038), ('heads', 0.035), ('perpendicular', 0.035), ('implications', 0.035), ('empirical', 0.033), ('intervals', 0.033), ('interactive', 0.033), ('floyd', 0.033), ('realizable', 0.033), ('ion', 0.033), ('ew', 0.033), ('tight', 0.032), ('warmuth', 0.032), ('true', 0.031), ('adult', 0.031), ('sonar', 0.029), ('mkl', 0.029), ('calculation', 0.029), ('reporting', 0.029), ('people', 0.029), ('interval', 0.028), ('sometimes', 0.027), ('failure', 0.027), ('learner', 0.027), ('quantitatively', 0.027), ('pressure', 0.027), ('tutorial', 0.026), ('addressable', 0.026), ('azor', 0.026), ('ccam', 0.026), ('clopper', 0.026), ('cloudy', 0.026), ('conversions', 0.026), ('convince', 0.026), ('mpr', 0.026), ('practices', 0.026), ('reinterpretations', 0.026), ('specialization', 0.026), ('stest', 0.026), ('tate', 0.026), ('yv', 0.026), ('draws', 0.026), ('con', 0.025), ('fundamentally', 0.025), ('pg', 0.025), ('pm', 0.025), ('kernelized', 0.025), ('examples', 0.024), ('cumulative', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
2 0.10601156 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
Author: Roni Khardon, Rocco A. Servedio
Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions
3 0.090010971 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
Author: Mario Marchand, Marina Sokolova
Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory
4 0.081176311 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
5 0.065204032 32 jmlr-2005-Expectation Consistent Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
7 0.057635605 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
8 0.055567257 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
9 0.046213564 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
10 0.041738335 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
11 0.041127883 22 jmlr-2005-Concentration Bounds for Unigram Language Models
12 0.040158655 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
13 0.039307859 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
14 0.03929954 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
15 0.038480289 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
16 0.038311083 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
17 0.037954852 71 jmlr-2005-Variational Message Passing
18 0.035874426 29 jmlr-2005-Efficient Margin Maximizing with Boosting
19 0.035072379 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
20 0.035022594 67 jmlr-2005-Stability of Randomized Learning Algorithms
topicId topicWeight
[(0, 0.215), (1, -0.126), (2, 0.074), (3, 0.179), (4, -0.098), (5, -0.161), (6, -0.013), (7, -0.08), (8, 0.126), (9, -0.207), (10, 0.041), (11, 0.064), (12, -0.026), (13, 0.015), (14, -0.18), (15, -0.19), (16, 0.043), (17, -0.013), (18, 0.251), (19, -0.24), (20, -0.081), (21, 0.03), (22, 0.132), (23, 0.034), (24, -0.161), (25, 0.079), (26, 0.167), (27, -0.055), (28, 0.138), (29, 0.051), (30, 0.041), (31, 0.05), (32, -0.086), (33, 0.004), (34, 0.138), (35, 0.151), (36, -0.017), (37, -0.081), (38, 0.012), (39, -0.096), (40, -0.139), (41, 0.14), (42, 0.143), (43, -0.169), (44, 0.096), (45, -0.041), (46, 0.135), (47, -0.123), (48, 0.178), (49, -0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.96607244 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
2 0.45730579 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
Author: Roni Khardon, Rocco A. Servedio
Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions
3 0.35661843 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
4 0.34309226 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
Author: Mario Marchand, Marina Sokolova
Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory
5 0.27992076 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
Author: Günther Eibl, Karl-Peter Pfeiffer
Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps
6 0.23795068 32 jmlr-2005-Expectation Consistent Approximate Inference
7 0.21857914 22 jmlr-2005-Concentration Bounds for Unigram Language Models
8 0.2116859 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
9 0.20753874 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
10 0.20205697 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
12 0.17885898 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
13 0.17295751 71 jmlr-2005-Variational Message Passing
14 0.16593598 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
15 0.16173466 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
16 0.16064017 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
17 0.15824105 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
18 0.15664887 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
19 0.15379944 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
20 0.14995842 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
topicId topicWeight
[(13, 0.014), (17, 0.02), (19, 0.02), (36, 0.017), (37, 0.04), (43, 0.04), (47, 0.016), (52, 0.103), (59, 0.02), (70, 0.045), (88, 0.094), (90, 0.021), (92, 0.441), (94, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.73084939 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
2 0.3432948 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
Author: Roni Khardon, Rocco A. Servedio
Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions
3 0.33035958 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
4 0.32565993 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
5 0.32348445 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
6 0.32184815 64 jmlr-2005-Semigroup Kernels on Measures
7 0.3211078 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
8 0.32032856 71 jmlr-2005-Variational Message Passing
9 0.31830063 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
10 0.31792718 3 jmlr-2005-A Classification Framework for Anomaly Detection
11 0.31631789 39 jmlr-2005-Information Bottleneck for Gaussian Variables
12 0.31626135 11 jmlr-2005-Algorithmic Stability and Meta-Learning
13 0.31486821 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
14 0.31335905 44 jmlr-2005-Learning Module Networks
15 0.30859673 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
16 0.30712998 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
17 0.30682358 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
18 0.30446869 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
19 0.30260664 20 jmlr-2005-Clustering with Bregman Divergences
20 0.30242226 48 jmlr-2005-Learning the Kernel Function via Regularization