jmlr jmlr2008 jmlr2008-68 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
Reference: text
sentIndex sentText sentNum sentScore
1 COM PO Box 60543 Pasadena, CA 91116-6543 Editor: Manfred Warmuth Abstract This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). [sent-2, score-0.32]
2 The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. [sent-3, score-0.2]
3 The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. [sent-4, score-0.382]
4 Keywords: compression, error bound, support vector machine, nearly uniform 1. [sent-5, score-0.463]
5 Introduction The error bounds developed in this paper are based on the number of support vectors in an SVM. [sent-6, score-0.293]
6 Littlestone and Warmuth (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995) pioneered error bounds of this type. [sent-7, score-0.219]
7 Their method derives error bounds based on how few training examples are needed to represent a classifier that is consistent with all training examples. [sent-8, score-0.551]
8 Hence, bounds derived using their method are called compression-based bounds. [sent-9, score-0.154]
9 Compression-based bounds apply to SVMs because producing an SVM involves determining which training examples are “border” examples of each class and then ignoring “interior” examples. [sent-10, score-0.483]
10 The number of border examples can be a small fraction of the number of training examples. [sent-11, score-0.222]
11 Discarding the interior examples and training on the border examples alone produces the same SVM. [sent-12, score-0.408]
12 So SVM training itself is a method to reconstruct the classifier based on a subset of the training data. [sent-13, score-0.114]
13 For more details on applying compression-based bounds to SVMs, refer to Cristianini and Shawe-Taylor (2000) and von Luxburg et al. [sent-14, score-0.154]
14 For information on applying compressionbased bounds to some other classifiers, refer to Littlestone and Warmuth (1986), Floyd and Warmuth (1995), Marchand and Shawe-Taylor (2001) and Marchand and Sokolova (2005). [sent-16, score-0.154]
15 Compression-based bounds are effective when a small subset of the available examples can represent a classifier that is consistent with all available examples. [sent-17, score-0.372]
16 Proofs of effectiveness for compression-based bounds use uniform validation over a set of classifiers that includes the consistent classifier. [sent-18, score-0.822]
17 The validation is uniform in the sense that no classifier in the set may be misvalidated. [sent-19, score-0.519]
18 The bounds introduced in this paper apply when multiple subsets of the available examples can represent the same consistent classifier. [sent-20, score-0.45]
19 ) Proofs of effectiveness for the new bounds use validation over a set of classifiers that includes several copies of the consistent classifier. [sent-22, score-0.831]
20 So the validation need not be strictly uniform over the set of classifiers; the proofs can tolerate any number of misvalidated classifiers less than the number of copies of the classifier of interest and must still validate that classifier. [sent-23, score-0.827]
21 Hence, the error bounds are said to be nearly uniform. [sent-24, score-0.44]
22 Nearly uniform error bounds are introduced in Bax (1997). [sent-25, score-0.374]
23 Section 3 gives an error bound for validation of a classifier. [sent-29, score-0.507]
24 Section 4 presents a bound on the probability of several simultaneous events, which is the basis for nearly uniform error bounds. [sent-30, score-0.571]
25 Section 6 applies nearly uniform error bounds to compression-based bounding. [sent-32, score-0.599]
26 Define the error of g: ED (g) = PD (g(X) = Y ), where the probability is over distribution D. [sent-46, score-0.097]
27 Define the empirical error of g on V: EV (g) = PV (g(X) = Y ), where the probability is uniform over the examples in V. [sent-48, score-0.355]
28 If a classifier has empirical error zero, then the classifier is said to be consistent with V. [sent-49, score-0.171]
29 The goal is to use the examples in C to develop a classifier g* that is consistent with C and to produce a PAC (probably approximately correct) bound on the error. [sent-50, score-0.288]
30 This paper focuses on producing the error bound for training methods that can develop g* using subsets of the examples in C, called compression training algorithms. [sent-51, score-0.673]
31 These methods include training support vector machines (SVMs) and perceptrons. [sent-52, score-0.103]
32 Validation of a Consistent Classifier Theorem 1 Let V be a sequence of examples drawn i. [sent-54, score-0.128]
33 from D, and let g be a classifier developed independently of the examples in V. [sent-57, score-0.205]
34 (2) If the error is at least ε, then the probability of correctly classifying each example in V is at most 1-ε, so (2) is ≤ (1 − ε)|V | . [sent-61, score-0.097]
35 1742 N EARLY U NIFORM VALIDATION I MPROVES C OMPRESSION -BASED E RROR B OUNDS The set V is called the set of validation examples. [sent-62, score-0.364]
36 Theorem 1 cannot be applied directly to g* with V = C to compute an error bound, because g* is developed using the examples in C. [sent-63, score-0.196]
37 To validate g*, we can use Theorem 1 indirectly, performing uniform validation over a set of classifiers that includes g*, with validation for each classifier based on examples not used to develop the classifier. [sent-64, score-1.149]
38 Since the set of classifiers includes g*, uniform validation over the set implies validation of g*. [sent-65, score-0.927]
39 In this paper, we use nearly uniform validation to validate g*. [sent-66, score-0.81]
40 We use a multi-set of classifiers that has several copies of g*, and we perform validation over the classifiers, allowing fewer failed validations than the number of copies of g*. [sent-67, score-0.692]
41 Probability of Several Simultaneous Events Nearly uniform validation is based on a bound on the probability of several simultaneous events. [sent-70, score-0.673]
42 k Note that setting k = 1 gives the well-known sum bound on the probability of a union: P [A1 ∪ . [sent-101, score-0.11]
43 Nearly Uniform Validation Consider the probability that at least k classifiers from a set of n classifiers are consistent with their validation examples and yet all have error at least ε. [sent-109, score-0.646]
44 ,Vn be validation sets, with each classifier gi developed independently of validation set Vi . [sent-117, score-1.058]
45 , n} ∧ |S| = k : ∀i ∈ S : (EVi (gi ) = 0 ∧ ED (gi ) ≥ ε)] ≤ n(1 − ε)|V | , k where the probability is over validation sets, with the examples within each validation set drawn i. [sent-122, score-0.888]
46 according to D, but without requiring any independence between validation sets. [sent-125, score-0.364]
47 For instance, with a set of examples, each classifier could be the result of training on a subset of the examples, and each validation set could be the examples not used to train the corresponding classifier. [sent-126, score-0.524]
48 ,Vn )|(EVi (gi ) = 0 ∧ ED (gi ) ≥ ε)}, that is, Ai is the set of validation set sequences for which gi is consistent with Vi and yet the error of gi is at least ε. [sent-134, score-1.027]
49 These are the compression bounds found in previous work. [sent-149, score-0.352]
50 The new bounds are based on nearly uniform validation. [sent-152, score-0.506]
51 , Zm is the sequence of examples available for training. [sent-156, score-0.103]
52 , m}, define g(T) to be the classifier represented by the examples in C that are indexed by T, under some scheme for representing classifiers. [sent-160, score-0.182]
53 (An example scheme is to train a classifier on the examples used for representation. [sent-161, score-0.103]
54 ) Define V(T) to be the subsequence of examples in C not indexed by T. [sent-162, score-0.182]
55 1 Review of Uniform Sample Compression Bounds Define compression index set H to be a minimum-sized subset of {1, . [sent-165, score-0.198]
56 , m} such that EV (H) = 0, that is, g(H) is consistent with the examples in C not indexed by H. [sent-168, score-0.264]
57 Hence, the bounds developed here also apply under the condition that H indexes a minimum-sized subset of examples in C that represent a classifier that is consistent with C. [sent-170, score-0.427]
58 Then m h P [ED (g∗) ≥ ε ∧ |H| = h] ≤ (1 − ε)m−h , where the probability is over random draws of C = Z1 , . [sent-177, score-0.115]
59 Since H depends on the examples in C, Theorem 3 does not apply directly. [sent-184, score-0.13]
60 So use uniform validation over the set of classifiers represented by size-h subsets of C to validate g(H) using Theorem 3. [sent-185, score-0.664]
61 (This set of classifiers is chosen independently of C, and it includes g(H). [sent-186, score-0.088]
62 , gn be the classifiers represented by size-h subsets of C. [sent-190, score-0.182]
63 , gn }, P[EV (H) = 0 ∧ ED (H) ≥ ε] ≤ P[∃gi ∈ {g1 , . [sent-194, score-0.131]
64 Set k=1 in Theorem 3 to bound the probability of at least one misvalidation, and note that n= m h . [sent-199, score-0.11]
65 , gn } : (EVi (gi ) = 0 ∧ ED (gi ) ≥ ε)] ≤ m h (1 − ε)m−h . [sent-203, score-0.131]
66 The following theorem allows us to choose h based on C. [sent-205, score-0.132]
67 Then, with probability at least 1-δ, δ ), m where the probability is over random draws of C = Z1 , . [sent-211, score-0.147]
68 m m Using the sum bound on the probability of a union: P[ED (g∗) ≥ ε(m, h, P[∃h ∈ {1, . [sent-219, score-0.11]
69 2 Nearly Uniform Sample Compression Bounds for SVMs Now consider a case where multiple subsets of the examples in C all represent the same consistent classifier. [sent-228, score-0.269]
70 Under this condition, we can use nearly uniform validation to derive new error bounds. [sent-229, score-0.781]
71 In other words, every superset of R represents the same classifier, g*, which is consistent with the examples in C not indexed by R. [sent-238, score-0.289]
72 For example, in support vector machine training, R can be the set of support vectors in a support vector machine produced by training on all examples in C. [sent-239, score-0.298]
73 (To ensure that the training algorithm produces the same SVM for different supersets of R, assume that the training algorithm breaks ties to determine which SVM to return in a nonrandom way that does not depend on which examples beyond R are in the training set. [sent-240, score-0.346]
74 For example, the algorithm could 1746 N EARLY U NIFORM VALIDATION I MPROVES C OMPRESSION -BASED E RROR B OUNDS form a candidate set consisting of all SVMs with a minimum number of support vectors among those that minimize the algorithm’s training objective function. [sent-241, score-0.103]
75 Then −1 m−r q−r P [ED (g ) ≥ ε ∧ q ≥ r] ≤ ∗ m q (1 − ε)m−q , where the probability is over random draws of C = Z1 , . [sent-249, score-0.115]
76 (6) Since R depends on the examples in C, Theorem 3 does not apply directly. [sent-266, score-0.13]
77 So use nearly uniform validation over the set of classifiers represented by size-q subsets of C to validate g* using Theorem 3. [sent-267, score-0.861]
78 , gn be the classifiers represented by size-q subsets of C. [sent-275, score-0.182]
79 , gn contains at least k instances of g*, the RHS of (6) is ≤ P[∃S ⊆ {1, . [sent-279, score-0.131]
80 Also, if q < r, then the theorem does not produce an error bound. [sent-287, score-0.197]
81 The following theorem allows us to choose q based on r. [sent-288, score-0.132]
82 Use C to identify a retained set R and an associated classifier g*. [sent-295, score-0.091]
83 q≥r where the probability is over random draws of C = Z1 , . [sent-299, score-0.115]
84 w w Using the sum bound on the probability of a union: δ P[∃q ∈ W : ED (g∗ ) ≥ ε(m, r, q, ) ∧ q ≥ r] ≤ δ. [sent-304, score-0.11]
85 , m} in Theorem 7 gives the compression error bound from Theorem 5, which is the bound from the literature (Littlestone and Warmuth, 1986; Cristianini and Shawe-Taylor, 2000; Langford, 2005). [sent-309, score-0.419]
86 Analysis This section analyzes optimal choices of q and analyzes how strongly the error bound depends on different factors. [sent-312, score-0.275]
87 To determine optimal choices for q, we analyze how probability of bound failure δ changes as q increases. [sent-313, score-0.11]
88 To compare the influence of different factors, we use some approximations for the bound ε. [sent-314, score-0.106]
89 Also, we compare choosing q to maximize the number of examples used for validation to choosing q to maximize the number of copies of g ∗ in the nearly uniform validation. [sent-315, score-0.983]
90 For some background, note that increasing q increases the fraction of classifiers in the nearly uniform validation that match g ∗ , but it decreases the number of validation examples for each classifier. [sent-318, score-1.183]
91 The minimum for q is r, which produces only one classifier that matches g* and leaves m-r examples for validation. [sent-319, score-0.148]
92 The maximum for q is m, making g∗ the only classifier involved in uniform validation, but leaving no validation examples. [sent-320, score-0.519]
93 Setting the RHS equal to one and solving for q produces r −1 , ε qopt = making the optimal validation set size m − qopt = m − r −1 . [sent-328, score-0.553]
94 ε For example, with SVM training, if 5% of the training examples are support vectors, and the error bound is ε = 10%, then the optimal choice for q is one less than half the number of training examples. [sent-329, score-0.406]
95 2 How Error Bound ε Depends on m, r, q, and δ To explore how the error bound ε(m,r,q,δ/w) in Theorem 7 depends on m, r, q, ,δ, and w, we will use the following pair of approximations: n k ≈ en k k , which follows from Stirling’s approximation (Feller, 1968, p. [sent-331, score-0.143]
96 Apply these approximations to δ = w m−r q−r −1 producing 1749 m q (1 − ε)m−q , (9) BAX δ ≈ w e(m − r) q−r −(q−r) em q q e−ε(m−q) . [sent-333, score-0.113]
97 Solve for ε: 1 e(m − r) em w δ −(q − r) ln + q ln + ln . [sent-334, score-0.253]
98 ε(m, r, q, ) ≈ w m−q q−r q δ (10) The error bound is linear in the inverse of the number of validation examples m - q, approximately linear in q - r and in q, logarithmic in the number w of candidates for q, and logarithmic in the inverse of δ. [sent-335, score-0.707]
99 The first combination counts the number of copies of g* in the set of classifiers to be validated. [sent-342, score-0.164]
100 Using the bounds for the central and near-central terms of the binomial distribution from Feller (1968, p. [sent-346, score-0.198]
wordName wordTfidf (topN-words)
[('validation', 0.364), ('ev', 0.305), ('gi', 0.258), ('bax', 0.215), ('compression', 0.198), ('nearly', 0.197), ('evi', 0.181), ('warmuth', 0.176), ('copies', 0.164), ('uniform', 0.155), ('bounds', 0.154), ('lhs', 0.153), ('mproves', 0.145), ('niform', 0.145), ('ompression', 0.145), ('rhs', 0.137), ('littlestone', 0.137), ('theorem', 0.132), ('gn', 0.131), ('er', 0.131), ('ers', 0.13), ('zm', 0.125), ('ounds', 0.123), ('classi', 0.122), ('rror', 0.11), ('examples', 0.103), ('validate', 0.094), ('svms', 0.09), ('draws', 0.083), ('consistent', 0.082), ('indexed', 0.079), ('bound', 0.078), ('feller', 0.072), ('floyd', 0.072), ('marchand', 0.072), ('qopt', 0.072), ('ln', 0.069), ('analyzes', 0.066), ('error', 0.065), ('retained', 0.062), ('border', 0.062), ('training', 0.057), ('ai', 0.056), ('develops', 0.055), ('svm', 0.054), ('subsets', 0.051), ('support', 0.046), ('em', 0.046), ('produces', 0.045), ('simultaneous', 0.044), ('includes', 0.044), ('binomial', 0.044), ('eric', 0.044), ('independently', 0.044), ('early', 0.043), ('cristianini', 0.043), ('producing', 0.039), ('interior', 0.038), ('union', 0.037), ('logarithmic', 0.033), ('represent', 0.033), ('probability', 0.032), ('baxhome', 0.031), ('po', 0.031), ('canceling', 0.031), ('earliest', 0.031), ('lookup', 0.031), ('oor', 0.031), ('candidates', 0.031), ('events', 0.031), ('integer', 0.03), ('let', 0.03), ('proof', 0.029), ('identify', 0.029), ('applies', 0.028), ('approximations', 0.028), ('developed', 0.028), ('supersets', 0.027), ('lexicographically', 0.027), ('ei', 0.027), ('apply', 0.027), ('drawn', 0.025), ('copy', 0.025), ('tolerate', 0.025), ('langford', 0.025), ('yahoo', 0.025), ('superset', 0.025), ('stirling', 0.025), ('proofs', 0.025), ('develop', 0.025), ('said', 0.024), ('pac', 0.023), ('pv', 0.023), ('pasadena', 0.023), ('validated', 0.023), ('augment', 0.023), ('pd', 0.023), ('effectiveness', 0.023), ('improves', 0.023), ('discarding', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
2 0.15334232 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
Author: Gérard Biau, Luc Devroye, Gábor Lugosi
Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent. Keywords: random forests, classification trees, consistency, bagging This paper is dedicated to the memory of Leo Breiman.
3 0.15057912 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
Author: Eric Bax, Augusto Callejas
Abstract: This paper introduces a new PAC transductive error bound for classification. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming
4 0.11658286 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2 ) per trial, where n is the dimension of the instances. Keywords: principal component analysis, online learning, density matrix, expert setting, quantum Bayes rule
5 0.10459213 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
Author: Bo Jiang, Xuegong Zhang, Tianxi Cai
Abstract: Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine
6 0.092316508 52 jmlr-2008-Learning from Multiple Sources
7 0.07064908 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
8 0.057366785 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
9 0.053337272 9 jmlr-2008-Active Learning by Spherical Subdivision
10 0.049461171 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
11 0.046689197 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
12 0.043970611 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
13 0.042558808 54 jmlr-2008-Learning to Select Features using their Properties
14 0.040601034 58 jmlr-2008-Max-margin Classification of Data with Absent Features
15 0.039297499 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
16 0.038458932 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
17 0.036400463 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
18 0.035076961 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
19 0.033636991 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
20 0.032637343 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
topicId topicWeight
[(0, 0.206), (1, -0.074), (2, 0.019), (3, -0.104), (4, -0.002), (5, -0.016), (6, 0.159), (7, -0.173), (8, -0.063), (9, -0.257), (10, 0.077), (11, 0.03), (12, -0.021), (13, 0.036), (14, -0.343), (15, 0.176), (16, 0.354), (17, 0.131), (18, -0.166), (19, 0.073), (20, -0.026), (21, -0.017), (22, 0.091), (23, -0.072), (24, -0.013), (25, 0.017), (26, -0.101), (27, -0.033), (28, 0.001), (29, -0.008), (30, -0.036), (31, -0.009), (32, -0.044), (33, 0.167), (34, -0.02), (35, 0.022), (36, 0.031), (37, 0.001), (38, -0.053), (39, 0.053), (40, -0.06), (41, -0.017), (42, -0.003), (43, -0.003), (44, 0.008), (45, -0.036), (46, 0.038), (47, -0.024), (48, -0.09), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.9706229 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
2 0.7152012 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
Author: Gérard Biau, Luc Devroye, Gábor Lugosi
Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent. Keywords: random forests, classification trees, consistency, bagging This paper is dedicated to the memory of Leo Breiman.
3 0.66505086 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
Author: Eric Bax, Augusto Callejas
Abstract: This paper introduces a new PAC transductive error bound for classification. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming
4 0.40576413 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
Author: Bo Jiang, Xuegong Zhang, Tianxi Cai
Abstract: Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine
5 0.38349894 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
6 0.34336877 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
7 0.30652729 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
8 0.30266708 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
9 0.29884455 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
10 0.2931948 9 jmlr-2008-Active Learning by Spherical Subdivision
11 0.26099163 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
12 0.23475359 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
13 0.22987147 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
14 0.22776259 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
15 0.18733335 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2 (Machine Learning Open Source Software Paper)
16 0.18693347 54 jmlr-2008-Learning to Select Features using their Properties
17 0.17924131 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
18 0.17914155 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
19 0.1775468 58 jmlr-2008-Max-margin Classification of Data with Absent Features
20 0.17282724 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
topicId topicWeight
[(0, 0.037), (11, 0.418), (31, 0.058), (40, 0.015), (54, 0.014), (58, 0.022), (66, 0.031), (76, 0.048), (88, 0.111), (92, 0.079), (94, 0.038), (99, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.73394287 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
2 0.31489694 96 jmlr-2008-Visualizing Data using t-SNE
Author: Laurens van der Maaten, Geoffrey Hinton
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling
3 0.31349888 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
4 0.30811569 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park
Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform
5 0.30600908 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
6 0.30123529 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
7 0.3001146 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
8 0.29943064 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
9 0.29633093 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
10 0.29606682 9 jmlr-2008-Active Learning by Spherical Subdivision
11 0.29427904 57 jmlr-2008-Manifold Learning: The Price of Normalization
12 0.29111171 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
13 0.29059342 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
14 0.2905823 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
15 0.28957251 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
16 0.28943765 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
17 0.28941661 56 jmlr-2008-Magic Moments for Structured Output Prediction
18 0.28929466 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
19 0.28923672 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
20 0.28911442 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration