jmlr jmlr2008 jmlr2008-13 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eric Bax, Augusto Callejas
Abstract: This paper introduces a new PAC transductive error bound for classification. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming
Reference: text
sentIndex sentText sentNum sentScore
1 Wilson Avenue Apartment 3 Pasadena, CA 91106 Editor: G´ bor Lugosi a Abstract This paper introduces a new PAC transductive error bound for classification. [sent-5, score-0.236]
2 The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. [sent-6, score-0.91]
3 A likely assignment with maximum error determines the bound. [sent-7, score-0.266]
4 Introduction An error bound based on VC dimension (Vapnik and Chervonenkis, 1971; Vapnik, 1998) uses uniform bounds over the largest number of assignments possible from a class of classifiers, based on worst-case arrangements of training and working examples. [sent-10, score-0.744]
5 However, as the number of training examples grows, the probability that training error is a good approximation of working error is so great that the VC error bound succeeds in spite of using uniform bounds based on worst-case assumptions about examples. [sent-11, score-0.741]
6 However, VC error bounds have some drawbacks: they are ineffective for smaller data sets, and they do not apply to some classifiers, such as nearest neighbor classifiers. [sent-14, score-0.35]
7 Transductive inference (Vapnik, 1998) is a training method that uses information provided by inputs of working examples in addition to information provided by training examples. [sent-15, score-0.432]
8 The idea is to develop the best classifier for the inputs of the specific working examples at hand rather than develop a classifier that is good for general inputs and then apply it to the working examples. [sent-16, score-0.592]
9 Transductive inference improves on general VC bounds by using the actual working example inputs, instead of a worst-case arrangement of inputs, to find the number of different assignments that classifiers in each training class can produce. [sent-17, score-0.555]
10 The error bound presented here is designed to provide error bounds for data sets so small that other bounds are ineffective. [sent-19, score-0.34]
11 Like transductive inference, the error bound presented here uses information provided by the inputs of the working examples in addition to information provided by the c 2008 Eric Bax and Augusto Callejas, patent pending. [sent-20, score-0.523]
12 The error bound in this paper is based on the fact that if the training and working examples are generated independently and identically distributed (i. [sent-31, score-0.451]
13 ), then each partition of the complete set of training and working examples into a training set and a working set is equally likely. [sent-34, score-0.61]
14 339-343), compression-based error bounds (Littlestone and Warmuth, 1986), and uniform error bounds based on constraints imposed by patterns of agreement and disagreement among classifiers over the working inputs (Bax, 1999). [sent-39, score-0.565]
15 Section 6 introduces filters based on virtual partitioning, which do not require explicit computation over multiple partitions of the data into different training and working sets. [sent-47, score-0.462]
16 Section 7 gives an efficient algorithm to compute an error bound for a 1-nearest neighbor classifier. [sent-48, score-0.248]
17 , Xt+w that is, inputs and outputs of t training examples and just the inputs of w working examples. [sent-60, score-0.455]
18 (We will use T to denote the set of training examples and W to denote the set of working examples. [sent-61, score-0.301]
19 The bounds in this paper consider different possible assignments to the unknown labels of the working set examples and use permutations of the complete sequence. [sent-86, score-0.756]
20 So we introduce some notation around assignments and permutations. [sent-87, score-0.249]
21 , t+w}, let C(a, σ) be the sequence that results from assigning labels to the working examples in C: ∀i ∈ {1, . [sent-91, score-0.247]
22 Let g(a, σ) be the classifier produced by applying the training procedure with T(a, σ) as the training set and W(a, σ) as the working set. [sent-97, score-0.299]
23 Since each permutation of a complete sequence is equally likely, we can replace the integral over complete sequences by an integral over sequences Q of t+w examples followed by an average over permutations σ of Q to form complete sets: = 1 ∑ I(Pσ [E(a∗, σ) ≥ E(a∗ , σ )] ≤ δ|C = σ Q)p(Q)dQ. [sent-129, score-0.523]
24 Let g(a, S) be the classifier that results from training with T(a, S) as the training set and W(a, S) as the working set. [sent-148, score-0.299]
25 Hence, we can compute errors E(a, S) over size-t subsets S rather than over all permutations in order to compute bound in Equation (2). [sent-184, score-0.411]
26 3 Algorithm Given an array C of t training examples followed by w working examples and a bound failure probability ceiling δ, Algorithm 3. [sent-186, score-0.439]
27 1 procedure bound(C, delta) // Variable bound stores the running max of errors for likely assignments. [sent-192, score-0.239]
28 Sampled Filters and Ranking with Random Tie Breaking The goal is to reject from L as many false assignments a as possible among those that have E(a, Id) > E(a*, Id), while only rejecting a* in a fraction δ or fewer of cases. [sent-219, score-0.316]
29 The bound process in the previous section rejects assignments a for which E(a, S*) is abnormally high among errors E(a, S) over subsets S of {1, . [sent-220, score-0.557]
30 , t+w}, that is, among partitions of the complete sequence into training and working sets. [sent-223, score-0.406]
31 For each assignment a, the process is equivalent to ranking all subsets S in order of E(a, S), finding the fraction of subsets that outrank S*, even with S* losing all ties, and rejecting a if the fraction is δ or less. [sent-224, score-0.483]
32 This section introduces alternative filters that do not require computation over all subsets and a random tie breaking process that ranks S* fairly among subsets S having the same error instead of having S* lose all ties. [sent-226, score-0.556]
33 Of course, this random filter is unlikely to produce a strong bound, because it does not preferentially reject assignments a that have high error E(a, S*). [sent-231, score-0.317]
34 The following sampled filter, based on errors E(a, S) over a random sample of subsets S, rejects assignments with high error E(a, S*), and it is less expensive to compute than the complete filter. [sent-232, score-0.644]
35 For each assignment a, generate a sample of n size-t subsets S of {1, . [sent-233, score-0.272]
36 Then use the sample in place of the set of all subsets S in the algorithm, that is, accept assignment a if the fraction of subsets S in the sample with E(a, S) at least E(a, S*) is greater than δ . [sent-245, score-0.451]
37 2 Random Tie Breaking Both the complete filter and the sampled filter accept an assignment if the fraction of a set of subsets S with E(a, S) at least E(a, S*) is greater than δ. [sent-314, score-0.442]
38 In essence, if other subsets S have the same error as S*, then this rule errs on the side of safety by treating those subsets S as having greater error than S*. [sent-315, score-0.356]
39 To close the gap, use random tie breaking to rank S* at random among the subsets S that have E(a, S) = E(a, S*). [sent-317, score-0.341]
40 Generate an integer uniformly at random in [1,k] to be the number of subsets S with the same error that rank at or above S* after random tie breaking. [sent-319, score-0.322]
41 If that number plus the number of subsets S with error greater than for S* is a larger fraction of the partitions than δ, then accept the assignment. [sent-320, score-0.344]
42 Since the maximum error E(a, S) over assignments in L is obtained by maintaining a running maximum as assignments are added to L, there is no need to store L explicitly. [sent-323, score-0.566]
43 To reduce computation, evaluate assignments a for membership in L in decreasing order of E(a, S*). [sent-328, score-0.249]
44 When the first assignment a is accepted into L, return E(a, S*) as the error bound, and stop. [sent-329, score-0.23]
45 To order assignments by E(a, S*), train a classifier g, and apply it to each input in W to form the assignment with zero error. [sent-330, score-0.411]
46 Invert that assignment to form the assignment with maximum error. [sent-331, score-0.324]
47 Invert single elements of that assignment to produce assignments with the next greatest error rates. [sent-332, score-0.479]
48 (This technique reduces computation only by a small fraction when the bound is effective; the reduction is only about half for an error bound of 0. [sent-334, score-0.266]
49 a ∈ Lh If the filter PS [h(a, S) ≥ h(a, S∗ )] > δ can be computed for each assignment a without explicitly computing h(a, S) over some subsets S, then we call it a virtual partition filter. [sent-352, score-0.355]
50 ) A filter based on leave-one-out errors excludes assignments a that cause an improbably large fraction of the leave-one-out errors in C(a) to be in the working set. [sent-356, score-0.642]
51 The frequencies have a hypergeometric distribution—if there are m leave-one-out errors in C(a), then t +w−m w− j m j PS [h(a, S) = j] = t +w w , where the probability is uniform over size-t subsets S of {1, . [sent-358, score-0.308]
52 Z∈W (a,S) Let n(Z) be the nearest neighbor to Z in T∪W–{Z}. [sent-380, score-0.221]
53 Computing an error bound using virtual partitions requires O(2 w poly(t+w)) time because the scoring function is computed for each of the 2w assignments. [sent-397, score-0.402]
54 ) The computation requires O(poly(t+w)) space since each assignment can be filtered without reference to others, and the maximum error of a likely assignment can be maintained using a single variable. [sent-399, score-0.428]
55 Joachims discovered a method to bound the number of leave-one-out errors based on the results of training a single SVM on all examples. [sent-403, score-0.22]
56 Efficient Computation of Error Bounds for 1-Nearest Neighbor Classifiers When using virtual partitions based on leave-one-out errors to produce an error bound for a 1nearest neighbor classifier, there is a way to avoid iterating over all assignments to compute the bound. [sent-411, score-0.761]
57 Avoiding this iteration leads to an efficient method to compute an error bound for a 1nearest neighbor classifier, that is, a method that requires time polynomial in the size of the problem. [sent-412, score-0.248]
58 This section ends with a note on how to extend the algorithm to improve the bounds by allowing random tie breaking for ranking. [sent-416, score-0.292]
59 1 Preliminaries and Concepts To begin, ensure that each example has a unique nearest neighbor by randomly perturbing the inputs of examples that tie to be nearest neighbors to any example. [sent-418, score-0.621]
60 Perturb by so little that no new nearest neighbor can be introduced. [sent-419, score-0.221]
61 1 This form of random tie breaking makes it impossible for a cycle of three or more examples to have each example in the cycle the nearest neighbor of the next. [sent-423, score-0.508]
62 Let n(Z) be the nearest neighbor of example Z in T ∪ W − {Z}. [sent-425, score-0.221]
63 , Zm , Z1 with m > 2 and each example the nearest neighbor of the next, that is, ∀i ∈ {1, . [sent-429, score-0.221]
64 For cycles greater than length two, the tie breaking makes equality impossible. [sent-437, score-0.231]
65 The algorithm to efficiently compute an error bound uses dynamic programming, starting at the leaves of F and working up to the root or roots. [sent-455, score-0.341]
66 Let A(i,j,k) be the subset of assignments in {0, 1}w that have i leave-one-out errors among the training examples in F(k) and j leave-one-out errors among the working examples of F(k). [sent-458, score-0.774]
67 Let n(Z,T) be the nearest neighbor of example Z among the training examples. [sent-459, score-0.275]
68 If there are no such assignments, then define ei jky = −1, to signify that the value is “undefined. [sent-464, score-0.374]
69 For a leaf example Zk that is in T and has label y, e00ky = 0, and, for all other combinations of i, j, and y, ei jky = −1. [sent-466, score-0.454]
70 For a leaf example Zk that is in W and has label y, e00ky = 0, e0,0,k,1−y = 1, and, for all other combinations of i, j, and y, ei jky = −1. [sent-467, score-0.454]
71 y, let Zk be the parent of Zi in F, and let yk = Zk . [sent-470, score-0.274]
72 Define cT (i, yi , k, yk ) = 1 Zi ∈ T and yi = yk , 0 otherwise (9) to count whether Zk having label yk causes example Zi to be a leave-one-out error in T. [sent-472, score-0.815]
73 Define dT (i, yi , k, yk ) = 1 Zk ∈ T, yi = yk , and Zi = n(Zk ) , 0 otherwise 871 (10) BAX AND C ALLEJAS to count whether example Zk having label yk causes example Zk to be a leave-one-out error in T. [sent-473, score-0.815]
74 Define cW (i, yi , k, yk ) = 1 Zi ∈ W and yi = yk , 0 otherwise (11) to count whether example Zk having label yk causes example Zi to be a leave-one-out error in W if it has label yi . [sent-474, score-0.895]
75 Define dW (i, yi , k, yk ) = 1 Zk ∈ W, yi = yk , and Zi = n(Zk ) , 0 otherwise (12) to count whether example Zi having label yi causes example Zk to be a leave-one-out error in W if it has label yk . [sent-475, score-0.895]
76 Define nT (Z) to be the nearest neighbor of example Z in T. [sent-476, score-0.221]
77 y , 0 otherwise to count whether example Zk having label yk causes example Zk to be misclassified by its nearest training example. [sent-478, score-0.482]
78 Then ei jky = max (ib , jb , b, yb ) : b ∈ B(k), eib jb byb = −1 s. [sent-483, score-0.867]
79 ∑b∈B(k) [ib + cT (b, yb , k, y) + dT (b, yb , k, y)] = i, and ∑b∈B(k) [ jb + cW (b, yb , k, y) + dW (b, yb , k, y)] = j ∑ h(k, yk ) + eib jb byb . [sent-485, score-1.021]
80 (13) b∈B(k) Let A(i,j) be the subset of assignments in {0, 1}w that have i leave-one-out errors on training examples and j leave-one-out errors on working examples. [sent-486, score-0.718]
81 y}| , a∈A(i, j) that is, the maximum error over assignments that have i leave-one-out errors on training examples and j leave-one-out errors on working examples. [sent-489, score-0.786]
82 b∈B ∑b∈B ib = i, and ∑b∈B jb = j 872 b b b A N E RROR B OUND BASED ON A W ORST L IKELY A SSIGNMENT Let ui j be the probability that j or more leave-one-out errors are in W(S) for a random size-t subset S of {1, . [sent-495, score-0.275]
83 (14) For a given δ, the value max vi j (15) ui j ≥δ bounds the number of working examples misclassified by their training examples, with probability at least 1-δ. [sent-500, score-0.394]
84 Note that the recurrence for each term ei jky depends on terms ei jby for all b∈B(k). [sent-501, score-0.592]
85 Compute terms e i jky in that order, to ensure that each term is computed prior to computing any term that depends on it. [sent-503, score-0.266]
86 3 Example of Computing Values ei jky Consider a small example to demonstrate the recurrence for e i jky . [sent-507, score-0.712]
87 So compute terms ei jky for k = 0, then k = 2, then k = 1. [sent-515, score-0.374]
88 Since example 0 is in T and has output y 0 = 0, e0000 = 0, meaning that, in the single-node subtree consisting of node 0, there are no leave-one-out errors in T or in W, and there are no examples in W misclassified by nearest neighbors in T. [sent-517, score-0.411]
89 If y2 = 0, then example 2 is misclassified by its nearest neighbor in T, which is example 1. [sent-521, score-0.221]
90 Each pair of terms with one from each child node can produce a term for node 1. [sent-528, score-0.266]
91 Relationships between node 1 and each child node contribute to the term. [sent-530, score-0.266]
92 For the relationship between node 1 and node 0, the values are n = 0, y n = 0, k = 1, and yk = 1. [sent-531, score-0.347]
93 For the relationship between node 1 and node 2, the values are n = 2, y n = 0, k = 1, and yk = 1. [sent-538, score-0.347]
94 ) The value e 2111 = 1 means that, in the subtree of F rooted at node 1, that is, in F, it is possible to have two leave-one-out errors in T, one leave-one-out error in W, and one example in W misclassified by its nearest neighbor in T. [sent-551, score-0.567]
95 The relationship between node 1 and node 2 changes because now y2 = 1, so n = 2, yn = 1, k = 1, and yk = 1. [sent-554, score-0.347]
96 The value is e2011 = h(1, 1) + e0000 + e0021 = 0 + 0 + 0 = 0, which means that it is possible to have two leave-one-out errors in T, zero leave-one-out errors in W, and zero examples in W misclassified by nearest neighbors in T. [sent-565, score-0.347]
97 This completes the computation of ei jky values for this problem. [sent-567, score-0.374]
98 Computing values e i jky and vi j directly from the recurrences is inefficient. [sent-570, score-0.374]
99 To compute each slice, iterate through children Z b of Zk , using a slice-sized array prev to store the accumulation over terms from children before Z b and a slice-sized array next to store the accumulation of terms from children up to and including Z b . [sent-575, score-0.216]
100 In other words, when the iteration begins for child Zb∗ , previ j is the value that ei jky would have if the subtree in F rooted at k had children [ Zb . [sent-576, score-0.669]
wordName wordTfidf (topN-words)
[('zk', 0.277), ('jky', 0.266), ('assignments', 0.249), ('working', 0.191), ('yk', 0.189), ('bax', 0.177), ('lter', 0.175), ('assignment', 0.162), ('allejas', 0.152), ('ikely', 0.152), ('orst', 0.152), ('ssignment', 0.152), ('dw', 0.145), ('id', 0.145), ('tie', 0.144), ('permutations', 0.135), ('jb', 0.133), ('ound', 0.129), ('ps', 0.127), ('nearest', 0.123), ('rror', 0.116), ('yb', 0.113), ('cw', 0.111), ('subsets', 0.11), ('child', 0.108), ('ei', 0.108), ('ct', 0.104), ('neighbor', 0.098), ('partitions', 0.097), ('breaking', 0.087), ('zb', 0.087), ('lm', 0.086), ('parent', 0.085), ('errors', 0.084), ('virtual', 0.083), ('bound', 0.082), ('lters', 0.079), ('node', 0.079), ('vc', 0.078), ('inputs', 0.077), ('recurrences', 0.076), ('zt', 0.072), ('recurrence', 0.072), ('scoring', 0.072), ('misclassi', 0.072), ('children', 0.072), ('sequences', 0.07), ('zi', 0.069), ('subtree', 0.069), ('error', 0.068), ('poly', 0.065), ('replacement', 0.064), ('complete', 0.064), ('dt', 0.061), ('bounds', 0.061), ('ib', 0.058), ('byb', 0.057), ('callejas', 0.057), ('eib', 0.057), ('examples', 0.056), ('training', 0.054), ('storage', 0.049), ('transductive', 0.049), ('lhs', 0.048), ('label', 0.048), ('rooted', 0.046), ('zm', 0.044), ('lh', 0.043), ('yt', 0.043), ('invert', 0.04), ('ec', 0.04), ('uniform', 0.039), ('scores', 0.039), ('entries', 0.039), ('augusto', 0.038), ('catoni', 0.038), ('hypergeometric', 0.038), ('jby', 0.038), ('markets', 0.038), ('tail', 0.037), ('frequencies', 0.037), ('filters', 0.037), ('completing', 0.037), ('stores', 0.037), ('introduces', 0.037), ('drawn', 0.037), ('sampled', 0.037), ('causing', 0.036), ('xt', 0.036), ('likely', 0.036), ('count', 0.035), ('accept', 0.035), ('fraction', 0.034), ('rejecting', 0.033), ('causes', 0.033), ('ers', 0.033), ('rejects', 0.032), ('leaf', 0.032), ('vi', 0.032), ('yi', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
Author: Eric Bax, Augusto Callejas
Abstract: This paper introduces a new PAC transductive error bound for classification. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming
2 0.15057912 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
3 0.086737901 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
Author: Ilya Shpitser, Judea Pearl
Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple “parallel worlds” and resulting from simultaneous, possibly conflicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Specifically, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy. Keywords: causality, graphical causal models, identification
4 0.068644032 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
Author: Elena Marchiori
Abstract: In supervised learning, a training set consisting of labeled instances is used by a learning algorithm for generating a model (classifier) that is subsequently employed for deciding the class label of new instances (for generalization). Characteristics of the training set, such as presence of noisy instances and size, influence the learning algorithm and affect generalization performance. This paper introduces a new network-based representation of a training set, called hit miss network (HMN), which provides a compact description of the nearest neighbor relation over pairs of instances from each pair of classes. We show that structural properties of HMN’s correspond to properties of training points related to the one nearest neighbor (1-NN) decision rule, such as being border or central point. This motivates us to use HMN’s for improving the performance of a 1-NN classifier by removing instances from the training set (instance selection). We introduce three new HMN-based algorithms for instance selection. HMN-C, which removes instances without affecting accuracy of 1-NN on the original training set, HMN-E, based on a more aggressive storage reduction, and HMN-EI, which applies iteratively HMN-E. Their performance is assessed on 22 data sets with different characteristics, such as input dimension, cardinality, class balance, number of classes, noise content, and presence of redundant variables. Results of experiments on these data sets show that accuracy of 1-NN classifier increases significantly when HMN-EI is applied. Comparison with state-of-the-art editing algorithms for instance selection on these data sets indicates best generalization performance of HMN-EI and no significant difference in storage requirements. In general, these results indicate that HMN’s provide a powerful graph-based representation of a training set, which can be successfully applied for performing noise and redundance reduction in instance-based learning. Keywords: graph-based training set repre
5 0.066493981 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
6 0.056319766 56 jmlr-2008-Magic Moments for Structured Output Prediction
7 0.05366857 52 jmlr-2008-Learning from Multiple Sources
8 0.051937547 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
9 0.050400313 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
10 0.045719232 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
11 0.044086963 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
12 0.043679524 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
13 0.043465015 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
14 0.041426934 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)
15 0.041071679 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
16 0.040798116 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
17 0.040125843 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
18 0.039182134 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
19 0.038180992 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
20 0.037703164 87 jmlr-2008-Stationary Features and Cat Detection
topicId topicWeight
[(0, 0.206), (1, 0.012), (2, 0.06), (3, -0.075), (4, -0.08), (5, 0.012), (6, 0.094), (7, -0.117), (8, -0.056), (9, -0.218), (10, -0.041), (11, 0.154), (12, -0.09), (13, -0.055), (14, -0.107), (15, 0.037), (16, 0.241), (17, 0.097), (18, -0.203), (19, -0.003), (20, -0.124), (21, 0.005), (22, -0.003), (23, 0.132), (24, -0.05), (25, 0.173), (26, -0.093), (27, -0.161), (28, -0.057), (29, -0.056), (30, -0.021), (31, 0.032), (32, 0.102), (33, -0.073), (34, -0.263), (35, -0.03), (36, 0.06), (37, -0.023), (38, -0.187), (39, 0.102), (40, -0.117), (41, 0.171), (42, 0.019), (43, 0.087), (44, 0.023), (45, 0.143), (46, 0.138), (47, -0.031), (48, -0.039), (49, -0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.96108192 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
Author: Eric Bax, Augusto Callejas
Abstract: This paper introduces a new PAC transductive error bound for classification. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming
2 0.59322429 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
3 0.42350113 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
Author: Elena Marchiori
Abstract: In supervised learning, a training set consisting of labeled instances is used by a learning algorithm for generating a model (classifier) that is subsequently employed for deciding the class label of new instances (for generalization). Characteristics of the training set, such as presence of noisy instances and size, influence the learning algorithm and affect generalization performance. This paper introduces a new network-based representation of a training set, called hit miss network (HMN), which provides a compact description of the nearest neighbor relation over pairs of instances from each pair of classes. We show that structural properties of HMN’s correspond to properties of training points related to the one nearest neighbor (1-NN) decision rule, such as being border or central point. This motivates us to use HMN’s for improving the performance of a 1-NN classifier by removing instances from the training set (instance selection). We introduce three new HMN-based algorithms for instance selection. HMN-C, which removes instances without affecting accuracy of 1-NN on the original training set, HMN-E, based on a more aggressive storage reduction, and HMN-EI, which applies iteratively HMN-E. Their performance is assessed on 22 data sets with different characteristics, such as input dimension, cardinality, class balance, number of classes, noise content, and presence of redundant variables. Results of experiments on these data sets show that accuracy of 1-NN classifier increases significantly when HMN-EI is applied. Comparison with state-of-the-art editing algorithms for instance selection on these data sets indicates best generalization performance of HMN-EI and no significant difference in storage requirements. In general, these results indicate that HMN’s provide a powerful graph-based representation of a training set, which can be successfully applied for performing noise and redundance reduction in instance-based learning. Keywords: graph-based training set repre
4 0.33809331 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
Author: Ilya Shpitser, Judea Pearl
Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple “parallel worlds” and resulting from simultaneous, possibly conflicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Specifically, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy. Keywords: causality, graphical causal models, identification
5 0.29679504 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
6 0.27632681 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
7 0.27580497 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
8 0.26130748 52 jmlr-2008-Learning from Multiple Sources
9 0.24816246 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
10 0.23216745 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
11 0.22748388 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
12 0.20615032 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
13 0.20132184 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
14 0.19497925 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
15 0.19474067 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
16 0.19321197 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
17 0.17829096 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations
18 0.17827876 57 jmlr-2008-Manifold Learning: The Price of Normalization
19 0.17266141 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
20 0.17133942 9 jmlr-2008-Active Learning by Spherical Subdivision
topicId topicWeight
[(0, 0.02), (11, 0.015), (31, 0.553), (40, 0.027), (54, 0.037), (58, 0.031), (66, 0.037), (76, 0.014), (88, 0.062), (92, 0.058), (94, 0.033), (99, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.92465478 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
Author: Eric Bax, Augusto Callejas
Abstract: This paper introduces a new PAC transductive error bound for classification. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming
2 0.82906711 46 jmlr-2008-LIBLINEAR: A Library for Large Linear Classification (Machine Learning Open Source Software Paper)
Author: Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, Chih-Jen Lin
Abstract: LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets. Keywords: large-scale linear classification, logistic regression, support vector machines, open source, machine learning
3 0.40682483 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
4 0.26933569 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
Author: Ilya Shpitser, Judea Pearl
Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple “parallel worlds” and resulting from simultaneous, possibly conflicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Specifically, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy. Keywords: causality, graphical causal models, identification
5 0.26112106 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
6 0.2569406 56 jmlr-2008-Magic Moments for Structured Output Prediction
7 0.25079006 86 jmlr-2008-SimpleMKL
9 0.24846675 7 jmlr-2008-A Tutorial on Conformal Prediction
10 0.24717429 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
11 0.2465217 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
12 0.24413629 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
13 0.24143824 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
14 0.2408399 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
15 0.23569183 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
16 0.2343343 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
17 0.2334694 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
18 0.23311338 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
19 0.23273936 58 jmlr-2008-Max-margin Classification of Data with Absent Features
20 0.23044395 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons