jmlr jmlr2005 jmlr2005-50 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mario Marchand, Marina Sokolova
Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory
Reference: text
sentIndex sentText sentNum sentScore
1 We will see that the proposed learning algorithm for the DLM with data-dependent features is effectively compressing the training data into a small subset of examples which is called the compression set. [sent-35, score-0.41]
2 The proposed bound will apply to any compression set-dependent distribution of messages (see the definition in Section 4) and allows for the message set to be of variable size (in contrast with the sample compression bound of Littlestone and Warmuth (1986) that requires fixed size). [sent-37, score-0.764]
3 Else If (hr (x)) then br Else br+1 , where each bi ∈ {0, 1} defines the output of f (x) if and only if hi is the first feature to be satisfied on x (i. [sent-49, score-0.313]
4 To compute a conjunction of hi s, we simply place in f the negation of each hi with bi = 0 for i = 1 . [sent-57, score-0.327]
5 With this prescription, given any example x, we always have that r sgn ∑ wi hi (x) − θ i=1 = 2bk − 1, where k is the smallest integer for which hk (x) = 1 in the DLM or k = r + 1 if hi (x) = 0 ∀i ∈ {1, . [sent-76, score-0.336]
6 For a given set S = P ∪ N of examples (where P ⊆ P and N ⊆ N) and a given set H of features, consider only the features hi ∈ H which either have hi (x) = 0 for all x ∈ P or hi (x) = 0 for all x ∈ N . [sent-83, score-0.505]
7 Let Qi be the subset of examples on which hi = 1 (our constraint on the choice of hi implies that Qi contains only examples having the same class label). [sent-84, score-0.34]
8 We therefore define the usefulness Ui of feature hi by def Ui = max {|Pi | − pn |Ni |, |Ni | − p p |Pi |} , where pn denotes the penalty of making an error on a negative example whereas p p denotes the penalty of making an error on a positive example. [sent-99, score-0.361]
9 Indeed, whenever we add to a DLM a feature hi for which Pi and Ni are both non empty, the output bi associated with hi will be 1 if |Pi | − pn |Ni | ≥ |Ni | − p p |Pi | or 0 otherwise. [sent-100, score-0.408]
10 Output: A decision list f consisting of an ordered set R = {(hi , bi )}r of features hi with their i=1 corresponding output values bi , and a default value br+1 . [sent-113, score-0.393]
11 Hence, a ball is a feature that covers the examples that are located inside the ball; whereas a hole covers the examples that are located outside. [sent-166, score-0.311]
12 Hence, each ball and hole is identified by only two training examples: a center c and a border b that identifies the radius with d(c, b). [sent-174, score-0.507]
13 To avoid having examples directly on the decision surface of the DLM, the radius ρ of a ball of center c will always be given by ρ = d(c, b) − ε for some training example b chosen for the border and some fixed and very small positive value ε. [sent-185, score-0.485]
14 We have not chosen to assign the radius values “in between” two training example since this would have required three examples per ball and hole and would have decreased substantially the tightness of our generalization error bound without providing a significant increase of discriminative power. [sent-187, score-0.352]
15 Whenever a half-space hc is chosen to be appended to the DLM, we must also provide an a,b output value b which will be the output of the DLM on example x when hc is the first feature of a,b the DLM having hc (x) = 1. [sent-203, score-0.383]
16 Hence, in this section, we provide a general risk bound that depends on the number of examples that are used in the final classifier and the size of the information message needed to identify the final classifier from the compression set. [sent-211, score-0.506]
17 Later, we apply this general risk bound to DLMs by making appropriate choices for a compression set-dependent distribution of messages. [sent-215, score-0.359]
18 This implies that there exists a reconstruction function R , associated to A, that outputs a classifier R (σ, zi ) when given an arbitrary compression set zi and message string σ chosen from the set M (zi ) of all distinct messages that can be supplied to R with the compression set zi . [sent-224, score-2.007]
19 It is only when such an R exists that the classifier returned by A(S) is always identified by a compression set zi and a message string σ. [sent-225, score-0.834]
20 Given a training set S, the compression set zi is defined by a vector i of indices such that def i = (i1 , i2 , . [sent-226, score-0.737]
21 In contrast, we will see in the next section that the reconstruction function for DLMs needs both a compression set and a message string. [sent-238, score-0.407]
22 We seek a tight risk bound for arbitrary reconstruction functions that holds uniformly for all compression sets and message strings. [sent-239, score-0.51]
23 Recall that classifier A(zm ) is described entirely in terms of a compression set zi ⊂ zm and a message string σ ∈ M (zi ). [sent-253, score-0.876]
24 Let M (zi ) be the set of all messages σ that can be attached to compression set zi . [sent-255, score-0.737]
25 (3) We will now stratify PZi |Zi (R(R (σ, Zi )) > ε) in terms of the errors that R (σ, Zi ) can make on the training examples that are not used for the compression set. [sent-260, score-0.319]
26 Given a training sample zm and a compression set zi , we denote by Rzi ( f ) the vector of indices pointing to the examples in zi which are misclassified by f . [sent-262, score-1.153]
27 Given any compression set zi , let us now use any function PM (zi ) (σ) which has the property that ∑ σ∈M (zi ) PM (zi ) (σ) ≤ 1 (7) and can, therefore, be interpreted as compression set-dependent distribution of messages when it sums to one. [sent-266, score-0.993]
28 However, a reconstruction function will generally need less additional information when it is constrained to produce a classifier making no errors with the compression set. [sent-273, score-0.319]
29 Hence, the above risk bound will generally be smaller for sample-compression learning algorithms that always return a classifier making no errors on the compression set. [sent-274, score-0.359]
30 Finally note that the risk bound is small for classifiers making a small number |j| of training errors, having a small compression set size |i|, and having a message string σ with large prior “probability” PM (Zi ) (σ). [sent-276, score-0.568]
31 Comparison with Other Risk Bounds Although the risk bound of Theorem 1 is basically a sample compression bound it, nevertheless, applies to a much broader class of learning algorithms than just sample compression learning algorithms. [sent-279, score-0.643]
32 Indeed the risk bound depends on two complementary sources of information used to identify the classifier: the sample compression set zi and the message string σ. [sent-280, score-0.937]
33 In fact, the bound still holds when the sample compression set vanishes and when the classifier h = R (σ) is described entirely in terms of a message string σ. [sent-281, score-0.466]
34 1 Comparison with Data-Independent Bounds The risk bound of Theorem 1 can be qualified as “data-dependent” when the learning algorithm is searching among a class of functions (classifiers) described in terms of a subset zi of the training set. [sent-284, score-0.526]
35 Consequently, the bound of Theorem 1 is a generalization of the Occam’s razor bound to the case where the classifiers are identified by two complementary sources of information: the message string σ and the compression set zi . [sent-293, score-0.89]
36 The proposed bound is obtained by using a union bound over the possible compression subsets of the training set and over the possible messages σ ∈ M (zi ). [sent-294, score-0.424]
37 The ln(m + 1) terms are absent from the Littlestone and Warmuth compression bounds because their bounds hold uniformly for all compression sets of a fixed size |i| and for all configurations of training error points of a fixed amount |j|. [sent-313, score-0.585]
38 The bound of Theorem 1 holds uniformly for all compression sets of arbitrary sizes and for all configurations of training error points of an arbitrary amount. [sent-316, score-0.311]
39 When the classifier is allowed to make training errors, the bound of Equation 14 is tighter than the lossy compression bound of Theorem 5. [sent-326, score-0.339]
40 Consequently, the bound of Theorem 1 is tighter than the above-mentioned sample compression bounds for three reasons. [sent-328, score-0.307]
41 Third, in contrast with the other sample compression bounds, the bound of Theorem 1 is valid for any a priori defined sample compression-dependent distribution of messages PM (zi ) (σ). [sent-331, score-0.369]
42 Indeed, we feel that it is important to allow the set of possible messages and the message set size to depend on the sample compression zi since the class labels of the compression set examples give information about the set of possible data-dependent features that can be constructed from zi . [sent-333, score-1.603]
43 Indeed, it is conceivable that for some zi , very little extra information may be needed to identify the classifier whereas for some other zi , more information may be needed. [sent-334, score-0.792]
44 Consider, for example, the case where the compression set consists of two examples that are used by the reconstruction function R to obtain a single-ball classifier. [sent-335, score-0.332]
45 For the reconstruction function of the set covering machine (described in the next section), a ball border must be a positive example whereas both positive and negative examples are allowed for ball centers. [sent-336, score-0.484]
46 In that case, if the two examples in the compression set have a positive label, the reconstruction function needs a message string of at least one bit that indicates which example is the ball center. [sent-337, score-0.613]
47 If the two examples have opposite class labels, then the negative example is necessarily the ball center and no message at all is needed to reconstruct the classifier. [sent-338, score-0.351]
48 More generally, the set of messages that we use for all types of DLMs proposed in this paper depends on some properties of zi like its number n(zi ) of negative examples. [sent-339, score-0.481]
49 Without such a dependency on zi , the set of possible messages M could be unnecessarily too large and would then loosen the risk bound. [sent-340, score-0.556]
50 3 Comparison with the Set Covering Machine Risk Bound The risk bound for the set covering machine (SCM) (Marchand and Shawe-Taylor, 2001, 2002) is not a general-purposed sample compression risk bound as the one provided by Theorem 1. [sent-342, score-0.495]
51 For the case of data-dependent balls and holes, each feature is identified by a training example called a center (xc , yc ), and another training example called a border (xb , yb ). [sent-349, score-0.477]
52 h(x) = In this case, given a compression set zi , we need to specify the examples in zi that are used for a border point without being used as a center. [sent-351, score-1.29]
53 As explained by Marchand and Shawe-Taylor (2001), no additional amount of information is required to pair each center with its border point whenever the reconstruction function R is constrained to produce a classifier that always correctly classifies the compression set. [sent-352, score-0.6]
54 In that case, each message σ ∈ M (zi ) just needs to specify the positive examples that are a border point without being a center. [sent-354, score-0.353]
55 Let n(zi ) and p(zi ) be, respectively, the number of negative and the number of positive examples in compression set zi . [sent-355, score-0.688]
56 Let b(σ) be the number of border point examples specified in message σ and let ζ(a) be defined by Equation 9. [sent-356, score-0.324]
57 We can then use PM (Zi ) (σ) = ζ(b(σ)) · p(zi ) b(σ) −1 (15) since, in that case, we have for any compression set zi : ∑ σ∈M (zi ) p(zi ) PM (zi ) (σ) = ∑ ζ(b) ∑ σ:b(σ)=b b=0 p(zi ) b(σ) −1 ≤ 1. [sent-357, score-0.652]
58 1 DLMs Containing Only Balls Even in this simplest case, the compression set zi alone does not contain enough information to identify a DLM classifier (the hypothesis). [sent-371, score-0.652]
59 Recall that, in this case, zi contains ball centers and border points. [sent-373, score-0.721]
60 Moreover, each center can only be the center of one ball since the center is removed from the data when a ball is added to the DLM. [sent-375, score-0.444]
61 But a center can be the border of a previous ball in the DLM and a border can be the border of more than one ball (since the border of a ball is not removed from the data when that ball is added to the DLM). [sent-376, score-1.186]
62 Hence, σ needs to specify the border points in zi that are a border without being the center of another ball. [sent-377, score-0.861]
63 Since we expect that most of the compression sets will contain roughly the same number of centers and borders, we assign, to each example of zi , an equal a priori probability to be a center or a border. [sent-379, score-0.783]
64 441 M ARCHAND AND S OKOLOVA Once σ1 is specified, the centers and borders of zi are identified. [sent-383, score-0.445]
65 But to identify each ball we need to pair each center with a border point (which could possibly be the center of another ball). [sent-384, score-0.44]
66 For this operation, recall that the border and the center of each ball must have opposite class labels. [sent-385, score-0.381]
67 Let n(zi ) and p(zi ) be, respectively, the number of negative and the number of positive examples in compression set zi . [sent-387, score-0.688]
68 Since a border point can be used for more that one ball and a center can also be used as a border, we assign an equal probability of 1/n(zi ) to each negative example of zi to be paired with x. [sent-389, score-0.794]
69 Similarly, we assign an equal probability of 1/p(zi ) to each positive example to be paired with a negative center of zi . [sent-390, score-0.518]
70 Let c p (zi ) and cn (zi ) be, respectively, the number of positive centers and negative centers in zi (this is known once σ1 is specified). [sent-391, score-0.494]
71 For an arbitrary compression set zi , we thus assign the following a priori probability to each possible pairing information string σ2 : c p (zi ) 1 n(zi ) P2 (σ2 |σ1 ) = cn (zi ) 1 p(zi ) ∀σ2 | n(zi ) = 0 and p(zi ) = 0. [sent-392, score-0.763]
72 However, since the border and center of each ball must have opposite labels, this condition is the same as |i| = 0. [sent-394, score-0.381]
73 For an arbitrary compression set zi , we assign an equal a priori probability to each possible ball ordering by using P3 (σ3 |σ2 , σ1 ) = 1 r(zi )! [sent-401, score-0.791]
74 Indeed, in the worst case, each pair of (distinct) examples taken from the compression 442 D ECISION L ISTS OF DATA -D EPENDENT F EATURES set zi could be used for two holes: giving a total of |i|(|i| − 1) features. [sent-410, score-0.688]
75 The first part σ1 of the (complete) message string σ will specify the number r(zi ) of features present in compression set zi . [sent-411, score-0.93]
76 For this, we give an equal probability of 1/|i| to each example in zi of being chosen (whenever |i| = 0). [sent-422, score-0.396]
77 3 Constrained DLMs Containing Only Balls A constrained DLM is a DLM that has the property of correctly classifying each example of its compression set zi with the exception of the compression set examples who’s output is determined by the default value. [sent-427, score-0.989]
78 Every training example (x, y) ∈ Qi that is either a border point of a previous feature (ball or hole) in the DLM or a center of a previous hole in the DLM must have y = bi and thus be correctly classified by hi . [sent-430, score-0.641]
79 As before, we use a string σ1 , with the same probability P1 (σ1 ) = 2−|i| ∀σ1 to specify if each example of the compression set zi is a center or a border point. [sent-435, score-1.011]
80 ∀σ2 where r(zi ) denotes the number of balls for zi (an information given by σ1 ). [sent-440, score-0.519]
81 But now, since each feature was constrained to correctly classify the examples of zi that it covers (and which were not covered by the features above), we do not need any additional information to specify the border for each center. [sent-441, score-0.821]
82 Given a compression set zi , let P and N denote, respectively, the set of positive and the set of negative 443 M ARCHAND AND S OKOLOVA examples in zi . [sent-443, score-1.084]
83 If center c is positive, then its border b is given by argminx∈N d(c, x) and we remove from P (to find the border of the other balls) the center c and all other positive examples covered by that feature and used by the previous features. [sent-445, score-0.654]
84 If center c is negative, then its border b is given by argminx∈P d(c, x) and we remove from N the center c and all other negative examples covered by that feature and used by the previous features. [sent-446, score-0.477]
85 2, we use a string σ1 to specify the number r(zi ) of features present in compression set zi . [sent-451, score-0.819]
86 Here, however, we only need to specify the center of each feature, since, as we will see below, no additional information is needed to find the border of each feature when the DLM is constrained to classify correctly each example in zi . [sent-455, score-0.77]
87 Given a compression set zi , let P and N denote, respectively, the set of positive and the set of negative examples in zi . [sent-458, score-1.084]
88 If the feature is a ball with a positive center c, then its border is given by argminx∈N d(c, x) and we remove from P the center c and all other positive examples covered by that feature and used by the previous features. [sent-460, score-0.617]
89 If the feature is a hole with a positive center c, then its border is given by argmaxx∈P −{c} d(c, x) and we remove from N all the negative examples covered by that feature and used by the previous features. [sent-461, score-0.535]
90 If the feature is a ball with a negative center c, then its border is given by argminx∈P d(c, x) and we remove from N the center c and all other negative examples covered by that feature and used by the previous features. [sent-462, score-0.617]
91 If the feature is a hole with a negative center c, then its border is given by argmaxx∈N −{c} d(c, x) and we remove from P all the positive examples covered by that feature and used by the previous features. [sent-463, score-0.535]
92 The first part of the message will be a string σ1 that specifies the number r(zi ) of half-spaces used in the compression set zi . [sent-467, score-0.834]
93 As before, let p(zi ) and n(zi ) denote, respectively, the number of positive examples and the number of negative examples in the compression set zi . [sent-468, score-0.724]
94 Let P(zi ) and N(zi ) denote, respectively, the set of positive examples and the set of negative examples in the compression set zi . [sent-469, score-0.724]
95 Therefore, for the string σ1 that specifies the number r(zi ) of half-spaces used in the compression set zi , we could assign the same probability to each number between zero and |i|p(zi )n(zi ). [sent-474, score-0.763]
96 Finally, as for the other constrained DLMs, we do not need any additional message string to identify the threshold point xc ∈ zi for each w of the DLM. [sent-483, score-0.682]
97 Let P and N denote, respectively, the set of positive and the set of negative examples in zi . [sent-485, score-0.432]
98 5, we use a string σ1 to specify the number r(zi ) of half-spaces present in compression set zi . [sent-493, score-0.752]
99 For this, we give an equal probability of 1/|i| to each example in zi of being chosen (when |i| = 0). [sent-498, score-0.396]
100 We have proposed a general risk bound that depends on the number of examples that are used in the final classifier and the size of the information message needed to identify the final classifier from the compression set. [sent-847, score-0.506]
wordName wordTfidf (topN-words)
[('dlm', 0.565), ('zi', 0.396), ('dlms', 0.282), ('compression', 0.256), ('border', 0.177), ('scm', 0.134), ('hi', 0.134), ('balls', 0.123), ('marchand', 0.113), ('message', 0.111), ('err', 0.11), ('hc', 0.106), ('ball', 0.099), ('hole', 0.099), ('hsp', 0.099), ('archand', 0.092), ('holes', 0.092), ('pm', 0.088), ('messages', 0.085), ('eatures', 0.085), ('ecision', 0.085), ('ists', 0.085), ('okolova', 0.085), ('center', 0.082), ('br', 0.079), ('risk', 0.075), ('string', 0.071), ('ependent', 0.071), ('features', 0.067), ('ni', 0.06), ('bi', 0.059), ('def', 0.058), ('xc', 0.057), ('credit', 0.053), ('pi', 0.052), ('builddlm', 0.049), ('haberman', 0.049), ('msfrombound', 0.049), ('msfromcv', 0.049), ('centers', 0.049), ('ln', 0.049), ('lists', 0.047), ('xa', 0.046), ('xb', 0.044), ('er', 0.044), ('zm', 0.042), ('breastw', 0.042), ('bupa', 0.042), ('prunedlm', 0.042), ('pzm', 0.042), ('pima', 0.042), ('mario', 0.041), ('decision', 0.041), ('feature', 0.041), ('pn', 0.04), ('reconstruction', 0.04), ('assign', 0.04), ('qi', 0.04), ('examples', 0.036), ('bh', 0.036), ('disjunction', 0.036), ('votes', 0.036), ('pzi', 0.035), ('sokolova', 0.035), ('list', 0.033), ('covering', 0.033), ('glass', 0.032), ('littlestone', 0.031), ('covered', 0.03), ('specify', 0.029), ('remove', 0.029), ('bound', 0.028), ('argminx', 0.028), ('dlmb', 0.028), ('scms', 0.028), ('training', 0.027), ('rivest', 0.026), ('warmuth', 0.026), ('hk', 0.025), ('sequentially', 0.025), ('threshold', 0.024), ('classi', 0.024), ('penalty', 0.024), ('appended', 0.024), ('compressing', 0.024), ('append', 0.024), ('specialized', 0.024), ('opposite', 0.023), ('bounds', 0.023), ('radius', 0.023), ('constrained', 0.023), ('correctly', 0.022), ('heart', 0.022), ('wi', 0.022), ('dhagat', 0.021), ('dlmbh', 0.021), ('dlmhsp', 0.021), ('qk', 0.021), ('rzi', 0.021), ('sgn', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
Author: Mario Marchand, Marina Sokolova
Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory
2 0.11395214 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
3 0.10023837 67 jmlr-2005-Stability of Randomized Learning Algorithms
Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil
Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.1 Keywords: stability, randomized learning algorithms, sensitivity analysis, bagging, bootstrap methods, generalization error, leave-one-out error.
4 0.090010971 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
5 0.086542323 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
6 0.075165272 71 jmlr-2005-Variational Message Passing
8 0.041830737 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
9 0.04007059 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
10 0.039376352 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
11 0.038579445 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
12 0.037113097 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
13 0.034584198 39 jmlr-2005-Information Bottleneck for Gaussian Variables
14 0.031763289 11 jmlr-2005-Algorithmic Stability and Meta-Learning
15 0.030977966 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
16 0.030301763 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
17 0.029235182 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
18 0.027666701 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
19 0.027611533 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
20 0.025742203 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
topicId topicWeight
[(0, 0.197), (1, -0.152), (2, -0.033), (3, 0.169), (4, -0.084), (5, -0.103), (6, 0.069), (7, 0.039), (8, 0.037), (9, -0.289), (10, 0.091), (11, 0.097), (12, 0.041), (13, 0.048), (14, -0.239), (15, 0.287), (16, -0.244), (17, -0.114), (18, 0.174), (19, 0.035), (20, 0.114), (21, -0.089), (22, -0.152), (23, -0.028), (24, -0.006), (25, 0.126), (26, 0.012), (27, -0.022), (28, 0.006), (29, -0.053), (30, 0.097), (31, -0.083), (32, -0.025), (33, -0.035), (34, 0.004), (35, 0.038), (36, -0.121), (37, -0.029), (38, 0.043), (39, 0.018), (40, 0.092), (41, 0.014), (42, 0.032), (43, -0.253), (44, -0.11), (45, -0.083), (46, 0.071), (47, -0.107), (48, -0.163), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.97038639 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
Author: Mario Marchand, Marina Sokolova
Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory
2 0.4465045 67 jmlr-2005-Stability of Randomized Learning Algorithms
Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil
Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.1 Keywords: stability, randomized learning algorithms, sensitivity analysis, bagging, bootstrap methods, generalization error, leave-one-out error.
3 0.41035858 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
4 0.32895565 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
5 0.29035372 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
6 0.25974467 36 jmlr-2005-Gaussian Processes for Ordinal Regression
7 0.23779762 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
9 0.20553641 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
10 0.16516969 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
11 0.14330968 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
12 0.14258724 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
13 0.13730007 3 jmlr-2005-A Classification Framework for Anomaly Detection
14 0.13556589 39 jmlr-2005-Information Bottleneck for Gaussian Variables
15 0.12990831 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
16 0.12847753 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
17 0.12106089 11 jmlr-2005-Algorithmic Stability and Meta-Learning
18 0.11779577 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
19 0.11022265 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
20 0.10418882 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
topicId topicWeight
[(17, 0.017), (19, 0.015), (36, 0.023), (37, 0.014), (43, 0.016), (47, 0.012), (52, 0.697), (59, 0.01), (70, 0.014), (88, 0.055), (92, 0.012), (94, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.9797157 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
Author: Mario Marchand, Marina Sokolova
Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory
2 0.97169316 12 jmlr-2005-An MDP-Based Recommender System
Author: Guy Shani, David Heckerman, Ronen I. Brafman
Abstract: Typical recommender systems adopt a static view of the recommendation process and treat it as a prediction problem. We argue that it is more appropriate to view the problem of generating recommendations as a sequential optimization problem and, consequently, that Markov decision processes (MDPs) provide a more appropriate model for recommender systems. MDPs introduce two benefits: they take into account the long-term effects of each recommendation and the expected value of each recommendation. To succeed in practice, an MDP-based recommender system must employ a strong initial model, must be solvable quickly, and should not consume too much memory. In this paper, we describe our particular MDP model, its initialization using a predictive model, the solution and update algorithm, and its actual performance on a commercial site. We also describe the particular predictive model we used which outperforms previous models. Our system is one of a small number of commercially deployed recommender systems. As far as we know, it is the first to report experimental analysis conducted on a real commercial site. These results validate the commercial value of recommender systems, and in particular, of our MDP-based approach. Keywords: recommender systems, Markov decision processes, learning, commercial applications
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
4 0.90650016 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
Author: Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins
Abstract: This paper presents an active learning method which reduces the labeling effort of domain experts in multi-class classification problems. Active learning is applied in conjunction with support vector machines to recognize underwater zooplankton from higher-resolution, new generation SIPPER II images. Most previous work on active learning with support vector machines only deals with two class problems. In this paper, we propose an active learning approach “breaking ties” for multiclass support vector machines using the one-vs-one approach with a probability approximation. Experimental results indicate that our approach often requires significantly less labeled images to reach a given accuracy than the approach of labeling the least certain test example and random sampling. It can also be applied in batch mode resulting in an accuracy comparable to labeling one image at a time and retraining. Keywords: active learning, support vector machine, plankton recognition, probabilistic output, multi-class support vector machine
5 0.76662332 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung
Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability
6 0.76295775 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
7 0.72313666 8 jmlr-2005-Active Coevolutionary Learning of Deterministic Finite Automata
8 0.69333053 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
9 0.6830579 36 jmlr-2005-Gaussian Processes for Ordinal Regression
10 0.67087811 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
11 0.64523679 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
12 0.64455289 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
13 0.64069748 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
14 0.62149292 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
16 0.60043234 39 jmlr-2005-Information Bottleneck for Gaussian Variables
17 0.59948903 25 jmlr-2005-Denoising Source Separation
18 0.59760809 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
19 0.59686172 54 jmlr-2005-Managing Diversity in Regression Ensembles
20 0.59490305 67 jmlr-2005-Stability of Randomized Learning Algorithms