jmlr jmlr2010 jmlr2010-9 knowledge-graph by maker-knowledge-mining

9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

Source: pdf

Author: Erik Ĺ trumbelj, Igor Kononenko

Abstract: We present a general method for explaining individual predictions of classiﬁcation models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method’s initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efﬁcient and that the explanations are intuitive and useful. Keywords: data postprocessing, classiﬁcation, explanation, visualization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. [sent-8, score-0.866]

2 Data postprocessing includes the integration, ﬁltering, evaluation, and explanation of acquired knowledge. [sent-16, score-0.539]

3 To introduce the reader with some of the concepts used in this paper, we start with a simple illustrative example of an explanation for a model’s prediction (see Fig. [sent-18, score-0.685]

4 In our example, the contributions of the three feature values can be interpreted as follows. [sent-23, score-0.266]

5 This is a trivial example, but providing the end-user with such an explanation on top of a prediction, makes the prediction easier to understand and to trust. [sent-28, score-0.585]

6 1 is speciﬁc to Naive Bayes, but can we design an explanation method which works for any type of classiﬁer? [sent-34, score-0.497]

7 Figure 1: An instance from the well-known Titanic data set with the Naive Bayes model’s prediction and an explanation in the form of contributions of individual feature values. [sent-36, score-0.933]

8 1 Related Work Before addressing general explanation methods, we list a few model-speciﬁc methods to emphasize two things. [sent-42, score-0.497]

9 And second, providing an explanation in the form of contributions of feature values is a common approach. [sent-44, score-0.763]

10 Note that many more model-speciﬁc explanation methods exist and this is far from being a complete reference. [sent-45, score-0.497]

11 Nomograms are a way of visualizing contributions of feature values and were applied to Naive Bayes (Moˇ ina et al. [sent-49, score-0.312]

12 The explanation and interpretation of artiﬁcial neural networks, which are arguably one of the least transparent models, has also received a lot of attention, especially in the form of rule extraction (Towell and Shavlik, 1993; Andrews et al. [sent-57, score-0.528]

13 So, why do we even need a general explanation method? [sent-59, score-0.497]

14 It is not difﬁcult to think of a reasonable scenario where a general explanation method would be useful. [sent-60, score-0.497]

15 For example, imagine a user using a classiﬁer and a corresponding explanation method. [sent-61, score-0.497]

16 The user then has to invest time and effort into adapting to the new explanation method. [sent-63, score-0.497]

17 This can be avoided by using a general explanation method. [sent-64, score-0.497]

18 Overall, a good general explanation method reduces the dependence between the user-end and the underlying machine learning methods, which makes work with machine learning models more user-friendly. [sent-65, score-0.497]

19 An effective and efﬁcient general explanation method would also be a useful tool for comparing how a model predicts different instances and how different models predict the same instance. [sent-67, score-0.599]

20 As far as the authors are aware, there exist two other general explanation methods for explaining ˇ a model’s prediction: the work by Robnik-Sikonja and Kononenko (2008) and the work by Lemaire et al. [sent-68, score-0.603]

21 While there are several differences between the two methods, both explain a prediction with contributions of feature values and both use the same basic approach. [sent-70, score-0.354]

22 A feature value’s contribution is deﬁned as the difference between the model’s initial prediction and its average prediction across perturbations of the corresponding feature. [sent-71, score-0.373]

23 This is, of course, an incorrect explanation of how these two values contribute to the persons decision. [sent-85, score-0.588]

24 To summarize, we have existing general explanation methods, which sacriﬁce a part of their effectiveness for efﬁciency, and we know that generating effective contributions requires observing the power set of all features, which is far from efﬁcient. [sent-89, score-0.639]

25 First, we provide a rigorous theoretical analysis of our explanation method and link it with known concepts from game theory, thus formalizing some of its desirable properties. [sent-91, score-0.801]

26 Section 2 introduces some basic concepts from classiﬁcation and coalitional game theory. [sent-94, score-0.521]

27 Preliminaries First, we introduce some basic concepts from classiﬁcation and coalitional game theory, which are used in the formal description of our explanation method. [sent-99, score-1.018]

28 1 Classiﬁcation In machine learning classiﬁcation is a form of supervised learning where the objective is to predict the class label for unlabelled input instances, each described by feature values from a feature space. [sent-101, score-0.277]

29 × An , where each feature Ai is a ﬁnite set of feature values. [sent-109, score-0.248]

30 2 Coalitional Game Theory The following concepts from coalitional game theory are used in the formalization of our method, starting with the deﬁnition of a coalitional game. [sent-123, score-0.738]

31 Deﬁnition 3 A coalitional form game is a tuple N, v , where N = {1, 2, . [sent-124, score-0.421]

32 We start with a description of the intuition behind the method and then link it with coalitional game theory. [sent-156, score-0.421]

33 Our goal is to explain how the given feature values contribute to the prediction difference between the classiﬁers prediction for this instance and the expected prediction if no feature values are given (that is, if all feature values are ”ignored”). [sent-169, score-0.726]

34 , 2009) we used a different deﬁnition: ∆(S) = ∗ (S) − f ∗ (∅), where f ∗ (W ) is obtained by retraining the classiﬁer only on features in W and fc c re-classifying the instance (similar to the wrappers approach in feature selection Kohavi and John 1997). [sent-176, score-0.322]

35 The expression ∆(S) is the difference between the expected prediction when we know only those values of x, whose features are in S, and the expected prediction when no feature values are known. [sent-179, score-0.358]

36 The main shortcoming of existing general explanation methods is that they do not take into account all the potential dependencies and interactions between feature values. [sent-183, score-0.689]

37 To avoid this issue, we implicitly deﬁne interactions by deﬁning that each prediction difference ∆(S) is composed of 2N contributions of interactions (that is, each subset of feature values might contribute something): ∆(S) = ∑ I (W ), S ⊆ N. [sent-184, score-0.535]

38 (3) W ⊂S Now we distribute the interaction contributions among the n feature values. [sent-187, score-0.311]

39 For each interaction the involved feature values can be treated as equally responsible for the interaction as the interaction would otherwise not exist. [sent-188, score-0.259]

40 , n}, ∆ is a coalitional form game and ϕ(∆) = (ϕ1 , ϕ2 , . [sent-201, score-0.421]

41 Proof Following the deﬁnition of ∆ we get that ∆(∅) = 0, so the explanation of the classiﬁer’s prediction can be treated as a coalitional form game N, ∆ . [sent-205, score-1.006]

42 Now we provide an elementary proof that the contributions of individual feature values correspond to the Shapley value for the game 6 E XPLAINING I NDIVIDUAL C LASSIFICATIONS N, ∆ . [sent-206, score-0.507]

43 S⊆N\{i} ∑ So, the explanation method can be interpreted as follows. [sent-239, score-0.497]

44 We divide this change amongst the feature values in a way that is fair to their contributions across all possible sub-coalitions. [sent-241, score-0.295]

45 Axioms 1 to 3 and their interpretation in the context of our explanation method are of particular interest. [sent-243, score-0.497]

46 (2) - the 7 ˇ S TRUMBELJ AND KONONENKO sum of all n contributions in an instance’s explanation is equal to the difference in prediction ∆(N). [sent-245, score-0.727]

47 According to the 2nd axiom, if two features values have an identical inﬂuence on the prediction they are assigned contributions of equal size. [sent-247, score-0.288]

48 The 3rd axiom says that if a feature has no inﬂuence on the prediction it is assigned a contribution of 0. [sent-248, score-0.426]

49 When viewed together, these properties ensure that any effect the features might have on the classiﬁers output will be reﬂected in the generated contributions, which effectively deals with the issues of previous general explanation methods. [sent-249, score-0.555]

50 1 A N I LLUSTRATIVE E XAMPLE In the introduction we used a simple boolean logic example to illustrate the shortcomings of existing general explanation methods. [sent-252, score-0.497]

51 With the same example we illustrate how our explanation method works. [sent-255, score-0.497]

52 Intuitively, ∆(S) is the difference between the classiﬁers expected prediction if only values of features in S are known and the expected prediction if no values are known. [sent-260, score-0.234]

53 When observed together, the two features contribute less than their individual contributions would suggest, which 1 results in a negative interaction: I (1, 2) = ∆(1, 2) − (I (1) + I (2)) = − 4 . [sent-266, score-0.282]

54 The 2 8 2 8 generated contributions reveal that both features contribute the same amount towards the prediction being 1 and the contributions sum up to the initial difference between the prediction for this instance and the prior belief. [sent-268, score-0.658]

55 To avoid this and still achieve an efﬁcient explanation method, we extend the sampling algorithm in the following way. [sent-289, score-0.497]

56 The optimal Z2 ·σ2 (minimal) number of samples we need for the entire explanation is: mmin ( 1 − α, e ) = n · 1−α , e2 1 where σ2 = n ∑n σ2 . [sent-328, score-0.531]

57 When explaining an instance, the sampling process has to be repeated for each of the n feature values. [sent-331, score-0.23]

58 Therefore, for a given error and conﬁdence level, the time complexity of the explanation is O(n · T (A )), where function T (A ) describes the instance classiﬁcation time of the model on A . [sent-332, score-0.571]

59 We use a variety of different classiﬁers both to illustrate that it is indeed a general explanation method and to investigate how the method behaves with different types of classiﬁcation models. [sent-336, score-0.497]

60 All experiments were done on an off-the-shelf laptop computer (2GHz dual core CPU, 2GB RAM), the explanation method is a straightforward Java implementation of the equations presented in this paper, and the classiﬁers were imported from the Weka machine learning software (http://www. [sent-338, score-0.497]

61 The ﬁrst 8 data sets are ˇ synthetic data sets, designed speciﬁcally for testing explanation methods (see Robnik-Sikonja and ˇ Kononenko, 2008; Strumbelj et al. [sent-415, score-0.497]

62 Those interested in a more detailed description of this data set and how our previous explanation method is successfully applied in practice can refer to ˇ our previous work (Strumbelj et al. [sent-420, score-0.497]

63 For each data set we use half of the instances for training and half for testing the explanation method. [sent-427, score-0.538]

64 For each data set/classiﬁer pair, we train the classiﬁer on the training set and use both the explanation method and its approximation on each test instance. [sent-428, score-0.497]

65 When over 10000 samples are drawn, all the contributions across all features and all test instances are very close to the actual contributions. [sent-441, score-0.304]

66 And although the explanations would not tell us much about the concepts behind the data set (we conclude from the model’s performance, that it’s knowledge is useless), they would reveal what the model has learned, which is the purpose of an explanation method. [sent-479, score-0.802]

67 Explaining the prediction of the ANN model for an instance Monks2 is the most complex explanation (that is, requires the most samples - see Fig. [sent-487, score-0.693]

68 3, which shows the time needed to provide an explanation with the desired error 99%, 0. [sent-491, score-0.497]

69 For smaller data sets (smaller in the number of features) the explanation is generated almost instantly. [sent-493, score-0.497]

70 For larger data sets, generating an explanation takes less than a minute, with the exception of bstNB on a few data sets and the ANN model on the Mushroom data set. [sent-494, score-0.526]

71 The Arrhythmia data set, with its 279 features, is an example of a data set, where the explanation can not be generated in some sensible time. [sent-496, score-0.497]

72 For example, it takes more than an hour to generate an explanation for a prediction of the bstNB model. [sent-497, score-0.585]

73 The explanation method is therefore less appropriate for explaining models which are built on several hundred features or more. [sent-498, score-0.661]

74 Arguably, providing a comprehensible explanation involving a hundred or more features is a problem in its own right and even inherently transparend models become less comprehensible with such a large number of features. [sent-499, score-0.663]

75 However, the focus of this paper is on providing an effective and general explanation method, which is computationally feasible on the majority of data sets we encounter. [sent-500, score-0.497]

76 Therefore, when considering the number of features the explanation method can still handle, we need not count irrelevant features, which are not included in the ﬁnal model. [sent-502, score-0.586]

77 An additional advantage of the generated contributions is that they sum up to the difference between the model’s output prediction and the model’s expected output, given no information about the values of the features. [sent-514, score-0.23]

78 They were generated for various classiﬁcation models and data sets, to show the advantage of having a general explanation method. [sent-517, score-0.497]

79 However, both NB and bstDT learn the importance of the ﬁfth feature and explanations reveal that value 2 for the ﬁfth feature speaks against class 1. [sent-523, score-0.424]

80 The explanations reveal that DT predicts this animal is a bird, because it has feathers. [sent-528, score-0.239]

81 These ﬁrst two pairs of examples illustrate how the explanations reﬂect what the model has learnt and how we can compare explanations from different classiﬁers. [sent-532, score-0.281]

82 Some examples of underﬁtting and overﬁtting are actually desirable as they allow us to inspect if the explanation method reveals what the classiﬁer has (or has not) learned. [sent-534, score-0.497]

83 6(a) shows the explanation of the logREG model’s prediction for an instance from the Xor data set. [sent-536, score-0.63]

84 or concept of this data set (for this data set the class label is the odd parity bit for the ﬁrst three feature values) and the explanation is appropriate. [sent-541, score-0.65]

85 7(a) is an explanation for ANN’s prediction for the introductory instance from the Titanic data set (see Fig. [sent-545, score-0.63]

86 7(b)) is very similar to the inherent explanation (taking into account that a logarithm is applied in the inherent explanation). [sent-548, score-0.553]

87 8 is an explanation for RT’s prediction regarding whether breast cancer will (class = 1) or will not (class = 2) recur for this patient. [sent-552, score-0.616]

88 According to RF it is more likely that cancer will not recur and the explanation indicates that this is mostly due to a low number of positive lymph nodes (nLymph). [sent-553, score-0.559]

89 Conclusion In the introductive section, we asked if an efﬁcient and effective general explanation method for classiﬁers’ predictions can be made. [sent-558, score-0.539]

90 Using only the input and output of a classiﬁer we decompose the changes in its prediction into contributions of individual feature values. [sent-560, score-0.391]

91 These contributions correspond to known concepts from coalitional game theory. [sent-561, score-0.663]

92 Unlike with existing methods, the resulting theoretical properties of the proposed method guarantee that no matter which concepts the classiﬁer learns, the generated contributions will reveal the inﬂuence of feature values. [sent-562, score-0.416]

93 As we show 15 ˇ S TRUMBELJ AND KONONENKO (a) logREG model (b) SVM model Figure 6: The left hand side explanation indicates that the feature values have no signiﬁcant inﬂuence on the logREG model on the Xor data set. [sent-564, score-0.708]

94 The right hand side explanation shows how SVM overﬁts the Random data set. [sent-565, score-0.497]

95 The left hand side explanation is for the ANN model. [sent-567, score-0.497]

96 The right hand side explanation is for the NB model. [sent-568, score-0.497]

97 Figure 8: An explanation or the RF model’s prediction for a patient from the Oncology data set. [sent-569, score-0.585]

98 It would also be interesting to explore the possibility of applying the same principles to the explanation of regression models. [sent-574, score-0.497]

99 Polynomial calculation of the shapley value based on o sampling. [sent-593, score-0.253]

100 Explaining instance classiﬁcations with interactions of subsets of feature values. [sent-663, score-0.237]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('explanation', 0.497), ('shapley', 0.253), ('kononenko', 0.231), ('coalitional', 0.217), ('game', 0.204), ('axiom', 0.17), ('prei', 0.163), ('trumbelj', 0.163), ('lassifications', 0.145), ('xplaining', 0.145), ('contributions', 0.142), ('explanations', 0.126), ('feature', 0.124), ('ndividual', 0.124), ('strumbelj', 0.124), ('nb', 0.113), ('logreg', 0.108), ('igor', 0.108), ('explaining', 0.106), ('concepts', 0.1), ('ann', 0.099), ('bstdt', 0.09), ('condind', 0.09), ('prediction', 0.088), ('oncology', 0.083), ('naive', 0.081), ('rf', 0.077), ('xor', 0.077), ('bstnb', 0.072), ('passenger', 0.072), ('interactions', 0.068), ('classi', 0.067), ('er', 0.065), ('players', 0.062), ('titanic', 0.06), ('features', 0.058), ('retraining', 0.056), ('coalition', 0.054), ('comprehensible', 0.054), ('disjunct', 0.054), ('nomograms', 0.054), ('zoo', 0.051), ('reveal', 0.05), ('sphere', 0.05), ('erik', 0.048), ('ina', 0.046), ('mushroom', 0.046), ('persons', 0.046), ('bayes', 0.046), ('instance', 0.045), ('interaction', 0.045), ('contribute', 0.045), ('contribution', 0.044), ('survival', 0.043), ('predictions', 0.042), ('population', 0.042), ('mo', 0.042), ('postprocessing', 0.042), ('dt', 0.041), ('instances', 0.041), ('contributes', 0.039), ('fc', 0.039), ('chess', 0.038), ('visualization', 0.038), ('svm', 0.038), ('individual', 0.037), ('ignored', 0.037), ('bla', 0.036), ('eytan', 0.036), ('fri', 0.036), ('invasion', 0.036), ('jakulin', 0.036), ('keinan', 0.036), ('ljubljana', 0.036), ('marko', 0.036), ('martens', 0.036), ('moretti', 0.036), ('szafron', 0.036), ('towell', 0.036), ('samples', 0.034), ('decision', 0.033), ('predicts', 0.032), ('ers', 0.032), ('irrelevant', 0.031), ('janez', 0.031), ('nursery', 0.031), ('andrews', 0.031), ('animal', 0.031), ('lymph', 0.031), ('recur', 0.031), ('soybean', 0.031), ('transparent', 0.031), ('kohavi', 0.031), ('uence', 0.029), ('across', 0.029), ('label', 0.029), ('model', 0.029), ('inherent', 0.028), ('castro', 0.028), ('bird', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

Author: Erik Ĺ trumbelj, Igor Kononenko

2 0.29846248 48 jmlr-2010-How to Explain Individual Classification Decisions

Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller

Abstract: After building a classiﬁer with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted a particular label for a single instance and what features were most inﬂuential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classiﬁcation method. Keywords: explaining, nonlinear, black box model, kernel methods, Ames mutagenicity

3 0.092948891 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

Author: Markus Ojala, Gemma C. Garriga

Abstract: We explore the framework of permutation-based p-values for assessing the performance of classiﬁers. In this paper we study two simple permutation tests. The ﬁrst test assess whether the classiﬁer has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classiﬁcation problems in computational biology. The second test studies whether the classiﬁer is exploiting the dependency between the features in classiﬁcation; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classiﬁer performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classiﬁer performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classiﬁer exploits the interdependency between the features in the data. Keywords: classiﬁcation, labeled data, permutation tests, restricted randomization, signiﬁcance testing

4 0.080508441 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

Author: Jean-Yves Audibert, Sébastien Bubeck

Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efﬁcient and bandit label efﬁcient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a uniﬁed γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which deﬁnes a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus ﬁll in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modiﬁcation of the upper conﬁdence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efﬁcient, upper conﬁdence bound (UCB) policy, online learning, prediction with limited feedback.

5 0.0549003 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Author: Franz Pernkopf, Jeff A. Bilmes

Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classiﬁers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classiﬁcation rate (i.e., risk), respectively. Given an ordering, we can ﬁnd a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classiﬁcation using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classiﬁcation performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures signiﬁcantly outperforms generatively produced structures, and achieves a classiﬁcation accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classiﬁers still hold in the case of missing features, a case where generative classiﬁers have an advantage over discriminative classiﬁers. Keywords: Bayesian networks, classiﬁcation, discriminative learning, structure learning, graphical model, missing feature

6 0.054179519 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

7 0.052261017 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

8 0.045748249 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

9 0.045135744 22 jmlr-2010-Classification Using Geometric Level Sets

10 0.04392533 63 jmlr-2010-Learning Instance-Specific Predictive Models

11 0.042186461 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

12 0.039534505 40 jmlr-2010-Fast and Scalable Local Kernel Machines

13 0.036447499 67 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

14 0.034833293 68 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions

15 0.034556858 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

16 0.032934658 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project

17 0.032231066 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

18 0.031082543 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases

19 0.030780617 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

20 0.030604335 37 jmlr-2010-Evolving Static Representations for Task Transfer

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.176), (1, 0.07), (2, -0.043), (3, 0.125), (4, 0.011), (5, 0.199), (6, 0.027), (7, 0.059), (8, -0.037), (9, 0.017), (10, 0.382), (11, 0.085), (12, -0.16), (13, 0.083), (14, 0.384), (15, -0.153), (16, -0.023), (17, 0.079), (18, 0.042), (19, 0.25), (20, 0.081), (21, -0.183), (22, -0.034), (23, -0.144), (24, 0.178), (25, 0.075), (26, -0.039), (27, -0.024), (28, -0.015), (29, -0.026), (30, 0.04), (31, 0.012), (32, 0.066), (33, 0.024), (34, -0.005), (35, -0.017), (36, 0.043), (37, 0.02), (38, -0.048), (39, 0.026), (40, 0.027), (41, 0.05), (42, -0.018), (43, 0.026), (44, 0.017), (45, -0.006), (46, 0.044), (47, 0.012), (48, 0.005), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93877023 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

Author: Erik Ĺ trumbelj, Igor Kononenko

2 0.89552051 48 jmlr-2010-How to Explain Individual Classification Decisions

Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller

3 0.33170691 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

Author: Markus Ojala, Gemma C. Garriga

4 0.24637002 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Author: Franz Pernkopf, Jeff A. Bilmes

5 0.24129307 63 jmlr-2010-Learning Instance-Specific Predictive Models

Author: Shyam Visweswaran, Gregory F. Cooper

Abstract: This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-speciﬁc heuristic to locate a set of suitable models to average over. We call this method the instance-speciﬁc Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using ﬁve different performance measures and its performance was compared to that of several commonly used predictive algorithms, including nave Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms. Keywords: instance-speciﬁc, Bayesian network, Markov blanket, Bayesian model averaging

6 0.23651934 22 jmlr-2010-Classification Using Geometric Level Sets

7 0.21954158 40 jmlr-2010-Fast and Scalable Local Kernel Machines

8 0.21842802 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

9 0.20897084 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

10 0.19832005 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

11 0.19195838 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

12 0.18936701 68 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions

13 0.18845725 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

14 0.1878382 67 jmlr-2010-Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

15 0.18469489 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

16 0.18043327 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

17 0.17819057 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

18 0.16804537 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

19 0.16735049 94 jmlr-2010-Quadratic Programming Feature Selection

20 0.16625433 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.011), (4, 0.016), (8, 0.024), (15, 0.014), (21, 0.019), (32, 0.047), (33, 0.011), (34, 0.01), (36, 0.037), (37, 0.06), (39, 0.046), (40, 0.308), (71, 0.02), (75, 0.138), (81, 0.018), (85, 0.108), (96, 0.012), (97, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80534667 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený

Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efﬁciency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach

same-paper 2 0.71931678 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

Author: Erik Ĺ trumbelj, Igor Kononenko

3 0.53605765 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

Author: Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol

Abstract: We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classiﬁcation problems to yield signiﬁcantly lower classiﬁcation error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classiﬁers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations. Keywords: deep learning, unsupervised feature learning, deep belief networks, autoencoders, denoising

4 0.52238089 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E

5 0.52130318 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classiﬁcation methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identiﬁes its sensitivity to afﬁne transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classiﬁcation function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classiﬁcation function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efﬁciency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classiﬁcation and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

6 0.51245725 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

7 0.50909692 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

8 0.5074954 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

9 0.50684071 48 jmlr-2010-How to Explain Individual Classification Decisions

10 0.5053103 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

11 0.5049184 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

12 0.50426692 102 jmlr-2010-Semi-Supervised Novelty Detection

13 0.50404119 63 jmlr-2010-Learning Instance-Specific Predictive Models

14 0.50245702 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

15 0.50203258 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

16 0.5016349 22 jmlr-2010-Classification Using Geometric Level Sets

17 0.49984345 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

18 0.49721986 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

19 0.49685609 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

20 0.49684879 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming