jmlr jmlr2010 jmlr2010-9 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Erik Ĺ trumbelj, Igor Kononenko
Abstract: We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method’s initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful. Keywords: data postprocessing, classification, explanation, visualization
Reference: text
sentIndex sentText sentNum sentScore
1 The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. [sent-8, score-0.866]
2 Data postprocessing includes the integration, filtering, evaluation, and explanation of acquired knowledge. [sent-16, score-0.539]
3 To introduce the reader with some of the concepts used in this paper, we start with a simple illustrative example of an explanation for a model’s prediction (see Fig. [sent-18, score-0.685]
4 In our example, the contributions of the three feature values can be interpreted as follows. [sent-23, score-0.266]
5 This is a trivial example, but providing the end-user with such an explanation on top of a prediction, makes the prediction easier to understand and to trust. [sent-28, score-0.585]
6 1 is specific to Naive Bayes, but can we design an explanation method which works for any type of classifier? [sent-34, score-0.497]
7 Figure 1: An instance from the well-known Titanic data set with the Naive Bayes model’s prediction and an explanation in the form of contributions of individual feature values. [sent-36, score-0.933]
8 1 Related Work Before addressing general explanation methods, we list a few model-specific methods to emphasize two things. [sent-42, score-0.497]
9 And second, providing an explanation in the form of contributions of feature values is a common approach. [sent-44, score-0.763]
10 Note that many more model-specific explanation methods exist and this is far from being a complete reference. [sent-45, score-0.497]
11 Nomograms are a way of visualizing contributions of feature values and were applied to Naive Bayes (Moˇ ina et al. [sent-49, score-0.312]
12 The explanation and interpretation of artificial neural networks, which are arguably one of the least transparent models, has also received a lot of attention, especially in the form of rule extraction (Towell and Shavlik, 1993; Andrews et al. [sent-57, score-0.528]
13 So, why do we even need a general explanation method? [sent-59, score-0.497]
14 It is not difficult to think of a reasonable scenario where a general explanation method would be useful. [sent-60, score-0.497]
15 For example, imagine a user using a classifier and a corresponding explanation method. [sent-61, score-0.497]
16 The user then has to invest time and effort into adapting to the new explanation method. [sent-63, score-0.497]
17 This can be avoided by using a general explanation method. [sent-64, score-0.497]
18 Overall, a good general explanation method reduces the dependence between the user-end and the underlying machine learning methods, which makes work with machine learning models more user-friendly. [sent-65, score-0.497]
19 An effective and efficient general explanation method would also be a useful tool for comparing how a model predicts different instances and how different models predict the same instance. [sent-67, score-0.599]
20 As far as the authors are aware, there exist two other general explanation methods for explaining ˇ a model’s prediction: the work by Robnik-Sikonja and Kononenko (2008) and the work by Lemaire et al. [sent-68, score-0.603]
21 While there are several differences between the two methods, both explain a prediction with contributions of feature values and both use the same basic approach. [sent-70, score-0.354]
22 A feature value’s contribution is defined as the difference between the model’s initial prediction and its average prediction across perturbations of the corresponding feature. [sent-71, score-0.373]
23 This is, of course, an incorrect explanation of how these two values contribute to the persons decision. [sent-85, score-0.588]
24 To summarize, we have existing general explanation methods, which sacrifice a part of their effectiveness for efficiency, and we know that generating effective contributions requires observing the power set of all features, which is far from efficient. [sent-89, score-0.639]
25 First, we provide a rigorous theoretical analysis of our explanation method and link it with known concepts from game theory, thus formalizing some of its desirable properties. [sent-91, score-0.801]
26 Section 2 introduces some basic concepts from classification and coalitional game theory. [sent-94, score-0.521]
27 Preliminaries First, we introduce some basic concepts from classification and coalitional game theory, which are used in the formal description of our explanation method. [sent-99, score-1.018]
28 1 Classification In machine learning classification is a form of supervised learning where the objective is to predict the class label for unlabelled input instances, each described by feature values from a feature space. [sent-101, score-0.277]
29 × An , where each feature Ai is a finite set of feature values. [sent-109, score-0.248]
30 2 Coalitional Game Theory The following concepts from coalitional game theory are used in the formalization of our method, starting with the definition of a coalitional game. [sent-123, score-0.738]
31 Definition 3 A coalitional form game is a tuple N, v , where N = {1, 2, . [sent-124, score-0.421]
32 We start with a description of the intuition behind the method and then link it with coalitional game theory. [sent-156, score-0.421]
33 Our goal is to explain how the given feature values contribute to the prediction difference between the classifiers prediction for this instance and the expected prediction if no feature values are given (that is, if all feature values are ”ignored”). [sent-169, score-0.726]
34 , 2009) we used a different definition: ∆(S) = ∗ (S) − f ∗ (∅), where f ∗ (W ) is obtained by retraining the classifier only on features in W and fc c re-classifying the instance (similar to the wrappers approach in feature selection Kohavi and John 1997). [sent-176, score-0.322]
35 The expression ∆(S) is the difference between the expected prediction when we know only those values of x, whose features are in S, and the expected prediction when no feature values are known. [sent-179, score-0.358]
36 The main shortcoming of existing general explanation methods is that they do not take into account all the potential dependencies and interactions between feature values. [sent-183, score-0.689]
37 To avoid this issue, we implicitly define interactions by defining that each prediction difference ∆(S) is composed of 2N contributions of interactions (that is, each subset of feature values might contribute something): ∆(S) = ∑ I (W ), S ⊆ N. [sent-184, score-0.535]
38 (3) W ⊂S Now we distribute the interaction contributions among the n feature values. [sent-187, score-0.311]
39 For each interaction the involved feature values can be treated as equally responsible for the interaction as the interaction would otherwise not exist. [sent-188, score-0.259]
40 , n}, ∆ is a coalitional form game and ϕ(∆) = (ϕ1 , ϕ2 , . [sent-201, score-0.421]
41 Proof Following the definition of ∆ we get that ∆(∅) = 0, so the explanation of the classifier’s prediction can be treated as a coalitional form game N, ∆ . [sent-205, score-1.006]
42 Now we provide an elementary proof that the contributions of individual feature values correspond to the Shapley value for the game 6 E XPLAINING I NDIVIDUAL C LASSIFICATIONS N, ∆ . [sent-206, score-0.507]
43 S⊆N\{i} ∑ So, the explanation method can be interpreted as follows. [sent-239, score-0.497]
44 We divide this change amongst the feature values in a way that is fair to their contributions across all possible sub-coalitions. [sent-241, score-0.295]
45 Axioms 1 to 3 and their interpretation in the context of our explanation method are of particular interest. [sent-243, score-0.497]
46 (2) - the 7 ˇ S TRUMBELJ AND KONONENKO sum of all n contributions in an instance’s explanation is equal to the difference in prediction ∆(N). [sent-245, score-0.727]
47 According to the 2nd axiom, if two features values have an identical influence on the prediction they are assigned contributions of equal size. [sent-247, score-0.288]
48 The 3rd axiom says that if a feature has no influence on the prediction it is assigned a contribution of 0. [sent-248, score-0.426]
49 When viewed together, these properties ensure that any effect the features might have on the classifiers output will be reflected in the generated contributions, which effectively deals with the issues of previous general explanation methods. [sent-249, score-0.555]
50 1 A N I LLUSTRATIVE E XAMPLE In the introduction we used a simple boolean logic example to illustrate the shortcomings of existing general explanation methods. [sent-252, score-0.497]
51 With the same example we illustrate how our explanation method works. [sent-255, score-0.497]
52 Intuitively, ∆(S) is the difference between the classifiers expected prediction if only values of features in S are known and the expected prediction if no values are known. [sent-260, score-0.234]
53 When observed together, the two features contribute less than their individual contributions would suggest, which 1 results in a negative interaction: I (1, 2) = ∆(1, 2) − (I (1) + I (2)) = − 4 . [sent-266, score-0.282]
54 The 2 8 2 8 generated contributions reveal that both features contribute the same amount towards the prediction being 1 and the contributions sum up to the initial difference between the prediction for this instance and the prior belief. [sent-268, score-0.658]
55 To avoid this and still achieve an efficient explanation method, we extend the sampling algorithm in the following way. [sent-289, score-0.497]
56 The optimal Z2 ·σ2 (minimal) number of samples we need for the entire explanation is: mmin ( 1 − α, e ) = n · 1−α , e2 1 where σ2 = n ∑n σ2 . [sent-328, score-0.531]
57 When explaining an instance, the sampling process has to be repeated for each of the n feature values. [sent-331, score-0.23]
58 Therefore, for a given error and confidence level, the time complexity of the explanation is O(n · T (A )), where function T (A ) describes the instance classification time of the model on A . [sent-332, score-0.571]
59 We use a variety of different classifiers both to illustrate that it is indeed a general explanation method and to investigate how the method behaves with different types of classification models. [sent-336, score-0.497]
60 All experiments were done on an off-the-shelf laptop computer (2GHz dual core CPU, 2GB RAM), the explanation method is a straightforward Java implementation of the equations presented in this paper, and the classifiers were imported from the Weka machine learning software (http://www. [sent-338, score-0.497]
61 The first 8 data sets are ˇ synthetic data sets, designed specifically for testing explanation methods (see Robnik-Sikonja and ˇ Kononenko, 2008; Strumbelj et al. [sent-415, score-0.497]
62 Those interested in a more detailed description of this data set and how our previous explanation method is successfully applied in practice can refer to ˇ our previous work (Strumbelj et al. [sent-420, score-0.497]
63 For each data set we use half of the instances for training and half for testing the explanation method. [sent-427, score-0.538]
64 For each data set/classifier pair, we train the classifier on the training set and use both the explanation method and its approximation on each test instance. [sent-428, score-0.497]
65 When over 10000 samples are drawn, all the contributions across all features and all test instances are very close to the actual contributions. [sent-441, score-0.304]
66 And although the explanations would not tell us much about the concepts behind the data set (we conclude from the model’s performance, that it’s knowledge is useless), they would reveal what the model has learned, which is the purpose of an explanation method. [sent-479, score-0.802]
67 Explaining the prediction of the ANN model for an instance Monks2 is the most complex explanation (that is, requires the most samples - see Fig. [sent-487, score-0.693]
68 3, which shows the time needed to provide an explanation with the desired error 99%, 0. [sent-491, score-0.497]
69 For smaller data sets (smaller in the number of features) the explanation is generated almost instantly. [sent-493, score-0.497]
70 For larger data sets, generating an explanation takes less than a minute, with the exception of bstNB on a few data sets and the ANN model on the Mushroom data set. [sent-494, score-0.526]
71 The Arrhythmia data set, with its 279 features, is an example of a data set, where the explanation can not be generated in some sensible time. [sent-496, score-0.497]
72 For example, it takes more than an hour to generate an explanation for a prediction of the bstNB model. [sent-497, score-0.585]
73 The explanation method is therefore less appropriate for explaining models which are built on several hundred features or more. [sent-498, score-0.661]
74 Arguably, providing a comprehensible explanation involving a hundred or more features is a problem in its own right and even inherently transparend models become less comprehensible with such a large number of features. [sent-499, score-0.663]
75 However, the focus of this paper is on providing an effective and general explanation method, which is computationally feasible on the majority of data sets we encounter. [sent-500, score-0.497]
76 Therefore, when considering the number of features the explanation method can still handle, we need not count irrelevant features, which are not included in the final model. [sent-502, score-0.586]
77 An additional advantage of the generated contributions is that they sum up to the difference between the model’s output prediction and the model’s expected output, given no information about the values of the features. [sent-514, score-0.23]
78 They were generated for various classification models and data sets, to show the advantage of having a general explanation method. [sent-517, score-0.497]
79 However, both NB and bstDT learn the importance of the fifth feature and explanations reveal that value 2 for the fifth feature speaks against class 1. [sent-523, score-0.424]
80 The explanations reveal that DT predicts this animal is a bird, because it has feathers. [sent-528, score-0.239]
81 These first two pairs of examples illustrate how the explanations reflect what the model has learnt and how we can compare explanations from different classifiers. [sent-532, score-0.281]
82 Some examples of underfitting and overfitting are actually desirable as they allow us to inspect if the explanation method reveals what the classifier has (or has not) learned. [sent-534, score-0.497]
83 6(a) shows the explanation of the logREG model’s prediction for an instance from the Xor data set. [sent-536, score-0.63]
84 or concept of this data set (for this data set the class label is the odd parity bit for the first three feature values) and the explanation is appropriate. [sent-541, score-0.65]
85 7(a) is an explanation for ANN’s prediction for the introductory instance from the Titanic data set (see Fig. [sent-545, score-0.63]
86 7(b)) is very similar to the inherent explanation (taking into account that a logarithm is applied in the inherent explanation). [sent-548, score-0.553]
87 8 is an explanation for RT’s prediction regarding whether breast cancer will (class = 1) or will not (class = 2) recur for this patient. [sent-552, score-0.616]
88 According to RF it is more likely that cancer will not recur and the explanation indicates that this is mostly due to a low number of positive lymph nodes (nLymph). [sent-553, score-0.559]
89 Conclusion In the introductive section, we asked if an efficient and effective general explanation method for classifiers’ predictions can be made. [sent-558, score-0.539]
90 Using only the input and output of a classifier we decompose the changes in its prediction into contributions of individual feature values. [sent-560, score-0.391]
91 These contributions correspond to known concepts from coalitional game theory. [sent-561, score-0.663]
92 Unlike with existing methods, the resulting theoretical properties of the proposed method guarantee that no matter which concepts the classifier learns, the generated contributions will reveal the influence of feature values. [sent-562, score-0.416]
93 As we show 15 ˇ S TRUMBELJ AND KONONENKO (a) logREG model (b) SVM model Figure 6: The left hand side explanation indicates that the feature values have no significant influence on the logREG model on the Xor data set. [sent-564, score-0.708]
94 The right hand side explanation shows how SVM overfits the Random data set. [sent-565, score-0.497]
95 The left hand side explanation is for the ANN model. [sent-567, score-0.497]
96 The right hand side explanation is for the NB model. [sent-568, score-0.497]
97 Figure 8: An explanation or the RF model’s prediction for a patient from the Oncology data set. [sent-569, score-0.585]
98 It would also be interesting to explore the possibility of applying the same principles to the explanation of regression models. [sent-574, score-0.497]
99 Polynomial calculation of the shapley value based on o sampling. [sent-593, score-0.253]
100 Explaining instance classifications with interactions of subsets of feature values. [sent-663, score-0.237]
wordName wordTfidf (topN-words)
[('explanation', 0.497), ('shapley', 0.253), ('kononenko', 0.231), ('coalitional', 0.217), ('game', 0.204), ('axiom', 0.17), ('prei', 0.163), ('trumbelj', 0.163), ('lassifications', 0.145), ('xplaining', 0.145), ('contributions', 0.142), ('explanations', 0.126), ('feature', 0.124), ('ndividual', 0.124), ('strumbelj', 0.124), ('nb', 0.113), ('logreg', 0.108), ('igor', 0.108), ('explaining', 0.106), ('concepts', 0.1), ('ann', 0.099), ('bstdt', 0.09), ('condind', 0.09), ('prediction', 0.088), ('oncology', 0.083), ('naive', 0.081), ('rf', 0.077), ('xor', 0.077), ('bstnb', 0.072), ('passenger', 0.072), ('interactions', 0.068), ('classi', 0.067), ('er', 0.065), ('players', 0.062), ('titanic', 0.06), ('features', 0.058), ('retraining', 0.056), ('coalition', 0.054), ('comprehensible', 0.054), ('disjunct', 0.054), ('nomograms', 0.054), ('zoo', 0.051), ('reveal', 0.05), ('sphere', 0.05), ('erik', 0.048), ('ina', 0.046), ('mushroom', 0.046), ('persons', 0.046), ('bayes', 0.046), ('instance', 0.045), ('interaction', 0.045), ('contribute', 0.045), ('contribution', 0.044), ('survival', 0.043), ('predictions', 0.042), ('population', 0.042), ('mo', 0.042), ('postprocessing', 0.042), ('dt', 0.041), ('instances', 0.041), ('contributes', 0.039), ('fc', 0.039), ('chess', 0.038), ('visualization', 0.038), ('svm', 0.038), ('individual', 0.037), ('ignored', 0.037), ('bla', 0.036), ('eytan', 0.036), ('fri', 0.036), ('invasion', 0.036), ('jakulin', 0.036), ('keinan', 0.036), ('ljubljana', 0.036), ('marko', 0.036), ('martens', 0.036), ('moretti', 0.036), ('szafron', 0.036), ('towell', 0.036), ('samples', 0.034), ('decision', 0.033), ('predicts', 0.032), ('ers', 0.032), ('irrelevant', 0.031), ('janez', 0.031), ('nursery', 0.031), ('andrews', 0.031), ('animal', 0.031), ('lymph', 0.031), ('recur', 0.031), ('soybean', 0.031), ('transparent', 0.031), ('kohavi', 0.031), ('uence', 0.029), ('across', 0.029), ('label', 0.029), ('model', 0.029), ('inherent', 0.028), ('castro', 0.028), ('bird', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
Author: Erik Ĺ trumbelj, Igor Kononenko
Abstract: We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method’s initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful. Keywords: data postprocessing, classification, explanation, visualization
2 0.29846248 48 jmlr-2010-How to Explain Individual Classification Decisions
Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller
Abstract: After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted a particular label for a single instance and what features were most influential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method. Keywords: explaining, nonlinear, black box model, kernel methods, Ames mutagenicity
3 0.092948891 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
4 0.080508441 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
5 0.0549003 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
Author: Franz Pernkopf, Jeff A. Bilmes
Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures significantly outperforms generatively produced structures, and achieves a classification accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classifiers still hold in the case of missing features, a case where generative classifiers have an advantage over discriminative classifiers. Keywords: Bayesian networks, classification, discriminative learning, structure learning, graphical model, missing feature
6 0.054179519 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
7 0.052261017 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
8 0.045748249 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
9 0.045135744 22 jmlr-2010-Classification Using Geometric Level Sets
10 0.04392533 63 jmlr-2010-Learning Instance-Specific Predictive Models
11 0.042186461 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
12 0.039534505 40 jmlr-2010-Fast and Scalable Local Kernel Machines
15 0.034556858 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?
16 0.032934658 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
17 0.032231066 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
18 0.031082543 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases
19 0.030780617 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
20 0.030604335 37 jmlr-2010-Evolving Static Representations for Task Transfer
topicId topicWeight
[(0, -0.176), (1, 0.07), (2, -0.043), (3, 0.125), (4, 0.011), (5, 0.199), (6, 0.027), (7, 0.059), (8, -0.037), (9, 0.017), (10, 0.382), (11, 0.085), (12, -0.16), (13, 0.083), (14, 0.384), (15, -0.153), (16, -0.023), (17, 0.079), (18, 0.042), (19, 0.25), (20, 0.081), (21, -0.183), (22, -0.034), (23, -0.144), (24, 0.178), (25, 0.075), (26, -0.039), (27, -0.024), (28, -0.015), (29, -0.026), (30, 0.04), (31, 0.012), (32, 0.066), (33, 0.024), (34, -0.005), (35, -0.017), (36, 0.043), (37, 0.02), (38, -0.048), (39, 0.026), (40, 0.027), (41, 0.05), (42, -0.018), (43, 0.026), (44, 0.017), (45, -0.006), (46, 0.044), (47, 0.012), (48, 0.005), (49, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.93877023 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
Author: Erik Ĺ trumbelj, Igor Kononenko
Abstract: We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method’s initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful. Keywords: data postprocessing, classification, explanation, visualization
2 0.89552051 48 jmlr-2010-How to Explain Individual Classification Decisions
Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller
Abstract: After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted a particular label for a single instance and what features were most influential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method. Keywords: explaining, nonlinear, black box model, kernel methods, Ames mutagenicity
3 0.33170691 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
4 0.24637002 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
Author: Franz Pernkopf, Jeff A. Bilmes
Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures significantly outperforms generatively produced structures, and achieves a classification accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classifiers still hold in the case of missing features, a case where generative classifiers have an advantage over discriminative classifiers. Keywords: Bayesian networks, classification, discriminative learning, structure learning, graphical model, missing feature
5 0.24129307 63 jmlr-2010-Learning Instance-Specific Predictive Models
Author: Shyam Visweswaran, Gregory F. Cooper
Abstract: This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including nave Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms. Keywords: instance-specific, Bayesian network, Markov blanket, Bayesian model averaging
6 0.23651934 22 jmlr-2010-Classification Using Geometric Level Sets
7 0.21954158 40 jmlr-2010-Fast and Scalable Local Kernel Machines
8 0.21842802 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
9 0.20897084 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
10 0.19832005 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
11 0.19195838 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
13 0.18845725 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
15 0.18469489 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
16 0.18043327 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
17 0.17819057 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
18 0.16804537 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
19 0.16735049 94 jmlr-2010-Quadratic Programming Feature Selection
20 0.16625433 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
topicId topicWeight
[(3, 0.011), (4, 0.016), (8, 0.024), (15, 0.014), (21, 0.019), (32, 0.047), (33, 0.011), (34, 0.01), (36, 0.037), (37, 0.06), (39, 0.046), (40, 0.308), (71, 0.02), (75, 0.138), (81, 0.018), (85, 0.108), (96, 0.012), (97, 0.018)]
simIndex simValue paperId paperTitle
1 0.80534667 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený
Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach
same-paper 2 0.71931678 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
Author: Erik Ĺ trumbelj, Igor Kononenko
Abstract: We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method’s initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful. Keywords: data postprocessing, classification, explanation, visualization
Author: Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol
Abstract: We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations. Keywords: deep learning, unsupervised feature learning, deep belief networks, autoencoders, denoising
4 0.52238089 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le
Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E
5 0.52130318 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
6 0.51245725 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
7 0.50909692 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
8 0.5074954 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
9 0.50684071 48 jmlr-2010-How to Explain Individual Classification Decisions
10 0.5053103 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
11 0.5049184 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
12 0.50426692 102 jmlr-2010-Semi-Supervised Novelty Detection
13 0.50404119 63 jmlr-2010-Learning Instance-Specific Predictive Models
14 0.50245702 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
15 0.50203258 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
16 0.5016349 22 jmlr-2010-Classification Using Geometric Level Sets
17 0.49984345 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
19 0.49685609 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors
20 0.49684879 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming