jmlr jmlr2010 jmlr2010-48 knowledge-graph by maker-knowledge-mining

48 jmlr-2010-How to Explain Individual Classification Decisions

Source: pdf

Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller

Abstract: After building a classiﬁer with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted a particular label for a single instance and what features were most inﬂuential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classiﬁcation method. Keywords: explaining, nonlinear, black box model, kernel methods, Ames mutagenicity

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper proposes a simple framework that provides local explanation vectors applicable to any classiﬁcation method in order to help understanding prediction results for single data instances. [sent-46, score-0.589]

2 The local explanation yields the features relevant for the prediction at the very points of interest in the data space, and is able to spot local peculiarities that are neglected in the global view, for example, due to cancellation effects. [sent-47, score-0.664]

3 The paper is organized as follows: We deﬁne local explanation vectors as class probability gradients in Section 2 and give an illustration for Gaussian Process Classiﬁcation (GPC). [sent-48, score-0.687]

4 In Section 4 we apply our methodology to learn distinguishing properties of Iris ﬂowers by estimating explanation vectors for a k-NN classiﬁer applied to the classic Iris data set. [sent-51, score-0.466]

5 In Section 6 we focus on a more real-world application scenario where the proposed explanation capabilities prove useful in drug discovery: Human experts regularly decide how to modify existing lead compounds in order to obtain new compounds with improved properties. [sent-53, score-0.747]

6 Our automatically generated explanations match with chemical domain knowledge about toxifying functional groups of the compounds in question. [sent-55, score-0.371]

7 Deﬁnitions of Explanation Vectors In this Section we will give deﬁnitions for our approach of local explanation vectors in the classiﬁcation setting. [sent-58, score-0.557]

8 For the Bayes classiﬁer we deﬁne the explanation vector of a data point x0 to be the derivative with respect to x at x = x0 of the conditional probability of Y = g∗ (x0 ) given X = x, or formally, Deﬁnition 1 ζ(x0 ) := ∂ P(Y = g∗ (x) | X = x) ∂x . [sent-81, score-0.415]

9 Ignoring the orientations of the explanation vectors, ζ forms a continuously changing (orientation-less) vector ﬁeld along which the class labels change. [sent-92, score-0.415]

10 Our explanation vector ﬁts well to classiﬁers where the conditional distribution P(Y = c | X = x) is usually not completely ﬂat in some regions. [sent-95, score-0.415]

11 In the case of deterministic classiﬁers, despite of this issue, Parzen window estimators with appropriate widths (Section 3) can provide meaningful explanation vectors for many samples in practice (see also Section 8). [sent-96, score-0.566]

12 In the case of binary classiﬁcation we directly deﬁne local explanation vectors as local gradients of the probability function p(x) = P(Y = 1 | X = x) of the learned model for the positive class. [sent-97, score-0.778]

13 , (xn , yn )} ∈ ℜd × {−1, +1} the explanation vector for a classiﬁed test point x0 is the local gradient of p at x0 : Deﬁnition 2 η p (x0 ) := ∇p(x)|x=x0 . [sent-101, score-0.559]

14 By this deﬁnition the explanation η is again a d-dimensional vector just like the test point x0 is. [sent-102, score-0.415]

15 1 1 0 (c) Local explanation vectors 0 (d) Direction of explanation vectors Figure 1: Explaining simple object classiﬁcation with Gaussian Processes Panel (a) of Figure 1 shows the training data of a simple object classiﬁcation task and panel (b) shows the model learned using GPC. [sent-192, score-1.022]

16 Panel (c) shows the probability gradient of the model together with the local gradient explanation vectors. [sent-195, score-0.612]

17 Now, instead of trying to explain g∗ , to which we have 1808 H OW TO E XPLAIN I NDIVIDUAL C LASSIFICATION D ECISIONS no access, we will deﬁne explanation vectors that help us understand the classiﬁer g on the test data points. [sent-230, score-0.505]

18 For g it is straightforward to deﬁne explanation vectors: ˆ Deﬁnition 3 ∂ ˆ p(y = g(z) | x) ˆ ζ(z) := ∂x − = ∑i∈Ig(z) k(z − xi ) / x=z ∑i∈Ig(z) k(z − xi )(z − xi ) / ∑i∈Ig(z) k(z − xi )(z − xi ) σ2 ∑n k(z − xi ) i=1 ∑i∈Ig(z) k(z − xi ) σ2 ∑n k(z − xi ) i=1 2 2 . [sent-242, score-0.663]

19 In the following, we will present a more general method for estimating explanation vectors. [sent-250, score-0.415]

20 1809 ¨ BAEHRENS , S CHROETER , H ARMELING , K AWANABE , H ANSEN AND M ULLER explanation vectors using Deﬁnition 3. [sent-251, score-0.466]

21 In order to estimate explanation vectors we mimic the classiﬁcation results with a Parzen window classiﬁer. [sent-269, score-0.566]

22 Since the explanation vectors live in the input space we can visualize them with scatter plots of the initially measured features. [sent-272, score-0.466]

23 The blue markers correspond to explanation vectors of Iris setosa and the red markers correspond to those of of Iris virginica (both class 1). [sent-276, score-0.583]

24 1810 H OW TO E XPLAIN I NDIVIDUAL C LASSIFICATION D ECISIONS Figure 3: Scatter plots of the explanation vectors for the test data. [sent-282, score-0.466]

25 Shown are all explanation vectors for both classes: class 1 containing Iris setosa (shown in blue) and Iris virginica (shown in red) versus class 0 containing only the species Iris versicolor (shown in green). [sent-283, score-0.641]

26 Explaining USPS Digit Classiﬁcation by Support Vector Machine We now apply the framework of estimating explanation vectors to a high dimensional data set, the USPS digits. [sent-286, score-0.466]

27 For each digit from left to right: (i) explanation vector (with black being negative, white being positive), (ii) the original digit, (iii-end) artiﬁcial digits along the explanation vector towards the other class. [sent-293, score-0.946]

28 For each digit from left to right: (i) explanation vector (with black being negative, white being positive), (ii) the original digit, (iii-end) artiﬁcial digits along the explanation vector towards the other class. [sent-295, score-0.946]

29 For each example we display from left to right: (i) the explanation vector, (ii) the original digit, (iii-end) artiﬁcial digits along the explanation vector towards the other class. [sent-299, score-0.884]

30 6 These artiﬁcial digits should help to understand and interpret the explanation vector. [sent-300, score-0.469]

31 Thus the parts of the lines that are missing show up in the explanation vector: if the dark parts (which correspond to the missing lines) are added to the “two” digit then it will be classiﬁed as an “eight”. [sent-303, score-0.477]

32 A similar explanation holds for the middle example framed in red in the same Figure. [sent-305, score-0.451]

33 This is reﬂected in the explanation vector by white spots/lines. [sent-307, score-0.415]

34 However, its explanation vector shows nicely which parts have to be added and which have to be removed. [sent-309, score-0.415]

35 The explanation vectors again tell us how the “eights” have to change to become classiﬁed as “twos”. [sent-311, score-0.466]

36 On the test set the explanation vectors are not as pronounced as on the training set. [sent-314, score-0.495]

37 Again the explanation vector shows us how to edit the image of the “two” to transform it into an “eights”, that is, exactly which parts of the digit were important for the classiﬁcation result. [sent-317, score-0.477]

38 For several other “twos” the explanation vectors do not directly lead to the “eights” but weight the different parts of the digits that were relevant for the classiﬁcation. [sent-318, score-0.52]

39 Figure 5 (right panel): Similarly to the training data, we see that also these explanation vectors are not bringing all “eights” to “twos”. [sent-319, score-0.495]

40 Their explanation vectors mainly suggest to remove most of the “eights” (black pixels) and add some black in the lower part (the light parts, which look like a white shadow). [sent-320, score-0.466]

41 Overall, the explanation vectors tell us how to edit our example digits to change the assigned class label. [sent-321, score-0.52]

42 Explaining Mutagenicity Classiﬁcation by Gaussian Processes In the following Section we describe an application of our local gradient explanation methodology to a complex real world data set. [sent-324, score-0.559]

43 For the sake of simplicity, no intermediate updates were performed, that is, artiﬁcial digits were generated by taking equal-sized steps in the direction given by the original explanation vector calculated for the original digit. [sent-326, score-0.469]

44 Together with the prediction we calculated the explanation vector (as introduced in Deﬁnition 2) for each test point. [sent-348, score-0.447]

45 In Figures 7 and 8 we show the distribution of the local importance of selected features across the test set: For each input feature we generate a histogram of local importance values, as indicated by its corresponding entry in the explanation vector of each of the 4512 test compounds. [sent-350, score-0.662]

46 Modifying the test compounds by adding toxicophores will increase the probability of being mutagenic as predicted by the GPC model while adding detoxicophores will decrease this predicted probability. [sent-359, score-0.445]

47 8 false positive rate 1 Figure 6: Receiver operating characteristic curve of GPC model for mutagenicity prediction We have seen that the conclusions drawn from our explanation vectors agree with established knowledge about toxicophores and detoxicophores. [sent-376, score-0.73]

48 While this is reassuring, such a sanity check required existing knowledge about which compounds are toxicophores and detoxicophores and which are not. [sent-377, score-0.375]

49 Thus it is interesting to ask whether we also could have discovered that knowledge from the explanation vectors. [sent-378, score-0.415]

50 In the following paragraph we will discuss steroids10 as an example of an important compound class for which the meaning of features differs from this global trend, so that local explanation vectors are needed to correctly identify relevant features. [sent-384, score-0.592]

51 Figure 9 displays the difference in relevance of epoxide (a) and aliphatic nitrosamine (c) substructures for the predicted mutagenicity of steroids and non-steroid compounds. [sent-385, score-0.512]

52 5 1 local gradient DR17:nArNH2 relative frequency relative frequency relative frequency 0. [sent-420, score-0.48]

53 5 1 local gradient DR17:nRNNOx (d) aliphatic nitrosamine 0. [sent-426, score-0.33]

54 5 1 local gradient DR17:nArNNOx (e) aromatic nitrosamine −1 −0. [sent-433, score-0.33]

55 5 1 local gradient X:AliphaticHalide (j) aliphatic halide Figure 7: Distribution of local importance of selected features across the test set of 4512 compounds. [sent-479, score-0.363]

56 5 1 local gradient DR17:nArCOOH (e) aromatic carboxylic acid Figure 8: Distribution of local importance of selected features across the test set of 4512 compounds. [sent-519, score-0.363]

57 In contrast, almost all epoxide containing steroids exhibit gradients just below zero. [sent-522, score-0.375]

58 This peculiarity in chemical space is clearly exhibited by the local explanation given by our approach. [sent-525, score-0.583]

59 In conclusion, we can learn from the explanation vectors that: • Toxicophores tend to make compounds mutagenic (class 1). [sent-528, score-0.679]

60 5 local gradient DR17:nOxiranes (b) epoxide feature: random compounds vs. [sent-564, score-0.392]

61 the rest Figure 9: The local distribution of feature importance to steroids and random non-steroid compounds signiﬁcantly differs for two known toxicophores. [sent-580, score-0.404]

62 The small local gradients found for the steroids (shown in blue) indicate that the presence of each toxicophore is irrelevant to the molecules toxicity. [sent-581, score-0.391]

63 Our notion of explanation is not related to the prediction error, but only to the label provided by the prediction algorithm. [sent-585, score-0.479]

64 1818 H OW TO E XPLAIN I NDIVIDUAL C LASSIFICATION D ECISIONS The explanation vector proposed here is similar in spirit to sensitivity analysis which is common in various areas of information science. [sent-587, score-0.415]

65 Our framework of explanation vectors considers a different view. [sent-598, score-0.466]

66 The explanation vectors are used to extract sensitive features that are relevant to the prediction results, rather than detecting/eliminating the inﬂuential samples. [sent-600, score-0.533]

67 In recent decades, explanation of results by expert systems has been an important topic in the Artiﬁcial Intelligence community. [sent-601, score-0.415]

68 Especially for expert systems based on Bayesian belief networks, such explanation is crucial in practical use. [sent-602, score-0.415]

69 There the inﬂuence is evaluated by removing a set of variables (features) from the evidence and the explanation is constructed from those variables that affect inference (relevant variables). [sent-605, score-0.415]

70 This line of research is more connected to our work, because explanation can depend on the assigned values of the evidence E, and is thus local. [sent-611, score-0.415]

71 Discussion We have shown that our methods for calculating / estimating explanation vectors are useful in a variety of situations. [sent-627, score-0.466]

72 However, the explanation ζ(x) for the center point of the middle cluster is the zero vector, because at that point p(Y = 1|X = x) is maximal. [sent-638, score-0.415]

73 Actually, the (normalized) explanation vector is derived from the following optimization problem for ﬁnding the locally most inﬂuential direction: argmax ε =1 {p(Y = g∗ (x0 )|X = x0 + ε) − p(Y = g∗ (x0 )|X = x0 )}. [sent-640, score-0.451]

74 In the example data set with three clusters, the explanation vector is constant along the second dimension. [sent-642, score-0.415]

75 However, if the conditional distri1820 H OW TO E XPLAIN I NDIVIDUAL C LASSIFICATION D ECISIONS bution P(Y = 1 | X = x) is ﬂat in some regions, no meaningful explanation can be obtained by the gradient-based approach with the remedy mentioned above. [sent-647, score-0.415]

76 Practically, by using Parzen window estimators with larger widths, the explanation vector can capture coarse structures of the classiﬁer at the points that are not so far from the borders. [sent-648, score-0.515]

77 When using the local gradient of the model prediction directly as in Deﬁnition 2 and Section 6, the explanation follows the given model precisely by deﬁnition. [sent-656, score-0.591]

78 In that case the explanation vectors will be different, which makes sense, since they should explain the classiﬁer at hand, even if its estimated labels were not all correct. [sent-658, score-0.505]

79 On the other hand, if the different classiﬁers agree on all labels, the explanation will be exactly equal. [sent-659, score-0.415]

80 When querying the model in an area of the feature space where predictions are negative, and one approaches the boundaries of the space populated with training data, explanation vectors will point away from any training data and therefore also away from areas of positive prediction. [sent-664, score-0.554]

81 This behavior can be observed in Figure 1(d), where unit length vectors indicate the direction of explanation vectors. [sent-665, score-0.466]

82 4 Stationarity of the Data Since explanation vectors are deﬁned as local gradients of the model prediction (see Deﬁnition 2), no assumption on the data is made: The local gradients follow the predictive model in any case. [sent-670, score-0.94]

83 If, however, the model to be explained assumes stationarity of the data, the explanation vectors will inherit this limitation and reﬂect any shortcomings of the model (e. [sent-671, score-0.466]

84 Our method for estimating explanation vectors, on the other hand, assumes stationarity of the data. [sent-674, score-0.415]

85 In a nutshell, the estimated explanations are local gradients that characterize how a data point has to be moved to change its predicted label. [sent-685, score-0.337]

86 Even local peculiarities in chemical space (the extraordinary behavior of steroids) was discovered using the local explanations given by our approach. [sent-690, score-0.375]

87 Thus using our explanation framework in computational biology (see Sonnenburg et al. [sent-692, score-0.415]

88 In the following we present the derivation of direct local gradients and illustrate aspects like the effect of different kernel functions, outliers and local non-linearities. [sent-702, score-0.385]

89 Furthermore we present the derivation of explanation vectors based on the parzen window estimation and illustrate how the quality of the ﬁt of the Parzen window approximation affects the quality of the estimated explanation vectors. [sent-703, score-1.242]

90 Since the explanation is derived directly from the respective model, it is interesting to investigate its acurateness depending on different model parameters and in instructive scenarios. [sent-708, score-0.415]

91 6 (d) rational quadratic explanation Figure 11: The effect of different kernel functions to the local gradient explanations For other parameter values the rational quadratic kernel leads to similar results as the RBF kernel function used in Figure 1. [sent-786, score-0.765]

92 9 1 (c) outlier explanation Figure 12: The effect of outliers to the local gradient explanations • Local gradients are in the same way sensitive to outliers as the model which they try to explain. [sent-834, score-0.939]

93 Here a single outlier deforms the model and with it the explanation which may be extracted from it. [sent-835, score-0.463]

94 • Thus the local gradient of a point near an outlier may not reﬂect a true explanation of the features important in reality. [sent-837, score-0.642]

95 To compensate for the effect of outliers to the local gradients of points in the affected region we propose to use a sliding window method to smooth the gradients around each point of interest. [sent-840, score-0.494]

96 9 1 (c) locally non-linear explanation Figure 13: The effect of local non-linearity to the local gradient explanations The effect of locally non-linear class boundaries in the data is shown in Figure 13 again for GPC with an RBF kernel. [sent-885, score-0.838]

97 First we give the derivation to obtain the explanation vector and second we examine how the explanation varies with the goodness of ﬁt of the Parzen window method. [sent-891, score-0.93]

98 F IT BY PARZEN W INDOW In our estimation framework the quality of the local gradients depends on the approximation of the classiﬁer we want to explain by Parzen windows for which we can calculate the explanation vectors as given by Deﬁnition 3. [sent-898, score-0.726]

99 0 Figure 14: Good ﬁt of Parzen window approximation affects the quality of the estimated explanation vectors • The SVM model was trained with C = 10 and using an RBF kernel of width σ = 0. [sent-970, score-0.657]

100 1828 H OW TO E XPLAIN I NDIVIDUAL C LASSIFICATION D ECISIONS • For a too large window width in Subﬁgure 14(d) the approximation fails to obtain local gradients which closely follow the model. [sent-978, score-0.382]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('explanation', 0.415), ('baehrens', 0.198), ('gpc', 0.169), ('ansen', 0.163), ('armeling', 0.163), ('awanabe', 0.163), ('chroeter', 0.163), ('ecisions', 0.163), ('xplain', 0.163), ('iris', 0.162), ('parzen', 0.161), ('eights', 0.151), ('toxicophores', 0.151), ('compounds', 0.143), ('steroids', 0.14), ('ndividual', 0.139), ('uller', 0.139), ('gradients', 0.13), ('schroeter', 0.128), ('explanations', 0.116), ('var', 0.113), ('epoxide', 0.105), ('twos', 0.105), ('window', 0.1), ('schwaighofer', 0.099), ('aliphatic', 0.093), ('ames', 0.093), ('aromatic', 0.093), ('nitrosamine', 0.093), ('local', 0.091), ('ow', 0.087), ('detoxicophores', 0.081), ('hansen', 0.081), ('heinrich', 0.081), ('laak', 0.081), ('mutagenicity', 0.081), ('steroid', 0.081), ('ig', 0.081), ('lassification', 0.078), ('frequency', 0.078), ('ic', 0.077), ('chemical', 0.077), ('mutagenic', 0.07), ('obrezanova', 0.07), ('setosa', 0.07), ('ter', 0.07), ('uential', 0.07), ('classi', 0.07), ('digit', 0.062), ('panel', 0.061), ('width', 0.061), ('petal', 0.06), ('versicolor', 0.058), ('digits', 0.054), ('explaining', 0.054), ('gradient', 0.053), ('vectors', 0.051), ('kld', 0.05), ('sub', 0.048), ('mika', 0.048), ('outlier', 0.048), ('ganzer', 0.047), ('kazius', 0.047), ('virginica', 0.047), ('drug', 0.046), ('er', 0.045), ('sugiyama', 0.045), ('outliers', 0.043), ('kononenko', 0.041), ('qsar', 0.04), ('explain', 0.039), ('usps', 0.039), ('berlin', 0.039), ('ks', 0.036), ('locally', 0.036), ('framed', 0.036), ('features', 0.035), ('rbf', 0.035), ('harmeling', 0.035), ('hypotenuse', 0.035), ('kawanabe', 0.035), ('motoaki', 0.035), ('noxiranes', 0.035), ('nrnnox', 0.035), ('ragon', 0.035), ('timon', 0.035), ('toxifying', 0.035), ('relative', 0.034), ('svm', 0.032), ('uence', 0.032), ('prediction', 0.032), ('xi', 0.031), ('tu', 0.031), ('kernel', 0.03), ('feature', 0.03), ('strumbelj', 0.03), ('molecules', 0.03), ('erfc', 0.03), ('sepal', 0.03), ('training', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 48 jmlr-2010-How to Explain Individual Classification Decisions

Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller

2 0.29846248 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

Author: Erik Ĺ trumbelj, Igor Kononenko

Abstract: We present a general method for explaining individual predictions of classiﬁcation models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method’s initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efﬁcient and that the explanations are intuitive and useful. Keywords: data postprocessing, classiﬁcation, explanation, visualization

3 0.080294468 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

Author: Markus Ojala, Gemma C. Garriga

Abstract: We explore the framework of permutation-based p-values for assessing the performance of classiﬁers. In this paper we study two simple permutation tests. The ﬁrst test assess whether the classiﬁer has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classiﬁcation problems in computational biology. The second test studies whether the classiﬁer is exploiting the dependency between the features in classiﬁcation; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classiﬁer performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classiﬁer performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classiﬁer exploits the interdependency between the features in the data. Keywords: classiﬁcation, labeled data, permutation tests, restricted randomization, signiﬁcance testing

4 0.066164121 22 jmlr-2010-Classification Using Geometric Level Sets

Author: Kush R. Varshney, Alan S. Willsky

Abstract: A variational level set method is developed for the supervised classiﬁcation problem. Nonlinear classiﬁer decision boundaries are obtained by minimizing an energy functional that is composed of an empirical risk term with a margin-based loss and a geometric regularization term new to machine learning: the surface area of the decision boundary. This geometric level set classiﬁer is analyzed in terms of consistency and complexity through the calculation of its ε-entropy. For multicategory classiﬁcation, an efﬁcient scheme is developed using a logarithmic number of decision functions in the number of classes rather than the typical linear number of decision functions. Geometric level set classiﬁcation yields performance results on benchmark data sets that are competitive with well-established methods. Keywords: level set methods, nonlinear classiﬁcation, geometric regularization, consistency, complexity

5 0.063927397 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

Author: Gavin C. Cawley, Nicola L. C. Talbot

Abstract: Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneﬁcial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a nonnegligible variance introduces the potential for over-ﬁtting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-ﬁtting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-ﬁtting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-ﬁtting and hence are unreliable. We discuss methods to avoid over-ﬁtting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the ﬁndings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a ﬁnite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds. Keywords: model selection, performance evaluation, bias-variance trade-off, selection bias, overﬁtting

6 0.052634057 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

7 0.051570628 40 jmlr-2010-Fast and Scalable Local Kernel Machines

8 0.043808334 15 jmlr-2010-Approximate Tree Kernels

9 0.04338361 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

10 0.042971544 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

11 0.041653126 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

12 0.040034998 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

13 0.03884808 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

14 0.036699612 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

15 0.036652882 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

16 0.034113526 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

17 0.033578847 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting

18 0.033443432 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox

19 0.033183075 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

20 0.033142634 44 jmlr-2010-Graph Kernels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.181), (1, 0.056), (2, -0.05), (3, 0.142), (4, -0.003), (5, 0.228), (6, 0.007), (7, 0.013), (8, -0.057), (9, -0.022), (10, 0.326), (11, 0.071), (12, -0.183), (13, 0.083), (14, 0.377), (15, -0.144), (16, -0.038), (17, 0.097), (18, 0.091), (19, 0.223), (20, 0.097), (21, -0.176), (22, -0.025), (23, -0.083), (24, 0.2), (25, 0.085), (26, -0.041), (27, -0.03), (28, -0.006), (29, 0.034), (30, 0.01), (31, 0.066), (32, 0.077), (33, 0.011), (34, 0.026), (35, -0.059), (36, 0.083), (37, 0.042), (38, -0.09), (39, 0.029), (40, 0.03), (41, -0.006), (42, -0.001), (43, 0.09), (44, 0.03), (45, -0.003), (46, 0.057), (47, 0.011), (48, 0.009), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92380846 48 jmlr-2010-How to Explain Individual Classification Decisions

Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller

2 0.9141891 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

Author: Erik Ĺ trumbelj, Igor Kononenko

3 0.31352901 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

Author: Markus Ojala, Gemma C. Garriga

4 0.26549792 22 jmlr-2010-Classification Using Geometric Level Sets

Author: Kush R. Varshney, Alan S. Willsky

5 0.23635381 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

Author: Qiang Wu, Justin Guinney, Mauro Maggioni, Sayan Mukherjee

Abstract: The problems of dimension reduction and inference of statistical dependence are addressed by the modeling framework of learning gradients. The models we propose hold for Euclidean spaces as well as the manifold setting. The central quantity in this approach is an estimate of the gradient of the regression or classiﬁcation function. Two quadratic forms are constructed from gradient estimates: the gradient outer product and gradient based diffusion maps. The ﬁrst quantity can be used for supervised dimension reduction on manifolds as well as inference of a graphical model encoding dependencies that are predictive of a response variable. The second quantity can be used for nonlinear projections that incorporate both the geometric structure of the manifold as well as variation of the response variable on the manifold. We relate the gradient outer product to standard statistical quantities such as covariances and provide a simple and precise comparison of a variety of supervised dimensionality reduction methods. We provide rates of convergence for both inference of informative directions as well as inference of a graphical model of variable dependencies. Keywords: gradient estimates, manifold learning, graphical models, inverse regression, dimension reduction, gradient diffusion maps

6 0.23510101 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

7 0.2345089 40 jmlr-2010-Fast and Scalable Local Kernel Machines

8 0.21333365 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.20535792 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

10 0.2044647 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

11 0.19734055 63 jmlr-2010-Learning Instance-Specific Predictive Models

12 0.19362146 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

13 0.19324176 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

14 0.18887386 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

15 0.18819694 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

16 0.18779802 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

17 0.18364576 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

18 0.18251635 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

19 0.17934155 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

20 0.17644285 61 jmlr-2010-Learning From Crowds

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.012), (4, 0.013), (8, 0.012), (21, 0.014), (24, 0.016), (32, 0.044), (33, 0.013), (36, 0.038), (37, 0.047), (71, 0.447), (75, 0.139), (81, 0.018), (85, 0.091), (96, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67958063 48 jmlr-2010-How to Explain Individual Classification Decisions

Author: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Müller

2 0.52962625 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classiﬁcation methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identiﬁes its sensitivity to afﬁne transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classiﬁcation function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classiﬁcation function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efﬁciency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classiﬁcation and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

3 0.37259677 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E

4 0.36802238 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

5 0.36779627 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

Author: Gal Chechik, Varun Sharma, Uri Shalit, Samy Bengio

Abstract: Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or ﬁnding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large data sets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efﬁcient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a data set with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, data sets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale data set, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale data sets that could not be handled before. Keywords: large scale, metric learning, image similarity, online learning ∗. Varun Sharma and Uri Shalit contributed equally to this work. †. Also at ICNC, The Hebrew University of Jerusalem, 91904, Israel. c 2010 Gal Chechik, Varun Sharma, Uri Shalit

6 0.36779562 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

7 0.36627367 63 jmlr-2010-Learning Instance-Specific Predictive Models

8 0.36482075 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

9 0.36481851 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

10 0.36469078 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

11 0.36433363 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

12 0.36210734 102 jmlr-2010-Semi-Supervised Novelty Detection

13 0.36113542 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

14 0.36073396 22 jmlr-2010-Classification Using Geometric Level Sets

15 0.36031333 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

16 0.35939837 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

17 0.35926193 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

18 0.3585315 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

19 0.35815468 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

20 0.35802656 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes