jmlr jmlr2006 jmlr2006-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Janez Demšar
Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests
Reference: text
sentIndex sentText sentNum sentScore
1 Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests 1. [sent-8, score-0.698]
2 Their message has been taken by the community, and the overly confident paired t-tests over cross validation folds are giving place to the McNemar test and 5×2 cross validation. [sent-17, score-0.31]
3 We observed that many otherwise excellent and innovative machine learning papers end by drawing conclusions from a matrix of, for instance, McNemar’s tests comparing all pairs of classifiers, as if the tests for multiple comparisons, such as ANOVA and Friedman test are yet to be invented. [sent-21, score-0.911]
4 The core of the paper is the study of the statistical tests that could be (or already are) used for comparing two or more classifiers on multiple data sets. [sent-22, score-0.52]
5 Although some of the tests are quite common in machine learning literature, many researchers seem ignorant about what the tests actually measure and which circumstances they are suitable for. [sent-29, score-0.622]
6 The tests used have however long been rather naive and unverified. [sent-34, score-0.371]
7 He examines five statistical tests and concludes the analysis by recommending the newly crafted 5×2cv t-test that overcomes the problem of underestimated variance and the consequently elevated Type I error of the more traditional paired t-test over folds of the usual k-fold cross validation. [sent-39, score-0.568]
8 As Salzberg himself notes, binomial testing lacks the power of the better non-parametric tests and the Bonferroni correction is overly radical. [sent-48, score-0.477]
9 Finally, for comparison of classifiers over multiple data sets, Hull (1994) was, to the best of our knowledge, the first who used non-parametric tests for comparing classifiers in information retrieval and assessment of relevance of documents (see also Sch¨ tze et al. [sent-52, score-0.519]
10 There is a fundamental difference between the tests used to assess the difference between two classifiers on a single data set and the differences over multiple data sets. [sent-92, score-0.573]
11 Since these samples are usually related, a lot of care is needed in designing the statistical procedures and tests that avoid problems with biased estimations of variance. [sent-94, score-0.397]
12 Furthermore, the problem of correct statistical tests for comparing classifiers on a single data set is not related to the comparison on multiple data sets in the sense that we would first have to solve the former problem in order to tackle the latter. [sent-98, score-0.599]
13 Since running the algorithms on multiple data sets naturally gives a sample of independent measurements, such comparisons are even simpler than comparisons on a single data set. [sent-99, score-0.371]
14 1 Comparisons of Two Classifiers In the discussion of the tests for comparisons of two classifiers over multiple data sets we will make two points. [sent-103, score-0.526]
15 Another, even more rarely used test is the sign test which is weaker than the Wilcoxon test but also has its distinct merits. [sent-106, score-0.288]
16 2 PAIRED T-T EST A common way to test whether the difference between two classifiers’ results over various data sets is non-random is to compute a paired t-test, which checks whether the average difference in their performance over the data sets is significantly different from zero. [sent-121, score-0.283]
17 Ironically, the Kolmogorov-Smirnov and similar tests for testing the normality of distributions have little power on small samples, that is, they are unlikely to detect abnormalities and warn against using the t-test. [sent-136, score-0.466]
18 6 S TATISTICAL C OMPARISONS OF C LASSIFIERS OVER M ULTIPLE DATA S ETS adult (sample) breast cancer breast cancer wisconsin cmc ionosphere iris liver disorders lung cancer lymphography mushroom primary tumor rheum voting wine C4. [sent-138, score-0.483]
19 The differences are ranked according to their absolute values; average ranks are assigned in case of ties. [sent-195, score-0.295]
20 Let R+ be the sum of ranks for the data sets on which the second algorithm outperformed the first, and R− the sum of ranks for the opposite. [sent-196, score-0.424]
21 The ranks are assigned from the lowest to the highest absolute difference, and the equal differences (0. [sent-214, score-0.295]
22 The sum of ranks for the positive differences is R+ = 3. [sent-217, score-0.295]
23 5 = 93 and the sum of ranks for the negative differences equals R− = 7 + 3. [sent-219, score-0.295]
24 The Wilcoxon signed ranks test is more sensible than the t-test. [sent-225, score-0.284]
25 The Wilcoxon test assumes continuous differences di , therefore they should not be rounded to, say, one or two decimals since this would decrease the power of the test due to a high number of ties. [sent-229, score-0.395]
26 When the assumptions of the paired t-test are met, the Wilcoxon signed-ranks test is less powerful than the paired t-test. [sent-230, score-0.344]
27 This test does not assume any commensurability of scores or differences nor does it assume normal distributions and is thus applicable to any data (as long as the observations, i. [sent-253, score-0.321]
28 Some authors prefer to count only the significant wins and losses, where the significance is determined using a statistical test on each data set, for instance Dietterich’s 5×2cv. [sent-258, score-0.281]
29 This would be a valid argument if statistical tests could distinguish between the random and non-random differences. [sent-260, score-0.343]
30 However, statistical tests only measure the improbability of the obtained experimental result if the null hypothesis was correct, which is not even the (im)probability of the null-hypothesis. [sent-261, score-0.393]
31 Contrary to the popular belief, counting only significant wins and losses therefore does not make the tests more but rather less reliable, since it draws an arbitrary threshold of p < 0. [sent-265, score-0.38]
32 2 Comparisons of Multiple Classifiers None of the above tests was designed for reasoning about the means of multiple random variables. [sent-268, score-0.37]
33 When so many tests are made, a certain proportion of the null hypotheses is rejected due to random chance, so listing them makes little sense. [sent-271, score-0.514]
34 Statistics offers more powerful specialized procedures for testing the significance of differences between multiple means. [sent-275, score-0.3]
35 If the between-classifiers variability is significantly larger than the error variability, we can reject the null-hypothesis and conclude that there are some differences between the classifiers. [sent-284, score-0.282]
36 To make pairwise comparisons between the classifiers, the corresponding differences in performances are divided by the standard error and compared with the critical value. [sent-288, score-0.392]
37 The two procedures are thus similar to a t-test, except that the critical values tabulated by Tukey and Dunnett are higher to ensure that there is at most 5 % chance that one of the pairwise differences will be erroneously found significant. [sent-289, score-0.306]
38 We will not describe ANOVA and its post-hoc tests in more details due to our reservations about the parametric tests and, especially, since these tests are well known and described in statistical literature (Zar, 1998; Sheskin, 2000). [sent-298, score-1.017]
39 It ranks the algorithms for each data set separately, the best performing algorithm getting the rank of 1, the second best rank 2. [sent-311, score-0.316]
40 The Friedman test j 1 compares the average ranks of algorithms, R j = N ∑i ri . [sent-317, score-0.284]
41 The performance of two classifiers is significantly different if the corresponding average ranks differ by at least the critical difference k(k + 1) 6N CD = qα 11 ˇ D EM S AR #classifiers q0. [sent-330, score-0.298]
42 Table 5: Critical values for post-hoc tests after the Friedman test √ where critical values qα are based on the Studentized range statistic divided by 2 (Table 5(a)). [sent-371, score-0.517]
43 When all classifiers are compared with a control classifier, we can instead of the Nemenyi test use one of the general procedures for controlling the family-wise error in multiple hypothesis testing, such as the Bonferroni correction or similar procedures. [sent-372, score-0.295]
44 Although these methods are generally conservative and can have little power, they are in this specific case more powerful than the Nemenyi test, since the latter adjusts the critical value for making k(k − 1)/2 comparisons while when comparing with a control we only make k − 1 comparisons. [sent-373, score-0.318]
45 The tests differ in the way they adjust the value of α to compensate for multiple comparisons. [sent-376, score-0.37]
46 Although we here use these procedures only as post-hoc tests for the Friedman test, they can be used generally for controlling the family-wise error when multiple hypotheses of possibly various types are tested. [sent-406, score-0.486]
47 The Friedman test checks whether the measured average ranks are significantly different from the mean rank R j = 2. [sent-434, score-0.324]
48 13 ˇ D EM S AR adult (sample) breast cancer breast cancer wisconsin cmc ionosphere iris liver disorders lung cancer lymphography mushroom primary tumor rheum voting wine average rank C4. [sent-446, score-0.523]
49 The ranks in the parentheses are used in computation of the Friedman test and would usually not be published in an actual paper. [sent-523, score-0.284]
50 For the other tests we have to compute and order the corresponding statistics and p values. [sent-574, score-0.311]
51 4 G RAPHICAL P RESENTATION OF R ESULTS When multiple classifiers are compared, the results of the post-hoc tests can be visually represented with a simple diagram. [sent-620, score-0.37]
52 Similar graphs for the other post-hoc tests would need to plot a different adjusted critical interval for each classifier and specify the procedure used for testing and the corresponding order of comparisons, which could easily become confusing. [sent-628, score-0.471]
53 Figure 1: Visualization of post-hoc tests for data from Table 6. [sent-1521, score-0.359]
54 1 Experimental Setup We examined the behaviour of the studied tests through the experiments in which we repeatedly compared the learning algorithms on sets of ten randomly drawn data sets and recorded the p values returned by the tests. [sent-2130, score-0.359]
55 Since the statistical tests which we use are theoretically well understood, we do not need to test the tests but the compliance of the real-world data to their assumptions. [sent-2158, score-0.798]
56 As already mentioned in the description of the t-test, the tests like the Kolmogorov-Smirnov test of normality are unreliable on small samples where they are very unlikely to detect abnormalities. [sent-2162, score-0.46]
57 And even if we did have suitable tests at our disposal, they would only compute the degree of (ab)normality of the distribution, non-homogeneity of variance etc, and not the sample’s suitability for t-test. [sent-2163, score-0.366]
58 2 M EASURES OF P OWER AND R EPLICABILITY Formally, the power of a statistical test is defined as the probability that the test will (correctly) reject the false null-hypothesis. [sent-2169, score-0.397]
59 The two measures for assessing the power of the tests lead to two related measures of replicability. [sent-2176, score-0.363]
60 The disadvantage of this measure is that a statistical test will show a low replicability when the difference between the classifiers is marginally significant. [sent-2183, score-0.35]
61 When comparing two tests of different power, the one with results closer to the chosen α will usually be deemed as less reliable. [sent-2184, score-0.381]
62 4 To allow for comparisons with Bouckaert’s R(e), we define the replicability with respect to the variance of p as R(p) = 1 − 2 · var(p) = 1 − 2 ∑i (pi − p)2 . [sent-2188, score-0.385]
63 n−1 A problem with this measure of replicability when used in our experimental procedure is that when the bias k increases, the variability of the data set selection decreases and so does the variance of p. [sent-2189, score-0.379]
64 2 Comparisons of Two Classifiers We have tested four statistics for comparisons of two classifiers: the paired t-test on absolute and on relative differences, the Wilcoxon test and the sign test. [sent-2198, score-0.295]
65 The graphs on the left hand side of Figure 3 show the average p values returned by the tests as a function of the bias k when comparing C4. [sent-2200, score-0.381]
66 To demonstrate the relation between power (as we measure it) and Bouckaert’s measure of replicability we have added the right axis that shows R(e) corresponding to the number of rejected hypothesis. [sent-2207, score-0.47]
67 Therefore, at k = 0 the tests reflect the number of rejections of the null-hypothesis on a completely random selection of data sets from our collection. [sent-2210, score-0.417]
68 The numbers below the diagonal show the average p values and the related replicability R(p), and the numbers above the diagonal represent the number of experiments in which the null-hypothesis was rejected at α = 5% and the related R(e). [sent-2221, score-0.363]
69 The table again shows that the Wilcoxon test almost always returns lower p values than other tests and more often rejects the null hypothesis. [sent-2222, score-0.454]
70 R(e), on the other hand, again prefers other tests with p values farther from the critical 0. [sent-2224, score-0.421]
71 Overall, it is known that parametric tests are more likely to reject the null-hypothesis than the non-parametric unless their assumptions are violated. [sent-2226, score-0.52]
72 kNN Figure 3: Power of statistical tests for comparison of two classifiers. [sent-5936, score-0.374]
73 kNN Figure 4: Replicability of tests for comparison of two classifiers: variance-based R(p) (left) and Bouckaert’s R(e) (right). [sent-10070, score-0.342]
74 00 (b) Paired t-test on relative differences c45 c45 c45-m c45-cf tree bayes disc-bayes knn . [sent-10197, score-0.326]
75 00 (c) Wilcoxon signed-ranks test c45 c45 c45-m c45-cf tree bayes disc-bayes knn . [sent-10260, score-0.315]
76 Altogether, replicability seems somewhat smaller than the replicability of the tests for comparisons of two classifiers. [sent-11624, score-0.863]
77 Therefore, as common sense would suggest, when comparing multiple classifiers, it is even more important to conduct the tests on as many data sets as possible. [sent-11625, score-0.488]
78 Figure 7 compares post hoc tests for comparisons with a control classifier, using the same two ways of selecting data sets as in Figure 6. [sent-11635, score-0.467]
79 When the differences are large, the power of all tests is comparable, while when they are smaller the number of rejections for the parametric test seems to lag behind (we have observed this same pattern on other combinations of algorithms). [sent-11636, score-0.676]
80 The order of the non-parametric tests is as expected from the theory, although it is interesting to note that the Holm and Hochberg tests give practically equal results. [sent-11637, score-0.622]
81 kNN Figure 6: Power of statistical tests for comparison of multiple classifiers. [sent-12884, score-0.433]
82 These experiments again seem to favour the non-parametric tests over the parametric ones although not always as convincingly as in the case of comparisons of two classifiers. [sent-12888, score-0.501]
83 Due to the theoretical and practical advantages of the Friedman test (ease of computation and interpretation, the ability to present the overall performance of classifiers in form of ranks instead of the dubious averages), the Friedman test should be preferred over ANOVA. [sent-12889, score-0.38]
84 The corresponding non-parametric post-hoc tests give similar results, so it is upon the researcher to decide whether the slightly more powerful Hommel test is worth the complexity of its calculation as compared to the much simpler Holm test. [sent-12890, score-0.472]
85 kNN Figure 7: Power of statistical tests for comparison of multiple classifiers with a control. [sent-15492, score-0.433]
86 There is however no golden standard for making such comparisons and the tests performed often have dubious statistical foundations and lead to unwarranted and unverified conclusions. [sent-15498, score-0.451]
87 The problems with the multiple data set tests are quite different, even in a sense complementary: the measurements from different data sets are usually incommensurate, and the normality of their distributions and the homogeneity of variance is questionable at best. [sent-15501, score-0.604]
88 Based on the well known statistical properties of the tests and our knowledge of the machine learning data, we concluded that the non-parametric tests should be preferred over the parametric ones. [sent-15504, score-0.706]
89 We varied the differences between the classifiers by biasing the selection of data sets, and measured the likelihood of rejection of the null-hypothesis and the replicability of the test. [sent-15506, score-0.41]
90 We have indeed found that the non-parametric tests are more likely to reject the null-hypothesis, which hints at the presence of outliers or violations of assumptions of the parametric tests and confirms our theoretical misgivings about them. [sent-15507, score-0.831]
91 The empirical analysis also shows that replicability of the tests might be a problem, thus the actual experiments should be conducted on as many data sets as possible. [sent-15508, score-0.581]
92 They are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. [sent-15515, score-0.393]
93 Empirical results suggest that they are also stronger than the other tests studied. [sent-15517, score-0.311]
94 There is an alternative opinion among statisticians that significance tests should not be performed at all since they are often misused, either due to misinterpretation or by putting too much stress on their results (Cohen, 1994; Schmidt, 1996; Harlow and Mulaik, 1997). [sent-15521, score-0.311]
95 Our stance is that statistical tests provide certain reassurance about the validity and non-randomness of the published results. [sent-15522, score-0.343]
96 On the other hand, statistical tests should not be the deciding factor for or against publishing the work. [sent-15524, score-0.343]
97 Evaluating the replicability of significance tests for comparing learning algorithms. [sent-15577, score-0.603]
98 Approximate statistical tests for comparing supervised classification learning algorithms. [sent-15607, score-0.413]
99 A comparison of alternative tests of significance for the problem of m rankings. [sent-15639, score-0.342]
100 A sharper Bonferroni procedure for multiple tests of significance. [sent-15655, score-0.37]
wordName wordTfidf (topN-words)
[('anova', 0.339), ('tests', 0.311), ('replicability', 0.222), ('cf', 0.188), ('ranks', 0.188), ('knn', 0.178), ('ultiple', 0.175), ('wilcoxon', 0.174), ('ers', 0.171), ('omparisons', 0.163), ('tatistical', 0.163), ('ets', 0.142), ('rejected', 0.141), ('nemenyi', 0.14), ('bouckaert', 0.139), ('lassifiers', 0.139), ('friedman', 0.133), ('reject', 0.121), ('classi', 0.118), ('critical', 0.11), ('ar', 0.109), ('comparisons', 0.108), ('differences', 0.107), ('holm', 0.105), ('test', 0.096), ('paired', 0.091), ('hommel', 0.082), ('cd', 0.076), ('em', 0.074), ('auc', 0.071), ('comparing', 0.07), ('commensurability', 0.07), ('wins', 0.069), ('bonferroni', 0.068), ('papers', 0.064), ('hypotheses', 0.062), ('cancer', 0.061), ('naive', 0.06), ('multiple', 0.059), ('rejections', 0.058), ('txt', 0.058), ('variance', 0.055), ('axis', 0.055), ('procedures', 0.054), ('variability', 0.054), ('hochberg', 0.053), ('normality', 0.053), ('parametric', 0.052), ('cance', 0.052), ('power', 0.052), ('hypothesis', 0.05), ('testing', 0.05), ('tukey', 0.049), ('data', 0.048), ('ei', 0.048), ('dem', 0.047), ('dunnett', 0.047), ('rejects', 0.047), ('breast', 0.045), ('di', 0.044), ('cross', 0.044), ('signi', 0.041), ('bayes', 0.041), ('rank', 0.04), ('mushroom', 0.04), ('lung', 0.04), ('slovenia', 0.04), ('assumptions', 0.036), ('count', 0.036), ('correction', 0.036), ('researcher', 0.035), ('tumor', 0.035), ('iris', 0.035), ('salzberg', 0.035), ('validation', 0.035), ('beck', 0.035), ('elevated', 0.035), ('janez', 0.035), ('mcnemar', 0.035), ('mladeni', 0.035), ('orange', 0.035), ('sheskin', 0.035), ('zar', 0.035), ('zupan', 0.035), ('pairwise', 0.035), ('averages', 0.034), ('rejection', 0.033), ('est', 0.033), ('statistical', 0.032), ('performances', 0.032), ('comparison', 0.031), ('powerful', 0.03), ('er', 0.03), ('lymphography', 0.03), ('disorders', 0.03), ('homogeneity', 0.03), ('favour', 0.03), ('ljubljana', 0.03), ('degrees', 0.03), ('binomial', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
Author: Janez Demšar
Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests
2 0.060650211 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection
Author: Hichem Sahbi, Donald Geman
Abstract: We introduce a computational design for pattern detection based on a tree-structured network of support vector machines (SVMs). An SVM is associated with each cell in a recursive partitioning of the space of patterns (hypotheses) into increasingly finer subsets. The hierarchy is traversed coarse-to-fine and each chain of positive responses from the root to a leaf constitutes a detection. Our objective is to design and build a network which balances overall error and computation. Initially, SVMs are constructed for each cell with no constraints. This “free network” is then perturbed, cell by cell, into another network, which is “graded” in two ways: first, the number of support vectors of each SVM is reduced (by clustering) in order to adjust to a pre-determined, increasing function of cell depth; second, the decision boundaries are shifted to preserve all positive responses from the original set of training data. The limits on the numbers of clusters (virtual support vectors) result from minimizing the mean computational cost of collecting all detections subject to a bound on the expected number of false positives. When applied to detecting faces in cluttered scenes, the patterns correspond to poses and the free network is already faster and more accurate than applying a single pose-specific SVM many times. The graded network promotes very rapid processing of background regions while maintaining the discriminatory power of the free network. Keywords: statistical learning, hierarchy of classifiers, coarse-to-fine computation, support vector machines, face detection
3 0.060170949 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers
Author: Francis R. Bach, David Heckerman, Eric Horvitz
Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classifiers for all feasible ratios of the costs associated with false positives and false negatives. For linear classifiers, the set of classifiers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classifiers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efficiently the entire ROC curve, and that has the same computational complexity as training a single classifier. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classifier and the cost model assumed in applying the classifier. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantifiable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classification
4 0.059993856 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
Author: Luis M. de Campos
Abstract: We propose a new scoring function for learning Bayesian networks from data using score+search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a non-Bayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the Kullback-Leibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the well-known K2, BDeu and BIC/MDL scores are also presented. Keywords: Bayesian networks, scoring functions, learning, mutual information, conditional independence tests
5 0.051672924 57 jmlr-2006-Linear State-Space Models for Blind Source Separation
Author: Rasmus Kongsgaard Olsson, Lars Kai Hansen
Abstract: We apply a type of generative modelling to the problem of blind source separation in which prior knowledge about the latent source signals, such as time-varying auto-correlation and quasiperiodicity, are incorporated into a linear state-space model. In simulations, we show that in terms of signal-to-error ratio, the sources are inferred more accurately as a result of the inclusion of strong prior knowledge. We explore different schemes of maximum-likelihood optimization for the purpose of learning the model parameters. The Expectation Maximization algorithm, which is often considered the standard optimization method in this context, results in slow convergence when the noise variance is small. In such scenarios, quasi-Newton optimization yields substantial improvements in a range of signal to noise ratios. We analyze the performance of the methods on convolutive mixtures of speech signals. Keywords: blind source separation, state-space model, independent component analysis, convolutive model, EM, speech modelling
6 0.051509753 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann
7 0.050950922 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
8 0.042004503 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)
9 0.038984299 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming (Special Topic on Machine Learning and Optimization)
10 0.038508806 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies
11 0.038348384 12 jmlr-2006-Active Learning with Feedback on Features and Instances
12 0.038204417 48 jmlr-2006-Learning Minimum Volume Sets
13 0.037991464 88 jmlr-2006-Streamwise Feature Selection
15 0.037243646 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
16 0.035593472 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
17 0.033620667 83 jmlr-2006-Sparse Boosting
19 0.032227255 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)
20 0.032150172 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)
topicId topicWeight
[(0, 0.177), (1, -0.058), (2, -0.052), (3, 0.038), (4, -0.091), (5, -0.006), (6, -0.058), (7, -0.14), (8, 0.033), (9, -0.036), (10, -0.002), (11, -0.055), (12, -0.051), (13, -0.031), (14, 0.071), (15, -0.038), (16, -0.095), (17, 0.111), (18, 0.194), (19, 0.108), (20, 0.072), (21, -0.225), (22, -0.162), (23, -0.258), (24, -0.077), (25, 0.053), (26, -0.116), (27, 0.021), (28, -0.163), (29, -0.098), (30, -0.005), (31, 0.004), (32, -0.036), (33, 0.014), (34, -0.093), (35, 0.157), (36, -0.051), (37, 0.021), (38, -0.032), (39, -0.004), (40, 0.134), (41, 0.117), (42, -0.056), (43, 0.022), (44, -0.273), (45, -0.044), (46, -0.032), (47, -0.045), (48, 0.025), (49, 0.163)]
simIndex simValue paperId paperTitle
same-paper 1 0.9612177 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
Author: Janez Demšar
Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests
2 0.51418245 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers
Author: Francis R. Bach, David Heckerman, Eric Horvitz
Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classifiers for all feasible ratios of the costs associated with false positives and false negatives. For linear classifiers, the set of classifiers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classifiers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efficiently the entire ROC curve, and that has the same computational complexity as training a single classifier. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classifier and the cost model assumed in applying the classifier. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantifiable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classification
3 0.39059252 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection
Author: Hichem Sahbi, Donald Geman
Abstract: We introduce a computational design for pattern detection based on a tree-structured network of support vector machines (SVMs). An SVM is associated with each cell in a recursive partitioning of the space of patterns (hypotheses) into increasingly finer subsets. The hierarchy is traversed coarse-to-fine and each chain of positive responses from the root to a leaf constitutes a detection. Our objective is to design and build a network which balances overall error and computation. Initially, SVMs are constructed for each cell with no constraints. This “free network” is then perturbed, cell by cell, into another network, which is “graded” in two ways: first, the number of support vectors of each SVM is reduced (by clustering) in order to adjust to a pre-determined, increasing function of cell depth; second, the decision boundaries are shifted to preserve all positive responses from the original set of training data. The limits on the numbers of clusters (virtual support vectors) result from minimizing the mean computational cost of collecting all detections subject to a bound on the expected number of false positives. When applied to detecting faces in cluttered scenes, the patterns correspond to poses and the free network is already faster and more accurate than applying a single pose-specific SVM many times. The graded network promotes very rapid processing of background regions while maintaining the discriminatory power of the free network. Keywords: statistical learning, hierarchy of classifiers, coarse-to-fine computation, support vector machines, face detection
4 0.36809242 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
Author: Magnus Ekdahl, Timo Koski
Abstract: In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Na¨ve Bayes classifier. We show that the perforı mance of the Na¨ve Bayes depends on the degree of functional dependence between the features ı and labels. We provide a sufficient condition for zero loss of performance, too. Keywords: Bayesian networks, na¨ve Bayes, plug-in classifier, Kolmogorov distance of variation, ı variational learning
5 0.33881405 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies
Author: Fu Chang, Chin-Chin Lin, Chi-Jen Lu
Abstract: In this paper, we propose a number of adaptive prototype learning (APL) algorithms. They employ the same algorithmic scheme to determine the number and location of prototypes, but differ in the use of samples or the weighted averages of samples as prototypes, and also in the assumption of distance measures. To understand these algorithms from a theoretical viewpoint, we address their convergence properties, as well as their consistency under certain conditions. We also present a soft version of APL, in which a non-zero training error is allowed in order to enhance the generalization power of the resultant classifier. Applying the proposed algorithms to twelve UCI benchmark data sets, we demonstrate that they outperform many instance-based learning algorithms, the k-nearest neighbor rule, and support vector machines in terms of average test accuracy. Keywords: adaptive prototype learning, cluster-based prototypes, consistency, instance-based prototype, pattern classification 1
7 0.31922311 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming (Special Topic on Machine Learning and Optimization)
8 0.31182796 48 jmlr-2006-Learning Minimum Volume Sets
11 0.26113889 88 jmlr-2006-Streamwise Feature Selection
12 0.25805461 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification
14 0.23005559 83 jmlr-2006-Sparse Boosting
15 0.22608621 57 jmlr-2006-Linear State-Space Models for Blind Source Separation
16 0.20742507 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery
17 0.20568693 12 jmlr-2006-Active Learning with Feedback on Features and Instances
18 0.20060901 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting (Special Topic on Inductive Programming)
19 0.19212842 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann
20 0.18979912 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
topicId topicWeight
[(35, 0.017), (36, 0.661), (45, 0.01), (50, 0.036), (63, 0.023), (68, 0.011), (76, 0.011), (78, 0.016), (81, 0.027), (84, 0.01), (90, 0.018), (91, 0.014), (96, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.99050605 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
Author: Janez Demšar
Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests
Author: Pat Langley, Dongkyu Choi
Abstract: In this paper, we propose a new representation for physical control – teleoreactive logic programs – along with an interpreter that uses them to achieve goals. In addition, we present a new learning method that acquires recursive forms of these structures from traces of successful problem solving. We report experiments in three different domains that demonstrate the generality of this approach. In closing, we review related work on learning complex skills and discuss directions for future research on this topic. Keywords: teleoreactive control, logic programs, problem solving, skill learning
3 0.96697903 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
Author: Paul W. Goldberg
Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification
4 0.65466356 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
Author: Magnus Ekdahl, Timo Koski
Abstract: In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Na¨ve Bayes classifier. We show that the perforı mance of the Na¨ve Bayes depends on the degree of functional dependence between the features ı and labels. We provide a sufficient condition for zero loss of performance, too. Keywords: Bayesian networks, na¨ve Bayes, plug-in classifier, Kolmogorov distance of variation, ı variational learning
5 0.63651562 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification
Author: Ting Liu, Andrew W. Moore, Alexander Gray
Abstract: This paper is about non-approximate acceleration of high-dimensional nonparametric operations such as k nearest neighbor classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the data points close to the query, but merely need to answer questions about the properties of that set of data points. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. This is applicable to many algorithms in nonparametric statistics, memory-based learning and kernel-based learning. But for clarity, this paper concentrates on pure k-NN classification. We introduce new ball-tree algorithms that on real-world data sets give accelerations from 2-fold to 100-fold compared against highly optimized traditional ball-tree-based k-NN . These results include data sets with up to 106 dimensions and 105 records, and demonstrate non-trivial speed-ups while giving exact answers. keywords: ball-tree, k-NN classification
7 0.61690074 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis
9 0.61074698 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming (Special Topic on Machine Learning and Optimization)
10 0.60304594 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
11 0.59740674 47 jmlr-2006-Learning Image Components for Object Recognition
12 0.59532994 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
13 0.59122425 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
14 0.5906958 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
15 0.5901916 53 jmlr-2006-Learning a Hidden Hypergraph
16 0.57515872 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
17 0.57449931 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
18 0.56440926 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
19 0.56191564 49 jmlr-2006-Learning Parts-Based Representations of Data
20 0.56093341 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals