jmlr jmlr2013 jmlr2013-11 jmlr2013-11-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis
Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness
NOVEL TIE∗ • Semi-Interleaved HITON-PC (without symmetry correction) with α = 0.05 and max-k = 3 Extended from was used for identification of Markov boundaries. Procedure IGS-Lex was used for generatStatnikov and ing data sets from the embedded distributions. Criteria Independence and Predictivity were Aliferis (2010a) used for verifying Markov boundaries in simulated and real data, respectively. See Appendix D for details. iTIE∗ • α = 0.05, max-k = 3. Novel method STOCHASTIC MARKOV BOUNDARY DISCOVERY • ♯ of runs = 5, 000, α = 0.05, K = 0.7 KIAMB • ♯ of runs = 5, 000, α = 0.05, K = 0.8 Pe˜ a et al. (2007) n • ♯ of runs = 5, 000, α = 0.05, K = 0.9 • l = 7, K = 10 • l = 5, 000, K = 10 EGS-CMIM • l = 7, K = 50 • l = 5, 000, K = 50 • l = 7, δ = 0.015 • l = 5000, δ = 0.015 Liu et al. (2010b) EGS-NCMIGS • l = 7, K = 10 • l = 5000, K = 10 • l = 7, K = 50 • l = 5000, K = 50 VARIABLE GROUPING-BASED MARKOV BOUNDARY DISCOVERY • ♯ of Markov boundaries = 30,t = 15 • ♯ of Markov boundaries = 5000,t = 15 EGSG Liu et al. (2010) • ♯ of Markov boundaries = 30,t = 10 • ♯ of Markov boundaries = 5000,t = 10 • ♯ of Markov boundaries = 30,t = 5 • ♯ of Markov boundaries = 5000,t = 5 RESAMPLING-BASED VARIABLE SELECTION • w/o statistical comparison of classification performance estimates Ein-Dor et al. Resampling+ (2005); Michiels • with statistical comparison at significance level = 0.05 RFE et al. (2005); All configurations used 5,000 bootstrap samples and a reduction coefficient of 1.2. Statistical Roepman et al. comparison of classification performance estimates was performed using permutation-based (2006); Statnikov testing (with 10,000 permutations) for weighted accuracy (Good, 2000) and DeLong’s test and Aliferis (DeLong et al., 1988) for AUC. (2010a) • w/o statistical comparison of classification performance estimates Resampling+ • with statistical comparison at significance level α = 0.05 UAF All configurations used 5,000 bootstrap samples and a reduction coefficient of 1.2. The same tests as in Resampling+RFE were used for statistical comparisons. ITERATIVE REMOVAL FOR VARIABLE SELECTION AND MARKOV BOUNDARY DISCOVERY • max-k = 3, α = 0.05 IR-HITON-PC Natsoulis et al. This method runs Semi-Interleaved HITON-PC without symmetry correction. The same tests (2005); Statnikov as in Resampling+RFE were used for statistical comparisons. and Aliferis • w/o statistical comparison of classification performance estimates (2010a) IR-SPLR • with statistical comparison at significance level α = 0.05 The regularization coefficient λ for each SPLR model was determined by holdout validation in training data. The same tests as in Resampling+RFE were used for statistical comparisons. Table 9: Parameterizations of methods for discovery of multiple Markov boundaries and variable sets. Parameter settings that have been recommended by the authors of prior methods are underlined. 552 Method TIE* iTIE* EGS-NCMIGS EGS-CMIM 553 EGSG Resampling+RFE Resampling+UAF IR-HITON-PC IR-SPLR Table 10: Results obtained in simulated data set T IED. “MB” stands for “Markov boundary”, and “VS” stands for “variable set”. The 95% interval for weighted accuracy denotes the range in which weighted accuracies of 95% of the extracted Markov boundaries/variable sets fell. Classification performance of the MAP-BN classifier in the same data sample was 0.966 weighted accuracy. Highlighted in bold are results that are statistically comparable to the MAP-BN classification performance. D ISCOVERY OF M ULTIPLE M ARKOV B OUNDARIES KIAMB max-k = 3, D = 0.05 max-k = 3, D = 0.05 Number of runs = 5000, D = 0.05, K = 0.7 Number of runs = 5000, D = 0.05, K = 0.8 Number of runs = 5000, D = 0.05, K = 0.9 l = 7, G = 0.015 l = 7, K = 10 l = 7, K = 50 l = 5000, G = 0.015 l = 5000, K = 10 l = 5000, K = 50 l = 7, K = 10 l = 7, K = 50 l = 5000, K = 10 l = 5000, K = 50 Number of Markov boundaries = 30, t = 5 Number of Markov boundaries = 30, t = 10 Number of Markov boundaries = 30, t = 15 Number of Markov boundaries = 5,000, t = 5 Number of Markov boundaries = 5,000, t = 10 Number of Markov boundaries = 5,000, t = 15 without statistical comparison with statistical comparison (D = 0.05) without statistical comparison with statistical comparison (D = 0.05) max-k = 3, D = 0.05 without statistical comparison with statistical comparison (D = 0.05) I. II. III. IV. V. VI. Number Average size Number Average Average Weighted accuracy over of distinct of extracted of true proportion false all extracted MBs or VSs MBs or distinct MBs MBs of false negative rate VSs or VSs identified positives Average 95% Interval exactly 0.000 0.000 72 5.0 72 0.938 0.965 0.951 0.000 0.000 72 5.0 72 0.938 0.965 0.951 0.000 0.400 377 2.8 0 0.727 0.479 0.946 0.000 0.400 377 2.8 0 0.727 0.479 0.946 0.000 0.400 377 2.8 0 0.727 0.479 0.946 0.286 0.000 6 7.0 0 0.963 0.965 0.964 6 10.0 0 0.500 0.000 0.963 0.965 0.964 6 21.0 0 0.762 0.000 0.941 0.937 0.943 0.469 0.267 24 7.3 0 0.843 0.967 0.954 20 10.0 0 0.610 0.220 0.954 0.970 0.964 9 21.0 0 0.762 0.000 0.944 0.937 0.954 6 10.0 0 0.500 0.000 0.963 0.965 0.963 6 21.0 0 0.762 0.000 0.939 0.937 0.942 20 10.0 0 0.595 0.190 0.951 0.969 0.963 9 21.0 0 0.762 0.000 0.943 0.937 0.954 30 7.0 0 0.476 0.267 0.840 0.605 0.968 30 7.0 0 0.548 0.367 0.722 0.379 0.962 30 7.0 0 0.548 0.367 0.722 0.379 0.962 1,997 7.0 0 0.286 0.000 0.863 0.620 0.965 3,027 7.0 0 0.286 0.000 0.774 0.500 0.965 3,027 7.0 0 0.286 0.000 0.774 0.500 0.965 1,374 14.9 1 0.397 0.058 0.932 0.979 0.955 0.171 0.378 188 4.9 0 0.930 0.917 0.967 184 20.8 0 0.752 0.000 0.934 0.966 0.953 0.592 0.347 19 8.4 0 0.930 0.917 0.938 0.083 0.200 3 4.3 1 0.946 0.936 0.965 1 26.0 0 0.808 0.000 0.958 0.958 0.958 0.706 0.000 1 17.0 0 0.959 0.959 0.959 Method TIE* iTIE* KIAMB EGS-CMIM 554 EGSG Resampling+RFE Resampling+UAF IR-HITON-PC IR-SPLR Table 11: Results obtained in simulated data set T IED1000. “MB” stands for “Markov boundary”, and “VS” stands for “variable set”. The 95% interval for weighted accuracy denotes the range in which weighted accuracies of 95% of the extracted Markov boundaries/variable sets fell. Classification performance of the MAP-BN classifier in the same data sample was 0.972 weighted accuracy. Highlighted in bold are results that are statistically comparable to the MAP-BN classification performance. S TATNIKOV, LYTKIN , L EMEIRE AND A LIFERIS EGS-NCMIGS max-k = 3, D = 0.05 max-k = 3, D = 0.05 Number of runs = 5000, D = 0.05, K = 0.7 Number of runs = 5000, D = 0.05, K = 0.8 Number of runs = 5000, D = 0.05, K = 0.9 l = 7, G = 0.015 l = 7, K = 10 l = 7, K = 50 l = 5000, G = 0.015 l = 5000, K = 10 l = 5000, K = 50 l = 7, K = 10 l = 7, K = 50 l = 5000, K = 10 l = 5000, K = 50 Number of Markov boundaries = 30, t = 5 Number of Markov boundaries = 30, t = 10 Number of Markov boundaries = 30, t = 15 Number of Markov boundaries = 5,000, t = 5 Number of Markov boundaries = 5,000, t = 10 Number of Markov boundaries = 5,000, t = 15 without statistical comparison with statistical comparison (D = 0.05) without statistical comparison with statistical comparison (D = 0.05) max-k = 3, D = 0.05 without statistical comparison with statistical comparison (D = 0.05) I. II. III. IV. V. VI. Number of Average Number Average Average false Weighted accuracy over distinct size of of true proportion negative rate all extracted MBs or VSs MBs or extracted MBs of false VSs distinct identified positives Average 95% Interval MBs or VSs exactly 0.000 0.000 72 5.0 72 0.952 0.960 0.957 0.000 0.000 72 5.0 72 0.952 0.960 0.957 0.000 0.400 349 2.8 0 0.722 0.450 0.959 0.000 0.400 349 2.8 0 0.722 0.450 0.959 0.000 0.400 349 2.8 0 0.722 0.450 0.959 0.286 0.000 6 7.0 0 0.953 0.952 0.956 6 10.0 0 0.500 0.000 0.967 0.969 0.968 6 50.0 0 0.900 0.000 0.877 0.866 0.887 0.648 0.508 995 8.0 0 0.950 0.968 0.960 990 10.0 0 0.747 0.494 0.952 0.968 0.961 950 50.0 0 0.949 0.494 0.868 0.857 0.882 6 10.0 0 0.500 0.000 0.965 0.968 0.967 6 50.0 0 0.900 0.000 0.904 0.895 0.915 990 10.0 0 0.676 0.353 0.953 0.967 0.961 950 50.0 0 0.935 0.353 0.897 0.885 0.910 30 67.0 0 0.958 0.440 0.688 0.383 0.850 30 67.0 0 0.977 0.693 0.485 0.253 0.769 30 67.0 0 0.984 0.780 0.422 0.246 0.739 5,000 67.0 0 0.927 0.028 0.662 0.441 0.850 5,000 67.0 0 0.944 0.250 0.476 0.248 0.780 5,000 67.0 0 0.953 0.369 0.406 0.247 0.710 2,492 16.7 2 0.434 0.039 0.951 0.931 0.968 0.225 0.336 214 6.0 0 0.947 0.917 0.964 1,207 28.7 0 0.721 0.000 0.952 0.935 0.964 0.577 0.367 12 7.8 0 0.949 0.931 0.959 0.100 0.100 2 5.0 1 0.958 0.959 0.958 1 30.0 0 0.833 0.000 0.949 0.949 0.949 0.833 0.000 1 30.0 0 0.949 0.949 0.949 D ISCOVERY OF M ULTIPLE M ARKOV B OUNDARIES 1 Pareto frontier ATIE TIE* KIAMB EGS-NCMIGS EGS-CMIM EGSG Resampling+RFE Resampling+UAF IR-HITON-PC IR-SPLR Average false negative rate 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average proportion of false positives 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average proportion of false positives 0.9 1 Average false negative rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Figure 19: Results for average false negative rate and average proportion of false positives obtained in T IED (left) and T IED1000 (right) data sets. Results of TIE∗ and iTIE∗ were identical. 555 Domain ♯ of samples ♯ of variables Response type Data type Infant Mortality clinical 5,337 86 Discrete Ohsumed Text 5,000 14,373 Continuous Holdout ACPJ Etiology Text 15,779 28,228 Continuous Holdout Lymphoma Gene Expression 227 7,399 Continuous 10-fold Gisette Digit recognition 7,000 5,000 Death within the first year Relevant to neonatal diseases Relevant to etiology 3-year survival:dead vs. alive 4 vs. 9 CV design Holdout Continuous Holdout Dexter Text 600 19,999 Continuous 10-fold Sylva Ecology 14,394 216 Continuous Holdout Ovarian Cancer Proteomics 216 2,190 Continuous 10-fold Thrombin Drug discovery 2,543 139,351 Discrete Holdout KDD Cup 2001 Breast Cancer Hiva Gene Expression Drug discovery 286 4,229 17,816 1,617 Relevant to corporate acquisitions Ponderosa vs. rest Cancer vs. normal Binding to thrombin ER+vs. ERActivity to HIV AIDS infection NIPS 2003 Feature Selection Challenge Guyon et al. (2006) NIPS 2003 Feature Selection Challenge Guyon et al. (2006) WCCI 2006 Perf. Prediction Challenge Conrads et al. (2004) Continuous Discrete 10-fold Holdout Nova Text 1,929 16,969 Discrete Holdout Bankruptcy Financial 7,063 147 Political topics vs. religious Personal bankruptcy Continuous Holdout Wang et al. (2005) WCCI 2006 Perf.Prediction Challenge WCCI 2006 Perf. Prediction Chanllenge Foster and Stine (2004) Table 12: Real data sets used in the experiments. Mani and Cooper (1999) Joachims (2002) Aphinyanaphongs et al. (2006) Rosenwald et al. (2002) S TATNIKOV, LYTKIN , L EMEIRE AND A LIFERIS 556 Name D ISCOVERY OF M ULTIPLE M ARKOV B OUNDARIES E.5 On Computation of Performance Criteria in Experiments with Real Data In order to rank all methods based on a given performance criterion, the average value of this criterion was first computed over all evaluation data sets for each method. The methods were then ordered from best to worst performing according to these averages. The best performer was assigned rank 1 and designated as the current “reference method”. Performance of the next unranked method in the ordered list was compared to performance of the reference method using permutation-based testing at significance level 5% and with 10,000 permutations of the vectors of criterion values computed on each data set. If performance of the two methods was found to be statistically comparable, the unranked method received the same rank as the reference method. Otherwise, the next lowest rank was assigned to the unranked method and this method was designated as the new reference method. This process was repeated until each method was assigned a rank. E.6 Additional Discussion of the Results of Experiments with Real Data KIAMB produced some of the more compact Markov boundaries with the average PV of 1% and ranked second out of 11 by that criterion. Small sizes of the extracted Markov boundaries were, to a large extent, due to KIAMB’s sample inefficiency resulting in inability to perform some of the required tests of independence as discussed in Appendix C. As a result, classification performance of Markov boundaries extracted by KIAMB was lower than of most other methods with KIAMB ranking 4 out of 5 by AUC. Consequently, KIAMB ranked 5 out of 15 on the (PV, AUC) criterion. Although, KIAMB was parameterized to produce 5,000 Markov boundaries, only about 30% of them were distinct, which means that 70% of computational time was spent on repeated retrieval of the same Markov boundaries. EGS-NCMIGS with the alternative stopping criterion produced the smallest Markov boundaries at the expense of a significant reduction in AUC (∼ 9% below TIE∗ ). Parameterizations of EGSNCMIGS with the alternative stopping criterion ranked first and second out of 11 by PV and fourth out of 5 by AUC. Overall, performance of EGS-NCMIGS and EGS-CMIM varied widely depending on parameterization. Ranks of these methods ranged from 2 to 11 out of 15 on the (PV, AUC) criterion. EGSG showed an overall poor performance, ranking between 10 and 15 out of 15 on (PV, AUC). Markov boundaries extracted by EGSG were larger than Markov boundaries identified by many other methods and had the lowest average classification performance. Resampling+RFE and Resampling+UAF extracted variable sets that were the largest in comparison with other methods, but that also had highest classification performance. Resampling+RFE and Resampling+UAF ranked between 9 and 11 out of 11 by PV and between 1 and 3 by AUC. Notably, variable sets extracted by Resampling+UAF had an average PV between 24% and 41%, depending on parameterization. Resampling+RFE extracted more compact variable sets than Resampling+UAF in every data set, with the average PV between 5% and 17%. Due to poor performance on the PV criterion, Resampling+RFE and Resampling+UAF ranked in the mid to poor range on the combined (PV, AUC) criterion, scoring between 7 and 11 out of 15. Iterative removal methods IR-HITON-PC and IR-SPLR extracted small numbers of Markov boundaries/variable sets and ranked between 5 and 6 out of 6 by that criterion. IR-HITON-PC produced more compact Markov boundaries than the variable sets of IR-SPLR. Markov boundaries extracted by IR-HITON-PC had an average PV of 2.3%, which was significantly smaller than the 11%-15% average PV of IR-SPLR. IR-HITON-PC method ranked 5 out of 11 by PV while IR-SPLR 557 S TATNIKOV, LYTKIN , L EMEIRE AND A LIFERIS methods ranked 9 and 10 by the same criterion. Despite the smaller average size of the extracted Markov boundaries, IR-HITON-PC ranked on par with IR-SPLR (parameterized with statistical comparison) by AUC, scoring third out of 5. Among all parameterizations of iterative removal methods, IR-SPLR without statistical comparison produced the largest variable sets, which helped this method reach a higher average classification performance and rank second out of 5 by AUC. Higher average PV of variable sets extracted by IR-SPLR caused these methods to rank 9 and 11 out of 15 on the combined (PV, AUC) criterion. IR-HITON-PC ranked sixth on the same criterion as a result of moderate ranks on PV and AUC. 558 559 Ohsumed N S AUC 1 14,373 0.857 2,497 37 0.776 250 7 0.651 133 7 0.650 58 7 0.648 6 4 0.584 1 10 0.691 1 50 0.828 4,999 4 0.564 4,991 10 0.693 4,951 50 0.830 1 10 0.696 1 50 0.843 4,991 10 0.687 4,951 50 0.841 30 70 0.653 30 70 0.634 30 70 0.602 5,000 70 0.649 5,000 70 0.624 5,000 70 0.606 4,942 3,889 0.846 5,000 914 0.836 2,533 10,722 0.855 4,925 7,690 0.864 2 40 0.778 1 176 0.829 3 122 0.728 ACPJ_ Etiology N S AUC 1 28,228 0.938 5,330 18 0.908 1,354 9 0.884 830 9 0.883 414 9 0.884 6 3 0.743 3 10 0.780 3 35 0.842 4,999 4 0.770 4,991 10 0.785 4,981 31 0.843 2 10 0.915 1 32 0.917 4,991 10 0.842 4,982 31 0.857 30 84 0.840 30 84 0.835 30 84 0.792 5,000 84 0.837 5,000 84 0.822 5,000 84 0.780 5,000 2,441 0.924 5,000 308 0.864 4,963 3,883 0.929 5,000 1,600 0.918 4 22 0.875 4 123 0.885 5 26 0.844 Lymphoma N S AUC 1 7,399 0.659 4,533 16 0.635 88 3 0.562 50 3 0.561 23 3 0.561 7 3 0.591 5 10 0.615 3 50 0.662 4,992 3 0.574 4,981 10 0.600 4,947 50 0.653 6 10 0.577 4 50 0.608 4,970 10 0.581 4,942 50 0.613 30 58 0.600 30 58 0.616 30 58 0.607 5,000 58 0.604 5,000 58 0.617 5,000 58 0.609 4,919 1,293 0.634 4,962 45 0.587 4,215 2,546 0.647 4,895 195 0.600 12 10 0.593 16 456 0.577 139 47 0.572 Gisette N S AUC 1 5,000 0.997 227 54 0.990 5,000 8 0.871 5,000 8 0.871 5,000 8 0.871 7 3 0.913 7 10 0.952 5 50 0.986 4,999 5 0.920 4,994 10 0.953 4,957 50 0.987 7 10 0.956 5 50 0.987 4,992 10 0.963 4,957 50 0.987 30 35 0.959 30 35 0.946 30 35 0.936 5,000 35 0.961 5,000 35 0.950 5,000 35 0.941 4,948 697 0.997 5,000 134 0.995 5,000 1,673 0.999 5,000 1,088 0.998 3 64 0.990 1 466 0.996 1 261 0.996 (Continued on the next page) Table 13: Results showing the number of distinct Markov boundaries or variable sets (N) extracted by each method, their average size in terms of the number of variables (S) and average classification performance (AUC) in each of 13 real data sets. The row labeled “All variables” shows performance of the entire set of variables available in each data set. D ISCOVERY OF M ULTIPLE M ARKOV B OUNDARIES Infant_ Mortality N S AUC All variables 1 86 0.821 TIE* 41 4 0.825 max-k = 3, D = 0.05 67 4 0.753 Number of runs = 5000, D = 0.05, K = 0.7 KIAMB 39 4 0.752 Number of runs = 5000, D = 0.05, K = 0.8 17 4 0.752 Number of runs = 5000, D = 0.05, K = 0.9 6 4 0.809 l = 7, G = 0.015 l = 7, K = 10 3 10 0.874 l = 7, K = 50 1 50 0.821 EGS-NCMIGS 84 4 0.806 l = 5000, G = 0.015 l = 5000, K = 10 77 10 0.862 l = 5000, K = 50 39 50 0.822 l = 7, K = 10 2 10 0.865 l = 7, K = 50 1 50 0.829 EGS-CMIM l = 5000, K = 10 77 10 0.863 l = 5000, K = 50 38 50 0.827 Number of Markov boundaries = 30, t = 5 30 12 0.634 Number of Markov boundaries = 30, t = 10 30 12 0.568 Number of Markov boundaries = 30, t = 15 30 12 0.552 EGSG Number of Markov boundaries = 5,000, t = 5 991 12 0.631 Number of Markov boundaries = 5,000, t = 10 3,576 12 0.587 Number of Markov boundaries = 5,000, t = 15 4,272 12 0.556 without statistical comparison 4,230 17 0.825 Resampling+RFE 3,222 9 0.814 with statistical comparison (D = 0.05) without statistical comparison 4,868 26 0.859 Resampling+UAF 3,141 15 0.777 with statistical comparison (D = 0.05) IR-HITON-PC 1 5 0.857 max-k = 3, D = 0.05 without statistical comparison 1 8 0.835 IR-SPLR 1 2 0.828 with statistical comparison (D = 0.05) Method (Continued from the previous page) Method All variables TIE* Dexter S 19,999 17 5 5 5 4 10 50 5 10 50 10 50 10 50 76 76 76 76 76 76 2,097 96 15,491 14,064 20 425 149 AUC 0.979 0.959 0.882 0.884 0.887 0.839 0.927 0.971 0.840 0.927 0.970 0.942 0.979 0.943 0.979 0.857 0.791 0.749 0.854 0.787 0.746 0.976 0.956 0.976 0.972 0.958 0.974 0.940 Sylva Ovarian_ Cancer N S AUC N S AUC 1 216 0.998 1 2,190 0.998 1,483 27 0.996 223 7 0.973 4,429 8 0.949 285 4 0.925 4,384 8 0.948 180 4 0.927 4,385 8 0.947 106 4 0.928 4 5 0.960 6 5 0.951 2 10 0.988 6 10 0.974 1 50 0.998 3 50 0.986 213 5 0.954 2,188 6 0.956 207 10 0.987 2,183 10 0.971 167 50 0.998 2,144 50 0.988 2 10 0.991 5 10 0.976 1 50 0.997 3 50 0.991 207 10 0.992 2,182 10 0.973 167 50 0.998 2,144 50 0.992 30 12 0.803 30 12 0.953 30 12 0.810 30 12 0.940 30 12 0.744 30 12 0.930 4,997 12 0.792 4,878 12 0.951 5,000 12 0.803 4,990 12 0.936 5,000 12 0.752 4,996 12 0.927 4,976 19 0.998 4,951 142 0.983 3,549 12 0.998 2,601 5 0.926 3,944 44 0.998 4,372 424 0.980 2,842 23 0.998 972 13 0.955 1 24 0.997 2 7 0.962 1 36 0.999 2 709 0.986 1 36 0.999 11 115 0.943 (Continued on the next page) Table 14: Thrombin N S AUC 1 139,351 0.927 298 11 0.813 4,936 6 0.771 4,900 6 0.774 4,854 6 0.774 7 3 0.781 7 10 0.854 7 12 0.760 4,999 4 0.779 4,996 10 0.858 4,997 14 0.764 7 10 0.799 7 12 0.711 4,999 10 0.856 5,000 14 0.720 30 29 0.776 30 29 0.817 30 29 0.757 5,000 29 0.758 5,000 29 0.815 5,000 29 0.749 5,000 14,996 0.912 5,000 216 0.861 4,998 74,521 0.933 5,000 25,217 0.916 1 13 0.844 4 245 0.876 28 144 0.749 S TATNIKOV, LYTKIN , L EMEIRE AND A LIFERIS 560 max-k = 3, D = 0.05 Number of runs = 5000, D = 0.05, K = 0.7 KIAMB Number of runs = 5000, D = 0.05, K = 0.8 Number of runs = 5000, D = 0.05, K = 0.9 l = 7, G = 0.015 l = 7, K = 10 l = 7, K = 50 EGS-NCMIGS l = 5000, G = 0.015 l = 5000, K = 10 l = 5000, K = 50 l = 7, K = 10 l = 7, K = 50 EGS-CMIM l = 5000, K = 10 l = 5000, K = 50 Number of Markov boundaries = 30, t = 5 Number of Markov boundaries = 30, t = 10 Number of Markov boundaries = 30, t = 15 EGSG Number of Markov boundaries = 5,000, t = 5 Number of Markov boundaries = 5,000, t = 10 Number of Markov boundaries = 5,000, t = 15 without statistical comparison Resampling+RFE with statistical comparison (D = 0.05) without statistical comparison Resampling+UAF with statistical comparison (D = 0.05) IR-HITON-PC max-k = 3, D = 0.05 without statistical comparison IR-SPLR with statistical comparison (D = 0.05) N 1 4,791 299 193 120 6 5 4 4,998 4,991 4,951 5 3 4,991 4,951 30 30 30 5,000 5,000 5,000 5,000 4,998 3,561 4,992 1 1 3 (Continued from the previous two pages) 561 Table 15: Hiva N S AUC 1 1,617 0.716 246 8 0.712 876 7 0.735 439 7 0.754 172 7 0.755 7 3 0.661 7 10 0.750 7 50 0.668 1,616 4 0.696 1,610 10 0.760 1,570 50 0.658 5 10 0.713 5 50 0.727 1,608 10 0.724 1,577 50 0.729 30 17 0.705 30 17 0.660 30 17 0.633 5,000 17 0.701 5,000 17 0.652 5,000 17 0.638 4,938 220 0.679 4,598 13 0.646 4,587 309 0.685 4,250 36 0.663 23 6 0.673 10 129 0.661 7 22 0.694 N 1 3,751 130 57 23 5 2 1 4,998 4,991 4,951 3 1 4,992 4,951 30 30 30 5,000 5,000 5,000 4,948 5,000 645 3,185 2 1 1 Nova S 16,969 41 6 6 6 4 10 50 5 10 50 10 50 10 50 89 89 89 89 89 89 5,305 1,261 13,950 12,503 49 10,289 10,289 AUC 0.981 0.922 0.759 0.764 0.771 0.730 0.815 0.849 0.735 0.780 0.846 0.818 0.886 0.780 0.897 0.751 0.722 0.687 0.751 0.720 0.682 0.982 0.966 0.981 0.978 0.920 0.981 0.981 Bankruptcy N S AUC 1 147 0.940 1,478 14 0.923 3,810 6 0.839 3,713 5 0.836 3,681 5 0.838 7 4 0.698 6 10 0.936 3 50 0.953 145 4 0.721 139 10 0.936 101 50 0.953 3 10 0.913 1 50 0.954 139 10 0.901 99 50 0.954 30 9 0.815 30 9 0.754 30 9 0.752 4,373 9 0.819 4,856 9 0.786 4,905 9 0.787 4,949 66 0.940 4,972 38 0.946 4,379 79 0.952 628 47 0.946 3 15 0.910 1 69 0.956 1 69 0.956 D ISCOVERY OF M ULTIPLE M ARKOV B OUNDARIES Breast_ Cancer N S AUC All variables 1 17,816 0.914 TIE* 1,011 10 0.906 max-k = 3, D = 0.05 418 4 0.873 Number of runs = 5000, D = 0.05, K = 0.7 KIAMB 255 4 0.874 Number of runs = 5000, D = 0.05, K = 0.8 136 4 0.879 Number of runs = 5000, D = 0.05, K = 0.9 6 4 0.922 l = 7, G = 0.015 l = 7, K = 10 6 10 0.926 l = 7, K = 50 6 50 0.928 EGS-NCMIGS 4,994 4 0.911 l = 5000, G = 0.015 l = 5000, K = 10 4,985 10 0.926 l = 5000, K = 50 4,973 50 0.927 l = 7, K = 10 7 10 0.914 l = 7, K = 50 6 50 0.902 EGS-CMIM l = 5000, K = 10 4,978 10 0.906 l = 5000, K = 50 4,966 50 0.907 Number of Markov boundaries = 30, t = 5 30 205 0.893 Number of Markov boundaries = 30, t = 10 30 205 0.886 Number of Markov boundaries = 30, t = 15 30 205 0.890 EGSG Number of Markov boundaries = 5,000, t = 5 5,000 205 0.892 Number of Markov boundaries = 5,000, t = 10 5,000 205 0.888 Number of Markov boundaries = 5,000, t = 15 5,000 205 0.889 without statistical comparison 4,848 1,067 0.901 Resampling+RFE 2,922 10 0.894 with statistical comparison (D = 0.05) without statistical comparison 4,365 3,359 0.905 Resampling+UAF 1,295 42 0.917 with statistical comparison (D = 0.05) IR-HITON-PC 12 9 0.890 max-k = 3, D = 0.05 without statistical comparison 47 159 0.892 IR-SPLR 54 27 0.880 with statistical comparison (D = 0.05) Method S TATNIKOV, LYTKIN , L EMEIRE AND A LIFERIS C. F. Aliferis, I. Tsamardinos, and A. Statnikov. Large-scale feature selection using markov blanket induction for the prediction of protein-drug binding. Technical Report DSL 02-06, 2002. C. F. Aliferis, I. Tsamardinos, and A. Statnikov. Hiton: a novel markov blanket algorithm for optimal variable selection. AMIA 2003 Annual Symposium Proceedings, pages 21–25, 2003a. C. F. Aliferis, I. Tsamardinos, A. Statnikov, and L. E. Brown. Causal explorer: a causal probabilistic network learning toolkit for biomedical discovery. Proceedings of the 2003 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS), 2003b. C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos. Local causal and markov blanket induction for causal discovery and feature selection for classification. part i: Algorithms and empirical evaluation. Journal of Machine Learning Research, 11:171–234, 2010a. C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos. Local causal and markov blanket induction for causal discovery and feature selection for classification. part ii: Analysis and extensions. Journal of Machine Learning Research, 11:235–284, 2010b. T. W. Anderson. An Introduction to Multivariate Statistical Analysis, volume 3rd of Wiley Series in Probability and Statistics. Wiley-Interscience, Hoboken, N.J, 2003. Y. Aphinyanaphongs, A. Statnikov, and C. F. Aliferis. A comparison of citation metrics to machine learning filters for the identification of high quality medline documents. J.Am.Med.Inform.Assoc., 13(4):446–455, 2006. L. Breiman. Statistical modeling: the two cultures. Statistical Science, 16(3):199–215, 2001. L. E. Brown, I. Tsamardinos, and D. Hardin. To feature space and back: Identifying top-weighted features in polynomial support vector machine models. Intelligent Data Analysis, 16(4), 2012. T. P. Conrads, V. A. Fusaro, S. Ross, D. Johann, V. Rajapakse, B. A. Hitt, S. M. Steinberg, E. C. Kohn, D. A. Fishman, G. Whitely, J. C. Barrett, L. A. Liotta, III Petricoin, E. F., and T. D. Veenstra. High-resolution serum proteomic features for ovarian cancer detection. Endocr.Relat Cancer, 11(2):163–178, 2004. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley New York, 1991. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other KernelBased Learning Methods. Cambridge University Press, Cambridge, 2000. E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3): 837–845, 1988. E. Dougherty and M. Brun. On the number of close-to-optimal feature sets. Cancer Informatics, 2: 189–196, 2006. 562 D ISCOVERY OF M ULTIPLE M ARKOV B OUNDARIES L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2):171–178, 2005. L. Ein-Dor, O. Zuk, and E. Domany. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc.Natl.Acad.Sci.U.S.A, 103(15):5923–5928, 2006. R. E. Fan, P. H. Chen, and C. J. Lin. Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6(1889):1918, 2005. T. Fawcett. Roc graphs: Notes and practical considerations for researchers. Technical Report, HPL-2003-4, HP Laboratories, 2003. F. Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5:1531–1555, 2004. D. P. Foster and R. A. Stine. Variable selecion in data mining: Building a predictive model for bankruptcy. Journal of the American Statistical Association, 99(466):303–314, 2004. C.N. Glymour and G.F. Copper. Computation, Causation and Discovery. AAAI Press, Menlo Park, Calif, 1991. P. I. Good. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses, volume 2nd of Springer series in statistics. Springer, New York, 2000. A. Gopnik and L. Schulz. Causal Learning: Psychology, Philosophy, and Computation. Oxford University Press, Oxford, 2007. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3(1):1157–1182, 2003. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1):389–422, 2002. I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Studies in fuzziness and soft computing. Springer-Verlag, Berlin, 2006. B. Hammer and K. Gersmann. A note on the universal approximation capability of support vector machines. Neural Processing Letters, 17(1):43–53, 2003. M. Hollander and D. Wolfe. Nonparametric statistical methods, volume 2nd of Wiley Series in Probability and Statistics. Wiley, New York, NY, USA, 1999. T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer international series in engineering and computer science. Kluwer Academic Publishers, Boston, 2002. R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2): 273–324, 1997. J. Lemeire. Learning Causal Models of Multivariate Systems and the Value of It for the Performance Modeling of Computer Programs. PhD thesis, 2007. 563 S TATNIKOV, LYTKIN , L EMEIRE AND A LIFERIS J. Lemeire, S. Meganck, and F. Cartella. Robust independence-based causal structure learning in absence of adjacency faithfulness. Proccedings of the Fifth European Workshop on Probabilistic Graphical Models (PGM 2010), 2010. H. Liu, L. Liu, and H. Zhang. Ensemble gene selection by grouping for microarray data classification. J.Biomed.Inform., 43(1):81–87, 2010. H. Liu, L. Liu, and H. Zhang. Ensemble gene selection for cancer classification. Pattern Recognition, 43(8):2763–2772, 2010b. S. Mani and G. F. Cooper. A study in causal discovery from population-based infant birth and death records. Proceedings of the AMIA Annual Fall Symposium, 319, 1999. S. Mani and G. F. Cooper. Causal discovery using a bayesian local causal discovery algorithm. Medinfo 2004, 11(Pt 1):731–735, 2004. S. Michiels, S. Koscielny, and C. Hill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, 365(9458):488–492, 2005. G. Natsoulis, Ghaoui L. El, G. R. Lanckriet, A. M. Tolley, F. Leroy, S. Dunlea, B. P. Eynon, C. I. Pearson, S. Tugendreich, and K. Jarnagin. Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. Genome Res., 15(5):724–736, 2005. R. E. Neapolitan. Learning Bayesian Networks. Prentice Hall series in artificial intelligence. Pearson Prentice Hall, Upper Saddle River, NJ, 2004. J. Pe˜ a, R. Nilsson, J. Bj¨ rkegren, and J. Tegn´ r. Towards scalable and data efficient learning of n o e markov boundaries. International Journal of Approximate Reasoning, 45(2):211 –232, 2007. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. The Morgan Kaufmann series in representation and reasoning. Morgan Kaufmann Publishers, San Mateo, California, 1988. J. Pearl. Causality: Models, Reasoning, and Inference, volume 2nd. Cambridge University Press, Cambridge, U.K, 2009. J. P. Pellet and A. Elisseeff. Using markov blankets for causal structure learning. The Journal of Machine Learning Research, 9:1295–1342, 2008. A. Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:143–195, 1999. J. Ramsey, J. Zhang, and P. Spirtes. Adjacency-faithfulness and conservative causal inference. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-2006), pages 401–408, 2006. T. Richardson and P. Spirtes. Automated Discovery of Linear Feedback Models. MIT Press, Menlo Park, CA, 1999. T. S. Richardson and P. Spirtes. Ancestral graph markov models. Annals of Statistics, 30(4):962– 1030, 2002. 564 D ISCOVERY OF M ULTIPLE M ARKOV B OUNDARIES P. Roepman, P. Kemmeren, L. F. Wessels, P. J. Slootweg, and F. C. Holstege. Multiple robust signatures for detecting lymph node metastasis in head and neck cancer. Cancer Res., 66(4): 2361–2366, 2006. A. Rosenwald, G. Wright, W. C. Chan, J. M. Connors, E. Campo, R. I. Fisher, R. D. Gascoyne, H. K. Muller-Hermelink, E. B. Smeland, J. M. Giltnane, E. M. Hurt, H. Zhao, L. Averett, L. Yang, W. H. Wilson, E. S. Jaffe, R. Simon, R. D. Klausner, J. Powell, P. L. Duffey, D. L. Longo, T. C. Greiner, D. D. Weisenburger, W. G. Sanger, B. J. Dave, J. C. Lynch, J. Vose, J. O. Armitage, E. Montserrat, A. Lopez-Guillermo, T. M. Grogan, T. P. Miller, M. LeBlanc, G. Ott, S. Kvaloy, J. Delabie, H. Holte, P. Krajci, T. Stokke, and L. M. Staudt. The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N.Engl.J Med., 346(25): 1937–1947, 2002. F. Scarselli and A. Chung Tsoi. Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Networks, 11(1):15–37, 1998. B. Sch¨ lkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods: Support Vector o Learning. MIT Press, Cambridge, Mass, 1999. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004. R. L. Somorjai, B. Dolenko, and R. Baumgartner. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics, 19 (12):1484–1491, 2003. P. Spirtes, C. N. Glymour, and R. Scheines. Causation, Prediction, and Search, volume 2nd of Adaptive computation and machine learning. MIT Press, Cambridge, Mass, 2000. A. Statnikov and C. F. Aliferis. Analysis and computational dissection of molecular signature multiplicity. PLoS Computational Biology, 6(5):e1000790, 2010a. A. Statnikov and C. F. Aliferis. TIED: An Artificially Simulated Dataset with Multiple Markov Boundaries. Journal of Machine Learning Research Workshop and Conference Proceedings, Volume 6: Causality: Objectives and Assessment (NIPS 2008), 6:249–256, 2010b. A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631–643, 2005. A. Statnikov, I. Tsamardinos, L. E. Brown, and C. F. Aliferis. Causal Explorer: A Matlab Library of Algorithms for Causal Discovery and Variable Selection for Classification. In Challenges in Machine Learning. Volume 2: Causation and Prediction Challenge. Edited by Guyon, I. and Aliferis, C. F. and Cooper, G. F. and Elisseeff, A. and Pellet, J. P. and Spirtes, P. and Statnikov, A. Microtome Publishing, Bookline, Massachusetts, 2010. I. Tsamardinos and C. F. Aliferis. Towards principled feature selection: relevancy, filters and wrappers. Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics(AI and Stats), 2003. 565 S TATNIKOV, LYTKIN , L EMEIRE AND A LIFERIS I. Tsamardinos and L. E. Brown. Markov blanket-based variable selection in feature space. Technical Report DSL-08-01, 2008. I. Tsamardinos, C. F. Aliferis, and A. Statnikov. Algorithms for large scale markov blanket discovery. Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS), pages 376–381, 2003a. I. Tsamardinos, C. F. Aliferis, and A. Statnikov. Time and sample efficient discovery of markov blankets and direct causal relations. Proceedings of the Ninth International Conference on Knowledge Discovery and Data Mining (KDD), pages 673–678, 2003b. V. N. Vapnik. Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, 1998. Y. Wang, J. G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look, F. Yang, D. Talantov, M. Timmermans, M. E. Meijer-van Gelder, J. Yu, T. Jatkoe, E. M. Berns, D. Atkins, and J. A. Foekens. Geneexpression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365(9460):671–679, 2005. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. M. Kaufmann Publishers, San Mateo, Calif, 1991. 566