jmlr jmlr2012 jmlr2012-114 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ioannis Tsamardinos, Sofia Triantafillou, Vincenzo Lagani
Abstract: We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed algorithms make predictions that are better correlated with the sample estimates of the unknown parameters on test data ; this is particularly the case when the number of commonly measured variables is low. The enabling idea behind the methods is to induce one or all causal models that are simultaneously consistent with (fit) all available data sets and prior knowledge and reason with them. This allows constraints stemming from causal assumptions (e.g., Causal Markov Condition, Faithfulness) to propagate. Several methods have been developed based on this idea, for which we propose the unifying name Integrative Causal Analysis (INCA). A contrived example is presented demonstrating the theoretical potential to develop more general methods for co-analyzing heterogeneous data sets. The computational experiments with the novel methods provide evidence that causallyinspired assumptions such as Faithfulness often hold to a good degree of approximation in many real systems and could be exploited for statistical inference. Code, scripts, and data are available at www.mensxmachina.org. Keywords: integrative causal analysis, causal discovery, Bayesian networks, maximal ancestral graphs, structural equation models, causality, statistical matching, data fusion
Reference: text
sentIndex sentText sentNum sentScore
1 The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. [sent-9, score-0.258]
2 The enabling idea behind the methods is to induce one or all causal models that are simultaneously consistent with (fit) all available data sets and prior knowledge and reason with them. [sent-13, score-0.258]
3 Keywords: integrative causal analysis, causal discovery, Bayesian networks, maximal ancestral graphs, structural equation models, causality, statistical matching, data fusion 1. [sent-23, score-0.71]
4 One approach to allow the co-analysis of heterogeneous data sets in the context of prior knowledge is to try to induce one or all causal models that are simultaneously consistent with all available data sets and pieces of knowledge. [sent-31, score-0.258]
5 The use of causal models may allow additional inferences than what is possible with noncausal models. [sent-34, score-0.293]
6 Two of the most common causal assumptions in the literature are the Causal Markov Condition and the Faithfulness Condition (Spirtes et al. [sent-37, score-0.258]
7 , 2001); intuitively, these conditions assume that the observed dependencies and independencies in the data are due to the causal structure of the observed system and not due to accidental properties of the distribution parameters (Spirtes et al. [sent-38, score-0.386]
8 The idea of inducing causal models from several data sets has already appeared in several prior works. [sent-41, score-0.258]
9 Methods for inducing causal models from samples measured under different experimental conditions are described in Cooper and Yoo (1999), Tian and Pearl (2001), Claassen and Heskes (2010), Eberhardt (2008); Eberhardt et al. [sent-42, score-0.258]
10 In Tillman (2009) and Tsamardinos and Borboudakis (2010) approaches that induce causal models from data sets defined over semantically similar variables (e. [sent-48, score-0.326]
11 Methods for inducing causal models in the context of prior knowledge also exist (Angelopoulos and Cussens, 2008; Borboudakis et al. [sent-51, score-0.258]
12 The methods are able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. [sent-60, score-0.317]
13 The proposed algorithms make numerous predictions that range in the thousands for large data sets; the predictions are highly accurate, significantly more accurate than predictions made at ran1098 T OWARDS I NTEGRATIVE C AUSAL A NALYSIS dom. [sent-65, score-0.381]
14 In addition, when linear causal relations and Gaussian error terms are assumed, the algorithms successfully predict the strength of the linear correlation between Y and Z. [sent-67, score-0.487]
15 First, the results provide ample statistical evidence that some of the typical assumptions employed in causal modeling hold abundantly (at least to a good level of approximation) in a wide range of domains and lead to accurate inferences. [sent-78, score-0.258]
16 To obtain the results the causal semantics are not employed per se, that is, we do not predict the effects of experiments and manipulations. [sent-79, score-0.313]
17 In other words, one could view the assumptions made by the causal models as constraints or priors on probability distributions encountered in Nature without any reference to causal semantics. [sent-80, score-0.516]
18 temporal measurements) could potentially one day enable the automated large-scale integrative analysis of a large part of available data and knowledge to construct causal models. [sent-89, score-0.34]
19 The rest of this document is organized as follows: Section 2 briefly presents background on causal modeling with Maximal Ancestral Graphs. [sent-90, score-0.258]
20 Modeling Causality with Maximal Ancestral Graphs Maximal Ancestral Graphs (MAGs) is a type of graphical model that represents causal relations among a set of measured (observed) variables O as well as probabilistic properties, such as conditional independencies (independence model). [sent-99, score-0.381]
21 The probabilistic properties of MAGs can be de1099 T SAMARDINOS , T RIANTAFILLOU AND L AGANI veloped without any reference to their causal semantics; nevertheless, we also briefly discuss their causal interpretation. [sent-100, score-0.516]
22 The causal semantics of an edge A → B imply that A is probabilistically causing B, that is, an (appropriate) manipulation of A results in a change of the distribution of B. [sent-102, score-0.258]
23 A simple graphical transformation for a MAG G faithful to a distribution P with independence model J (P ) exists that provides a unique MAG G [L that represents the causal ancestral relations and the independence model J (P )[L after marginalizing out variables in L. [sent-156, score-0.589]
24 Different MAGs encode different causal information, but may share the same independence models and thus are statistically indistinguishable based on these models alone. [sent-164, score-0.316]
25 If a path p is discriminating for a vertex V in both graphs, V is a collider on the path on one graph if and only if it is a collider on the path on the other. [sent-175, score-0.247]
26 For example if X ◦ − ◦Y ◦ → W , X,Y,W is a possible ancestral path from X to W, but not a possible ancestral path from W to X. [sent-178, score-0.254]
27 This work is inspired by the following scenario: there exists an unknown causal mechanism over variables V, represented by a faithful CBN P , G . [sent-187, score-0.394]
28 Study 2 is a Randomized Controlled Trial were the levels of C for a subject are randomly decided and enforced by the experimenter, aiming at identifying a causal relation with cancer. [sent-225, score-0.258]
29 Prior knowledge provides a piece of causal knowledge but the raw data are not available. [sent-228, score-0.258]
30 • Prior Knowledge: A doctor establishes a causal relation between the use of Contraceptives (variable B) and the development of Thrombosis (variable A), that is, “B causes A” denoted as B A. [sent-239, score-0.258]
31 We use a double arrow to denote a causal relation without reference to the context of other variables. [sent-244, score-0.258]
32 This is to avoid confusion with the use of a single arrow → in most causal models (e. [sent-245, score-0.258]
33 , Causal Bayesian Networks) that denotes a direct causal relation (or inducing path, see Richardson and Spirtes 2002), where direct causality is defined in the context of the rest of the variables in the model. [sent-247, score-0.326]
34 1104 T OWARDS I NTEGRATIVE C AUSAL A NALYSIS A B A B A B A B C C D C C F D R D D E (a) (b) (c) (d) A A B B C C D D D F F E E (e) (f) (g) Figure 3: (a) Assumed unknown causal structure. [sent-248, score-0.258]
35 We now show informally the reasoning for an integrative causal analysis of the above studies and prior knowledge and compare against independent analysis of the studies. [sent-259, score-0.34]
36 Figure 3(a) shows the presumed true, unknown, causal structure. [sent-260, score-0.258]
37 Figure 3(b-c) shows the causal model induced (asymptotically) by an independent analysis of the data of Study 1 and Study 2 respectively using existing algorithms, such as FCI (Spirtes et al. [sent-261, score-0.258]
38 Notice that it removes any causal link into C since the value of C only depends on the result of the randomization. [sent-264, score-0.258]
39 Figure 3(d) shows the causal model that can be inferred by co-analyzing both studies together. [sent-265, score-0.258]
40 By INCA of Study 1 and 2 it is now additionally inferred that B and C are correlated but C does not cause B: If C was causing B, we would have found the variables dependent in Study 2 (the randomization procedure would not have eliminated the causal link C → B). [sent-266, score-0.326]
41 The dashed edges denote statistical indistinguishability about the existence of the edge, that is, there exist a consistent causal model with all data and knowledge having the edge, and one without the edge. [sent-274, score-0.258]
42 Based on the input data it is possible to induce with existing ⊥ ⊥ causal analysis algorithms, such as FCI the following PAGs from each data set respectively: P1 : X ◦ − ◦Y ◦ − ◦W and P2 : X ◦ − ◦ Z ◦ − ◦W. [sent-285, score-0.258]
43 We next develop the theory for their causal co-analysis. [sent-289, score-0.258]
44 A PCG focuses on representing the possible causal pair-wise relations among each pair of variables X and Y in O. [sent-345, score-0.326]
45 Notice that FTR requires 10 dependencies and 2 independencies to be identified, while MTR requires 4 dependencies and 2 independencies, and TR requires 2 dependencies to be found. [sent-452, score-0.274]
46 • Measuring Performance: The ground truth for the presence of a predicted correlation is not known. [sent-473, score-0.268]
47 Even though there were few or no predictions for a couple of data sets, there are typically hundreds or thousands of predictions for each data set. [sent-715, score-0.254]
48 The accuracy of the predictions for all dependencies in the model, named Structural Accuracy because it scores all the dependencies implied by the structure of model, is defined in a similar fashion to Acc (Definition 11) but based on p∗ instead of p: SAccR (t) = #{p∗ <= t, p ∈ MiR }/|MiR |. [sent-771, score-0.316]
49 1 Summary, Interpretation, and Conclusions The results show that both the FTR and MTR rules correctly predict all the dependencies (conditional and unconditional) implied by the models involving the two variables never measured together. [sent-792, score-0.232]
50 The rules of path analysis (Wright, 1934) dictate that the correlation between two variables, for example, ρXY equals the sum of the contribution of every d-connecting path (conditioned on the 5. [sent-853, score-0.268]
51 We then apply Algorithm 4 and predict the strength of correlation rY Z for various pairs ˆ of variables; we compare the predictions with the sample correlation rY Z as estimated in Dt . [sent-887, score-0.492]
52 In this example, the distribution of the sample correlation r of two variables for sample size 70 when the true correlation is ρ ∈ {0. [sent-927, score-0.34]
53 6 Figure 14: Predicted vs sample correlations over all data sets, grouped by the mean absolute values of the denominators used in their computation: predictions computed based on large correlations have reduced bias. [sent-1047, score-0.265]
54 In contrast, in the case where all variables are jointly measured and the distribution is Faithful the set of statistically indistinguishable causal graphs is completely determined by the independence model (again, also assuming linearity and Gaussian error terms). [sent-1086, score-0.384]
55 Table 5 shows the correlation between ˆ predicted and sample estimates for all methods and data sets. [sent-1195, score-0.236]
56 In comparison to FTR-S, we note the following: • When predictions are based on only 2 common variables, statistical matching based on the CIA (SMRQ ) is unreliable in several data sets and particularly the text categorization ones: the correlation of predicted vs. [sent-1306, score-0.409]
57 In general, SMR tends to predict a zero correlation between the two variables Y and Z: the point-clouds in Figures 17, 18, 19 and 20 are vertically oriented around zero. [sent-1309, score-0.259]
58 • When predictions are based on larger sets of common variables statistical matching based on the CIA (SMRG ) is more successful. [sent-1314, score-0.241]
59 FTR-S predictions are better correlated with the sample estimates of the unknown parameters, particularly when the number of common variables is low; we thus recommend that FTR-S should be preferred than existing statistical matching alternatives making the CIA in such cases. [sent-1325, score-0.241]
60 5 1 predicted correlation (a) ACPJ-Etiology SMRQ sample correlation SMRG FTR-S 1 R2 :0. [sent-1336, score-0.372]
61 5 1 predicted correlation (b) Breast-Cancer SMRQ sample correlation SMRG FTR-S 1 R2 :0. [sent-1347, score-0.372]
62 5 1 predicted correlation (c) C&C; Figure 17: Predicted vs Sample Correlations for SMRQ , SMRG , FTR-S 9. [sent-1358, score-0.236]
63 The most common distributional assumption adopted by statistical matching techniques for continuous variables is 1133 T SAMARDINOS , T RIANTAFILLOU AND L AGANI SMRQ sample correlation SMRG FTR-S 1 R2 :0. [sent-1363, score-0.25]
64 5 1 predicted correlation (a) Compactiv SMRQ sample correlation SMRG FTR-S 1 R2 :0. [sent-1374, score-0.372]
65 5 1 predicted correlation (b) Insurance SMRQ sample correlation SMRG FTR-S 1 R2 :0. [sent-1385, score-0.372]
66 5 1 predicted correlation (c) Lymphoma Figure 18: Predicted vs Sample Correlations for SMRQ , SMRG , FTR-S multivariate normality. [sent-1396, score-0.236]
67 5 1 predicted correlation (a) Ohsumed SMRQ sample correlation SMRG FTR-S 1 R2 :0. [sent-1410, score-0.372]
68 5 1 predicted correlation (b) Ovarian SMRQ sample correlation SMRG FTR-S 1 R2 :0. [sent-1421, score-0.372]
69 5 1 predicted correlation (a) p53 SMRQ sample correlation SMRG FTR-S 1 R2 :0. [sent-1444, score-0.372]
70 5 1 predicted correlation (b) All predictions Figure 20: Predicted vs Sample Correlations for SMRQ , SMRG , FTR-S The unknown quantity in the problem is parameter ρY Z . [sent-1455, score-0.363]
71 The columns of the table present the total number of randomly chosen quadruples (1000 × the number of chunks, except for the Wine data set), the number of predictions made by MNR on these random quadruples, the accuracies AccMNR and AccFTR at threshold t = 0. [sent-1472, score-0.272]
72 The final column presents the ratio of the number of predictions by the FTR rule over the expected number of predictions made by the MNR rule on all possible quadruples. [sent-1475, score-0.358]
73 To examine whether the predictions of MNR rule overlap with those of FTR, we applied the MNR rule on the quadruples where FTR makes a prediction. [sent-1480, score-0.304]
74 00 #FTR predictions / #expected MNR predictions on all quads 3. [sent-1503, score-0.254]
75 The columns are: the data set name, the total number of randomly sampled quadruples (1000 × the number of chunks, except for the Wine data set), the number of predictions made by MNR on those, the accuracies AccMNR and AccFTR at threshold t = 0. [sent-1513, score-0.272]
76 The final column presents the ratio of the number of predictions by the FTR rule over the expected number of predictions made by the MNR rule on all possible quadruples. [sent-1515, score-0.358]
77 Data Set #FTR predictions Breast-Cancer C&C; Compactiv Insurance-C Lymphoma Ovarian p53 Wine 1833 99241 135 1839 7712 539165 46647 4 #MNR predictions restricted to cases FTR makes a prediction 32 10640 28 15 681 59327 413 1 % common predictions 0. [sent-1517, score-0.381]
78 In this approach, one attempts to identify one or all causal models that are consistent with all available data and pieces of prior knowledge, and reason with them. [sent-1572, score-0.258]
79 As a proof-of-concept, we identify the simplest scenario where the INCA idea provides testable predictions, and specifically it predicts the presence and strength of an unconditional dependence, and a chain-like causal structure (entailing several additional conditional dependencies). [sent-1576, score-0.379]
80 The empirical results show that FTR and MTR are able to accurately predict the presence and strength of unconditional dependencies, as well as all the conditional dependencies entailed by the causal model. [sent-1578, score-0.507]
81 These predictions are better than chance and cannot be explained by the transitivity of dependencies often holding in Nature. [sent-1579, score-0.253]
82 Against typical statistical matching algorithms, FTR-S’s predictions are better correlated with sample estimates particularly when the number of common variables is low. [sent-1580, score-0.241]
83 Inducing causal models from observational data has been long debated (Pearl, 2000; Spirtes et al. [sent-1581, score-0.258]
84 In our experiments, we do not employ the causal semantics of the models to predict the effect of manipulations but their ability to represent independencies, based on the assumption of Faithfulness. [sent-1583, score-0.313]
85 While this is not a direct proof in favor of the causal semantics of the models, we do note that both Faithfulness and MAGs have been inspired by theories of probabilistic causality. [sent-1585, score-0.258]
86 2 Supplementary Tables Table 10 presents the performance of the algorithms as measured by the Mean Absolute Error (MAE) of the predictions rY Z and the sample-estimates rY Z : 1/N · ∑ |ˆi − ri |, where N is the toˆ r tal number of predictions of an algorithm. [sent-1696, score-0.254]
87 Table 11 presents the performance of the algorithms as measured by the Mean Relative Absolute Error (MRAE) of the predictions rY Z and the sample-estimates rY Z : 1/N · ∑ |ˆi − ri |/|ri |, where N is ˆ r the total number of predictions of an algorithm. [sent-1705, score-0.254]
88 For example, SMR on the Ovarian data set has a high MRAE (on the order of 109 despite a correlation between predictions and sample-estimates of 0. [sent-1707, score-0.263]
89 For the Ovarian data set the SMRG rule provides predictions for cases with nearby-zero sample estimated rY Z , and these predictions generate extremely high MRAE values. [sent-1918, score-0.306]
90 05 Figure 25: Accuracy at threshold t for data sets Wine, p53 Delicious Dexter 1 Structural accuracy at threshold t Structural accuracy at threshold t 1 0. [sent-2310, score-0.302]
91 05 1 Structural accuracy at threshold t Structural accuracy at threshold t 0. [sent-2340, score-0.23]
92 05 Figure 26: Structural accuracy at threshold t for data sets Delicious, Dexter, Gisette and Hiva for different rules 1150 T OWARDS I NTEGRATIVE C AUSAL A NALYSIS Infant–Mortality Insurance–C 1 Structural accuracy at threshold t Structural accuracy at threshold t 1 0. [sent-2374, score-0.381]
93 Local causal and Markov blanket induction for causal discovery and feature selection for classification part ii : Analysis and extensions. [sent-2545, score-0.516]
94 A constraint-based approach to incorporate prior knowledge in causal models. [sent-2567, score-0.258]
95 Predicting causal effects in large-scale u systems from observational data. [sent-2655, score-0.258]
96 Causal discovery using a Bayesian local causal discovery algorithm. [sent-2659, score-0.258]
97 Finding latent causes in causal networks: an efficient approach based on Markov blankets. [sent-2687, score-0.258]
98 A linear non-Gaussian acyclic model for a causal discovery. [sent-2705, score-0.258]
99 The possibility of integrative causal analysis: Learning from different datasets and studies. [sent-2743, score-0.34]
100 On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. [sent-2768, score-0.326]
wordName wordTfidf (topN-words)
[('ftr', 0.396), ('causal', 0.258), ('mags', 0.242), ('ry', 0.206), ('mtr', 0.2), ('agani', 0.154), ('ntegrative', 0.154), ('owards', 0.154), ('riantafillou', 0.154), ('samardinos', 0.154), ('smrq', 0.149), ('mnr', 0.144), ('smrg', 0.144), ('correlation', 0.136), ('predictions', 0.127), ('bonferroni', 0.118), ('ovarian', 0.114), ('mag', 0.113), ('ausal', 0.109), ('acc', 0.109), ('ohsumed', 0.108), ('acpj', 0.103), ('predicted', 0.1), ('lymphoma', 0.099), ('compactiv', 0.098), ('smr', 0.087), ('faithfulness', 0.084), ('integrative', 0.082), ('pags', 0.082), ('ancestral', 0.079), ('guess', 0.079), ('delicious', 0.077), ('nalysis', 0.077), ('insurance', 0.075), ('dependencies', 0.073), ('quadruples', 0.073), ('threshold', 0.072), ('hiva', 0.072), ('trianta', 0.072), ('wine', 0.07), ('correlations', 0.069), ('faithful', 0.068), ('variables', 0.068), ('llou', 0.067), ('nova', 0.067), ('spirtes', 0.064), ('cia', 0.063), ('bibtex', 0.062), ('inca', 0.062), ('infant', 0.062), ('tsamardinos', 0.059), ('independence', 0.058), ('dexter', 0.057), ('covtype', 0.057), ('tillman', 0.057), ('independencies', 0.055), ('oi', 0.055), ('predict', 0.055), ('mortality', 0.053), ('transitivity', 0.053), ('rule', 0.052), ('quadruplette', 0.051), ('rxy', 0.051), ('unconditional', 0.051), ('path', 0.048), ('gisette', 0.048), ('xz', 0.047), ('matching', 0.046), ('accr', 0.046), ('rxz', 0.046), ('xy', 0.044), ('mir', 0.044), ('accuracy', 0.043), ('breast', 0.042), ('graph', 0.041), ('yes', 0.04), ('pag', 0.04), ('strength', 0.038), ('preprocessing', 0.038), ('mb', 0.037), ('february', 0.037), ('download', 0.036), ('borboudakis', 0.036), ('ryw', 0.036), ('rzw', 0.036), ('correction', 0.036), ('read', 0.036), ('rules', 0.036), ('inferences', 0.035), ('xx', 0.035), ('structural', 0.033), ('ang', 0.033), ('presence', 0.032), ('freely', 0.032), ('alarm', 0.032), ('mrae', 0.031), ('orazio', 0.031), ('sacc', 0.031), ('collider', 0.031), ('pcg', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
Author: Ioannis Tsamardinos, Sofia Triantafillou, Vincenzo Lagani
Abstract: We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed algorithms make predictions that are better correlated with the sample estimates of the unknown parameters on test data ; this is particularly the case when the number of commonly measured variables is low. The enabling idea behind the methods is to induce one or all causal models that are simultaneously consistent with (fit) all available data sets and prior knowledge and reason with them. This allows constraints stemming from causal assumptions (e.g., Causal Markov Condition, Faithfulness) to propagate. Several methods have been developed based on this idea, for which we propose the unifying name Integrative Causal Analysis (INCA). A contrived example is presented demonstrating the theoretical potential to develop more general methods for co-analyzing heterogeneous data sets. The computational experiments with the novel methods provide evidence that causallyinspired assumptions such as Faithfulness often hold to a good degree of approximation in many real systems and could be exploited for statistical inference. Code, scripts, and data are available at www.mensxmachina.org. Keywords: integrative causal analysis, causal discovery, Bayesian networks, maximal ancestral graphs, structural equation models, causality, statistical matching, data fusion
2 0.13941629 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
Author: Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan
Abstract: Assessing treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept termed the facilitating score to account for both the confounding and interacting impacts of covariates on the treatment effect. Several approaches for estimating the facilitating score are discussed. In particular, we put forward a machine learning method, called causal inference tree (CIT), to provide a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits data in such a way that both the propensity and the treatment effect become more homogeneous within each resultant partition. Causal inference at different levels can be made on the basis of CIT. Together with an aggregated grouping procedure, CIT stratifies data into strata where causal effects can be conveniently assessed within each. Besides, a feasible way of predicting individual causal effects (ICE) is made available by aggregating ensemble CIT models. Both the stratified results and the estimated ICE provide an assessment of heterogeneity of causal effects and can be integrated for estimating the average causal effect (ACE). Mean square consistency of CIT is also established. We evaluate the performance of proposed methods with simulations and illustrate their use with the NSW data in Dehejia and Wahba (1999) where the objective is to assess the impact of c 2012 Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine and Xin Yan. S U , K ANG , FAN , L EVINE AND YAN a labor training program, the National Supported Work (NSW) demonstration, on post-intervention earnings. Keywords: CART, causal inference, confounding, interaction, observational study, personalized medicine, recursive partitioning
3 0.11236013 24 jmlr-2012-Causal Bounds and Observable Constraints for Non-deterministic Models
Author: Roland R. Ramsahai
Abstract: Conditional independence relations involving latent variables do not necessarily imply observable independences. They may imply inequality constraints on observable parameters and causal bounds, which can be used for falsification and identification. The literature on computing such constraints often involve a deterministic underlying data generating process in a counterfactual framework. If an analyst is ignorant of the nature of the underlying mechanisms then they may wish to use a model which allows the underlying mechanisms to be probabilistic. A method of computation for a weaker model without any determinism is given here and demonstrated for the instrumental variable model, though applicable to other models. The approach is based on the analysis of mappings with convex polytopes in a decision theoretic framework and can be implemented in readily available polyhedral computation software. Well known constraints and bounds are replicated in a probabilistic model and novel ones are computed for instrumental variable models without non-deterministic versions of the randomization, exclusion restriction and monotonicity assumptions respectively. Keywords: instrumental variables, instrumental inequality, causal bounds, convex polytope, latent variables, directed acyclic graph
4 0.086299367 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs
Author: Alain Hauser, Peter Bühlmann
Abstract: The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study. Keywords: causal inference, interventions, graphical model, Markov equivalence, greedy equivalence search
5 0.081364669 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
Author: Antti Hyttinen, Frederick Eberhardt, Patrik O. Hoyer
Abstract: Identifying cause-effect relationships between variables of interest is a central problem in science. Given a set of experiments we describe a procedure that identifies linear models that may contain cycles and latent variables. We provide a detailed description of the model family, full proofs of the necessary and sufficient conditions for identifiability, a search algorithm that is complete, and a discussion of what can be done when the identifiability conditions are not satisfied. The algorithm is comprehensively tested in simulations, comparing it to competing algorithms in the literature. Furthermore, we adapt the procedure to the problem of cellular network inference, applying it to the biologically realistic data of the DREAM challenges. The paper provides a full theoretical foundation for the causal discovery procedure first presented by Eberhardt et al. (2010) and Hyttinen et al. (2010). Keywords: causality, graphical models, randomized experiments, structural equation models, latent variables, latent confounders, cycles
6 0.049435023 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
7 0.04939758 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
8 0.042594194 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
9 0.041415934 82 jmlr-2012-On the Necessity of Irrelevant Variables
10 0.035727467 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
11 0.034534045 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso
12 0.034212388 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
13 0.034177978 44 jmlr-2012-Feature Selection via Dependence Maximization
14 0.033360522 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
15 0.030678602 80 jmlr-2012-On Ranking and Generalization Bounds
16 0.029668402 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
17 0.027033664 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
18 0.026908953 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
19 0.026784968 20 jmlr-2012-Analysis of a Random Forests Model
20 0.026197255 4 jmlr-2012-A Kernel Two-Sample Test
topicId topicWeight
[(0, -0.139), (1, 0.073), (2, 0.107), (3, -0.139), (4, 0.109), (5, 0.141), (6, -0.076), (7, 0.27), (8, 0.24), (9, 0.056), (10, -0.03), (11, -0.027), (12, 0.025), (13, 0.126), (14, 0.04), (15, -0.069), (16, 0.051), (17, 0.018), (18, -0.019), (19, -0.006), (20, -0.033), (21, 0.054), (22, 0.028), (23, -0.075), (24, -0.02), (25, 0.059), (26, -0.018), (27, 0.063), (28, 0.071), (29, 0.055), (30, -0.028), (31, -0.018), (32, 0.037), (33, 0.018), (34, 0.046), (35, 0.016), (36, -0.014), (37, -0.057), (38, 0.044), (39, 0.007), (40, -0.019), (41, -0.021), (42, 0.011), (43, -0.002), (44, 0.0), (45, -0.064), (46, -0.025), (47, -0.015), (48, -0.059), (49, -0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.9269985 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
Author: Ioannis Tsamardinos, Sofia Triantafillou, Vincenzo Lagani
Abstract: We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed algorithms make predictions that are better correlated with the sample estimates of the unknown parameters on test data ; this is particularly the case when the number of commonly measured variables is low. The enabling idea behind the methods is to induce one or all causal models that are simultaneously consistent with (fit) all available data sets and prior knowledge and reason with them. This allows constraints stemming from causal assumptions (e.g., Causal Markov Condition, Faithfulness) to propagate. Several methods have been developed based on this idea, for which we propose the unifying name Integrative Causal Analysis (INCA). A contrived example is presented demonstrating the theoretical potential to develop more general methods for co-analyzing heterogeneous data sets. The computational experiments with the novel methods provide evidence that causallyinspired assumptions such as Faithfulness often hold to a good degree of approximation in many real systems and could be exploited for statistical inference. Code, scripts, and data are available at www.mensxmachina.org. Keywords: integrative causal analysis, causal discovery, Bayesian networks, maximal ancestral graphs, structural equation models, causality, statistical matching, data fusion
2 0.85264337 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
Author: Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan
Abstract: Assessing treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept termed the facilitating score to account for both the confounding and interacting impacts of covariates on the treatment effect. Several approaches for estimating the facilitating score are discussed. In particular, we put forward a machine learning method, called causal inference tree (CIT), to provide a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits data in such a way that both the propensity and the treatment effect become more homogeneous within each resultant partition. Causal inference at different levels can be made on the basis of CIT. Together with an aggregated grouping procedure, CIT stratifies data into strata where causal effects can be conveniently assessed within each. Besides, a feasible way of predicting individual causal effects (ICE) is made available by aggregating ensemble CIT models. Both the stratified results and the estimated ICE provide an assessment of heterogeneity of causal effects and can be integrated for estimating the average causal effect (ACE). Mean square consistency of CIT is also established. We evaluate the performance of proposed methods with simulations and illustrate their use with the NSW data in Dehejia and Wahba (1999) where the objective is to assess the impact of c 2012 Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine and Xin Yan. S U , K ANG , FAN , L EVINE AND YAN a labor training program, the National Supported Work (NSW) demonstration, on post-intervention earnings. Keywords: CART, causal inference, confounding, interaction, observational study, personalized medicine, recursive partitioning
3 0.84472913 24 jmlr-2012-Causal Bounds and Observable Constraints for Non-deterministic Models
Author: Roland R. Ramsahai
Abstract: Conditional independence relations involving latent variables do not necessarily imply observable independences. They may imply inequality constraints on observable parameters and causal bounds, which can be used for falsification and identification. The literature on computing such constraints often involve a deterministic underlying data generating process in a counterfactual framework. If an analyst is ignorant of the nature of the underlying mechanisms then they may wish to use a model which allows the underlying mechanisms to be probabilistic. A method of computation for a weaker model without any determinism is given here and demonstrated for the instrumental variable model, though applicable to other models. The approach is based on the analysis of mappings with convex polytopes in a decision theoretic framework and can be implemented in readily available polyhedral computation software. Well known constraints and bounds are replicated in a probabilistic model and novel ones are computed for instrumental variable models without non-deterministic versions of the randomization, exclusion restriction and monotonicity assumptions respectively. Keywords: instrumental variables, instrumental inequality, causal bounds, convex polytope, latent variables, directed acyclic graph
Author: Alain Hauser, Peter Bühlmann
Abstract: The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study. Keywords: causal inference, interventions, graphical model, Markov equivalence, greedy equivalence search
5 0.48147413 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
Author: Antti Hyttinen, Frederick Eberhardt, Patrik O. Hoyer
Abstract: Identifying cause-effect relationships between variables of interest is a central problem in science. Given a set of experiments we describe a procedure that identifies linear models that may contain cycles and latent variables. We provide a detailed description of the model family, full proofs of the necessary and sufficient conditions for identifiability, a search algorithm that is complete, and a discussion of what can be done when the identifiability conditions are not satisfied. The algorithm is comprehensively tested in simulations, comparing it to competing algorithms in the literature. Furthermore, we adapt the procedure to the problem of cellular network inference, applying it to the biologically realistic data of the DREAM challenges. The paper provides a full theoretical foundation for the causal discovery procedure first presented by Eberhardt et al. (2010) and Hyttinen et al. (2010). Keywords: causality, graphical models, randomized experiments, structural equation models, latent variables, latent confounders, cycles
6 0.31293663 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
7 0.30894431 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
8 0.30614311 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
9 0.24435626 44 jmlr-2012-Feature Selection via Dependence Maximization
10 0.23497441 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
11 0.2278865 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
12 0.22530878 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
13 0.22119039 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
14 0.21405815 4 jmlr-2012-A Kernel Two-Sample Test
15 0.21179122 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
16 0.20491752 20 jmlr-2012-Analysis of a Random Forests Model
17 0.19565549 82 jmlr-2012-On the Necessity of Irrelevant Variables
18 0.19457851 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
19 0.19153786 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
20 0.17523359 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
topicId topicWeight
[(7, 0.012), (21, 0.031), (26, 0.029), (29, 0.074), (35, 0.019), (49, 0.019), (56, 0.019), (57, 0.017), (60, 0.443), (69, 0.019), (75, 0.058), (77, 0.016), (79, 0.023), (92, 0.065), (96, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.70139837 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
Author: Ioannis Tsamardinos, Sofia Triantafillou, Vincenzo Lagani
Abstract: We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed algorithms make predictions that are better correlated with the sample estimates of the unknown parameters on test data ; this is particularly the case when the number of commonly measured variables is low. The enabling idea behind the methods is to induce one or all causal models that are simultaneously consistent with (fit) all available data sets and prior knowledge and reason with them. This allows constraints stemming from causal assumptions (e.g., Causal Markov Condition, Faithfulness) to propagate. Several methods have been developed based on this idea, for which we propose the unifying name Integrative Causal Analysis (INCA). A contrived example is presented demonstrating the theoretical potential to develop more general methods for co-analyzing heterogeneous data sets. The computational experiments with the novel methods provide evidence that causallyinspired assumptions such as Faithfulness often hold to a good degree of approximation in many real systems and could be exploited for statistical inference. Code, scripts, and data are available at www.mensxmachina.org. Keywords: integrative causal analysis, causal discovery, Bayesian networks, maximal ancestral graphs, structural equation models, causality, statistical matching, data fusion
2 0.29135233 4 jmlr-2012-A Kernel Two-Sample Test
Author: Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, Alexander Smola
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. ∗. †. ‡. §. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universit¨ t M¨ nchen. a u This work was carried out while M.J.R. was with the Graz University of Technology. Also at The Australian National University, Canberra, ACT 0200, Australia. c 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ lkopf and Alexander Smola. o ¨ G RETTON , B ORGWARDT, R ASCH , S CH OLKOPF AND S MOLA Keywords: kernel methods, two-sample test, uniform convergence bounds, schema matching, integral probability metric, hypothesis testing
3 0.27762797 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
4 0.27523634 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
Author: Antti Hyttinen, Frederick Eberhardt, Patrik O. Hoyer
Abstract: Identifying cause-effect relationships between variables of interest is a central problem in science. Given a set of experiments we describe a procedure that identifies linear models that may contain cycles and latent variables. We provide a detailed description of the model family, full proofs of the necessary and sufficient conditions for identifiability, a search algorithm that is complete, and a discussion of what can be done when the identifiability conditions are not satisfied. The algorithm is comprehensively tested in simulations, comparing it to competing algorithms in the literature. Furthermore, we adapt the procedure to the problem of cellular network inference, applying it to the biologically realistic data of the DREAM challenges. The paper provides a full theoretical foundation for the causal discovery procedure first presented by Eberhardt et al. (2010) and Hyttinen et al. (2010). Keywords: causality, graphical models, randomized experiments, structural equation models, latent variables, latent confounders, cycles
Author: Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luján
Abstract: We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: “what are the implicit statistical assumptions of feature selection criteria based on mutual information?”. To answer this, we adopt a different strategy than is usual in the feature selection literature—instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature ‘relevancy’ and ‘redundancy’, our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples. Keywords: feature selection, mutual information, conditional likelihood
6 0.26580173 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
7 0.26512265 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
8 0.2645362 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
9 0.2643792 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
10 0.26409855 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
11 0.2634441 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
12 0.2625165 82 jmlr-2012-On the Necessity of Irrelevant Variables
13 0.26132143 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
14 0.26106006 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
15 0.26081288 100 jmlr-2012-Robust Kernel Density Estimation
16 0.25962082 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
17 0.25934264 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
18 0.2586858 109 jmlr-2012-Stability of Density-Based Clustering
19 0.25836551 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
20 0.25758904 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection