jmlr jmlr2008 jmlr2008-93 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jean-Philippe Pellet, André Elisseeff
Abstract: We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. We prove this under the Faithfulness assumption for the data distribution. In a causal graph, the strongly relevant variables for a node X are its parents, children, and children’s parents (or spouses), also known as the Markov blanket of X. Identifying the spouses leads to the detection of the V-structure patterns and thus to causal orientations. Repeating the task for all variables yields a valid partially oriented causal graph. We first show an efficient way to identify the spouse links. We then perform several experiments in the continuous domain using the Recursive Feature Elimination feature-selection algorithm with Support Vector Regression and empirically verify the intuition of this direct (but computationally expensive) approach. Within the same framework, we then devise a fast and consistent algorithm, Total Conditioning (TC), and a variant, TCbw , with an explicit backward feature-selection heuristics, for Gaussian data. After running a series of comparative experiments on five artificial networks, we argue that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy. Keywords: causal structure learning, feature selection, Markov blanket, partial correlation, statistical test of conditional independence
Reference: text
sentIndex sentText sentNum sentScore
1 COM Data Analytics Group IBM Zurich Research Laboratory S¨ umerstraße 4, CH–8803 R¨ schlikon a u Editor: David Maxwell Chickering Abstract We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. [sent-5, score-0.345]
2 In a causal graph, the strongly relevant variables for a node X are its parents, children, and children’s parents (or spouses), also known as the Markov blanket of X. [sent-7, score-0.614]
3 , 2001) is a multivariate dataanalysis approach that aims to build a directed acyclic graph (DAG) showing direct causal relations among the variables of interest of a given system. [sent-17, score-0.352]
4 Building the causal graph is a difficult task, subject to a series of assumptions, and provably correct algorithms have an exponential worst-case complexity. [sent-20, score-0.323]
5 Identifying the exact causal graph is in general impossible. [sent-21, score-0.323]
6 , the GES algorithm, Chickering, 2002); the constraint-based algorithms look for dependencies and conditional dependencies in the data and build the causal graph accordingly. [sent-30, score-0.323]
7 Feature selection and causal structure learning are related by a common concept: the Markov blanket of a variable X is the smallest set Mb(X) containing all variables carrying information about X that cannot be obtained from any other variable. [sent-47, score-0.501]
8 The feature-selection task and the causal graph construction task can both be stated to some extent as Markov blanket identification tasks. [sent-50, score-0.551]
9 Several algorithms identifying the Markov blanket of a single variable with techniques inspired from causal structure learning have been proposed as the optimal solution to the feature-selection problem in the case of a faithful distribution. [sent-52, score-0.541]
10 Tsamardinos and Aliferis (2003) show that for faithful distributions, the Markov blanket of a variable Y is exactly the set of strongly relevant features, and prove its uniqueness. [sent-53, score-0.371]
11 If we assume that FS returns the Markov blanket of the variables, we can show how to turn this approximate result, called moral graph (Lauritzen and Spiegelhalter, 1988), into a provably correct PDAG depicting the causal structure. [sent-79, score-0.659]
12 This paper contributes a generic algorithm to build a causal graph which clearly separates the Markov blanket identification and the needed local adjustments, an efficient algorithm to perform those adjustments, and two fast instances of the generic algorithm for multivariate Gaussian data sets. [sent-81, score-0.605]
13 In Section 3, we make the link from the outcome of a featureselection algorithm to a causal graph by detailing the needed local adjustments and detail an efficient way to perform them. [sent-83, score-0.386]
14 This is expressed in the definition by a change in the probability distribution of the target between conditioning on all other variables, X\i , and also including Xi in the conditioning set. [sent-120, score-0.348]
15 In a causal graph represented by a DAG, we want to represent direct causal relations with arcs between pairs of variables. [sent-140, score-0.818]
16 It may for instance be unintuitive that whereas conditioning on a node W on a directed path X → W → Y blocks the path from X to Y , conditioning on a common child Z of two variables X,Y in X → Z ← Y connects them. [sent-152, score-0.348]
17 (4) Using (4), we can theoretically determine all adjacencies of the causal graph with conditionalindependence tests, but we cannot orient the edges. [sent-178, score-0.377]
18 Z is then identified as an unshielded collider if conditioning on it creates a dependency between X and Y : X Z Y ⇐⇒ ∃S ⊆ V \ {X,Y, Z} : (X ⊥ Y | S ) and (X ⊥ Y | S ∪ {Z} ) ⊥ ⊥ and ∀S ⊆ V \ {X, Z} : (X ⊥ Z | S ) ⊥ and ∀S ⊆ V \ {Y, Z} : (Y ⊥ Z | S ) ⊥ . [sent-185, score-0.326]
19 Note that there is no guarantee that all links can be oriented into causal arcs, and that in 6. [sent-187, score-0.435]
20 Partially oriented graphs returned by structure-learning algorithms represent observationally equivalent classes of causal graphs (Pearl, 2000, p. [sent-194, score-0.339]
21 Formally, if we combine (3), (4) and (5), we find, for a perfect causal map G (using the symbol “ ” to denote direct causation and “→” to denote an arc in the graph): X,Y adjacent in G ⇐⇒ X X → Z ← Y ⇐⇒ X Y or Y Z X Y. [sent-197, score-0.385]
22 (6) It is sometimes possible to orient further arcs in a graph by looking at already-oriented arcs and propagating constraints, preventing acyclicity and the creation of additional V-structures other than those already detected. [sent-198, score-0.577]
23 Causal Network Construction Based on Feature Selection We have looked at the ideal outcome of feature selection in (1) and how to read a causal graph in (6). [sent-201, score-0.323]
24 Property 7 (Total conditioning) In the context of a faithful causal graph G , we have: ∀X,Y ∈ V : X ∈ Mb(Y ) ⇐⇒ (X ⊥ Y | V \ {X,Y } ) . [sent-208, score-0.363]
25 This links feature selection and causal structure learning in the sense that FY = Mb(Y ), 1302 U SING M ARKOV B LANKETS FOR C AUSAL S TRUCTURE L EARNING the Faithfulness assumption guaranteeing the unicity of Mb(Y ). [sent-213, score-0.396]
26 However, Markov blankets alone do not fully specify a causal graph. [sent-214, score-0.34]
27 This graph is close to the original causal graph in that it contains all arcs as undirected links, and additionally links spouses together, and is called the moral graph of the original directed graph (Lauritzen and Spiegelhalter, 1988, p. [sent-220, score-1.113]
28 The extra step needed to transform this graph into a causal PDAG is the deletion of the spouse links and the orientation of the arcs, a task which we call “resolving the Markov blankets. [sent-222, score-0.778]
29 The full algorithm first finds the Markov blanket for each variable, and performs further conditional-independence tests around each variable to infer the structure locally. [sent-224, score-0.365]
30 It first removes the possible spouse links between linked variables X and Y by looking for a d-separating set around X and Y . [sent-233, score-0.344]
31 In a second pass, it orients the arcs whenever it finds that conditioning on a middle node creates a dependency. [sent-234, score-0.41]
32 While the GS approach considerably reduces the number of tests to be performed with respect to a large subset search, it is possible to perform fewer tests while still identifying correctly the structure and orienting the arcs, and decreasing the average conditioning set size. [sent-239, score-0.376]
33 A helpful observation is that orientation and removal of the spouse links can be done together in a single pass. [sent-240, score-0.396]
34 We know, as discussed in the previous section, that only arcs in V-structures can be oriented: fortunately, V-structures are exactly spotted when we identify a spouse link to be removed. [sent-241, score-0.477]
35 Assuming that the information about the Markov blanket is correct, only triangles can hide spouse links and V-structures. [sent-249, score-0.572]
36 Second, for each connected pair X − Y in a triangle, decisions about possible spouse links and arc orientation are taken together and thus faster. [sent-250, score-0.446]
37 Pseudocode for the proposed search algorithm is listed in Algorithm 2, where the notation G \XY denotes the moral graph G where all direct links involving X or Y have been removed. [sent-251, score-0.398]
38 Suppose that G is the moral graph of the DAG representing the causal structure of a faithful data set. [sent-254, score-0.471]
39 Lemma 9 In the context of a faithful causal graph, the set Z that has the Collider Set property for a given pair (X,Y ) exists if and only if X is neither a direct cause nor a direct effect of Y . [sent-257, score-0.342]
40 The algorithm loops over all triangle links and performs a collider set search for each of them. [sent-261, score-0.348]
41 Those nodes must be blocked by appropriate conditioning on the boundary of X or Y as determined by the base conditioning set at line 6. [sent-268, score-0.39]
42 On (iii), the spouse link and orientation information that the collider set search for the linked pair X −Y gives. [sent-284, score-0.503]
43 But actually in this example, all (perfect) tests containing V in the conditioning set will yield dependence, because it is a descendant of the collider Z and thus opens a path by definition of d-separation. [sent-291, score-0.434]
44 This in turn allows us to identify the link X − Y as a spouse link and determine (line 25) that the set Tri(X −Y ) \ SXY = {Z} is the set of all direct effects of X and Y ; that is, fulfills the Collider Set property. [sent-294, score-0.347]
45 W W Y X Z (i) W Y X Y X Z (ii) W Z (iii) Y X Z (iv) Figure 2: Another sample local causal structure (i) and corresponding moral graph (ii). [sent-295, score-0.431]
46 For some structures, the order in which arcs are removed and oriented must happen such that all spouse links are removed before proceeding to orientation. [sent-298, score-0.605]
47 We may thus remove the link X −Y , and considering that Tri(X −Y ) = {W, Z}, we could want to orient X → Z ← X and 1307 P ELLET AND E LISSEEFF X → W ← X (leaving the spouse link W −Y to be removed later). [sent-301, score-0.372]
48 This would be wrong, precisely because W − Y is a spouse link, and thus the orientation X → W ← X is not allowed if one of the links to be oriented does not actually exist in the original graph. [sent-302, score-0.435]
49 After all spouse links have been removed, the orientations are done at line 33 only when both links to be oriented still exist, thus ensuring the existence of the V-structure X → Z ← Y . [sent-304, score-0.535]
50 Find the conjectured Markov blanket of each variable with feature selection and build the moral graph; 2. [sent-318, score-0.365]
51 Remove spouse links and orient V-structures using collider sets; 3. [sent-319, score-0.564]
52 As a conditional-independence test at lines 9 and 17 of the collider set search in Algorithm 2, we can use the distribution-free Recursive Median (RM) algorithm proposed by Margaritis (2005) to detect the V-structure and remove the spouse links, or a z-test as used in Scheines et al. [sent-376, score-0.388]
53 Then, our feature⊥ selection step (Algorithm 6) gives the Markov blanket for each node, and the collider set search (Algorithm 2) then takes care of identifying the V-structures and removing the spouse links. [sent-399, score-0.616]
54 It is worth discussing, however, depending on the particular problem to solve, which is more desirable: missing causal links or getting extra causal links. [sent-446, score-0.766]
55 13 7 Table 1: Number of tests and size of the conditioning sets (noted |Z|) as performed by various algorithms to recover the local network structure of the networks given perfect Markov blanket information. [sent-579, score-0.566]
56 First, we see that we get similar results as in Table 1 as far as the number of tests and size of the conditioning sets are concerned: CS is faster and consistently performs fewer tests with smaller conditioning sets, which leads to an increased power of the tests. [sent-584, score-0.536]
57 However, that is sometimes balanced out by the fact that CS relies on a single series of tests both to remove spouse links and to orient (possibly multiple) V-structures at the same time, thus leading to a greater penalty if the outcome of a test is wrong with respect to the initial graph. [sent-585, score-0.506]
58 GS(1), which checks not only triangle links but all links to try to orient them, makes more orientation mistakes, especially on the Carpo network. [sent-587, score-0.41]
59 20 Table 2: Number of tests, size of the conditioning sets (noted |Z|), and structural errors as returned by GS(1) and CS to recover the local network structure of the networks given perfect Markov blanket information. [sent-728, score-0.524]
60 This compares the true Markov blanket of each variable with the identified Markov blanket as returned by Algorithm 5; 2. [sent-742, score-0.485]
61 After removal of the spouse links using the Recursive Median (RM) algorithm (Margaritis, 2005) to check for conditional independence in the continuous domain; 3b. [sent-745, score-0.37]
62 Alternatively, after removal of the spouse links using a test on Fisher’s z-transform of partial correlation; 4a. [sent-746, score-0.344]
63 After removal of the spouse links using RM and after maximal orientation. [sent-747, score-0.344]
64 After removal of the spouse links using partial correlation tests and after maximal orientation; 5. [sent-749, score-0.497]
65 First, it allows the collider set search to be also distribution-free, in the sense that if “distributionfree feature selection” can be performed efficiently and consistently in the first phase, applying a subsequent collider set search does not make more assumptions on the distribution. [sent-754, score-0.392]
66 What we can read from the results is that, generally, the selected Markov blankets contain all variables from the true Markov blanket plus one or two additional variables. [sent-759, score-0.324]
67 The symmetry check requiring Y to be part of Mb(X) and X to be part of Mb(Y ) to add a link between X and Y fulfills its purpose, as even in the case n = 200 where on average about 73 variables enter wrong Markov blankets, only 4 extra links are added in the moral graph. [sent-765, score-0.382]
68 After the collider set search, the number of missing and extra arcs can both either increase or decrease. [sent-767, score-0.514]
69 If the number of missing links increases, it is because the collider set search found d-separation too often while variables were actually dependent. [sent-768, score-0.415]
70 If it decreases, it means that the missing arcs in the moral graph were spouse links, as their absence is not penalized in the PDAG any more. [sent-769, score-0.668]
71 If the number of extra arcs increases, then the collider set search failed to identify spouse links; if it decreases, then the collider set search also removed through appropriate conditioning links that were not spouse links (which in turn possibly led to wrong orientations). [sent-770, score-1.521]
72 PDAG/RM missing arcs extra arcs reversed arcs T OTAL 3b. [sent-776, score-0.878]
73 CPDAG/RM missing arcs extra arcs reversed arcs T OTAL 4b. [sent-779, score-0.878]
74 CPDAG/PC missing arcs extra arcs reversed arcs T OTAL n = 100 n = 200 n = 300 n = 400 n = 500 17. [sent-782, score-0.878]
75 40 Table 3: Structural errors at various stages of the RFE-based approach, showing the missing, extra and reversed arcs with respect to the original graph. [sent-1032, score-0.399]
76 In the collider set search, if a wrong spouse link is removed, it is because a wrong V-structure has been identified, so that the absence of an arc will be linked to the wrong orientation of the falsely recognized V-structure. [sent-1036, score-0.523]
77 For the PDAGs obtained using z-tests, the number of missing arcs always decreases with respect to the moral graph, and so does the number of extra links for n ≥ 200. [sent-1038, score-0.608]
78 We find that the more general RM test seems to return independence too often, as for n ≥ 200 more links are missing in the PDAG than in the moral graph. [sent-1039, score-0.353]
79 The structural errors, like before, are missing, extra, and reversed arcs in the returned CPDAG with respect to the generating graph. [sent-1053, score-0.342]
80 The numbers of extra and missing links seem to clearly decrease on average for all algorithms with an increasing number of samples, except for Bach-Jordan, which sometimes has the tendency to add more links when more data points are available. [sent-1071, score-0.43]
81 Note that TC, TCbw , and GS would also spend a long time on this cluster if all neighbors of variable 27 were its parents, because they would contain a lot of extra spouse links to be checked with an exponential number of combinations. [sent-1106, score-0.432]
82 We note that the extra links added by GS seem to allow it to obtain a better directionality accuracy than in our first series of experiments, where it was given the exact moral graph as input. [sent-1135, score-0.433]
83 J o r d a n 6 4 2 0 10 3 2 10 3 10 10 2 3 10 D a ta s e t s iz e n 10 D a ta s e t s iz e n Figure 12: Alarm: (a) run times and (b) number of statistical tests as a function of the sample size n. [sent-1144, score-0.796]
84 J o r d a n 4 0 N u m b e r o f s ta tis tic a l te s ts R u n tim e in s 5 0 3 0 2 0 10 4 10 3 10 0 2 3 10 2 10 3 10 D a ta s e t s iz e n 10 D a ta s e t s iz e n Figure 14: Insurance: (a) run times and (b) number of statistical tests as a function of the sample size n. [sent-1148, score-0.992]
85 J o r d a n 2 0 10 0 2 10 N u m b e r o f s ta tis tic a l te s ts R u n tim e in s 4 0 3 4 10 2 10 D a ta s e t s iz e n 10 3 10 D a ta s e t s iz e n Figure 16: Hailfinder: (a) run times and (b) number of statistical tests as a function of the sample size n. [sent-1152, score-0.992]
86 J o r d a n N u m b e r o f s ta tis tic a l te s ts 3 00 15 0 100 5 0 0 2 10 3 5 10 4 10 2 10 D a ta s e t s iz e n 10 3 10 D a ta s e t s iz e n Figure 18: Carpo: (a) run times and (b) number of statistical tests as a function of the sample size n. [sent-1156, score-0.962]
87 J o r d a n N u m b e r o f s ta tis tic a l te s ts R u n tim e in s 2 00 100 5 0 0 3 5 10 3 10 D a ta s e t s iz e n 10 D a ta s e t s iz e n Figure 20: Diabetes: (a) run times and (b) number of statistical tests as a function of the sample size n. [sent-1160, score-0.992]
88 The exponential growth in PC can be seen in case the nodes have a high degree, be it parents or children; in TC and GS, it is due to large fullyconnected triangle structures and to spouse links coming from the Markov blanket-construction step. [sent-1169, score-0.421]
89 Conclusion Causal discovery and feature selection are closely related: optimal feature selection discovers Markov blanket as sets of strongly relevant features, and causal discovery discovers Markov blankets as direct causes, direct effects, and common causes of direct effects. [sent-1186, score-0.758]
90 By performing perfect feature selection on each variable, we get the undirected moral graph as an approximation of the causal graph. [sent-1187, score-0.49]
91 Proof It follows from Definition 5 that a path π is blocked when either at least one collider (or one of its descendants) is not in the conditioning set S, or when at least one non-collider is in S. [sent-1208, score-0.359]
92 Property 7 (Total conditioning) In the context of a faithful causal graph G , we have: ∀X,Y ∈ V : X ∈ Mb(Y ) ⇐⇒ (X ⊥ Y | V \ {X,Y } ) . [sent-1227, score-0.363]
93 ⊥ Proof This is a direct consequence of Corollary 14, where points (i) and (ii) lead to the definition of the Markov blanket of Y as (i) all its causes and effects, and (ii) the other direct causes of its effects. [sent-1228, score-0.344]
94 Lemma 9 In the context of a faithful causal graph, the set Z that has the Collider Set property for a given pair (X,Y ) exists if and only if X is neither a direct cause nor a direct effect of Y , and is unique when it exists. [sent-1241, score-0.342]
95 This holds XY XY because, as shown by Lemma 15, Z is a direct child of X and Y , and conditioning on it opens a path, no matter what the conditioning set is. [sent-1253, score-0.349]
96 Thus, all spouse links to be removed are in the moral graph, and, by the definition of spouse, each spouse link between X and Y corresponds to at least one unshielded collider for the pair (X,Y ). [sent-1260, score-0.873]
97 1338 U SING M ARKOV B LANKETS FOR C AUSAL S TRUCTURE L EARNING If some SXY exists, then the link between X and Y is a spouse link by definition of a moral graph, which implies that X and Y have a nonempty set of common effects Z. [sent-1264, score-0.426]
98 It is then enough to show that all d-connecting paths between X and Y that are not due to conditioning on a collider or collider’s descendant go through the base conditioning set as determined at line 6. [sent-1268, score-0.486]
99 The collider set search then examines each link X − Y part of a triangle, and by Lemma 15, we know that if the search for a set Z that has the Collider Set property succeeds, there must be no link between X and Y . [sent-1280, score-0.352]
100 Time and sample efficient discovery of Markov blankets and direct causal relations. [sent-1553, score-0.369]
wordName wordTfidf (topN-words)
[('tc', 0.314), ('causal', 0.244), ('blanket', 0.228), ('gs', 0.228), ('arcs', 0.222), ('tcbw', 0.196), ('spouse', 0.192), ('lankets', 0.19), ('iz', 0.178), ('collider', 0.166), ('ta', 0.166), ('conditioning', 0.16), ('sxy', 0.155), ('links', 0.152), ('pc', 0.147), ('arkov', 0.144), ('ellet', 0.143), ('lisseeff', 0.143), ('mb', 0.127), ('tri', 0.125), ('tructure', 0.121), ('sing', 0.116), ('tests', 0.108), ('markov', 0.108), ('moral', 0.108), ('blankets', 0.096), ('bw', 0.089), ('ausal', 0.088), ('reversed', 0.086), ('pdag', 0.081), ('graph', 0.079), ('rfe', 0.076), ('missing', 0.067), ('hail', 0.066), ('link', 0.063), ('predictors', 0.062), ('bd', 0.062), ('extra', 0.059), ('alarm', 0.058), ('dag', 0.055), ('faithfulness', 0.055), ('nder', 0.055), ('orient', 0.054), ('otal', 0.053), ('orientation', 0.052), ('xy', 0.052), ('insurance', 0.05), ('arc', 0.05), ('carpo', 0.05), ('esolve', 0.048), ('earning', 0.047), ('spouses', 0.045), ('svr', 0.045), ('rc', 0.045), ('correlation', 0.045), ('parents', 0.04), ('faithful', 0.04), ('relevant', 0.04), ('oriented', 0.039), ('pearl', 0.039), ('tsamardinos', 0.038), ('network', 0.037), ('nodes', 0.037), ('margaritis', 0.036), ('ndep', 0.036), ('ond', 0.036), ('directionality', 0.035), ('fy', 0.035), ('rs', 0.034), ('strongly', 0.034), ('structural', 0.034), ('blocked', 0.033), ('perfect', 0.033), ('errors', 0.032), ('children', 0.032), ('search', 0.03), ('differentiated', 0.03), ('nilsson', 0.03), ('ollider', 0.03), ('pellet', 0.03), ('rro', 0.03), ('tim', 0.03), ('direct', 0.029), ('causation', 0.029), ('diabetes', 0.029), ('variable', 0.029), ('causes', 0.029), ('target', 0.028), ('node', 0.028), ('graphs', 0.028), ('ii', 0.028), ('spirtes', 0.028), ('aliferis', 0.027), ('generic', 0.027), ('sw', 0.027), ('eature', 0.027), ('colliders', 0.027), ('independence', 0.026), ('undirected', 0.026), ('ct', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
Author: Jean-Philippe Pellet, André Elisseeff
Abstract: We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. We prove this under the Faithfulness assumption for the data distribution. In a causal graph, the strongly relevant variables for a node X are its parents, children, and children’s parents (or spouses), also known as the Markov blanket of X. Identifying the spouses leads to the detection of the V-structure patterns and thus to causal orientations. Repeating the task for all variables yields a valid partially oriented causal graph. We first show an efficient way to identify the spouse links. We then perform several experiments in the continuous domain using the Recursive Feature Elimination feature-selection algorithm with Support Vector Regression and empirically verify the intuition of this direct (but computationally expensive) approach. Within the same framework, we then devise a fast and consistent algorithm, Total Conditioning (TC), and a variant, TCbw , with an explicit backward feature-selection heuristics, for Gaussian data. After running a series of comparative experiments on five artificial networks, we argue that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy. Keywords: causal structure learning, feature selection, Markov blanket, partial correlation, statistical test of conditional independence
2 0.17831527 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
Author: Xianchao Xie, Zhi Geng
Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is first decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efficiency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning
3 0.15637997 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
Author: Jiji Zhang
Abstract: Causal reasoning is primarily concerned with what would happen to a system under external interventions. In particular, we are often interested in predicting the probability distribution of some random variables that would result if some other variables were forced to take certain values. One prominent approach to tackling this problem is based on causal Bayesian networks, using directed acyclic graphs as causal diagrams to relate post-intervention probabilities to pre-intervention probabilities that are estimable from observational data. However, such causal diagrams are seldom fully testable given observational data. In consequence, many causal discovery algorithms based on data-mining can only output an equivalence class of causal diagrams (rather than a single one). This paper is concerned with causal reasoning given an equivalence class of causal diagrams, represented by a (partial) ancestral graph. We present two main results. The first result extends Pearl (1995)’s celebrated do-calculus to the context of ancestral graphs. In the second result, we focus on a key component of Pearl’s calculus—the property of invariance under interventions, and give stronger graphical conditions for this property than those implied by the first result. The second result also improves the earlier, similar results due to Spirtes et al. (1993). Keywords: ancestral graphs, causal Bayesian network, do-calculus, intervention
4 0.15546399 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
Author: Ilya Shpitser, Judea Pearl
Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple “parallel worlds” and resulting from simultaneous, possibly conflicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Specifically, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy. Keywords: causality, graphical causal models, identification
Author: Yang-Bo He, Zhi Geng
Abstract: The causal discovery from data is important for various scientific investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we first find a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph. Keywords: active learning, causal networks, directed acyclic graphs, intervention, Markov equivalence class, optimal design, structural learning
6 0.12979254 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
7 0.11749471 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
8 0.11255187 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
9 0.099544168 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
10 0.056517143 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.042663451 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
12 0.042344309 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
14 0.036723621 54 jmlr-2008-Learning to Select Features using their Properties
15 0.034146369 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
16 0.033111036 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
17 0.032835126 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
18 0.032803677 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
19 0.032210231 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
20 0.030564021 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
topicId topicWeight
[(0, 0.234), (1, 0.353), (2, 0.029), (3, 0.019), (4, 0.006), (5, -0.002), (6, 0.038), (7, 0.052), (8, -0.043), (9, -0.012), (10, -0.016), (11, 0.005), (12, 0.056), (13, -0.231), (14, 0.09), (15, -0.026), (16, 0.025), (17, -0.004), (18, -0.034), (19, 0.046), (20, 0.073), (21, 0.015), (22, -0.015), (23, 0.102), (24, -0.021), (25, -0.044), (26, -0.058), (27, 0.083), (28, 0.043), (29, 0.073), (30, -0.032), (31, -0.013), (32, 0.032), (33, 0.008), (34, 0.118), (35, -0.027), (36, -0.017), (37, 0.018), (38, -0.07), (39, -0.036), (40, 0.075), (41, 0.097), (42, -0.09), (43, 0.032), (44, 0.093), (45, -0.157), (46, 0.048), (47, -0.084), (48, 0.12), (49, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.9384163 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
Author: Jean-Philippe Pellet, André Elisseeff
Abstract: We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. We prove this under the Faithfulness assumption for the data distribution. In a causal graph, the strongly relevant variables for a node X are its parents, children, and children’s parents (or spouses), also known as the Markov blanket of X. Identifying the spouses leads to the detection of the V-structure patterns and thus to causal orientations. Repeating the task for all variables yields a valid partially oriented causal graph. We first show an efficient way to identify the spouse links. We then perform several experiments in the continuous domain using the Recursive Feature Elimination feature-selection algorithm with Support Vector Regression and empirically verify the intuition of this direct (but computationally expensive) approach. Within the same framework, we then devise a fast and consistent algorithm, Total Conditioning (TC), and a variant, TCbw , with an explicit backward feature-selection heuristics, for Gaussian data. After running a series of comparative experiments on five artificial networks, we argue that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy. Keywords: causal structure learning, feature selection, Markov blanket, partial correlation, statistical test of conditional independence
2 0.71770203 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
Author: Ilya Shpitser, Judea Pearl
Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple “parallel worlds” and resulting from simultaneous, possibly conflicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Specifically, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy. Keywords: causality, graphical causal models, identification
3 0.70016229 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
Author: Jiji Zhang
Abstract: Causal reasoning is primarily concerned with what would happen to a system under external interventions. In particular, we are often interested in predicting the probability distribution of some random variables that would result if some other variables were forced to take certain values. One prominent approach to tackling this problem is based on causal Bayesian networks, using directed acyclic graphs as causal diagrams to relate post-intervention probabilities to pre-intervention probabilities that are estimable from observational data. However, such causal diagrams are seldom fully testable given observational data. In consequence, many causal discovery algorithms based on data-mining can only output an equivalence class of causal diagrams (rather than a single one). This paper is concerned with causal reasoning given an equivalence class of causal diagrams, represented by a (partial) ancestral graph. We present two main results. The first result extends Pearl (1995)’s celebrated do-calculus to the context of ancestral graphs. In the second result, we focus on a key component of Pearl’s calculus—the property of invariance under interventions, and give stronger graphical conditions for this property than those implied by the first result. The second result also improves the earlier, similar results due to Spirtes et al. (1993). Keywords: ancestral graphs, causal Bayesian network, do-calculus, intervention
4 0.55229574 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
Author: Tianjiao Chu, Clark Glymour
Abstract: Pointwise consistent, feasible procedures for estimating contemporaneous linear causal structure from time series data have been developed using multiple conditional independence tests, but no such procedures are available for non-linear systems. We describe a feasible procedure for learning a class of non-linear time series structures, which we call additive non-linear time series. We show that for data generated from stationary models of this type, two classes of conditional independence relations among time series variables and their lags can be tested efficiently and consistently using tests based on additive model regression. Combining results of statistical tests for these two classes of conditional independence relations and the temporal structure of time series data, a new consistent model specification procedure is able to extract relatively detailed causal information. We investigate the finite sample behavior of the procedure through simulation, and illustrate the application of this method through analysis of the possible causal connections among four ocean indices. Several variants of the procedure are also discussed. Keywords: conditional independence test, contemporaneous causation, additive model regression, Granger causality, ocean indices
5 0.53212321 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
6 0.49078065 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
7 0.4769426 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
9 0.33297122 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
10 0.2853334 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.27562574 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
12 0.2493906 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
13 0.24660474 54 jmlr-2008-Learning to Select Features using their Properties
14 0.23625754 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
16 0.23257722 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
17 0.20900981 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
18 0.19983217 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
19 0.19535594 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
20 0.194675 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
topicId topicWeight
[(0, 0.046), (5, 0.019), (30, 0.405), (31, 0.016), (40, 0.037), (54, 0.026), (58, 0.035), (61, 0.024), (66, 0.052), (76, 0.056), (87, 0.015), (88, 0.054), (92, 0.043), (94, 0.039), (99, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.77880406 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
Author: Jean-Philippe Pellet, André Elisseeff
Abstract: We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. We prove this under the Faithfulness assumption for the data distribution. In a causal graph, the strongly relevant variables for a node X are its parents, children, and children’s parents (or spouses), also known as the Markov blanket of X. Identifying the spouses leads to the detection of the V-structure patterns and thus to causal orientations. Repeating the task for all variables yields a valid partially oriented causal graph. We first show an efficient way to identify the spouse links. We then perform several experiments in the continuous domain using the Recursive Feature Elimination feature-selection algorithm with Support Vector Regression and empirically verify the intuition of this direct (but computationally expensive) approach. Within the same framework, we then devise a fast and consistent algorithm, Total Conditioning (TC), and a variant, TCbw , with an explicit backward feature-selection heuristics, for Gaussian data. After running a series of comparative experiments on five artificial networks, we argue that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy. Keywords: causal structure learning, feature selection, Markov blanket, partial correlation, statistical test of conditional independence
2 0.32271701 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
Author: Xianchao Xie, Zhi Geng
Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is first decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efficiency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning
3 0.29330155 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
Author: Zongming Ma, Xianchao Xie, Zhi Geng
Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning
4 0.29041171 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
5 0.28293139 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
6 0.28107136 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
7 0.2792654 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
8 0.27899975 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
9 0.27342927 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
10 0.27132517 56 jmlr-2008-Magic Moments for Structured Output Prediction
11 0.27110061 58 jmlr-2008-Max-margin Classification of Data with Absent Features
12 0.27055264 57 jmlr-2008-Manifold Learning: The Price of Normalization
14 0.26860869 83 jmlr-2008-Robust Submodular Observation Selection
15 0.26810479 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
16 0.26789403 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
17 0.26612687 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
18 0.26500839 86 jmlr-2008-SimpleMKL
19 0.26381198 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
20 0.26294243 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection