jmlr jmlr2009 jmlr2009-11 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Raanan Yehezkel, Boaz Lerner
Abstract: We propose the recursive autonomy identification (RAI) algorithm for constraint-based (CB) Bayesian network structure learning. The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. The sequence of operations is performed recursively for each autonomous substructure while simultaneously increasing the order of the CI test. While other CB algorithms d-separate structures and then direct the resulted undirected graph, the RAI algorithm combines the two processes from the outset and along the procedure. By this means and due to structure decomposition, learning a structure using RAI requires a smaller number of CI tests of high orders. This reduces the complexity and run-time of the algorithm and increases the accuracy by diminishing the curse-of-dimensionality. When the RAI algorithm learned structures from databases representing synthetic problems, known networks and natural problems, it demonstrated superiority with respect to computational complexity, run-time, structural correctness and classification accuracy over the PC, Three Phase Dependency Analysis, Optimal Reinsertion, greedy search, Greedy Equivalence Search, Sparse Candidate, and Max-Min Hill-Climbing algorithms. Keywords: Bayesian networks, constraint-based structure learning
Reference: text
sentIndex sentText sentNum sentScore
1 The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. [sent-7, score-0.229]
2 The structure is a directed acyclic graph (DAG) that is composed of nodes representing domain variables and edges connecting these nodes. [sent-16, score-0.31]
3 An edge manifests dependence between the nodes connected by the edge, while the absence of an edge demonstrates independence between the nodes. [sent-17, score-0.24]
4 In the second stage, most CB algorithms direct edges by employing orientation rules in two consecutive steps: finding and directing V-structures and directing additional edges inductively (Pearl, 2000). [sent-72, score-0.231]
5 We propose the recursive autonomy identification (RAI) algorithm, which is a CB model that learns the structure of a BN by sequential application of CI tests, edge direction and structure decomposition into autonomous sub-structures that comply with the Markov property (i. [sent-86, score-0.328]
6 , tests employing small conditions sets) before those of high order, the RAI algorithm performs more reliable tests first, and thereby obviates the need to perform less reliable tests later. [sent-94, score-0.433]
7 By directing edges while testing conditional independence, the RAI algorithm can consider parentchild relations so as to rule out nodes from condition sets and thereby to avoid unnecessary CI tests and to perform tests using smaller condition sets. [sent-95, score-0.555]
8 By decomposing the graph into autonomous sub-structures, further elimination of both the number of CI tests and size of condition sets is obtained. [sent-97, score-0.287]
9 The structure G (V, E) is a DAG composed of V, a set of nodes representing the domain variables X, and E, a set of directed edges connecting the nodes. [sent-110, score-0.242]
10 A directed edge Xi → X j connects a child node X j to its parent node Xi . [sent-111, score-0.238]
11 We also make use of the term partially directed graph, that is, a graph that may have both directed and undirected edges and has at most one edge between any pair of nodes (Meek, 1995). [sent-120, score-0.397]
12 We use this term while learning a graph starting from a complete undirected graph and removing and directing edges until uncovering a graph representing a family of Markov equivalent structures (pattern) of the true underlying BN2 (Pearl, 2000; Spirtes et al. [sent-121, score-0.337]
13 Thereafter, we ⊥ define d-separation resolution with the aim to evaluate d-separation for different sizes of condition sets, d-separation resolution of a graph, an exogenous cause to a graph and an autonomous substructure. [sent-127, score-0.362]
14 Definition 1 – d-separation resolution: The resolution of a d-separation relation between a pair of non-adjacent nodes in a graph is the size of the smallest condition set that d-separates the two nodes. [sent-130, score-0.269]
15 Definition 2 – d-separation resolution of a graph: The d-separation resolution of a graph is the highest d-separation resolution in the graph. [sent-132, score-0.317]
16 Two nodes in a graph that are connected by an edge are adjacent. [sent-140, score-0.236]
17 Eliminating relation 8 by adding the edge X3 → X6 , we form a graph having a d-separation resolution of 1 (Figure 2b). [sent-143, score-0.223]
18 By further adding edges to the graph, eliminating relations of resolution 1, we form a graph having a d-separation resolution of 0 (Figure 2c) that encodes only relation 1. [sent-144, score-0.307]
19 Given a structure G , any two non-adjacent nodes in an autonomous sub-structure G A in G are d-separated given nodes either included in the sub-structure G A or exogenous causes to G A . [sent-150, score-0.327]
20 , they are either parents of nodes in G A or not adjacent to them; see Definition 3), G A is said to be autonomous in G given nodes X1 and X2 . [sent-154, score-0.299]
21 Since X and Y are contained in the sub-structure G A (VA , EA ), which is autonomous given the set of nodes Vex , then, following the definition of an autonomous sub-structure, all parents of the nodes in VA — and specifically Pa(Y )—are members in set {VA ∪ Vex }. [sent-164, score-0.358]
22 CI testing of order n between nodes X and Y is performed by thresholding the value of a criterion that measures the dependence between the nodes conditioned on a set of n nodes (i. [sent-169, score-0.308]
23 In the second step, also known as the inductive stage, edges are continually directed until no more edges can be directed, while assuring that no new V-structures and no directed cycles are created. [sent-186, score-0.234]
24 Decomposition also decreases the number and length of paths between nodes that are CI-tested, thereby diminishing, respectively, the number of CI tests and the sizes of condition sets used in these tests. [sent-188, score-0.256]
25 Following decomposition, the RAI algorithm identifies ancestor and descendant sub-structures; the former are autonomous, and the latter are autonomous given nodes of the former. [sent-191, score-0.314]
26 Previous knowledge includes Gstart , a structure having a dseparation resolution of n − 1, and G ex , a set of structures each having possible exogenous causes to Gstart . [sent-199, score-0.222]
27 , edges directed from nodes in Gstart into nodes that are not in Gstart or G ex ), and these will be / ignored by the RAI. [sent-203, score-0.339]
28 Given a structure Gstart having d-separation resolution n − 1, the RAI algorithm seeks independences between adjacent nodes conditioned on sets of size n and removes the edges corresponding to these independences. [sent-206, score-0.359]
29 This order is partial, since not all the edges can be directed; thus, edges that cannot be directed connect nodes of equal topological order. [sent-210, score-0.315]
30 Using this partial topological ordering, the algorithm decomposes the structure into ancestor and descendent autonomous sub-structures so as to reduce the complexity of the successive stages. [sent-211, score-0.245]
31 Second, all edges pointing towards nodes of the descendant sub-structure are temporarily removed (together with the descendant sub-structure itself), and the remaining clusters of connected nodes are identified as ancestor sub-structures. [sent-217, score-0.475]
32 The descendent sub-structure is autonomous, given nodes of higher topological order composing the ancestor sub-structures. [sent-218, score-0.234]
33 This is achieved by removing edges corresponding to independences between nodes in G ex and nodes in Gstart conditioned on sets of size n of nodes that are either exogenous to, or within, Gstart . [sent-225, score-0.497]
34 Similarly, in Stage B1, the algorithm tests for CI of order n between nodes in Gstart given sets of size n of nodes that are either exogenous to, or within, Gstart , and removes edges corresponding to independences. [sent-226, score-0.469]
35 When testing independence between X and Y , conditioned on the potential parents of node X, those nodes in the condition set that are exogenous to Gstart are X’s parents whereas those nodes that are in Gstart are either its parents or adjacents. [sent-228, score-0.472]
36 In Stage B3, the algorithm groups in a descendant sub-structure all the nodes having the lowest topological order in the derived partially directed structure, and following the temporary removal of these nodes, it defines in Stage B4 separate ancestor sub-structures. [sent-230, score-0.328]
37 Due to the topological order, every edge from a node X in an ancestor sub-structure to a node Z in the descendant substructure is directed as X → Z. [sent-231, score-0.379]
38 Thus, every ancestor sub-structure contains all the potential parents of its nodes, that is, it is autonomous (or if some potential parents are exogenous, then the sub-structure is autonomous given the set of exogenous nodes). [sent-233, score-0.331]
39 The descendant sub-structure is, by definition, autonomous given nodes of ancestor sub-structures. [sent-234, score-0.295]
40 Proposition 1 showed that we can identify all the conditional independences between nodes of an autonomous sub-structure. [sent-235, score-0.214]
41 Hence, every ancestor and descendant sub-structure can be processed independently in Stages C and D, respectively, so as to identify conditional independences of increasing orders in each recursive call of the algorithm. [sent-236, score-0.27]
42 Similarly, Stage D is a recursive call for the RAI algorithm for learning the descendant sub-structure with order n + 1, while assuming that the ancestor sub-structures have been fully learned (having d-separation resolution of n + 1). [sent-238, score-0.295]
43 , checking marginal independence), which yields the removal of the edges between node X1 and nodes X3 , X4 and X5 (Figure 5c). [sent-282, score-0.216]
44 The nodes having the lowest topological order (X2 , X6 , X7 ) are grouped into a descendant sub-structure GD (Stage B3), while the remaining nodes form two unconnected ancestor sub-structures, GA1 and GA2 (Stage B4)(Figure 5e). [sent-284, score-0.361]
45 Note that after decomposition, every edge between a node, Xi , in an ancestor sub-structure, and a node, X j , in a descendant sub-structure is a directed edge Xi → X j . [sent-285, score-0.328]
46 The set of all edges from an ancestor sub-structure to the descendant sub-structure is illustrated in Figure 5e by a wide arrow connecting the sub-structures. [sent-286, score-0.213]
47 We refer here to structures learned during algorithm execution and do not consider the empty graph that naturally has the lowest d-separation resolution (i. [sent-314, score-0.23]
48 This graph, having all nodes marginally independent of each other, will be found by the RAI algorithm immediately after the first iteration for graph resolution 0. [sent-317, score-0.266]
49 Missing or wrongly identifying a V-structure affect the orientation of other edges in the graph during the inductive stage and subsequent stages. [sent-325, score-0.228]
50 This knowledge permits avoiding some of the tests and decreases the size of conditions sets of some other tests (see Lemma 1). [sent-340, score-0.276]
51 Suppose that following CI tests of some order both the PC and RAI algorithms identify a triplet of nodes in which two non-adjacent nodes, X and Y , are adjacent to a third node, Z, that is, X – Z – Y . [sent-346, score-0.234]
52 The directed arm would also contribute to fewer CI tests and tests with smaller condition sets during CI testing with higher orders (e. [sent-351, score-0.407]
53 Now, suppose another example in which after removing all edges due to reliable CI tests using condition set sizes lower than or equal to n, the algorithm identifies the V-structure X → Z ← Y (Figure 7a). [sent-355, score-0.252]
54 Decomposition into ancestor and descendent sub-structures is followed by three levels of learning (Figure 4), that is, removing and directing edges 1) of ancestor sub-structures, 2) between ancestor and descendent sub-structures, and 3) of the descendent sub-structure. [sent-385, score-0.427]
55 In Stage A2 (Figure 5h), we direct the edge X2 → X6 (as X1 ⊥ X6 | X2 and thus X2 cannot be ⊥ a collider between X1 and X6 ) and edge X2 → X7 (as X1 ⊥ X7 | X2 and thus X2 cannot be a collider ⊥ between X1 and X7 ), and in Stage B (Figure 5i) we direct the edge X6 → X7 . [sent-394, score-0.216]
56 The direction of these edges could not be assured without removing first the above edges, since the (redundant) edges pointing onto X6 and X7 would have allowed wrong edge direction, that is, X6 → X2 and X7 → X2 . [sent-395, score-0.218]
57 In the worst case, the RAI algorithm will neither direct any edges nor decompose the structure and will thus identify the entire structure as a descendant sub-structure, calling Stages D and B1 iteratively while skipping all other stages. [sent-402, score-0.22]
58 Given the maximal number of possible parents k and the number of nodes n, the number of CI tests is bounded by (Spirtes et al. [sent-404, score-0.282]
59 This means that in most cases some edges are directed and the structure is decomposed; hence, the number of CI tests is much smaller than that of the worst case. [sent-409, score-0.284]
60 In Stage B, the algorithm tests for independence between every pair of nodes with an empty condition set, that is, X ⊥ Y | 0 (marginal independence), ⊥ / removes the redundant edges and directs the remaining edges as possible. [sent-424, score-0.421]
61 In the resulting structure, all the edges between independent nodes have been removed and no false conditional independences are entailed. [sent-425, score-0.228]
62 Since the data entail only independences of zero order, further recursive calls with m ≥ 1 will not find independences with condition sets of size m, and thus no edges will be removed, leaving the graph unchanged. [sent-431, score-0.358]
63 Note that decomposition into ancestor, GA , and descendant, GD , sub-structures occurs after identification of all nodes having the lowest topological order, such that every edge from a node X in GA to a node Y in GD is directed, X → Y . [sent-444, score-0.312]
64 In the case that the sub-structure is a descendant sub-structure, SXY contains nodes from the ancestor sub-structures and the descendant sub-structure. [sent-446, score-0.306]
65 Completing CI testing with a specific graph resolution n in the RAI algorithm and interrupting the AFCI at any stage of CI testing are analogous. [sent-455, score-0.266]
66 Furthermore, Spirtes (2001) proves that interrupting the algorithm at any stage is also possible during edge direction, that is, once an edge is directed, the algorithm never changes that direction. [sent-456, score-0.238]
67 Since directing edges by the AFCI algorithm after interruption yields a correct (although less informative) graph (Spirtes, 2001), also the direction of edges by the RAI algorithm yields a correct graph. [sent-460, score-0.279]
68 Since the true graph is known for these structures, we could assume that all CI tests were correct and compare the numbers of CI tests required by the algorithms to learn the true independence relationships. [sent-543, score-0.344]
69 We selected from a large number of randomly generated graphs 3,000 graphs that were restricted by a maximal fan-in value of 3; that is, every node in such a graph has 3 parents at most and at least one node in the graph has 3 parents. [sent-551, score-0.278]
70 Thus, the structures can theoretically be learned by employing CI tests of order 3 and below and should not use tests of orders higher than 3. [sent-553, score-0.381]
71 Both Figure 9a and Figure 9b show that the number of CI tests employed by the RAI algorithm increases more slowly with the graph size compared to that of the PC algorithm and that this advantage is much more significant for the redundant (and more costly) CI tests of order 4. [sent-559, score-0.382]
72 Figure 10 shows the average number and percentage of CI tests saved using the RAI algorithm compared to the PC algorithm for different condition set sizes and graph sizes. [sent-561, score-0.266]
73 The figure shows that the percentage of CI tests saved using the RAI algorithm increases with both graph and condition set sizes. [sent-563, score-0.247]
74 For example, the saving in CI tests when using the RAI algorithm instead of the PC algorithm for learning a graph having 15 nodes and using condition sets of size 4 is above 70% (Figure 10b). [sent-564, score-0.362]
75 4, we will demonstrate the RAI quality of requiring relatively fewer tests of high orders than of low orders for graphs of larger sizes for real, rather than synthetic, data. [sent-566, score-0.228]
76 An extra direction (ED) error is due to edge direction that appears in the learned graph but not in the true graph, whereas a missing direction (MD) error is due to edge direction that 1548 BAYESIAN N ETWORK S TRUCTURE L EARNING BY R ECURSIVE AUTONOMY I DENTIFICATION 5 −1. [sent-591, score-0.239]
77 Finally, a reversed direction (RD) error is due to edge direction in the learned graph that is opposite to the edge direction in the true graph. [sent-601, score-0.239]
78 The RAI algorithm reduces the number of CI tests of orders 1 and above required by the PC algorithm and those of orders 2 and above required by the TPDA algorithm. [sent-704, score-0.266]
79 Moreover, the RAI algorithm completely avoids the use of CI tests of orders 4 and above and almost completely avoids CI tests of order 3 compared to both the PC and TPDA algorithms. [sent-705, score-0.34]
80 The RAI algorithm requires 46% less CI tests than the PC algorithm and 14% more CI tests (of order 1) than the TPDA algorithm. [sent-708, score-0.314]
81 We find that for all networks the RAI algorithm performs fewer calls for statistical tests than all other algorithms. [sent-1323, score-0.229]
82 On average over all networks, the RAI algorithm performs only 53% of the calls for statistical tests performed by the MMHC algorithm, which is the algorithm that required the fewest calls of all algorithms examined in Tsamardinos et al. [sent-1324, score-0.299]
83 53 Table 5: Number of statistical calls performed by each algorithm normalized by the number of statistical calls performed by the MMHC algorithm for the nineteen networks detailed in Table 2. [sent-1427, score-0.223]
84 1, we further analyzed the complexity of RAI (as measured by the numbers of CI tests performed) according to the CI test orders and the graph size. [sent-1433, score-0.274]
85 We examined the numbers of tests as performed for different orders for the Child, Insurance, Alarm and Hailfinder networks and their tiled networks. [sent-1435, score-0.251]
86 Figure 17 shows the cumulative percentage of CI tests for a specific order out of the total number of CI tests performed for each network. [sent-1438, score-0.276]
87 The figure demonstrates that the percentages of CI tests performed decrease with the CI test order and become small for orders higher than the max fan-in of the network (see Table 2). [sent-1439, score-0.23]
88 This is due to a faster increase of the number of low-order CI tests compared with the number of high-order CI tests as the graph size increases for all networks except for Hailfinder. [sent-1441, score-0.365]
89 This led to an increase in the percentage of high-order CI tests and a decrease in CI tests of order 0 when comparing the Hailfinder network to its tiled versions. [sent-1443, score-0.326]
90 Thus, we can conclude that as the graph size increases, the RAI algorithm requires relatively fewer CI tests of high orders, especially of orders higher than the max fan-in, than tests of low orders. [sent-1449, score-0.408]
91 1559 Percentage from the total # CI tests Percentage from the total # CI tests Y EHEZKEL AND L ERNER 1 0. [sent-1461, score-0.276]
92 4 0 1 2 3 CI test order 3 (b) Percentage from the total # CI tests Percentage from the total # CI tests (a) 2 CI test order 4 5 1 0. [sent-1482, score-0.322]
93 65 0 (c) 1 2 3 CI test order 4 5 (d) Figure 17: Cumulative percentages of CI tests out of the total numbers of tests for increasing orders as performed by the RAI algorithm for the (a) Child, (b) Insurance, (c) Alarm, and (d) Hailfinder networks including their tiled networks. [sent-1489, score-0.41]
94 Table 7 shows the average number and percentage of CI tests reduced by the RAI algorithm compared to the PC algorithm for different CI test orders and each database. [sent-1525, score-0.244]
95 A 100% cut in CI tests for a specific order means that RAI does not need any of the CI tests employed by the PC algorithm for this order (e. [sent-1527, score-0.295]
96 It can be seen that for almost all databases examined, the RAI algorithm avoids most of the CI tests of orders two and above that are required by the PC algorithm (e. [sent-1530, score-0.29]
97 63 Table 7: Average number (and percentage) of CI tests reduced by the RAI algorithm compared to the PC algorithm for different databases and CI test orders and the cut (%) in the total CI test run-time. [sent-1626, score-0.336]
98 Thereby, the RAI algorithm deals with less potential parents for the nodes on a tested edge and thus uses smaller condition sets that enable the performance of fewer CI tests of higher orders. [sent-1927, score-0.395]
99 By introducing orientation rules through edge direction in early stages of the algorithm and following CI tests of lower orders, the graph “backbone” is established using the most reliable CI tests. [sent-1929, score-0.355]
100 Relying on this “backbone” and its directed edges in later stages obviates the need for unnecessary CI tests and enables RAI to be less complex and sensitive to errors. [sent-1930, score-0.282]
wordName wordTfidf (topN-words)
[('rai', 0.761), ('mmhc', 0.197), ('ci', 0.178), ('tpda', 0.162), ('pc', 0.161), ('gstart', 0.143), ('tests', 0.138), ('tsamardinos', 0.121), ('ges', 0.097), ('shd', 0.097), ('nodes', 0.096), ('autonomy', 0.092), ('alarm', 0.091), ('ehezkel', 0.085), ('erner', 0.085), ('resolution', 0.083), ('spirtes', 0.078), ('cmi', 0.077), ('edges', 0.073), ('bdeu', 0.073), ('edge', 0.072), ('descendant', 0.07), ('ancestor', 0.07), ('dentification', 0.069), ('ecursive', 0.069), ('tructure', 0.069), ('databases', 0.069), ('graph', 0.068), ('cb', 0.064), ('nineteen', 0.062), ('etwork', 0.062), ('sc', 0.06), ('independences', 0.059), ('autonomous', 0.059), ('stage', 0.056), ('calls', 0.051), ('parents', 0.048), ('exogenous', 0.047), ('node', 0.047), ('gall', 0.046), ('orders', 0.045), ('directed', 0.044), ('heckerman', 0.041), ('cheng', 0.04), ('descendent', 0.039), ('hail', 0.039), ('nbc', 0.039), ('correctness', 0.038), ('vex', 0.035), ('structures', 0.033), ('cooper', 0.032), ('insurance', 0.032), ('bn', 0.031), ('orientation', 0.031), ('ex', 0.03), ('bayesian', 0.03), ('gd', 0.03), ('dash', 0.03), ('druzdzel', 0.03), ('explorer', 0.03), ('gs', 0.029), ('herskovits', 0.029), ('structure', 0.029), ('topological', 0.029), ('pearl', 0.028), ('ee', 0.028), ('dag', 0.028), ('child', 0.028), ('database', 0.028), ('stages', 0.027), ('directing', 0.027), ('learned', 0.027), ('gtrue', 0.026), ('tiled', 0.026), ('causal', 0.026), ('recursive', 0.026), ('thresholds', 0.026), ('va', 0.025), ('nder', 0.025), ('structural', 0.025), ('pa', 0.025), ('threshold', 0.024), ('network', 0.024), ('nursery', 0.024), ('bnc', 0.023), ('corral', 0.023), ('test', 0.023), ('condition', 0.022), ('networks', 0.021), ('examined', 0.021), ('earning', 0.021), ('fx', 0.021), ('decomposition', 0.021), ('testing', 0.02), ('chess', 0.02), ('afci', 0.019), ('shuttle', 0.019), ('sxy', 0.019), ('yehezkel', 0.019), ('algorithm', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification
Author: Raanan Yehezkel, Boaz Lerner
Abstract: We propose the recursive autonomy identification (RAI) algorithm for constraint-based (CB) Bayesian network structure learning. The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. The sequence of operations is performed recursively for each autonomous substructure while simultaneously increasing the order of the CI test. While other CB algorithms d-separate structures and then direct the resulted undirected graph, the RAI algorithm combines the two processes from the outset and along the procedure. By this means and due to structure decomposition, learning a structure using RAI requires a smaller number of CI tests of high orders. This reduces the complexity and run-time of the algorithm and increases the accuracy by diminishing the curse-of-dimensionality. When the RAI algorithm learned structures from databases representing synthetic problems, known networks and natural problems, it demonstrated superiority with respect to computational complexity, run-time, structural correctness and classification accuracy over the PC, Three Phase Dependency Analysis, Optimal Reinsertion, greedy search, Greedy Equivalence Search, Sparse Candidate, and Max-Min Hill-Climbing algorithms. Keywords: Bayesian networks, constraint-based structure learning
Author: Junning Li, Z. Jane Wang
Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classification or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-specified level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-specified level, and a heuristic modification of the method is able to control the FDR more accurately around the user-specified level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton
3 0.057687383 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
Author: Ricardo Silva, Zoubin Ghahramani
Abstract: Directed acyclic graphs (DAGs) have been widely used as a representation of conditional independence in machine learning and statistics. Moreover, hidden or latent variables are often an important component of graphical models. However, DAG models suffer from an important limitation: the family of DAGs is not closed under marginalization of hidden variables. This means that in general we cannot use a DAG to represent the independencies over a subset of variables in a larger DAG. Directed mixed graphs (DMGs) are a representation that includes DAGs as a special case, and overcomes this limitation. This paper introduces algorithms for performing Bayesian inference in Gaussian and probit DMG models. An important requirement for inference is the specification of the distribution over parameters of the models. We introduce a new distribution for covariance matrices of Gaussian DMGs. We discuss and illustrate how several Bayesian machine learning tasks can benefit from the principle presented here: the power to model dependencies that are generated from hidden variables, but without necessarily modeling such variables explicitly. Keywords: graphical models, structural equation models, Bayesian inference, Markov chain Monte Carlo, latent variable models 1. Contribution The introduction of graphical models (Pearl, 1988; Lauritzen, 1996; Jordan, 1998) changed the way multivariate statistical inference is performed. Graphical models provide a suitable language to decompose many complex real-world processes through conditional independence constraints. Different families of independence models exist. The directed acyclic graph (DAG) family is a particularly powerful representation. Besides providing a language for encoding causal statements (Spirtes et al., 2000; Pearl, 2000), it is in a more general sense a family that allows for non-monotonic independence constraints: that is, models where some independencies can be destroyed by conditioning on new information (also known as the “explaining away” effect — Pearl, 1988), a feature to be expected in many real problems. ∗. Part of this work was done while RS was at the Gatsby Computational Neuroscience Unit, UCL, and at the Statistical Laboratory, University of Cambridge. †. Also affiliated with the Machine Learning Department, Carnegie Mellon University. c 2009 Ricardo Silva and Zoubin Ghahramani. S ILVA AND G HAHRAMANI Y1 Y2 Y4 Y3 Y5 Y6 Y2 Y1 Y4 (a) Y3 Y5 (b) Y2 Y1 Y4 Y3 Y5 (c) Figure 1: Consider the DAG in (a). Suppose we want to represent the marginal dependencies and independencies that result after marginalizing out Y6 . The simplest resulting DAG (i.e., the one with fewest edges) is depicted in (b). However, notice that this graph does not encode some of the independencies of the original model. For instance, Y3 and Y4 are no longer marginally independent in the modified DAGs. A different family of graphical models, encoded with more than one type of edge (directed and bi-directed), is the focus of this paper. The graph in (c) depicts the solution using this “mixed” representation. However, DAG independence models have an undesirable feature: they are not closed under marginalization, as we will illustrate. Consider the regression problem where we want to learn the effect of a cocktail of two drugs for blood pressure, while controlling for a chemotherapy treatment of liver cancer. We refer to Y1 , Y2 as the dosage for the blood pressure drugs, Y3 as a measure of chemotherapy dosage, Y4 as blood pressure, and Y5 as an indicator of liver status. Moreover, let Y6 be an hidden physiological factor that affects both blood pressure and liver status. It is assumed that the DAG corresponding to this setup is given by Figure 1(a). In this problem, predictions concerning Y6 are irrelevant: what we care is the marginal for {Y1 , . . . ,Y5 }. Ideally, we want to take such irrelevant hidden variables out of the loop. Yet the set of dependencies within the marginal for {Y1 , . . . ,Y5 } cannot be efficiently represented as a DAG model. If we remove the edge Y3 → Y4 from Figure 1(b), one can verify this will imply a model where Y3 and Y4 are independent given Y5 , which is not true in our original model. To avoid introducing unwanted independence constraints, a DAG such as the one in Figure 1(b) will be necessary. Notice that in general this will call for extra dependencies that did not exist originally (such as Y3 and Y4 now being marginally dependent). Not only learning from data will be more difficult due to the extra dependencies, but specifying prior knowledge on the parameters becomes less intuitive and therefore more error prone. In general, it will be the case that variables of interest have hidden common causes. This puts the researcher using DAGs in a difficult position: if she models only the marginal comprising the variables of interest, the DAG representation might not be suitable anymore. If she includes all hidden variables for the sake of having the desirable set of independencies, extra assumptions about hidden variables will have to be taken into account. In this sense, the DAG representation is flawed. There is a need for a richer family of graphical models, for which mixed graphs are an answer. Directed mixed graphs (DMGs) are graphs with directed and bi-directed edges. In particular, acyclic directed mixed graphs (ADMGs) have no directed cycle, that is, no sequence of directed edges X → · · · → X that starts and ends on the same node. Such a representation encodes a set of conditional independencies among random variables, which can be read off a graph by using a 1188 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Y1 Y2 Y3 Y1 (a) Y2 Y3 Y1 (b) Y2 Y3 Y1 Y2 (c) Y3 (d) Figure 2: Different examples of directed mixed graphs. The graph in (b) is cyclic, while all others are acyclic. A subgraph of two variables where both edges Y1 → Y2 and Y1 ↔ Y2 are present is sometimes known as a “bow pattern” (Pearl, 2000) due to its shape. Y1 Y1 Y2 H1 Y4 Y3 H2 Y1 Y2 Y4 Y3 Y2 Y3 Y4 Figure 3: After marginalizing variables H1 and H2 from the DAG on the left, one possible DMG representation of the same dependencies is shown by the graph in the middle. Notice that there are multiple DMGs within a same Markov equivalence class, that is, encoding the same set of conditional independencies (Richardson and Spirtes, 2002). The two last graphs above are on the same class. criterion known as m-separation, a natural extension of the d-separation criterion used for directed acyclic graphs (Richardson, 2003). In a ADMG, two adjacent nodes might be connected by up to two edges, where in this case one has to be bi-directed and the other directed. A cyclic model can in principle allow for two directed edges of opposite directions. Figure 2 provides a few examples of DMGs. The appeal of this graphical family lies on the representation of the marginal independence structure among a set of observed variables, assuming they are part of a larger DAG structure that includes hidden variables. This is illustrated in Figure 3.1 More details on DMGs are given in Sections 2 and 8. In our blood pressure\liver status multiple regression problem, the suitable directed mixed graph is depicted in Figure 1(c). The contribution of this paper is how to perform Bayesian inference on two different families of mixed graph models: Gaussian and probit. Markov chain Monte Carlo (MCMC) and variational approximations will be discussed. Current Bayesian inference approaches for DMG models have limitations, as discussed in Section 2, despite the fact that such models are widely used in several sciences. The rest of the paper is organized as follows. Section 3 describes a special case of Gaussian mixed graph models, where only bi-directed edges are allowed. Priors and a Monte Carlo algorithm are described. This case will be a building block for subsequent sections, such as Section 4, where 1. Notice that it is not necessarily the case that the probability model itself is closed under marginalization. This will happen to some models, including the Gaussian model treated in this paper. But the basic claim of closure concerns the graph, that is, the representation of independence constraints. 1189 S ILVA AND G HAHRAMANI Gaussian DMG models are treated. Section 5 covers a type of discrete distribution for binary and ordinal data that is Markov with respect to an acyclic DMG. In Section 6 we discuss more sophisticated algorithms that are useful for scaling up Bayesian learning to higher-dimensional problems. Section 7 presents several empirical studies. Since the use of mixed graph models in machine learning applications is still in its early stages, we briefly describe in Section 8 a variety of possible uses of such graphs in machine learning applications. 2. Basics of DMGs, Gaussian Models and Related Work In this section, we describe the Gaussian DMG model and how it complements latent variable models. At the end of the section, we also discuss a few alternative approaches for the Bayesian inference problem introduced in this paper. 2.1 Notation and Terminology In what follows, we will use standard notions from the graphical modeling literature, such as vertex (node), edge, parent, child, ancestor, descendant, DAG, undirected graph, induced subgraph, Markov condition and d-separation. Refer to Pearl (1988) and Lauritzen (1996) for the standard definitions if needed. Less standard definitions will be given explicitly when appropriate. A useful notion is that of m-separation (Richardson, 2003) for reading off which independencies are entailed by a DMG representation. This can be reduced to d-separation (Pearl, 1988) by the following trick: for each bi-directed edge Yi ↔ Y j , introduce a new hidden variable Xi j and the edges Xi j → Yi and Xi j → Y j . Remove then all bi-directed edges and apply d-separation to the resulting directed graph. As usual, we will refer to vertices (nodes) in a graph and the corresponding random variables in a distribution interchangeably. Data points are represented by vectors with an upper index, such as ( j) Y(1) , Y(2) , . . . , Y(n) . The variable corresponding to node Yi in data point Y( j) is represented by Yi . 2.2 Gaussian Parameterization The origins of mixed graph models can be traced back to Sewall Wright (Wright, 1921), who used special cases of mixed graph representations in genetic studies. Generalizing Wright’s approach, many scientific fields such as psychology, social sciences and econometrics use linear mixed graph models under the name of structural equation models (Bollen, 1989). Only recently the graphical and parametrical aspects of mixed graph models have been given a thorough theoretical treatment (Richardson and Spirtes, 2002; Richardson, 2003; Kang and Tian, 2005; Drton and Richardson, 2008a). In practice, many structural equation models today are Gaussian models. We will work under this assumption unless stated otherwise. For a DMG G with a set of vertices Y, a standard parameterization of the Gaussian model is given as follows. For each variable Yi with a (possibly empty) parent set {Yi1 , ...,Yik }, we define a “structural equation” Yi = αi + bi1Yi1 + bi2Yi2 + · · · + bikYik + εi where εi is a Gaussian random variable with zero mean and variance vii . Notice that this parameterization allows for cyclic models. Unlike in standard Gaussian DAG models, the error terms {εi } are not necessarily mutually independent. Independence is asserted by the graphical structure: given two vertices Yi and Y j , 1190 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS the respective error terms εi and ε j are marginally independent if Yi and Y j are not connected by a bi-directed edge. By this parameterization, each directed edge Yi ← Y j in the graph corresponds to a parameter bi j . Each bi-directed edge Yi ↔ Y j in the graph is associated with a covariance parameter vi j , the covariance of εi and ε j . Each vertex Y j in the graph is associated with variance parameter v j j , the variance of ε j . Algebraically, let B be a m × m matrix, m being the number of observed variables. This matrix is such that (B)i j = bi j if Yi ← Y j exists in the graph, and 0 otherwise. Let V be a m × m matrix, where (V)i j = vi j if i = j or if Yi ↔ Y j is in the graph, and 0 otherwise. Let Y be the column vector of observed variables, α the column vector of intercept parameters, and ε be the corresponding vector of error terms. The set of structural equations can be given in matrix form as Y = BY + α + ε ⇒ Y = (I − B)−1 (ε + α) ⇒ Σ(Θ) = (I − B)−1 V(I − B)−T (1) where A−T is the transpose of A−1 and Σ(Θ) is the implied covariance matrix of the model, Θ ≡ {B, V, α}. 2.2.1 C OMPLETENESS OF PARAMETERIZATION AND A NCESTRAL G RAPHS An important class of ADMGs is the directed ancestral graph. Richardson and Spirtes (2002) provide the definition and a thorough account of the Markov properties of ancestral graphs. One of the reasons for the name “ancestral graph” is due to one of its main properties: if there is a directed path Yi → · · · → Y j , that is, if Yi is an ancestor of Y j , then there is no bi-directed edge Yi ↔ Y j . Thus directed ancestral graphs are ADMGs with this constraint.2 In particular, they show that any Gaussian distribution that is Markov with respect to a given ADMG can be represented by some Gaussian ancestral graph model that is parameterized as above. For the ancestral graph family, the given parameterization is complete: that is, for each Markov equivalence class, it is always possible to choose an ancestral graph where the resulting parameterization imposes no further constraints on the distribution besides the independence constraints of the class. Since the methods described in this paper apply to general DMG models, they also apply to directed ancestral graphs. In principle, it is possible to define and parameterize a Gaussian DAG model that entails exactly the same independence constraints encoded in an directed ancestral graph. One possibility, as hinted in the previous Section, is to replace each bi-directed edge Yi ↔ Y j by a new path Yi ← Xi j → Y j . Variables {Xi j } are “ancillary” hidden variables, in the sense that they are introduced for the sake of obtaining the same independence constraints of an ancestral graph. Standard Bayesian methodology can then be applied to perform inference in this Gaussian DAG model. However, this parameterization might have undesirable consequences, as discussed in Section 8.6 of Richardson and Spirtes (2002). Moreover, when Markov chain Monte Carlo algorithms are applied to compute posteriors, the “ancillary” hidden variables will have to be integrated out numerically. The resulting Markov chain can suffer from substantial autocorrelation when compared to a model with no ancillary variables. We illustrate this behavior in Section 7. Further constraints beyond independence constraints are certainly desirable depending on the context. For instance, general ADMGs that are not ancestral graphs may impose other constraints (Richardson and Spirtes, 2002), and such graphs can still be sensible models of, for example, the 2. Notice this rules out the possibility of having both edges Yi → Y j and Yi ↔ Y j in the same ancestral graph. 1191 S ILVA AND G HAHRAMANI causal processes for the problem at hand. When many observed variables are confounded by a same hidden common cause, models based on factor analysis are appropriate (Silva et al., 2006). However, it is useful to be able to build upon independence models that are known to have a complete parameterization. In any case, even the latent variables in any model might have dependencies that arise from other latent variables that were marginalized, and a latent variable ADMG model will be necessary. When it comes to solving a problem, it is up to the modeler (or learning algorithm) to decide if some set of latent variables should be included, or if they should be implicit, living their hidden life through the marginals. Richardson and Spirtes (2002) provide further details on the advantages of a complete parameterization. Drton and Richardson (2004) provide an algorithm for fitting Gaussian ancestral graph models by maximum likelihood. 2.3 Bayesian Inference The literature on Bayesian structural equation models is extensive. Scheines et al. (1999) describe one of the first approaches, including ways of testings such models. Lee (2007) provides details on many recent advances. Standard Bayesian approaches for Gaussian DMG models rely on either attempting to reduce the problem to inference with DAG models, or on using rejection sampling. In an application described by Dunson et al. (2005), the “ancillary latent” trick is employed, and Gibbs sampling for Gaussian DAG models is used. This parameterization has the disadvantages mentioned in the previous section. Scheines et al. (1999) use the complete parameterization, with a single parameter corresponding to each bi-directed edge. However, the global constraint of positive-definiteness in the covariance matrix is enforced only by rejection sampling, which might be inefficient in models with moderate covariance values. The prior is setup in an indirect way. A Gaussian density function is independently defined for each error covariance vi j . The actual prior, however, is the result of multiplying all of such functions and the indicator function that discards non-positive definite matrices, which is then renormalized. In contrast, the Bayesian approach delineated in the next sections uses the complete parameterization, does not appeal to rejection sampling, makes use of a family of priors which we believe is the natural choice for the problem, and leads to convenient ways of computing marginal likelihoods for model selection. We will also see that empirically they lead to much better behaved Markov chain Monte Carlo samplers when compared to DAGs with ancillary latent variables. 3. Gaussian Models of Marginal Independence This section concerns priors and sampling algorithms for zero-mean Gaussian models that are Markov with respect to a bi-directed graph, that is, a DMG with no directed edges. Focusing on bi-directed graphs simplifies the presentation, while providing a convenient starting point to solve the full DMG case in the sequel. Concerning the notation: the distribution we introduce in this section is a distribution over covariance matrices. In the interest of generality, we will refer to the random matrix as Σ. In the context of the previous section, Σ ≡ Σ(Θ) = V, since we are assuming B = 0, α = 0. 1192 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 3.1 Priors Gaussian bi-directed graph models are sometimes called covariance graph models. Covariance graphs are models of marginal independence: each edge corresponds to a single parameter in the covariance matrix (the corresponding covariance); the absence of an edge Yi ↔ Y j is a statement that σYiY j = 0, σXY being the covariance of random variables X and Y . More precisely, if Σ is a random covariance matrix generated by a covariance model, a distribution of Σ is the distribution over the (non-repeated) entries corresponding to variances and covariances of adjacent nodes.3 In a model with a fully connected bi-directed graph, this reduces to a space of unrestricted covariance matrices. A common distribution for covariance matrices is the inverse Wishart IW (δ, U). In this paper, we adopt the following inverse Wishart parameterization: 1 p(Σ) ∝ |Σ|−(δ+2m)/2 exp − tr(Σ−1 U) , Σ positive definite, 2 p(·) being the density function, tr(·) the trace function, and m the number of variables (nodes) in our model.4 We will overload the symbol p(·) wherever it is clear from the context which density function we are referring to. It is assumed that δ > 0 and U is positive definite. Following Atay-Kayis and Massam (2005), let M + (G ) be the cone of positive definite matrices such that, for a given bi-directed graph G and Σ ∈ M + (G ), σi j = 0 if nodes Yi and Y j are not adjacent in G . It is convenient to choose a distribution that is conjugate to the Gaussian likelihood function, since one can use the same algorithms for performing inference both in the prior and posterior. In a zero-mean Gaussian model, the likelihood function for a fixed data set D = {Y(1) , Y(2) , . . . , Y(n) } is defined by the sufficient statistic S = ∑n (Y(d) )(Y(d) )T as follows: d=1 1 2 L (Σ; D ) = (2π)−nm/2 |Σ|−n/2 exp − tr(Σ−1 S) . (2) We extend the inverse Wishart distribution to the case of constrained covariance matrices in order to preserve conjugacy. This define the following distribution: p(Σ) = 1 1 |Σ|−(δ+2m)/2 exp − tr(Σ−1 U) , Σ ∈ M + (G ) IG (δ, U) 2 (3) which is basically a re-scaled inverse Wishart prior with a different support and, consequently, different normalizing constant IG (δ, U). An analogous concept exists for undirected graphs, where Σ−1 ∈ M + (G ) is given a Wishart-like prior: the “G -Wishart” distribution (Atay-Kayis and Massam, 2005). We call the distribution with density function defined as in Equation (3) the G -Inverse Wishart distribution (G -IW ). It will be the basis of our framework. There are no analytical formulas for the normalizing constant. 3. As such, the density function for Σ is defined with respect to the Lebesgue measure of the non-zero, independent elements of this matrix. 4. We adopt this non-standard parameterization of the inverse Wishart because it provides a more convenient reparameterization used in the sequel. Notice this is the parameterization used by Brown et al. (1993) and Atay-Kayis and Massam (2005), which developed other distributions for covariance matrices. 1193 S ILVA AND G HAHRAMANI 3.2 The Normalizing Constant We now derive a Monte Carlo procedure to compute IG (δ, U). In the sequel, this will be adapted into an importance sampler to compute functionals of a G -IW distribution. The core ideas are also used in a Gibbs sampler to obtain samples from its posterior. The normalizing constant is essential for model selection of covariance graphs. By combining the likelihood equation (2) with the prior (3), we obtain the joint nm p(D , Σ | G ) = (2π)− 2 IG (δ, U)−1 × |Σ|− δ+2m+n 2 1 exp − tr[Σ−1 (S + U)] 2 where we make the dependency on the graphical structure G explicit. By the definition of IG , integrating Σ out of the above equation implies the following marginal likelihood: p(D | G ) = 1 IG (δ + n, S + U) nm IG (δ, U) (2π) 2 from which a posterior P (G | D ) can be easily derived as a function of quantities of the type IG (·, ·). The normalizing constant IG (δ, U) is given by the following integral:5 IG (δ, U) = Z M + (G ) |Σ|− δ+2m 2 1 exp − tr(Σ−1 U) dΣ. 2 (4) The space M + (G ) can be described as the space of positive definite matrices conditioned on the event that each matrix has zero entries corresponding to non-adjacent nodes in graph G . We will reduce the integral (4) to an integral over random variables we know how to sample from. The given approach follows the framework of Atay-Kayis and Massam (2005) using the techniques of Drton and Richardson (2003). Atay-Kayis and Massam (2005) show how to compute the marginal likelihood of nondecomposable undirected models by reparameterizing the precision matrix through the Cholesky decomposition. The zero entries in the inverse covariance matrix of this model correspond to constraints in this parameterization, where part of the parameters can be sampled independently and the remaining parameters calculated from the independent ones. We will follow a similar framework but with a different decomposition. It turns out that the Cholesky decomposition does not provide an easy reduction of (4) to an integral over canonical, easy to sample from, distributions. We can, however, use Bartlett’s decomposition to achieve this reduction. 3.2.1 BARTLETT ’ S D ECOMPOSITION Before proceeding, we will need a special notation for describing sets of indices and submatrices. Let {i} represent the set of indices {1, 2, . . . , i}. Let Σi,{i−1} be the row vector containing the covariance between Yi and all elements of {Y1 ,Y2 , . . . ,Yi−1 }. Let Σ{i−1},{i−1} be the marginal covariance matrix of {Y1 ,Y2 , . . . ,Yi−1 }. Let σii be the variance of Yi . Define the mapping Σ → Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm }, 5. Notice this integral is always finite for any choice of δ > 0 and positive definite U, since it is no greater than the normalizing constant of the inverse Wishart. 1194 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS such that Bi is a row vector with i − 1 entries, γi is a scalar, and γ1 = σ11 , Bi = Σi,{i−1} Σ−1 i > 1, {i−1},{i−1} , −1 γi = σii.{i−1},{i−1} ≡ σii − Σi,{i−1} Σ{i−1},{i−1} Σ{i−1},i , i > 1. (5) The set Φ provides a parameterization of Σ, in the sense that the mapping (5) is bijective. Given that σ11 = γ1 , the inverse mapping is defined recursively by Σi,{i−1} = Bi Σ{i−1},{i−1} , i > 1, σii = γi + Bi Σ{i−1},i , i > 1. (6) We call the set Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm } the Bartlett parameters of Σ, since the decomposition (6) is sometimes known as Bartlett’s decomposition (Brown et al., 1993). For a random inverse Wishart matrix, Bartlett’s decomposition allows the definition of its density function by the joint density of {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm }. Define U{i−1},{i−1} , U{i−1},i and uii.{i−1},{i−1} in a way analogous to the Σ definitions. The next lemma follows directly from Lemma 1 of Brown et al. (1993): Lemma 1 Suppose Σ is distributed as IW (δ, U). Then the distribution of the corresponding Bartlett parameters Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bm , γm } is given by: 1. γi is independent of Φ\{γi , Bi } 2. γi ∼ IG((δ + i − 1)/2, uii.{i−1,i−1} /2), where IG(α, β) is the inverse gamma distribution −1 3. Bi | γi ∼ N(U−1 {i−1},{i−1} U{i−1},i , γi U{i−1},{i−1} ), where N(M, C) is a multivariate Gaussian −1 distribution and U−1 {i−1},{i−1} ≡ (U{i−1},{i−1} ) . 3.2.2 BARTLETT ’ S D ECOMPOSITION OF M ARGINAL I NDEPENDENCE M ODELS What is interesting about Bartlett’s decomposition is that it provides a simple parameterization of the inverse Wishart distribution with variation independent parameters. This decomposition allows the derivation of new distributions. For instance, Brown et al. (1993) derive a “Generalized Inverted Wishart” distribution that allows one to define different degrees of freedom for different submatrices of an inverse Wishart random matrix. For our purposes, Bartlett’s decomposition can be used to reparameterize the G -IW distribution. For that, one needs to express the independent elements of Σ in the space of Bartlett parameters. The original reparameterization maps Σ to Φ ≡ {γ1 , B2 , γ2 , B3 , γ3 , . . . , Bd , γd }. To impose the constraint that Yi and Y j are uncorrelated, for i > j, is to set Bi Σ{i−1},{i−1} j = σYiY j (Φ) = 0. For a fixed Σ{i−1},{i−1} , this implies a constraint on (Bi ) j ≡ βi j . Following the terminology used by Richardson and Spirtes (2002), let a spouse of node Y in a mixed graph be any node adjacent to Y by a bi-directed edge. The set of spouses of Yi is denoted by sp(i). The set of spouses of Yi according to order Y1 ,Y2 , . . . ,Ym is defined by sp≺ (i) ≡ sp(i) ∩ {Y1 , . . . ,Yi−1 }. The set of non-spouses of Yi is denoted by nsp(i). Analogously, nsp≺ (i) ≡ {Y1 , . . . ,Yi−1 }\sp≺ (i). Let Bi,sp≺ (i) be the subvector of Bi corresponding to the the respective spouses of Yi . Define Bi,nsp≺ (i) analogously. 1195 S ILVA AND G HAHRAMANI Given the constraint Bi Σ{i−1},nsp≺ (i) = 0, it follows that Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) + Bi,nsp≺ (i) Σnsp≺ (i),nsp≺ (i) = 0 ⇒ Bi,nsp≺ (i) = −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . nsp (7) Identity (7) was originally derived by Drton and Richardson (2003). A property inherited from the original decomposition for unconstrained matrices is that Bi,sp≺ (i) is functionally independent of Σ{i−1},{i−1} . From (7), we obtain that the free Bartlett parameters of Σ are ΦG ≡ {γ1 , B2,sp≺ (2) , γ2 , B3,sp≺ (3) , γ3 , . . . , Bm,sp≺ (m) , γm }. Notice that, according to (5), Φ corresponds to the set of parameters of a fully connected, zeromean, Gaussian DAG model. In such a DAG, Yi is a child of {Y1 , . . . ,Yi−1 }, and Yi = Bi Yi−1 + ζ j , ζ j ∼ N(0, γ j ) where Yi−1 is the (i − 1) × 1 vector corresponding to {Y1 , . . . ,Yi−1 }. As discussed by Drton and Richardson (2003), this interpretation along with Equation (7) implies Yi = Bi,sp≺ (i) Zi + ζ j (8) where the entries in Zi are the corresponding residuals of the regression of sp≺ (i) on nsp≺ (i). The next step in solving integral (4) is to find the Jacobian J(ΦG ) of the transformation Σ → ΦG . This is given by the following Lemma: Lemma 2 The determinant of the Jacobian for the change of variable Σ → ΦG is m |J(ΦG )| = ∏ |Ri | = i=2 m−1 1 γm−i m ∏i=2 |Σnsp≺(i) ,nsp≺(i) | i=1 i ∏ where Ri ≡ Σsp≺(i) ,sp≺(i) − Σsp≺(i) ,nsp≺(i) Σ−1 ≺(i) ,nsp≺(i) Σnsp≺(i) ,sp≺(i) , that is, the covariance matrix of the nsp / respective residual Zi (as parameterized by ΦG ). If nsp≺(i) = 0, Ri is defined as Σsp≺(i) ,sp≺(i) and |Σnsp≺(i) ,nsp≺(i) | is defined as 1. The proof of this Lemma is in Appendix C. A special case is the Jacobian of the unconstrained covariance matrix (i.e., when the graph has no missing edges): m−1 |J(Φ)| = ∏ γm−i . i (9) i=1 Now that we have the Jacobian, the distribution over Bartlett’s parameters given by Lemma 1, and the identities of Drton and Richardson (2003) given in Equation (7), we have all we need to provide a Monte Carlo algorithm to compute the normalizing constant of a G -IW with parameters (δ, U). Let Σ(ΦG ) be the implied covariance matrix given by our set of parameters ΦG . We start from the integral in (4), and rewrite it as a function of ΦG . This can be expressed by substituting Σ for Σ(ΦG ) and multiplying the integrand by the determinant of the Jacobian. Notice that the parameters in Σ(ΦG ) are variation independent: that is, their joint range is given by the product of 1196 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS their individual ranges (positive reals for the γ variables and the real line for the β coefficients). This range will replace the original M + (G ) space, which we omit below for simplicity of notation: IG (δ, U) = Z |J(ΦG )||Σ(ΦG )|− δ+2m 2 1 exp − tr(Σ(ΦG )−1 U) dΦG . 2 We now multiply and divide the above expression by the normalizing constant of an inverse Wishart (δ, U), which we denote by IIW (δ, U): IG (δ, U) = IIW (δ, U) Z −1 |J(ΦG )| × IIW (δ, U)|Σ(ΦG )|− δ+2m 2 1 exp − tr(Σ(ΦG )−1 U) dΦG . (10) 2 The expression −1 IIW (δ, U)|Σ|− δ+2m 2 1 exp − tr(Σ−1 U) 2 corresponds to the density function of an inverse Wishart Σ. Lemma 1 allows us to rewrite the inverse Wishart density function as the density of Bartlett parameters, but this is assuming no independence constraints. We can easily reuse the result of Lemma 1 as follows: 1. write the density of the inverse Wishart as the product of gamma-normal densities given in Lemma 1; 2. this expression contains the original Jacobian determinant |J(Φ)|. We have to remove it, since we are plugging in our own Jacobian determinant. Hence, we divide the reparameterized density by the expression in Equation (9). This ratio |J(ΦG )|/|J(Φ)| can be rewritten as m |J(ΦG )| |Ri | 1 = ∏ m−i = m |J(Φ)| ∏i=2 |Σnsp≺(i) ,nsp≺(i) | i=1 γi / where |Σnsp≺(i) ,nsp≺(i) | ≡ 1 if nsp≺ (i) = 0; 3. substitute each vector Bi,nsp≺ (i) , which is not a free parameter, by the corresponding expression −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . nsp This substitution takes place into the original factors given by Bartlett’s decomposition, as introduced in Lemma 1: −(i−1)/2 p(Bi , γi ) = (2π)−(i−1)/2 γi |U{i−1},{i−1} |1/2 1 × exp − (BiT − Mi )T U{i−1},{i−1} (BiT − Mi ) 2γi (uii.{i−1},{i−1} /2)(δ+i−1)/2 −( δ+i−1 +1) 1 exp − uii.{i−1},{i−1} × γi 2 Γ((δ + i − 1)/2) 2γi where Mi ≡ U−1 {i−1},{i−1} U{i−1},i . Plugging in this in (10) results in 1197 (11) S ILVA AND G HAHRAMANI IG (δ, U) = IIW (δ, U) Z 1 ∏m |Σnsp≺(i) ,nsp≺(i) | i=2 m × p(γ1 ) ∏ p(Bi , γi ) dΦG . i=2 However, after substitution, each factor p(Bi , γi ) is not in general a density function for {Bi,sp≺ (i) , γi } and will include also parameters {B j,sp≺ ( j) , γ j }, j < i. Because of the non-linear relationships that link Bartlett parameters in a marginal independence model, we cannot expect to reduce this expression to a tractable distribution we can easily sample from. Instead, we rewrite each original density factor p(Bi , γi ) such that it includes all information about Bi,sp≺ (i) and γi within a canonical density function. That is, factorize p(Bi , γi ) as p(Bi , γi |Φi−1 ) = pb (Bi,sp≺ (i) |γi , Φi−1 )pg (γi |Φi−1 ) × fi (Φi−1 ) (12) where we absorb any occurrence of Bi,sp≺ (i) within the sampling distribution and factorize the remaining dependence on previous parameters Φi−1 ≡ {γ1 , γ2 , B2,sp≺ (2) , . . . , γi−1 , Bi−1,sp≺ (i−1) } into a separate function.6 We derive the functions pb (·), pg (·) and fi (·) in Appendix A. The result is as follows. The density pb (Bi,sp≺ (i) |γi , Φi−1 ) is the density of a Gaussian N(Ki mi , γi Ki ) such that mi = (Uss − Ai Uns )Msp≺ (i) + (Usn − Ai Unn )Mnsp≺ (i) , K−1 = Uss − Ai Uns − Usn AT + Ai Unn AT , i i i (13) Ai = Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) nsp where Uss Usn Uns Unn ≡ Usp≺ (i),sp≺ (i) Usp≺ (i),nsp≺ (i) Unsp≺ (i),sp≺ (i) Unsp≺ (i),nsp≺ (i) . (14) The density pg (γi |Φi−1 ) is the density of an inverse gamma IG(g1 , g2 ) such that δ + i − 1 + #nsp≺ (i) , 2 uii.{i−1},{i−1} + Ui = , 2 TU = Mi {i−1},{i−1} Mi − mT Ki mi . i g1 = g2 Ui where uii.{i−1},{i−1} was originally defined in Section 3.2.1. Finally, (i−1)−#sp≺ (i) 2 |Ki |1/2 |U{i−1},{i−1} |1/2 fi (Φi−1 ) ≡ (2π)− (uii.{i−1},{i−1} /2)(δ+i−1)/2 Γ((δ + i − 1 + #nsp≺ (i))/2) . × Γ((δ + i − 1)/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 6. A simpler decomposition was employed by Silva and Ghahramani (2006) (notice however that paper used an incorrect expression for the Jacobian). The following derivation, however, can be adapted with almost no modification to define a Gibbs sampling algorithm, as we show in the sequel. 1198 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS / Density function pb (Bi,sp≺ (i) |·, ·) and determinant |Ki |1/2 are defined to be 1 if sp≺ (i) = 0. Ui TU / / is defined to be zero if nsp≺ (i) = 0, and Ui = Mi {i−1},{i−1} Mi if sp≺ (i) = 0. The original normalizing constant integral is the expected value of a function of ΦG over a factorized inverse gamma-normal distribution. The density function of this distribution is given below: m pI(δ,U) (ΦG ) = m ∏ pg (γi |Φi−1 ) ∏ pb (Bi,sp i=1 ≺ (i) |γi , Φi−1 ) . i=2 We summarize the main result of this section through the following theorem: Theorem 3 Let f (X) p(X) be the expected value of f (X) where X is a random vector with density p(X). The normalizing constant of a G -Inverse Wishart with parameters (δ, U) is given by m IG (δ, U) = IIW (δ, U) × fi (Φi−1 ) ∏ |Σnsp i=1 ≺(i) ,nsp≺(i) | . pI(δ,U) (ΦG ) This can be further simplified to m IG (δ, U) = fi′ (Φi−1 ) ∏ |Σnsp i=1 ≺(i) ,nsp≺(i) | (15) pI(δ,U) (ΦG ) where fi′ (Φi−1 ) ≡ (2π) #sp≺ (i) 2 |Ki (Φi−1 )|1/2 Γ((δ + i − 1 + #nsp≺ (i))/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 which, as expected, reduces IG (δ, U) to IIW (δ, U) when the graph is complete. (1) (2) A Monte Carlo estimate of IG (δ, U) is then given from (15) by obtaining samples {ΦG , ΦG , (M) . . . , ΦG } according to pI(δ,U) (·) and computing: (s) fi′ (Φi−1 ) 1 M m IG (δ, U) ≈ ∑∏ M s=1 i=1 |Σnsp ,nsp (Φ(s) )| i−1 ≺(i) ≺(i) where here we emphasize that Σnsp≺(i) ,nsp≺(i) is a function of ΦG as given by (6). 3.3 General Monte Carlo Computation If Y follows a Gaussian N(0, Σ) where Σ is given a G -IW (δ, U) prior, then from a sample D = {Y(1) , . . . , Y(n) } with sufficient statistic S = ∑n (Y(d) )(Y(d) )T , the posterior distribution for Σ d=1 given S will be a G -IW (δ + n, U + S). In order to obtain samples from the posterior or to compute its functionals, one can adapt the algorithm for computing normalizing constants. We describe an importance sampler for computing functionals, followed by a Gibbs sampling algorithm that also provides samples from the posterior. 1199 S ILVA AND G HAHRAMANI Algorithm S AMPLE GIW-1 Input: a m × m matrix U, scalar δ, bi-directed graph G , an ordering ≺ 1. Let Σ be a m × m matrix 2. Define functions sp≺ (·), nsp≺ (·) according to G and ordering ≺ 3. Sample σ11 from IG(δ/2, u11 /2) 4. For i = 2, 3, . . . , m 5. Sample γi ∼ IG((δ + i − 1 + #nsp≺ (i))/2, (uii.{i−1},{i−1} + Ui )/2) 6. Sample Bi,sp≺ (i) ∼ N(Ki mi , γi Ki ) 7. Set Bi,nsp≺(i) = −Bi,sp≺ (i) Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) nsp 8. Set ΣT {i−1},i = Σi,{i−1} = Bi Σ{i−1},{i−1} 9. Set σii = γi + Bi Σi,{i−1} 10. Set w = ∏m fi′ (Φi−1 )/|Σnsp≺ (i),nsp≺ (i) | i=1 11. Return (w, Σ). Figure 4: A procedure for generating an importance sample Σ and importance weight w for computing functionals of a G -Inverse Wishart distribution. Variables {Mi , mi , Ki , Ui } and function fi′ (Φi−1 ) are defined in Section 3.2.2. 3.3.1 T HE I MPORTANCE S AMPLER One way of computing functionals of the G -IW distribution, that is, functions of the type g(δ, U; G ) ≡ Z M + (G ) g(Σ)p(Σ | δ, U, G ) dΣ is through the numerical average g(δ, U; G ) ≈ ∑M ws g(Σ(s) ) s=1 , ∑M ws s=1 where weights {w1 , w2 , . . . , wM } and samples {Σ(1) , Σ(2) , . . . , Σ(M) } are generated by an importance sampler. The procedure for computing normalizing constants can be readily adapted for this task using pI(δ,U) (·) as the importance distribution and the corresponding weights from the remainder factors. The sampling algorithm is shown in Figure 4. 3.3.2 T HE G IBBS S AMPLER While the importance sampler can be useful to compute functionals of the distribution, we will need a Markov chain Monte Carlo procedure to sample from the posterior. In the Gibbs sampling 1200 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Algorithm S AMPLE GIW-2 Input: a m × m matrix U, scalar δ, bi-directed graph G , a m × m matrix Σstart 1. Let Σ be a copy of Σstart 2. Define functions sp(·), nsp(·) according to G 3. For i = 1, 2, 3, . . . , m 4. Sample γi ∼ IG((δ + (m − 1) + #nsp(i))/2, (uii.{\i},{\i} + U\i )/2) 5. Sample Bi,sp(i) from a N(K\i m\i , γi K\i ) 6. Set Bi,nsp(i) = −Bi,sp(i) Σsp(i),nsp(i) Σ−1 nsp(i),nsp(i) 7. Set ΣT = Σi,{\i} = Bi Σ{\i},{\i} {\i},i 8. Set σii = γi + Bi Σi,{\i} 9. Return Σ. Figure 5: A procedure for generating a sampled Σ within a Gibbs sampling procedure. procedure, we sample the whole i-th row of Σ, for each 1 ≤ i ≤ m, by conditioning on the remaining independent entries of the covariance matrix as obtained on the previous Markov chain iteration. The conditional densities required by the Gibbs sampler can be derived from (12), which for a particular ordering ≺ implies m p(Σ; δ, U, G ) ∝ pg (γ1 ) ∏ pb (Bi,sp≺ (i) |γi , Φi−1 )pg (γi |Φi−1 ) fi (Φi−1 ). i=2 By an abuse of notation, we used Σ in the left-hand side and the Bartlett parameters in the righ-hand side. The conditional density of {Bm,sp≺ (m) , γm } given all other parameters is therefore p(Bm,sp≺ (m) , γm |ΦG \{Bm,sp≺ (m) , γm }) = pb (Bm,sp≺ (m) |γm , Φm−1 )pg (γm |Φm−1 ) from which we can reconstruct a new sample of the m-th row/column of Σ after sampling {Bm,sp≺ (m) , γm }. Sampling other rows can be done by redefining a new order where the corresponding target variable is the last one. More precisely: let {\i} denote the set {1, 2, . . . , i − 1, i + 1, . . . , m}. The Gibbs algorithm is analogous to the previous algorithms. Instead of sp≺ (i) and nsp≺ (i), we refer to the original sp(i) and nsp(i). Matrices Σ{\i},{\i} and U{\i},{\i} are defined by deleting the respective i-th row and i-th columns. Row vector Σi,{\i} and scalar uii.{\i} are defined accordingly, as well as any other vector and matrix originally required in the marginal likelihood/importance sampling procedure. The algorithm is described in Figure 5. The procedure can be interpreted as calling a modification of the importance sampler with a dynamic ordering ≺i which, at every step, moves Yi to the end of the global ordering ≺. 1201 S ILVA AND G HAHRAMANI 3.4 Remarks The importance sampler suffers from the usual shortcomings in high-dimensional problems, where a few very large weights dominate the procedure (MacKay, 1998). This can result in unstable estimates of functionals of the posterior and the normalizing constant. The stability of the importance sampler is not a simple function of the number of variables in the domain. For large but sparse graphs, the number of parameters might be small. For large but fairly dense graphs, the importance distribution might be a good match to the actual distribution since there are few constraints. In Section 7, we performe some experiments to evaluate the sampler. When used to compute functionals, the Gibbs sampler is more computationally demanding considering the cost per step, but we expect it to be more robust in high-dimensional problems. In problems that require repeated calculations of functionals (such as the variational optimization procedure of Section 4.3), it might be interesting to run a few preliminary comparisons between the estimates of the two samplers, and choose the (cheaper) importance sampler if the estimates are reasonably close. Na¨vely, the Gibbs sampler costs O(m4 ) per iteration, since for each step we have to invert the ı matrix Σnsp{\i},nsp{\i} , which is of size O(m) for sparse graphs. However, this inversion can cost much less than O(m3 ) if sparse matrix inversion methods are used. Still, the importance sampler can be even more optimized by using the methods of Section 6. 4. Gaussian Directed Mixed Graph Models As discussed in Section 2, Gaussian directed mixed graph models are parameterized by the set with parameters Θ = {V, B, α}. Our prior takes the form p(Θ) = p(B)p(α)p(V). We assign priors for the parameters of directed edges (non-zero entries of matrix B) in a standard way: each parameter bi j is given a Gaussian N(cBj , sBj ) prior, where all parameters are marginally independent in the i i prior, that is, p(B) = ∏i j p(bi j ). The prior for intercept parameters α is analogous, with αi being a Gaussian N(cα , sα ). i i Recall from Equation (1) that the implied covariance of the model is given by the matrix Σ(Θ) = (I − B)−1 V(I − B)−T . Similarly, we have the implied mean vector µ(Θ) ≡ (I − B)−1 α. The likelihood function for data set D = {Y(1) , Y(2) , . . . , Y(n) } is defined as L (Θ; D ) = |Σ(Θ)|−n/2 ∏n exp − 1 (Y(d) − µ(Θ))T Σ(Θ)−1 (Y(d) − µ(Θ)) d=1 2 = |(I − B)−1 ||V||(I − B)−T | −n/2 exp − 1 tr(V−1 (I − B)S(I − B)T ) 2 , where now S ≡ ∑n (Y(d) − µ(Θ))(Y(d) − µ(Θ))T . d=1 Given a prior G -IW (δ, U) for V, it immediately follows that the posterior distribution of V given the data and other parameters is V | {B, α, D } ∼ G -IW (δ + n, U + (I − B)S(I − B)T ). Therefore it can be sampled using the results from the previous section. Notice this holds even if the directed mixed graph G is cyclic. Sampling αi given {D , Θ\{αi }} can also be done easily for both cyclic and acyclic models: the ′ ′ ′ posterior is given by a normal N(cα /sα , 1/sα ) where i i i 1202 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS sα i cα i ′ ≡ ′ ≡ 1 + n(V−1 )ii , sα i n m m cα (d) (d) −1 −1 i , − ∑ bt pt Ypt α − n ∑ (V )it αt + ∑ ∑ (V )it Yt si pt d=1 t=1 t=1,t=i with pt being an index running over the parents of Yt in G . However, sampling the non-zero entries of B results in two different cases depending whether G is cyclic or not. We deal with them separately. 4.1 Sampling from the Posterior: Acyclic Case The acyclic case is simplified by the fact that I − B can be rearranged in a way it becomes lower triangular, with each diagonal element being 1. This implies the identity |(I−B)−1 ||V||(I−B)−T | = |V|, with the resulting log-likelihood being a quadratic function of the non-zero elements of B. Since the prior for coefficient bi j is Gaussian, its posterior given the data and all other parameters will be ′ ′ ′ the Gaussian N(cbj /sbj , 1/sbj ) where i i i ′ sbj i ′ cbj i ≡ ≡ n 1 (d) + (V−1 )ii ∑ (Y j )2 , b si j d=1 m n cbj i (d) (d) (d) + ∑ Y j ∑ (V−1 )it Yt − ∑ bt pt Ypt − αt . b si j d=1 t=1 pt ,(t,pt )=(i, j) (16) As before, pt runs over the indices of the parents of Yt in G . Notice that in the innermost (d) summation we exclude bi jY j . We can then sample bi j accordingly. It is important to notice that, in practice, better mixing behavior can be obtained by sampling the coefficients (and intercepts) jointly. The joint distribution is Gaussian and can be obtained in a way similar to the above derivation. The derivation of the componentwise conditionals is nevertheless useful in the algorithm for cyclic networks. 4.2 Sampling from the Posterior: Cyclic Case Cyclic directed graph models have an interpretation in terms of causal systems in equilibrium. The simultaneous presence of directed paths Yi → · · · → Y j and Y j → · · · → Yi can be used to parameterize instantaneous causal effects in a feedback loop (Spirtes, 1995). This model appears also in the structural equation modeling literature (Bollen, 1989). In terms of cyclic graphs as families of conditional independence constraints, methods for reading off constraints in linear systems also exist (Spirtes et al., 2000). The computational difficulty in the cyclic case is that the determinant |I − B| is no longer a constant, but a multilinear function of coefficients {bi j }. Because bi j will appear outside the exponential term, its posterior will no longer be Gaussian. From the definition of the implied covariance matrix Σ(Θ), it follows that |Σ(Θ)|−n/2 = (|I − B||V|−1 |I − B|)n/2 . As a function of coefficient bi j , |I − B| = (−1)i+ j+1Ci j bi j + k=m ∑ k=1,k= j 1203 (−1)i+k+1Cik bik , S ILVA AND G HAHRAMANI where Ci j is the determinant of respective co-factor of I − B, bik ≡ 0 if there is no edge Yi ← Yk , and bii ≡ −1. The resulting density function of bi j given D and Θ\{bi j } is ′ p(bi j |Θ\{bi j }, D ) ∝ |bi j − κi j | exp − n ′ (bi j − cbj /sbj )2 i i ′ 2sbj i , where κi j ≡ Ci−1 j k=m ∑ (−1)k− j+1Cik bik k=1,k= j ′ ′ and {cbj , sbj } are defined as in Equation (16). Standard algorithms such as Metropolis-Hastings can i i be applied to sample from this posterior within a Gibbs procedure. 4.3 Marginal Likelihood: A Variational Monte Carlo Approach While model selection of bi-directed graphs can be performed using a simple Monte Carlo procedure as seen in the previous Section, the same is not true in the full Gaussian DMG case. Approaches such as nested sampling (Skilling, 2006) can in principle be adapted to deal with the full case. For problems where there are many possible candidates to be evaluated, such a computationally demanding sampling procedure might be undesirable (at least for an initial ranking of graphical structures). As an alternative, we describe an approximation procedure for the marginal likelihood p(D |G ) by combining variational bounds (Jordan et al., 1998) with the G -Inverse Wishart samplers, and therefore avoiding a Markov chain over the joint model of coefficients and error covariances. This is described for acyclic DMGs only. We adopt the following approximation in our variational approach, accounting also for possible latent variables X: n p(V, B, α, X|D ) ≈ q(V)q(B, α) ∏ q(X(d) ) ≡ q(V)q(B, α)q(X) d=1 with q(B, α) being a multivariate Gaussian density of the non-zero elements of B and α. Function q(X(d) ) is also a Gaussian density, and function q(V) is a G -Inverse Wishart density. From Jensen’s inequality, we obtain the following lower-bound (Beal, 2003, p. 47): ln p(D |G ) = ln p(Y, X|V, B, α)p(V, B, α) dX dB dV dα ≥ ln p(Y, X|V, B, α) q(V)q(B,α)q(X) + ln p(V)/q(V) q(V) + ln p(B, α)/q(B, α) q(B,α) − ln q(X) q(X) R (17) where this lower bound can be optimized with respect to functions q(V), q(B), q(X). This can be done by iterative coordinate ascent, maximizing the bound with respect to a single q(·) function at a time. The update of q(V) is given by qnew (V) = pG -IW (δ + d, U + (I − B)S(I − B)T 1204 q(X)q(B,α) ) BAYESIAN L EARNING WITH M IXED G RAPH M ODELS where pG -IW (·) is the density function for a G -Inverse Wishart, and S is the empirical second moment matrix summed over the completed data set (X, Y) (hence the expectation over q(X)) centered at µ(Θ). The updates for q(B, α) and q(X) are tedious but straightforward derivations, and described in Appendix B. The relevant fact about these updates is that they are functions of V−1 q(V) . Fortunately, we pay a relatively small cost to obtain these inverses using the Monte Carlo sampler of Figure 4: from the Bartlett parameters, define a lower triangular m × m matrix B (by placing on the ith line the row vector Bi , followed by zeroes) and a diagonal matrix Γ from the respective vector of γi ’s. The matrix V−1 can be computed from (I − B )T Γ−1 (I − B ), and the relevant expectation computed according to the importance sampling procedure. For problems of moderate dimensionality,7 the importance sampler might not be recommended, but the Gibbs sampler can be used. At the last iteration of the variational maximization, the (importance or posterior) samples from q(V) can then be used to compute the required averages in (17), obtaining a bound on the marginal log-likelihood of the model. Notice that the expectation ln p(V)/q(V) q(V) contains the entropy of q(V), which will require the computation of G -inverse Wishart normalizing constants. For large problems, the cost of this approximation might still be prohibitive. An option is to partially parameterize V in terms of ancillary latents and another submatrix distributed as a G -inverse Wishart, but details on how to best do this partition are left as future work (this approximation will be worse but less computationally expensive if ancillary latents are independent of the coefficient parameters in the variational density function q(·)). Laplace approximations might be an alternative, which have been successfully applied to undirected non-decomposable models (Roverato, 2002). We emphasize that the results present in this section are alternatives that did not exist before in previous approaches for learning mixed graph structures through variational methods (e.g., Silva and Scheines, 2006). It is true that the variational approximation for marginal likelihoods will tend to underfit the data, that is, generate models simpler than the true model in simulations. Despite the bias introduced by the method, this is less of a problem for large data sets (Beal and Ghahramani, 2006) and the method has been shown to be useful in model selection applications (Silva and Scheines, 2006), being consistently better than standard scores such as BIC when hidden variables are present (Beal and Ghahramani, 2006). An application in prediction using the variational posterior instead of MCMC samples is discussed by Silva and Ghahramani (2006). It is relevant to explore other approaches for marginal likelihood evaluation of DMG models using alternative methods such as annealed importance sampling (Neal, 2001) and nested sampling (Skilling, 2006), but it is unrealistic to expect that such methods can be used to evaluate a large number of candidate models. A pre-selection by approximations such as variational methods might be essential. 5. Discrete Models: The Probit Case Constructing a discrete mixed graph parameterization is not as easy as in the Gaussian case. Advances in this area are described by Drton and Richardson (2008a), where a complete parameterization of binary bi-directed graph models is given. In our Bayesian context, inference with the mixed graph discrete models of Drton and Richardson would not to be any computationally easier than the case for Markov random fields, which has been labeled as doubly-intractable (Murray et al., 2006). 7. We observed a high ratio of the highest importance weight divided by the median weight in problems with dimensionality as low as 15 nodes. However, notice that in practice the error covariance matrix V has a block diagonal structure, and only the size of the largest block is relevant. This is explained in more detail in Section 6. 1205 S ILVA AND G HAHRAMANI Instead, in this paper we will focus on a class of discrete models that has been widely used in practice: the probit model (Bartholomew and Knott, 1999). This model is essentially a projection of a Gaussian distribution into a discrete space. It also allows us to build on the machinery developed in the previous sections. We will describe the parameterization of the model for acyclic DMGs, and then proceed to describe algorithms for sampling from the posterior distribution. 5.1 Parameterizing Models of Observable Independencies A probit model for the conditional probability of discrete variable Yi given a set of variables {Yi1 , ..., Yik } can be described by the two following relationships: P (Yi = vi l Yi⋆ = αi + bi1Yi1 + bi2Yi2 + · · · + bikYik + εi | Yi⋆ ) = 1(τi ≤ Yi⋆ < τi ) l−1 l (18) where P (·) is the probability mass function of a given random variable, as given by the context, and 1(·) is the indicator function. Yi assumes values in {vi , vi , . . . , vi }. Thresholds {τi = −∞ < τi < 1 1 0 1 κ(i) τi < · · · < τi = ∞} are used to define the mapping from continuous Yi⋆ to discrete Yi . This model 2 κ(i) has a sensible interpretation for ordinal and binary values as the discretization of some underlying latent variable (UV) Yi⋆ . Such a UV is a conditionally Gaussian random variable, which follows by assuming normality of the error term εi . This formulation, however, is not appropriate for general discrete variables, which are out of the scope of this paper. Albert and Chib (1993) describe alternative Bayesian treatments of discrete distributions not discussed here. Given this binary/ordinal regression formulation, the natural step is how to define a graphical model accordingly. As a matter of fact, the common practice does not strictly follow the probit regression model. Consider the following example: for a given graph G , a respective graphical representation of a probit model can be built by first replicating G as a graph G ⋆ , where each vertex Yi is relabeled as Yi⋆ . Those vertices represent continuous underlying latent variables (UVs). To each vertex Yi⋆ in G ⋆ , we then add a single child Yi . We call this the Type-I UV model. Although there are arguments for this approach (see, for instance, the arguments by Webb and Forster (2006) concerning stability to ordinal encoding), this is a violation of the original modeling assumption as embodied by G : if the given graph is a statement of conditional independence constraints, it is expected that such independencies will be present in the actual model. The Type-I formulation does not fulfill this basic premise: by construction there are no conditional independence constraints among the set of variables Y (the marginal independencies are preserved, though). This is illustrated by Figure 6(b), where the conditional independence of Y1 and Y3 given Y2 disappears. An alternative is illustrated in Figure 6(c). Starting from the original graph G (as in Figure 6(a)), the probit graph model G ⋆ shown in the Figure is built from G by the following algorithm: 1. add to empty graph G ⋆ the vertices Y of G , and for each Yi ∈ Y, add a respective UV Yi⋆ and the edge Yi⋆ → Yi ; 2. for each Yi → Y j in G , add edge Yi → Y j⋆ to G ⋆ ; 3. for each Yi ↔ Y j in G , add edge Yi⋆ ↔ Y j⋆ to G ⋆ ; We call this the Type-II UV model, which has the following property (the proof is in Appendix C): 1206 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Y* 1 Y1 Y2 (a) Y2 Y* 2 Y3 Y* 4 Y4 Y1 Y3 Y* 1 Y* 3 (b) Y2 Y* 2 Y* 4 Y4 Y1 Y3 Y* 3 Y4 (c) Figure 6: The model in (a) has at least two main representations as a probit network. In (b), the original structure is given to the underlying variables, with observed variables being children of their respective latents. In (c), the underlying variable inherits the parents of the original variable and the underlying latents of the spouses. Theorem 4 Suppose G is acyclic with vertex set Y. Yi and Y j are m-separated given Z ⊆ Y\{Yi ,Y j } in G if and only if Yi and Y j are m-separated given Z in G ⋆ . The parameterization of the Type-II UV model follows from the definition of probit regression: the conditional distribution Yi given its parents in {Yi1 , ...,Yik } in G is given as in Equation (18), while the error terms {ε1 , ε2 , . . . , εm } follow the multivariate Gaussian N(0, V). The entry corresponding to the covariance of εi and ε j is assumed to be zero if there is no bi-directed edge Yi ↔ Y j in G . In what follows, we discuss algorithms for Type-II models. The approach here described can be easily adapted to cover Type-I models. We say that Type-II models are models of observable independencies, since independencies hold even after marginalizing all UVs. 5.2 Algorithm As before, we provide a Gibbs sampling scheme to sample parameters Θ = {α, B, V, T } from the posterior distribution given data set D = {Y(1) , Y(2) , . . . , Y(n) }. The set T = {Ti } is the set of threshold parameters, Ti = {τi = −∞ < τi < τi < · · · < τi = ∞} for each random variable Yi with 2 1 0 κ(i) κ(i) different values. We will not discuss priors and algorithms for sampling T given the other parameters: this can be done by standard approaches (e.g., Albert and Chib, 1993).8 For the purposes of the Gibbs procedure, we augment the data set with the underlying variables ⋆ = {Y⋆(1) , Y⋆(2) , . . . , Y⋆(n) } at each sampling step. D From the set of structural equations Y⋆(d) = α + BY(d) + ε it follows that the conditional distribution of Y⋆(d) given the D ∪ Θ is a truncated Gaussian with mean α + BY(d) and covariance matrix V. The truncation levels are given by the thresholds and (d) ⋆(d) observed data Y(d) : for each Yi = vi , the range for Yi becomes [τi , τi ). Sampling from a l l−1 l truncated Gaussian is a standard procedure. We used the algorithm of Kotecha and Djuric (1999) in our implementation. To sample V from its conditional, we will rely on the following result. 8. In Section 7, we perform experiments with binary data only. In this case, the thresholds are set to fixed values: {τi = −∞, τi = 0, τi = ∞} for all 0 ≤ i ≤ m. 0 1 2 1207 S ILVA AND G HAHRAMANI Proposition 5 Let G be an acyclic DMG, and (α, B, V, T ) be the respective set of parameters that defines the probit model. For a fixed (α, B, T ), there is a bijective function fBαT (·) mapping Y⋆ to ε. This is not true in general if G is cyclic. Proof: If the graph is acyclic, this follows directly by recursively solving the model equations, starting from those corresponding to Y j⋆ vertices with no parents. This results in ε = Y⋆ − α − BY, as expected. For cyclic graphs, the following model provides a counter-example. Let the graph be Y1⋆ → Y1 → Y2⋆ → Y2 → Y1⋆ . Let the model be Y1⋆ = Y2 + ε1 ,Y2⋆ = Y1 + ε2 , that is, b12 = b21 = 1 and α = 0. Let the variables be binary, with a threshold at zero (Yi = 1 if and only if Yi⋆ ≥ 0). Then the two instantiations (Y1⋆ = −0.8,Y2⋆ = 0), (Y1⋆ = 0.2,Y2⋆ = 1) imply the same pair (ε1 = −0.8, ε2 = 0). The negative result for discrete models with cycles is the reason why such models are out of the scope of the paper. ⋆ Let Dε = {ε(1) , . . . , ε(n) }, where ε(d) = fBαT (y(d)⋆ ). Due to this bijection (and the determinism mapping Y⋆ to Y), the density p(V | Θ\V, D , D ⋆ ) = p(V | Θ\V, D ⋆ ) = p(V | Θ\V, y(1)⋆ , . . . , y(d)⋆ ) is equivalent to p(V | Θ\V, D ⋆ ) = = ∝ ∝ ⋆ p(V | α, B, T , D ⋆ , Dε ) ⋆) p(V | α, B, T , Dε ⋆ p(V | α, B, T )p(Dε | α, B, T , V) n (d) | V). p(V) ∏d=1 p(ε For the given data set D ∪ D ⋆ , define S⋆ as the sum of (Y⋆(d) − α − BY(d) )(Y⋆(d) − α − BY(d) )T over all d ∈ {1, 2, . . . , n}. Since p(ε | V) is normal with zero mean and covariance matrix V, the posterior for V given all other parameters and variables is V | {Θ\V, D , D ⋆ } ∼ G -IW (δ + n, U + S⋆ ). Sampling B and α is analogous to the Gaussian case, except that we have to consider that the left-hand side of the structural equations now refer to Y⋆ . We give the explicit conditional for αi , with the conditional for bi j being similarly adapted from Section 4. The posterior for αi is given by a normal N((s′ )−1 m′ , s′ ) where i i i sα i ′ cα i ′ = = 1 + n(V−1 )ii , sα i m n m cα (d) ⋆(d) i − n ∑ (V−1 )it αt + ∑ ∑ (V−1 )it Yt . − ∑ bt pt Ypt sα pt i d=1 t=1 t=1,t=i 5.3 A Note on Identifiability The scale of the underlying latent variables in the probit model is arbitrary. As such, it has been often suggested that such latents should have constant (e.g., unity) variance (Pitt et al., 2006). There are two usual arguments for fixing the variance: improving the interpretability of the model, and improving the mixing of the Markov chain. The interpretability argument is not particularly appealing within the Bayesian setting with proper priors, such as the one proposed in this paper: the posterior distribution of the parameters is well-defined by the prior uncertainty and the data. 1208 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS The goal of improving the mixing of the chain might be important: if some parameters can assume arbitrary values and still allow for the same model over the observables, then fixing such parameters may help sampling by eliminating largely flat regions from the posterior (which will happen for large data sets and broad priors). In practice, however, scaling UVs might not be advantageous. In some cases it might increase the computational cost of each sampling step, while sampling from the non-scaled model might work just fine. Many MCMC algorithms work well on highly unidentifiable models such as multilayer perceptrons (Neal, 1996). In our experiments, we do not use any scaling. 5.4 Remarks It is clear that the given approach can be generalized to other generalized linear models by changing the link function that maps underlying latent variables (UVs) to observables. For instance, a model containing discrete and continuous variables can be constructed by using the identity link function instead of probit for the continuous variables. Notice that the continuous variables will not necessarily be marginally Gaussian if some of its parents are discrete. Other link functions will have different parameters besides thresholds, such as in multivalued (“polychotomous”) discrete distributions. A Bayesian account of Gaussian copula models is given by Pitt et al. (2006), to which a DMG-based family could in principle be defined. For continuous, marginally non-Gaussian, variables joined by a Gaussian copula, it is possible that all link functions are invertible. In this case, it is easier in principle to define cyclic models through Type-I UV models (e.g., Figure 6(b)) while preserving the observable independencies. It is important to point out that Type-II probit models with Markov equivalent graphs will not, in general, be likelihood equivalent. A simple example is given by the two-node graphs Y1 → Y2 and Y1 ↔ Y2 : if Y1 is binary, then the marginal for Y2 in the first case is equivalent to having an underlying latent variable that follows a mixture of two Gaussians. While some of these issues can be solved by adopting a mixture of Gaussians marginal independence model to account for bi-directed edges (Silva and Ghahramani, 2009), details need to be worked out. When the goal of model selection is to find causal structures (Spirtes et al., 2000), the usual guarantees of search methods based on Markov equivalence classes do not hold. However, it remains to be seen whether the parametric constraints implied by the Type-II formulation will allow for other consistent approaches for causal discovery, as shown in the case of non-linearities with additive noise (Hoyer et al., 2008). 6. Scaling Up: Factorizations and Perfect Sequences Each Monte Carlo sampling step for the given mixed graph models is theoretically tractable, but not necessarily practical when the dimensionality m of the data is high. By using clever factorizations of the graph and ordering of the variables, it is possible to sometimes scale to high-dimensional problems. In this section, we describe approaches to minimize the run-time of the marginal likelihood computation for bi-directed graphs, which is also important for computing variational bounds for DMG models. We start, however, with a discussion on factorizations of the posterior density for coefficient parameters B. The context is the Gibbs sampler for acyclic models. 1209 S ILVA AND G HAHRAMANI Y2 Y1 Y3 b31 Y3 Y1 Y2 Y3 Y4 Y5 Y1 b32 Y4 b54 Y5 (a) b21 Y4 b32 b43 Y6 Y2 b53 (b) (c) Figure 7: The coefficients b31 and b32 , represented as nodes in (a), become dependent after conditioning on Y. However, they are still independent of b54 . This a general property of DAG models. In DMG models, a sequence of bi-directed edges will connect extra coefficients. In graph (b), coefficients b21 , b32 and b43 will all be dependent given Y. Coefficients into nodes in different districts will still be independent. The graph in (c) has districts {Y1 ,Y2 ,Y3 ,Y4 } and {Y5 ,Y6 }. 6.1 Factorizations Our prior for coefficients {bi j } is fully factorized. In directed acyclic graphs, this is particularly advantageous: coefficients corresponding to edges into different nodes are independent in the posterior.9 One can then jointly sample a whole set of {bi j } coefficients with same i index, with no concern for the other coefficients. Figure 7(a) illustrates this factorization. This means that, in Equation (16), the summation over t does not go over all variables, but only for t = i. This also follows from the fact that (V)−1 = 0 unless i = t, since V is diagonal. it In ADMGs, however, this is not true anymore. For any pair of vertices linked by a path of bi-directed edges, for example, Yi ↔ Yi+1 ↔ · · · ↔ Yt , one will have in general that (V)−1 = 0. This it can be shown by using the graphical properties of the model when conditioning on some arbitrary datapoint Y: Proposition 6 Let G be an acyclic DMG with vertex set Y, and G ′ the DMG obtained by augmenting G with a vertex for each parameter bi j and a respective edge bi j → Yi . Then if there is a bi-directed path Yi ↔ · · · ↔ Yt in G , {bi j , btv } are not m-separated given Y in G ′ . Proof: The joint model for {Y, B} with independent priors on the non-zero entries of B is Markov with respect to G ′ . The sequence of bi-directed edges between Yi and Yt implies a path between bi j and btv where every vertex but the endpoints is a collider in this path. Since every collider is in Y, this path is active. This Proposition is illustrated by Figure 7(b). The practical implication is as follows: mconnection means that there is no further graphical property that would entail (V)−1 = 0 (i.e., only it particular cancellations on the expression of the inverse, unlikely to happen in practice, would happen to generate such zeroes). 9. Sampling in Gaussian DAG models is still necessary if the model includes latent variables (Dunson et al., 2005). 1210 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Consider the maximal sets of vertices in an ADMG such that each pair of elements in this set is connected by a path of bi-directed edges. Following Richardson (2003), we call this a district.10 It follows that is not possible in general to factorize the posterior of B beyond the set of districts of G . Figure 7(c) illustrates a factorization. Fortunately, for many DMG models with both directed and bi-directed edges found in practical applications (e.g., Bollen, 1989), the maximum district size tends to be considerably smaller than the dimensionality of the problem. 6.2 Perfect Sequences It is still important to speed up marginal likelihood (or variational bound) computations for models with districts of moderate size, particularly if many models are to be evaluated. Without loss of generality, assume our graph G is a bi-directed graph with a single district, since the problem can be trivially separated into the disjoint bi-directed components. We will consider the case where the bi-directed graph is sparse: otherwise there is little to be gained by exploring the graphical structure. In that case, we will assume that the largest number of spouses of any node in G is bounded by a constant κ that is independent of the total number of nodes, m. The goal is to derive algorithms that are of lower complexity in m than the original algorithms. The bottleneck of our procedure is the computation of the Σ−1 ≺ (i),nsp≺ (i) matrices, required nsp in the mapping between independent and dependent Bartlett parameters (Equation 7), as well as computing the determinants |Σnsp≺ (i),nsp≺ (i) |. Since in sparse districts nsp≺ (i) grows linearly with m, the cost of a na¨ve algorithm for a single sampling step is O(m3 ) per node. Iterating over all nodes ı implies a cost of O(m4 ) for a Monte Carlo sweep. Therefore, our goal is to find a procedure by which such mappings can be computed in less than O(m3 ) time. The general framework is reusing previous inverses and determinants instead of performing full matrix inversion and determinant calculation for each Yi . The difficulty on applying low-rank updates when we traverse the covariance matrix according to ≺ is that the sets of non-spouses nsp≺ (i) and nsp≺ (i + 1) might differ arbitrarily. We want sensible orderings where such sets vary slowly and allow for efficient low-rank updates, if any. The foundation of many scaling-up procedures for graphical models is the graph decomposition by clique separators (Tarjan, 1985), usually defined for undirected graphs. The definition for bi-directed graphs is analogous. Such a decomposition identifies overlapping prime subgraphs {GP(1) , GP(2) , . . . , GP(k) } of the original graph G . A prime graph is a graph that cannot be partitioned into a triple (Y′ , S, Y′′ ) of non-empty sets such that S is a complete separator (i.e., S is a clique and removing S disconnects the graph). Notice that a clique is also a prime subgraph. The prime components of a graph can be ordered in a perfect sequence {YP(1) , . . . , YP(k) } of subsets of Y (Roverato, 2002; Lauritzen, 1996). Define H j ≡ YP(1) ∪ · · · ∪ YP( j) as the history of the perfect sequence up to the j-th subgraph. Let R j ≡ YP( j) \H j−1 be the residual of this history (with R1 ≡ YP(1) ), and S j ≡ H j−1 ∩ YP( j) the separator. In a perfect sequence, the triple (H j−1 \S j , S j , R j ) forms a decomposition of the subgraph of G induced by the vertex set H j . Surprisingly, although bi-directed and undirected graph models have very different Markov properties (in undirected models, conditioning removes dependencies; in bi-directed models, it adds dependencies), perfect prime graph sequences prove to be also useful, but in an entirely different 10. Kang and Tian (2005) call such structures c-components and reserve the word “district” to refer to the function mapping a vertex to its respective c-component, as originally introduced by Richardson (2003). We choose to overload the word and call “district” both the structure and the mapping. 1211 S ILVA AND G HAHRAMANI Y7 Y3 Y1 Y5 Y2 Y6 V1 = {Y1 ,Y2 ,Y3 } V2 = {Y4 ,Y5 } V3 = {Y6 } Y4 Figure 8: On the left, we have a bi-directed graph of 7 vertices arranged and ordered such that nodes are numbered by a depth-first numbering starting from “root” Y7 , with {Y1 ,Y2 ,Y4 ,Y6 } 3 being leaves. Vertices {Y1 ,Y2 , . . . ,Y6 } can be partitioned as the union ∪t=1 Vt , as illustrated on the right. way. The next subsection describes the use of prime graph decompositions in a particularly interesting class of bi-directed graphs: the decomposable case. The general case is treated in the sequel. 6.2.1 D ECOMPOSABLE M ODELS In a recursively decomposable graph, all prime subgraphs are cliques. We will assume that any perfect sequence in this case contains all and only the (maximal) cliques of the graph. The resulting decomposition can be interpreted as a hypergraph where nodes are the maximal cliques of the original graph, and edges correspond to the separators. In the statistics literature, a decomposable model is defined as a model that is Markov with respect to a recursively decomposable undirected graph (Lauritzen, 1996). Its widespread presence on applications of Markov random fields is due to nice computational properties, with tree-structured distributions being a particular case. Our definition of bi-directed decomposable models is analogous: a model Markov with respect to a recursively decomposable bi-directed graph. Given the residual sequence {R1 , R2 , . . . , Rk } obtained through a perfect sequence of maximal cliques of G , we define a perfect ordering ≺ by numbering nodes in Rt before nodes in R1 , . . . , Rt−1 , 1 < t ≤ k and ordering nodes according to this numbering.11 Any ordering that satisfies this restriction is a perfect ordering. Such an ordering has the following property. Theorem 7 Let G be a recursively decomposable bi-directed graph such that the indexing of its vertices Y = {Y1 ,Y2 , . . . ,Ym } follows a perfect ordering ≺. Then for each 1 < i ≤ m, the set K(i) {Y1 ,Y2 , . . . ,Yi−1 } can be partitioned as ∪t=1 Vt such that: 1. each Vt induces a connected subgraph of G , and for each Yt ∈ Vt and Yt ′ ∈ Vt ′ , t = t ′ , Yt is not adjacent to Yt ′ in G ; 11. Lauritzen (1996) describes other uses of perfect sequences in undirected graphs. Notice that the notion of perfect numbering described by Lauritzen (1996) is not equivalent to our notion of perfect ordering, which is always derived from a perfect sequence of prime graphs. 1212 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 2. for each {Yp ,Yq } ⊆ Vt , if Yp is a spouse of Yi , and Yq is a non-spouse of Yi , then p > q; The proof is in Appendix C. This result is easier to visualize in trees. One can take as a perfect ordering some depth-first ordering for a given choice of root. Then for each vertex Yi , the set {Y1 ,Y2 , . . . ,Yi−1 } is partitioned according to the different branches “rooted” at Yi . The starting point of each branch is a spouse of Yi , and all other vertices are non-spouses of Yi . The ordering result then follows directly from the definition of depth-first traversal, as illustrated in Figure 8. Let Σ be the covariance matrix of a bi-directed decomposable model with graph G , where Σ follows a G -inverse Wishart distribution. Let ≺ be a perfect ordering for G . By the construction of Bartlett’s decomposition, mapping between parameters is given by Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) , nsp the computational bottleneck being the inversion. Notice this corresponds to the multiple regression coefficients of sp≺ (i) on nsp≺ (i). But according to Theorem 7, using a perfect ordering implies that within each Vs for a fixed Yi , all preceding non-spouses of Yi are ordered before the preceding spouses. Elements {Yp ,Yq } in different Vs are marginally independent given {Y1 , . . . ,Yi−1 }\{Yp ,Yq }. This implies that the regression coefficient of spouse Yp on non-spouse Yq will be zero if Yp and Yq are on different components Vs , and will be identical to the previously computed B p,q if they are in the same component. Splitting the set {Y1 ,Y2 , . . .Yi−1 } into preceding spouses Ysp≺ (i) and nonspouses Ynsp≺ (i) , we have Ysp≺ (i) = Bsp≺ (i),sp≺ (i) Ysp≺ (i) + Bsp≺ (i),nsp≺ (i) Ynsp≺ (i) + εsp≺ (i) ⇒ Ysp≺ (i) = (I − Bsp≺ (i),sp≺ (i) )−1 (Bsp≺ (i),nsp≺ (i) Ynsp≺ (i) + εsp≺ (i) ) where each ε j is an independent Gaussian with variance γ j , and each element (p, q) in Bsp≺ (i),nsp≺ (i) corresponds to the known (i.e., previously computed) regression coefficient of the spouse Yp on the non-spouse Yq . Matrix Bsp≺ (i),sp≺ (i) is defined analogously. Hence, the regression coefficients of Ysp≺ (i) on Ynsp≺ (i) are given by Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) = (I − Bsp≺ (i),sp≺ (i) )−1 Bsp≺ (i),nsp≺ (i) . nsp (19) No inversion of Σnsp≺ (i),nsp≺ (i) is ever necessary. Moreover, the determinant |Σnsp≺ (i),nsp≺ (i) | is given by ∏{q s.t. Yq ∈nsp≺ (i)} γq , since all non-spouses precede the spouses (which means their marginal covariance matrix is given by the previously computed Bartlett parameters). Hence, calculating Bi,nsp≺ (i) for all 1 ≤ i ≤ m according to a perfect ordering has as a bottleneck the inversion (of a triangular matrix) and multiplication in Equation (19), with a cost of O(κ2 +mκ2 ), κ being the maximum number of spouses for any given node. The cost of the remaining operations for the i-th stage in the importance sampler is O(κ3 ). As a function of m, the cost of the parameter sampling step falls from O(m3 ) to O(m). The cost of computing the weights is dominated by the computation of Ki from Equation (13), which is O(κ3 + κm2 ) = O(m2 ). Figure 9 illustrates the derivation of the new ordering in a tree-structured model. 6.2.2 N ON - DECOMPOSABLE M ODELS In a non-decomposable model, some prime graphs YP(t) will no longer be cliques. In what follows, we once again assume that ≺ is a perfect ordering. Unlike in the decomposable case, the product 1213 S ILVA AND G HAHRAMANI YA YC YD YA YC YC YB YC YC YC YD YB (a) (b) Figure 9: The tree-structured (i.e., cycle-free) bi-directed graph in (a) has as maximal cliques the adjacent pairs. Such cliques can be ordered in a perfect sequence as shown in (b), where rectangles indicate the separators. Notice that R1 = {YA ,YC }, R2 = {YB }, R3 = {YD }. One possible perfect ordering is {YD ,YB ,YC ,YA }. Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) does not simplify in general. Instead we will focus only on fast methods nsp to compute Σ−1 ≺ (i),nsp≺ (i) . nsp As we shall see, the function of the perfect sequence is now to provide a sensible choice of which inverse submatrices {Σ−1 }, W ⊆ Y, to cache and reuse when computing Σ−1 ≺ (i),nsp≺ (i) . W,W nsp The same can be done to compute determinants |Σnsp≺ (i),nsp≺ (i) |. A simple way of reusing the results from the previous section is by triangulating the nondecomposable graph G , transforming it into a decomposable one, G ′ , which is then used to generate the perfect sequence. We need to distinguish between the “true” spouses of a node Yi in G and the artificial spouses in G ′ that result from the extra edges added. Let nsp≺G ′ (i) be the non-spouses of Yi in G ′ that precede it according to ≺: by construction, these are also non-spouses of Yi in G . Let sp∆≺G ′ (i) be the spouses of Yi in G ′ that are not spouses of Yi in G . That is, the set of preceding non-spouses of Yi in G is given by nsp≺ (i) = nsp≺G ′ (i) ∪ sp∆≺G ′ (i). Recall that the inverse of a partitioned matrix can be given by the following identity: A B C D −1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1 −(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1 . (20) In order to compute Σ−1 ≺ (i),nsp≺ (i) , we consider its partitioned version nsp Σ−1 ≺ (i),nsp≺ (i) nsp = Σnsp≺G ′ (i),nsp≺G ′ (i) Σnsp≺G ′ (i),sp∆≺G ′ (i) Σsp∆≺G ′ (i),nsp≺G ′ (i) Σsp∆≺G ′ (i),sp∆≺G ′ (i) −1 . (21) Let κnsp be the maximum number of non-spouses among all Yi within any prime subgraph induced by YP(t) . By using relation (20), where we assume for now that we know A−1 ≡ Σ−1 ′ (i),nsp ′ (i) , the cost of computing (21) is O(m2 κnsp ) + O(κ3 ) = O(m2 κnsp ) (the cost of comnsp nsp ≺G ≺G puting D − CA−1 B is O(m2 κnsp ) + O(κ2 ) = O(m2 κnsp ), while the cost of inverting it is O(κ3 )). nsp nsp Treating κnsp as a constant, this reduces the complexity of sampling the i-th row of Σ from O(m3 ) to O(m2 ). A similar procedure applies to the computation of the determinant |Σnsp≺ (i),nsp≺ (i) |, using in this case the relationship (26). 1214 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS The advantage of using the perfect sequence is to allow for the computation of all A−1 ≡ at a total cost, across all nodes, of O(m3 ): each set nsp≺G ′ (i) is guaranteed to Σ−1 ′ (i),nsp ′ (i) nsp≺G ≺G be equal to {Y1 ,Y2 , . . . ,Ylns } where Ylns is the last non-spouse of Yi in G ′ that antecedes Yi . This follows from the result in the previous section, since all non-spouses of a node in a decomposable graph precede its spouses. Therefore, if we store the inverse covariance matrices for {Y1 ,Y2 , . . . ,Yi }, 1 ≤ i ≤ m, we can obtain the required matrices A−1 . This requires the storage of O(m) matrices, and each matrix can be obtained by the previous one by a low-rank update (20) with a O(m2 ) cost. Arbitrary orderings do not guarantee such an incremental pattern and, hence, no efficient lowrank updates. Notice that depending on the problem, many of such inverse matrices can be dynamically removed from memory if they are not used by any node placed after a particular position. 6.3 Remarks In Gaussian undirected models, the problem of covariance matrix sampling can also be reduced to sampling within each prime graph at the cost of O(|P |4 ), |P | being the size of the largest prime component (Atay-Kayis and Massam, 2005). Since both κ and κnsp are O(|P |), our procedure costs O(m2 |P |2 + |P |4 ) per prime graph, plus a cost of O(m2 ) per node to compute the importance weights. Considering a number of m/|P | prime graphs and |P | < m, the total cost is O(m3 |P |), down from O(m4 ). For undirected models, the corresponding cost by sampling step using the perfect ordering decomposition is O(m|P |3 ). The higher-order dependency on m in bi-directed models is to be expected, since the Markov blanket of any node Yt in a connected bi-directed graph is V\{Yt }. It is clear that inference with a given bi-directed graph model will never scale at the same rate of a undirected model with the same adjacencies, but this does not justify adopting an undirected representation if it is ill-suited to the problem at hand. One has also to consider that in problems with directed and bi-directed edges, the actual maximum district size might be much smaller than the number of variables. For large problems, however, further approximation schemes will be necessary. Drton and Richardson (2008b) describe some reduction techniques for transforming bi-directed edges into directed edges such that the resulting Gaussian model remains the same. As future work, such methods could be adapted to the G -inverse Wishart sampling procedures and combined with the ordering techniques developed here into a single framework. It will also be interesting to develop similar schemes for the Gibbs sampler. 7. Experiments We now evaluate the advantages of the Gaussian and probit models in Bayesian inference on real problems. 7.1 Industrialization and Democratization Study Bollen (1989) describes a structural equation model of political and democratization factors within nations. “Democratization” and “industrialization” levels are abstract notions, but nevertheless of clearly observable impact. They are tied to empirical observations through different sets of indicators. For instance, an indicator of industrialization level is the gross national product. Hence, democratization and industrialization levels are here defined as scalar latent variables never observed directly, while the observed data is composed of indicators. In this model, there is a total of three indicators of industrialization, and four indicators of democratization. Democratization is 1215 S ILVA AND G HAHRAMANI Y1 Y2 Y3 Industrialization 1960 Democratization 1960 Y4 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Y5 Y6 Democratization 1965 Y7 Y8 Y9 Y 10 Y 11 Gross national product (GNP) 1960 Energy consumption per capita 1960 Percentage of labor force in industry 1960 Freedom of press 1960 Freedom of opposition 1960 Fairness of elections 1960 Elective nature of legislative body 1960 Freedom of press 1965 Freedom of opposition 1965 Fairness of elections 1965 Elective nature of legislative body 1965 Figure 10: A directed mixed graph representing dependencies between 11 observed political and economical indicators and three latent concepts (shaded nodes) (Dunson et al., 2005; Bollen, 1989). measured in a longitudinal study, where data was collected in two years (1960 and 1965). The indicators of democratization are pooled expert opinions summarized in an ordinal number scaled from 1 to 10. Following Bollen, we will treat the model as multivariate Gaussian, which provides an excellent fit (a p-value greater than 0.3 using a chi-square test) for a sample of 75 countries. The corresponding mixed graph is depicted in Figure 10, along with a description of all indicators. The graph is taken from Bollen (1989). Other hidden common causes affect the democratization indicators over time, but the nature of such hidden variables is irrelevant to the problem at hand: that is, the bi-directed edges are motivated by unmeasured causes of variability in the observed indicators that exist over time. For instance, the records of freedom of press in 1960 (Y4 ) and 1965 (Y8 ) co-vary due to other unmeasured factors not accounted by democratization factors. 1216 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS Factor scores: countries in the latent space 5 0 Democratization 10 Dem1960 Dem1965 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 Country (ordered by industrialization factor) Figure 11: An embedding of 75 countries in a two-dimensional latent space: democratization level in 1960 and 1965. Boxplots of the Bayesian posterior distribution of the projection in the two dimensions are depicted in the vertical axis. Countries are arranged in the horizontal axis by the increasing order of their posterior expected industrialization level. Figure adapted from Dunson et al. (2005). An example of Bayesian inference application is shown in Figure 11. Boxplots of the posterior values of Democratization Level 1960 and Democratization Level 1965 are generated. Dunson et al. (2005) use this information to, for instance, find clusters of countries in the latent space. An example of a cluster is the one formed by the bottom 16 countries in the industrialization level ranking: the growing trend of democratization levels after the first 16 countries is interrupted. This type of analysis might provide new insights to a polical scientist, for example, by revealing particular characteristics for such a group of nations. 7.1.1 E VALUATING THE MCMC A LGORITHM FOR D IFFERENT M ODELS In our analysis, we fix to unity the coefficients corresponding to the edges Industrialization 1960 → Y1 , Democratization 1960 → Y4 and Democratization 1965 → Y8 , since the scale and sign of the latent variables is arbitrary. The intercept terms of the equations for Y1 ,Y4 and Y8 are set to zero, since the mean of the latents is also arbitrary. The resulting model is identifiable. We apply the Gibbs sampling procedure to three different models. The Gaussian DMG model as described in this paper, and two modified DAG models. The first DAG model is the one described by Dunson et al. (2005), where each bi-directed edge is substituted by an “ancillary” latent (as mentioned in Section 2.3). For instance, the pathway corresponding to Y4 ↔ Y8 is substituted by the chain Y4 ← D48 → Y8 , where D48 is unobserved. Dunson et al. further assume that all covariances due to such ancillary latents are positive. As such, the coefficients from Di j into {Yi ,Y j } are set 1217 S ILVA AND G HAHRAMANI Industrialization1960 → Democratization1965 Democratization1960 → Democratization1965 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 2 Density 1 0 0.0 0.0 0.5 1.0 Density 1.0 0.5 Density 3 1.5 1.5 4 2.0 Industrialization1960 → Democratization1960 -0.5 N = 50000 Bandwidth = 0.02317 0.0 0.5 1.0 1.5 2.0 2.5 -0.5 N = 50000 Bandwidth = 0.01972 (a) 0.0 0.5 1.0 1.5 2.0 2.5 N = 50000 Bandwidth = 0.009335 (b) (c) Figure 12: Posterior distribution of parameters associated with the respective edges in the industrialization/democratization domain. Smoothed posterior obtained using the output of our Gibbs sampler and the DENSITY function of R 2.6.0. Democratization1960 → Y5 0.8 5 0.6 0.4 Density 0.2 0.0 -1 0.0 0.0 0 0.5 0.5 1 1.0 2 1.0 1.5 3 2.0 4 2.5 2.0 1.5 Y7 ↔ Y11 Y7 ↔ Y11 3.0 Industrialization1960 → Democratization1960 0 1000 2000 3000 Iteration 4000 5000 0 1000 2000 3000 4000 5000 0 Iteration 1000 2000 3000 Iteration 4000 5000 -1 0 1 2 3 4 5 N = 50000 Bandwidth = 0.04786 Figure 13: The first three plots show the initial 5,000 iterations of a run of the Gibbs sampling algorithm for the DMG model for three different parameters associated with edges in the graph. The last plot depicts the posterior distribution the error covariance associated with the edge Y7 ↔ Y11 (smoothed with the kernel density estimator from the statistical software R). to unity, with the variance of Di j corresponding to the residual covariance of {Yi ,Y j } given their parents. Means of ancillary latents are fixed at zero. However, even for covariance matrices with positive covariances, this parameterization is not complete. This result is evident from the fact that the variances of Yi and Y j will both be larger than their covariance, which is not true of covariance matrices in general. For this particular problem, however, this extra restriction provides no measurable difference in terms of fitness. It does serve as a reminder, however, that “intuitive” parameterizations might hide undesirable constraints. The second DAG model is an extension of the DAG model suggested by Dunson et al., the only difference being that the coefficients corresponding to edges Di j → Yi , i < j, are free to vary (instead of being fixed to 1). In general, there are Gaussian DMG models that cannot be parameterized this way (Richardson and Spirtes, 2002). Notice also that because of chains such as Democratization 1960 → Y4 ↔ Y8 ← Democratization 1965, the set of independence constraints in this graph can only be represented by a DAG if we include the ancillary latents Di j . That is, there is no DAG with 1218 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS 50000 MCMC comparison: DMG vs. unconstrained DAG 30000 0 10000 20000 Effective Sample Size 30000 20000 10000 0 Effective Sample Size DMG DAG 40000 DMG posDAG 40000 50000 MCMC comparison: DMG vs. positive covariance DAG 1 4 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 1 4 7 Parameter 11 15 19 23 27 31 35 39 43 47 51 55 59 63 Parameter Figure 14: Comparison of the effective sample size of the MCMC algorithm applied to the three models, DMG, DAG with positive covariances (posDAG) and general DAG, as explained in the main text. The horizontal axis is the boxplot for each independent entry of the observed covariance matrix, 66 in total. The boxplots are obtained from 80 independent chains initialized randomly, where each chain runs for 50,000 iterations. exactly the same set of independence constraints as the given DMG, unless ancillary latent variables are added. We study the behavior of the MCMC algorithm for these three models.12 It turns out that the mixing properties of the chain are considerably affected by the choice of model. Recall that, in the Gibbs sampling algorithm for the DMG model, a whole row of the error covariance matrix is sampled jointly conditioning on the other parameters. For the DAG models all entries of the error covariance matrix are independent and can be sampled jointly, but this requires conditioning on the ancillary latents, which do not exist in the DMG model and have to be sampled only in the DAG case. For the majority of the covariance entries, the MCMC procedure mixed quite well, as illustrated in Figure 13. Notice how about 12% of the sampled DMG error covariances for Y7 ↔ Y11 were under zero, which could raise suspicion over the assumption of positive covariances. Autocorrelation is 12. A few technical notes: we used the priors suggested in Dunson et al. (2005), except that we changed the confidence in the prior of the covariance of the error terms V to be smaller (in order to minimize the influence of the priors in the models, since in this particular problem the DMG and DAG models are nearly likelihood equivalent but not posterior distribution equivalent − the priors belong to different families). We used 1 degree of freedom in our G -Inverse Wishart, with the matrix parameter being the expected value of Dunson et al.’s prior. For the DAG models, we also used the G -inverse Wishart prior for the error terms, but where all error terms are independent. For the DAG model with a free coefficient per ancillary latent, we assigned a standard Gaussian prior to such coefficients. The chains were initialized randomly by sampling standard Gaussians for the coefficients and latent variables. Error covariance matrices were initialized to diagonal matrices with diagonal entries sampled uniformly in [1, 2]. Coefficient parameters were sampled jointly given the error covariance matrix and latent variables. Latent variables were also sampled jointly, given the parameters. 1219 S ILVA AND G HAHRAMANI Comparison of average effective sample size 30000 DMG 0 0 10000 20000 20000 10000 DMG 30000 40000 40000 Comparison of average effective sample size 0 10000 20000 30000 40000 0 positive covariance DAG 10000 20000 30000 40000 unconstrained DAG Figure 15: Comparison of the effective sample size of the MCMC algorithm applied to the three models. Here we plot the average effective sample sizes over 80 trials of 50,000 samples for each of the 66 entries of the covariance matrix. Points over the line indicate parameters where the DMG approach performed better. essentially zero for most parameters at a lag of 50. The degree of autocorrelation, however, varied significantly between the DMG model and each DAG model. The chains for the DMG model mixed considerably better. To summarize such behavior, we calculated the effective sample size of the samples obtained from several chains. The parameters of interest in this comparison are the independent entries in the 11 × 11 dimensional observed covariance matrix. This is a total of 66 parameters. The effective sample size statistics were obtained by 80 independent chains of 50,000 samples each, for the three models. For each chain and each parameter, we compute the desired statistic using the EFFECTIVE S IZE function implemented in the R package CODA, freely available in the Internet. Results are summarized by boxplots in Figure 14. Parameters are ordered in the x-axis following the upper triangular covariance matrix, scanning it in the order {σY1Y1 , σY1Y2 , . . . , σY1Y11 , σY2Y2 , . . . , σY11Y11 }. White boxplots correspond to the distribution of effective sample size statistics with the DMG model across the 80 independent chains. Gray boxplots correspond to the two DAG variants. There is no significant difference between the behaviour of the Gibbs sampling procedure for the two DAG models. The procedure with the DMG model is clearly better behaved. As a summary statistic, the average effective sample size over 80 trials was steadly larger in the DMG outcome than in the positive DAG outcome (61 out of 66 parameters) and unconstrained DAG (59 out of 66). The comparison of averages is illustrated by Figure 15. By caching the sufficient statistics of the data and factorizing the sampling procedure according to the districts of the graph, the running time for generating 50,000 samples out of the DMG model was of 34 seconds in a dual core Pentium IV 2.0 GHz. Depending on the connectivity of the bidirected components of the graph and on the implementation of matrix inversion, sampling from the 1220 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS DAG model might be faster than sampling from the DMG. In this particular study, sampling from the DAG models was substantially slower, an approximate average of 60 seconds for both variants. This can be explained by the fact that sampling latent variables is very expensive, especially considering that in the given DAG models all ancillary latents become dependent when conditioning on the data. To summarize, the DMG approach allowed for a complete parameterization with significantly better mixing properties, while still resulting in a faster MCMC procedure. 7.2 Structure Learning Applications When trying to find a point estimate of graphical structures (i.e., returning a single graph that explains the data well), simple approaches such as testing for marginal independencies are reasonable learning algorithms under the Gaussian assumption. The Bayesian approach, however, allows one to compute odds and distributions over graphs and graph statistics, for example, the joint probability of small substructures (Friedman and Koller, 2003). Moreover, it is not clear how the independence test procedure controls for the predictive ability of the model, which is not a straightforward function of the edges that are selected due to the quantitative aspects of the dependencies. We evaluate our Bayesian model selection contribution, focusing on the Monte Carlo sampler for bi-directed models. Jones et al. (2005) propose the following priors for graphs: P (G |β) = β|E| (1 − β)0.5m(m−1)−|E| where β is a hyperparameter, |E| is the number of edges in G , and m is the number of nodes. As suggested by Jones et al., we choose β = 0.5/(m − 1), which puts more mass on graphs with O(m) edges than the uniform prior. We start with a brief synthetic study to compare the approach against a simple but effective approach based on the BIC approximation.13 An experiment with gene expression data closes this subsection. 7.2.1 S YNTHETIC S TUDIES As a sanity check for the procedure, we generate synthetic 10-dimensional Gaussian data from models that are Markov with respect to a bi-directed graph. One hundred data sets of 50 datapoints each are generated, each coming from a different model.14 We initially find a structure by marginal independence tests using the Fisher’s Z statistic at a 0.05 level. From this starting point, we perform two searches: one using the BIC score, and the other using the marginal likelihood with a G -IW prior.15 Given the best model for each procedure, we evaluate the predictive log-likelihood on a test set of 2,000 points which are independently sampled for each of the 100 models. 13. The BIC approach is an asymptotically consistent score for selecting the maximum a posteriori Gaussian bi-directed graph model (Richardson and Spirtes, 2002). 14. The details of the simulated data are as follows: we start with DAG with no edges, with observed nodes {Y1 ,Y2 , . . . ,Y10 } and hidden nodes {X1 , X2 , X3 , X4 }. Each individual edge Xi → Y j is added with probability 0.35, and no other edges are allowed. We reject graphs with fewer than 10 edges. All coefficient parameters are sampled from a standard Gaussian, and variances from an uniform distribution in [0, 1]. The model over Y corresponds to a bi-directed graph, where the edge Yi ↔ Y j exists if and only if Yi and Y j have a common latent parent Xk in the DAG. We then store 50 samples for the Y variables in a data set. The procedure is repeated 100 times with different parameters and graphical structures each time. The average number of edges in the resulting simulation was of 18.4 edges per graph. 15. In both cases, we center the data at the empirical mean of the training set and assume the data to have been generated from a zero-mean Gaussian. The G -Inverse Wishart is an empirical prior: a diagonal matrix with the training variance 1221 S ILVA AND G HAHRAMANI MLE Difference (GIW Greedy - BIC Greedy) 0.0008 Density 0.0000 0.0000 0.0002 0.0002 0.0004 0.0006 0.0006 0.0004 Density 0.0008 0.0010 0.0010 0.0012 0.0012 Difference (GIW Greedy - BIC Greedy) -500 0 500 1000 1500 2000 -500 N = 100 Bandwidth = 125.5 0 500 1000 1500 2000 N = 100 Bandwidth = 116.1 (a) (b) Figure 16: The differente in predictive log-likelihood with models learned with the G -IW prior and the best BIC models found by greedy search. Although the difference per point is small, it reflects a persistent advantage of the full Bayesian approach. Figure (a) shows the estimated density of the distribution of differences when predicting points using the Bayesian predictive log-likelihood. Since the BIC search method does not atempt to maximize the finite sample posterior distribution, we provide Figure (b) for completeness: in this case, the predictive log-likelihood for the BIC model was calculated using the maximum likelihood estimator. The difference hardly changes, and the fully Bayesian model still wins (density estimates produced by the DENSITY(·) function of R 2.6.0.). The average difference in log-likelihood prediction16 between the structure learned with the Bayesian prior and the BIC-induced model is depicted in Figure 16(a). This is computed by conditioning on the learned structures (fully Bayesian vs. BIC maximum a posteriori graphs) and marginalizing over the posterior of the parameters. The parameter priors are those used for the structure learning step. This might be unfair for the BIC procedure, since it is not designed to maximize the finite sample posterior: hence we also show in Figure 16(b) the results obtained when the predictions given the BIC model are obtained by using the maximum likelihood estimators of the of each variable used as the diagonal. The number of degrees of freedom is set to 1. The search is a standard greedy procedure: we evaluate the marginal log-likelihood or BIC score for each graph that differs from the current candidate by one edge (i.e., graphs with one more or one fewer edge) and pick the one with the highest score. We stop when no improvement is possible. 16. In terms of incorrect edge additions and deletions, the procedures behave about the same: an average of one third of the edges is missed, and 7% of edges are incorrectly added (individual percentages are with respect to total number of possible mistakes in each graph). Unlike BIC, however, our procedure allows for different trade-offs by using different priors. It should also be pointed out that counting edge errors is just one possible measure. A more global quantitative score such as predictive log-likelihood takes into account, indirectly, the magnitude of the errors—although it is not a direct measure of model fitness. 1222 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS parameters. The average difference in the first case is 400.07, and only slightly less for the second case (389.63). Most of the mass of the difference distribution is positive (85 out of 100 for the first case, 89 out of 100 in the second case), which passes a sign test at a 0.05 level. The difference is still relatively small, suggesting that informative prior knowledge might be necessary to produce substantially better predictions. 7.2.2 G ENE E XPRESSION A NALYSIS To illustrate the use of Bayesian model selection approaches, we analyse the gene expression data previously studied by Drton and Perlman (2007), also as Gaussian bi-directed models. As before, our goal will be to compare the predictive power of models learned by greedy search with BIC and greedy search with the Bayesian posterior. The data consists of 13 gene expression measurements from a metabolic network. A total of 118 points is available. Using all data, the BIC-induced graph has 39 edges, while the finite sample posterior graph had 44. The same procedure used in the synthetic studies, for initializing graphs and choosing priors and centering the data, was applied in this case with two choices of degrees of freedom δ for the G -IW prior: δ = 1 and δ = 5. Preliminary experiments where 90% of the samples are assigned to the training set showed a negligible difference between methods. We then generate 10 random splits of the data, 50% of them assigned to the training set. Predictive results using the MCMC method for evaluating the Bayesian predictions (with half a million samples) are shown in Table 1. The BIC graphs are by definition the same in the three sets of evaluation, but parameters are learned in three different ways (maximum likelihood point estimation and Bayesian averaging with two different priors). There is a steady advantage for the Bayesian approach, although a small one. Notice that using Bayesian averaging over parameters given the BIC graph improves prediction when compared to using the maximum likelihood point estimate, despite the simplistic choice of prior in this study. Notice also that the cases where the Monte Carlo method has small or no advantage over the BIC method were the ones where the maximum likelihood estimators produced their best results. 7.2.3 R EMARKS The procedure based on the sampler is doable for reasonably sized problem on the order of a few dozen variables in desktop machines. Further improvements are necessary for larger problems. One aspect that was not explored here was re-using previous computations when calculating the probability of a new candidate, in a way similar to the local updates in DAG models (Chickering, 2002). How to combine local updates with the ordering-based improved sampler of Section 6 is left as future research. Several practical variations can also be implemented, such as vetoing the inclusion of edges associated with high p-values in the respective independence tests. Such tabu lists can significantly shrink the search space. It is important to evaluate how the Monte Carlo procedure for computing normalizing constants behaves in practice. For all practical purposes, the procedure is an importance sampler and as such is not guaranteed to work within a reasonable amount of time for problems of high dimensionality (MacKay, 1998). We can, however, exploit the nature of the problem for our benefit. Notice that the procedure depends upon a choice of ordering ≺ for the variables. Different orderings correspond in general to different importance distributions. We can play with this feature to choose an suitable ordering. Consider the following algorithm for choosing an ordering given a bi-directed graph G : 1223 S ILVA AND G HAHRAMANI Folder 1 2 3 4 5 6 7 8 9 10 MLE BIC -6578.44 -6392.67 -8194.54 -6284.00 -9428.00 -7111.45 -6411.43 -6350.44 -6374.31 -7247.82 δ=1 BIC MC -6382.45 -6308.19 -6284.64 -6277.94 -6567.89 -6433.51 -6265.16 -6285.77 -6473.93 -6400.51 -6573.85 -6572.74 -6329.53 -6317.18 -6319.87 -6295.19 -6307.13 -6308.21 -6584.96 -6468.51 δ=5 BIC MC -6342.82 -6296.14 -6279.54 -6285.26 -6553.88 -6452.15 -6252.54 -6258.42 -6483.43 -6469.45 -6528.76 -6513.02 -6313.05 -6309.18 -6299.53 -6297.80 -6297.47 -6304.25 -6528.61 -6444.55 Table 1: Results for the 10 random splits of the gene expression data, with 50% of the points assigned to the training set. The first column shows the predictive log-likelihood for the graph learned with the BIC criterion and parameters fit by maximum likelihood. The next two columns show predictive log-likelihood results for the graphs learned with BIC and the Monte Carlo (MC) marginal likelihood method using a G -IW prior with degrees of freedom δ = 1. The last two columns are the results of a prior where δ = 5. Best results in bold. 1. Let ≺ be an empty queue. 2. Let G ′ be the graph complement of G , that is, the graph where {Yi ,Y j } are neighbors if and only if they are not adjacent in G . 3. Let C be an arbitrary maximum clique of G ′ . Add all elements of C to the end of ≺ in any arbitrary order. 4. For each pair {Yi ,Y j }, not intersecting C , such that the path Yi ↔ Yk ↔ Y j exists in G and Yk ∈ C , add the edge Yi ↔ Y j to G . 5. Remove all elements Yk ∈ C from G , including any edge into Yk . 6. Iterate Steps 2-5 until G is an empty graph. The resulting queue ≺ is an ordering that attempts to maximize the number of variables that are marginally independent given their common predecessors. This is just one possibility to simplify the importance sampling distribution: perfect orderings and the approaches for simplifying maximum likelihood estimation described by Drton and Richardson (2008b) could be adapted to provide even better orderings, but we leave this as future work.17 17. In our actual implementation used in the experiments in this Section, we implemented an even simpler approach: instead of finding maximum cliques, we start to build a clique from a particular node, “greedily” adding other nodes to the clique according to the column order of the data set. Each node generates a candidate clique, and we pick an arbitrary clique of maximal size to be our new set C . 1224 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS −13720 −13722 −13728 −13726 −13724 −13722 −13724 −13726 −13728 Marginal likelihood −13720 −13718 Gene data, optimized ordering −13718 Gene data, default ordering 0 2 × 104 4 × 104 6 × 104 8 × 104 105 0 2 × 104 4 × 104 6 × 104 8 × 104 Sample Synthetic data (123 parameters) Synthetic data (658 parameters) 4 × 104 6 × 104 8 × 104 −48625 −48620 −48615 2 × 104 −48630 Marginal likelihood Arbitrary ordering Optimized ordering 0 −48635 −48640 −24171.5 −24172.5 −24173.5 Marginal likelihood −24170.5 −48610 Sample 105 105 Arbitrary ordering Optimized ordering 0 Sample 2 × 104 4 × 104 6 × 104 8 × 104 105 Sample Figure 17: An evaluation on the stability of the Monte Carlo normalizing function procedure. The top row depicts the marginal likelihood estimates for the gene problem using two different distributions implied by two different orderings, as explained in the text. Experiments with synthetic data are shown in the bottom, and the bottom-right figure illustrates major differences. Figure 17 illustrates the difference that a smart choice of ordering can make. The top left graph in Figure 17 depicts the progress of the marginal likelihood Monte Carlo estimator for the gene expression problem using the graph given by the hypothesis testing procedure. The model has 55 parameters. We obtain three estimates, each using a sample of 100,000 points, which allows us to observe how the estimates change at the initial stages. The variable ordering in this case is the 1225 S ILVA AND G HAHRAMANI ordering present in the original database (namely, DXPS1, DXPS2, DXPS3, DXR, MCT, CMK, MECPS, HDS, HDR, IPPI1, GPPS, PPDS1 and PPDS2). The top right graph shows three runs using the optimized ordering criterion. Convergence is much faster in this case, and both samplers agree on the normalizing constant estimate. As an illustration of the power of the procedure and its limitations, we generated a synthetic sample of 1,000 training points from a graph with 25 nodes, using the same procedure of Section 7.2.1. A run of two different samplers is shown at the bottom left of Figure 17. They are seemingly well-behaved, the ratio between the largest and median weight being at the order of one hundred in the “optimally” ordered case. In contrast, the bottom right corner of Figure 17 illustrates the method with a covariance matrix of 50 random variables and 1,000 training points. Notice this is a particularly dense graph. Much bigger jumps are observed in this case and there is no clear sign of convergence at 100,000 iterations. While there is no foolproof criterion to evaluate the behavior of an importance sampler, the relationship between orderings provides a complementary technique: if the normalizing constant estimators vary substantially for a given set of random permutations of the variables, then the outcomes are arguably not to be trusted even if the respective estimators appear to have converged. Concerning the choice of priors, in this Section we exploited empirical priors. The G -Inverse Wishart matrix hyperparameter is a diagonal matrix where variance entries are the sample variances. While this adds an extra bias towards diagonal matrices, at least in our experiments we performed close to or better than other approaches—it is however not clear whether we could have done much better. It is still an open question which practical “default” hyperparameters will prove useful for the G -IW . Elicitation of subjective priors in the context of structural equation models can benefit from pre-existing work on Bayesian regression, although again practical matters might be different for the G -IW . Dunson et al. (2005) describe some limitations of default priors for structural equation models. A thorough evaluation of methods for eliciting subjective priors is out of the context of this work, but existing work on inverse Wishart elicitation provides a starting point (Al-Awadhi and Garthwaite, 1998). As in the case of the inverse Wishart, the G -Inverse Wishart has a single hyperparameter for specifying degrees of freedom, a limitation which might motivate new types of priors (Brown et al., 1993). 7.3 Discrete Data Applications We now show results on learning a discrete distribution that factorizes according to a mixed graph. Drton and Richardson (2008a) describe applications on real-world binary data modeled according to bi-directed graphs. The empirical contingency tables for two studies can be found in the corresponding technical report (Drton and Richardson, 2005). Drton and Richardson used a complete parameterization for bi-directed binary models and a maximum likelihood estimation procedure. In this section, we analyze these two data sets to illustrate the behavior of our Bayesian procedure using the probit model. Our model imposes probit constraints that are not enforced by Drton and Richardson, but it allows us to obtain Bayesian credible intervals and predictions. The graphs used in the two studies are depicted in Figure 18. The first problem is a study on the dependence between alcoholism and depression, as shown in Figure 18(a). A data point is collected for a given pair of mono-zygotic twins. For each sibling Si , it is recorded whether Si is/is not alcoholic (Ai ), and whether Si suffers/does not suffer from depression (Di ). The hypothesis encoded by the graph is that alcoholism and depression do not share a common genetic cause, despite A and 1226 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS A1 Success A2 Type of offense Prior sentence D1 D2 Drugs (a) Age (b) Figure 18: Two learning problems with discrete data. In (a), the graph shows dependencies concerning alcoholism (Ai ) and depression (Di ) symptoms for paired twins {1, 2}. In (b), a model for dependencies among features of a study on parole appeals, including the success of the parole, if the type of offense was a person offense or not, and if the offender had a dependency on drugs and was over 25 years old. All variables in these studies are binary and further details and references are provided by Drton and Richardson (2008a). A1A2D1D2 = 0001 A1A2D1D2 = 1000 0.0 0.2 0.4 0.6 0.8 1.0 N = 5000 Bandwidth = 0.003151 80 Density 40 60 0 0 0 5 20 5 10 Density 10 Density 15 20 15 25 20 30 A1A2D1D2 = 0000 0.0 0.1 0.2 0.3 0.4 0.5 N = 5000 Bandwidth = 0.002153 0.0 0.1 0.2 0.3 0.4 0.5 N = 5000 Bandwidth = 0.000712 Figure 19: Posterior distribution of some of the marginal contingency table entries for the twin model. D having some hidden (but different) genetic causes. If A and D did have genetic common causes, one would expect that the edges A1 ↔ D2 and A2 ↔ D1 would be also required. The compounded hypothesis of marginal independencies for Ai and D j , i = j, can be tested jointly by testing a bidirected model. Notice that no reference to particular genetic hidden causes of alcoholism and depression is necessary, which again illustrates the power of modeling by marginalizing out latent variables. The second study, as shown in Figure 18(b), concerns the dependencies among several variables in an application for parole. The model implies, for instance, that the success of a parole application (Success node, in the Figure) is independent of the age of the offender being under 25 (Age node). However, if it is known that the offender had a prior sentence, these two variables become dependent (through the path Success ↔ Prior sentence ↔ Age). As reported by Drton and Richardson, their 1227 S ILVA AND G HAHRAMANI Entry A1 A2 D1 D2 0000 0001 0010 0011 0100 0101 0110 0111 E[Θ|D ] 0.461 0.136 0.157 0.097 0.032 0.022 0.007 0.012 Estimates MLE 0.461 0.138 0.159 0.096 0.032 0.021 0.008 0.012 uMLE 0.482 0.134 0.154 0.085 0.025 0.015 0.012 0.017 Entry A1 A2 D1 D2 1000 1001 1010 1011 1100 1101 1110 1111 E[Θ|D ] 0.018 0.003 0.021 0.009 0.008 0.003 0.003 0.006 Estimates MLE 0.018 0.004 0.020 0.009 0.010 0.002 0.005 0.005 uMLE 0.013 0.007 0.013 0.015 0.005 0.003 0.007 0.012 Figure 20: The posterior expected value of the 16 entries in the twin study table (E[Θ|D ]). Results generated with a chain of 5,000 points. We also show the maximum likelihood estimates of Drton and Richardson (MLE) and the maximum likelihood values obtained using an unconstrained model (uMLE). Despite the probit parameterization, in this particular study there is a reasonable agreement between the Bayesian estimator and the estimator of Drton and Richardson. binary bi-directed model passes a significance test. Drton and Richardson also attempted to learn an undirected (Markov) network structure with this data, but the outcome was a fully connected graph. This is expected, since Markov networks cannot represent marginal independencies unless the graph is disconnected, which would introduce all sorts of other independencies and possibly not fit the data well. If many marginal independencies exist in the data generating process, Markov networks might be a bad choice of representation. For problems with symmetries such as the twin study, DAGs are not a natural choice either. 7.3.1 R ESULTS For the twin data problem, we used a simple prior for the covariance matrix of the underlying latent variables: a G -inverse Wishart with 1 degree of freedom and a complete covariance with a value of 2 for each element in the diagonal and 1 outside the diagonals. Thresholds are fixed at zero, since we have binary data. We present the expected posterior values of the contingency table entries in Figure 20. The outcome is essentially identical to the maximum likelihood estimates of Drton and Richardson despite the probit parameterization. Moreover, with our procedure we are able to generate Bayesian confidence intervals, as illustrated in Figure 19. The results are very stable for a chain of 1,000 points. For the parole data, we used a G -inverse Wishart prior for the covariance matrix of underlying variables Y⋆ with 1 degree of freedom and the identity matrix as hyperparameters. We compare the effective sample size of the Gibbs sampler for our DMG model and the DAG model obtained by using the ancillary latent parameterization of Section 7.1 for the underlying latent variable covariance matrix.18 Boxplots for the 16 contingency table entries of the twin network and the 32 entries of the parole study are shown in Figure 21. The setup is the same as in the democratization and 18. The priors used are as follows: the ancillary representation was given a prior with mean 1 and variance 1 for the coefficients Xi j → Y j⋆ , for j > i, and set constant to 1, if i < j. The means of the ancillary latents were fixed at 1228 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS industrialization experiment, where we run 80 independent chains and plot the distribution of the effective sample sizes to measure the mixing time. We ran a shorter chain of 2,000 points, since computing the contingency table entries is expensive. There is a substantial difference in effective sample size for the parole study. Notice that we are comparing MCMC samples for the entries in the contingency table, which in the DAG case requires integrating out not only the underlying latent variables implicit in the probit parameterization, but also the ancillary latents that account for the bi-directed edges. This hierarchy of latent variables, which does not exist in the DMG case, causes a considerable increase on autocorrelation of the chain compared to the DMG model. The standard DMG parameterization can be seen as a way of obtaining a collapsed Gibbs sampler, where the parameterization by construction reflects latent variables that were analytically marginalized. MCMC comparison: DMG vs. DAG (parole data) 800 700 MCMC comparison: DMG vs. DAG (twin data) DMG DAG 200 400 Effective Sample Size 400 300 0 100 200 Effective Sample Size 500 600 600 DMG DAG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 Parameter 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Parameter (a) (b) Figure 21: Comparison of effective sample sizes for the twin data (a) and parole data (b). 80 independent chains of 2,000 points were obtained using the Gibbs sampling algorithm, and the respective box-plots shown above. The Markov chain with the DMG approach easily dominates the DAG one. For the parole data, the average effective sample size for the DAG was as low as 60 points. 8. Conclusion Directed mixed graph models are a generalization of directed graph models. Whenever a machine learning application requires directed graphs, one should first consider whether directed mixed graphs are a better choice of representation instead. DMGs represent conditional independencies of DAGs where hidden variables have been marginalized out. Given that in most applications it is 0. Variance parameters were given (0.5, 0.5) inverse gamma priors, which approximately matches the priors in the DMG model. 1229 S ILVA AND G HAHRAMANI Y1 Y1 Y2 Y3 Y2 Y3 Y4 Y4 (a) (b) Figure 22: In (a), a simple bi-directed chain with four random variables. In (b), the respective factor graph that is obtained from a Bartlett parameterization using the ordering ≺≡ {Y1 ,Y2 ,Y3 ,Y4 }. In this case, the factors are p(Y1 ) × p(Y2 |Y1 ) × p(Y3 |Y1 ,Y2 ) × p(Y4 |Y1 ,Y2 ,Y3 ). A different choice of ordering (e.g., the perfect ordering) could provide simpler factors on average, but the presence of a factor linked to all variables is unavoidable. unlikely that all relevant variables are known, DMGs are a natural representation to use. In this paper, we introduced priors and inference algorithms for Bayesian learning with two popular families of mixed graph models: Gaussian and probit. We discussed some implementations and approximations to scale up algorithms. We showed examples of applications with real data, and demonstrated that Bayesian inference in Gaussian and probit DMG models using MCMC can have substantially faster mixing than in comparable DAGs. It is part of the machine learning folklore that factor graphs can subsume directed networks. In an important sense, this is known not to be true: undirected and factor graphs only allow for monotonic independence models, where explaining away is ruled out. This excludes a vast number of realistic, non-monotonic, models. While factor graphs are perhaps the data structures of choice for general message-passing algorithms (e.g., Yedidia et al., 2005), they are far from being universal modeling languages for independencies. What is true is that for any distribution that is Markov with respect to a DAG or DMG there is at least one corresponding factor graph model, but this is a vacuous claim of little interest: any distribution can be represented by a single-factor model involving all variables. Some will require a factor with all variables, even under the presence of a large number of independence constraints. For instance, a factor graph corresponding to any given bi-directed chain will necessarily include a factor node adjacent to all variable nodes, as illustrated in Figure 22. When parameterizing a distribution with many marginal independencies (e.g., a bi-directed tree), the respective factor graph would be no more than an unhelpful drawing. A better strategy for solving real-world problems is to define a family of models according to the (directed/undirected/factor) graphs of choice, and let the inference algorithm decide which re-expression of the model suits the problem. This has been traditional in graphical modeling literature (Lauritzen, 1996). The strategy adopted in this paper followed this spirit. An alternative has been recently introduced by Huang and Frey (2008). This paper discusses graphical families of marginal independence constraints (essentially identical to bi-directed graphs, although other types of constraints might implicitly follow from the parameterization). Models are parameterized using a very different strategy. The idea is to parameterize cumulative distribution functions (CDFs) instead of densities or probability mass functions. A simple factorization criterion can be defined in the space of CDFs, but densities have to be computed by a novel message-passing 1230 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS scheme. The particular application discussed by Huang and Frey (2008) could in principle be approached using the Gaussian bi-directed probit model of Section 5, but the parameterization in Huang and Frey (2008) does not need to rely on Gaussian distributions. It is not clear, however, how to efficiently perform Bayesian inference in this case and which constraints are implicitly implied by the different choices of parameterization. The different perspective given by products of CDFs is novel and promising. It should point out to new directions in mixed graph modeling. The structural equation modeling literature also describes several pragmatic ways of specifying non-linearities in the structural equations (Lee, 2007). Less common is the specification of nonGaussian models for the joint density of the error terms. Silva and Ghahramani (2009) introduce a flexible mixture of Gaussians approach for models of marginal independence. There is a need on how to combine this approach with flexible families of structural equations in a computationally efficient way. Also, models with non-additive error terms remain to be explored. Current interest in estimating sparse statistical models has lead to approaches that estimate structured covariance matrices (e.g., Bickel and Levina, 2008). This development could also lead to new families of priors. In particular, different matrix decompositions have motivated different ways of specifying priors on covariance matrices. For instance, Chen and Dunson (2003) propose a modified Cholesky decomposition for the covariance matrix of random effect parameters: standard deviations are parameterized separately with a prior that puts positive mass on zero variances (effectively allowing the random effect to be neutralized). Wong et al. (2003) describe a prior for inverse correlation matrices that is uniform conditioned on the number of structural zeros. Metropolis-Hastings schemes are necessary in this case. Shrinkage methods have also been applied to the estimation of covariance matrices. A common approach, shrinkage towards a diagonal matrix (e.g., Daniels and Kass, 1999), could be generalized towards some sparse matrix corresponding to a bi-directed graph. Although shrinkage will not generate structural zeros in the resulting matrix, allowing for sparse shrinkage matrices other than the identity matrix could be interesting in prediction problems. Some approaches can exploit an ordering for the variables, which is natural in some domains such as time-series analysis. While the G -Inverse Wishart is invariant to a permutation of the variables, new types of priors that exploit a natural variable ordering should be of interest, as in the original work of Brown et al. (1993) that motivated our approach. Other directions and applications are suggested by recent papers: • learning measurement models: the industrialization and democratization problem of Section 7.1 provides an example of a measurement model. In such a family of problems, observed variables are children of latent variables, and connections from latents to observables define the measurement model. Sparsity in the measurement can be exploited to allow for more general dependencies connecting latent variables. One role of the bi-directed component is to allow for extra dependencies connecting observed variables that are not accounted by the explicit latent variables in the model. Silva et al. (2006) describes a learning algorithm for mixed graph measurement models using the “ancillary” parameterization. The natural question is which alternative optimization strategies could be used and how to scale them up; • structural and relational learning: in prediction problems where given an input vector X we have to predict an output vector Y, the dependence structure of Y given X can also lie within the directed mixed graph family. Silva et al. (2007) introduces mixed graph models within the context of relational classification, where Y are labels of different data points 1231 S ILVA AND G HAHRAMANI not independently distributed. In such a class of problems, novel kinds of parameterization are necessary since the dimensionality of the covariance matrix increases with the sample size. Structural features of the graph are used to propose different parameterizations of the dependencies, and many other alternatives are possible; • causal inference: mixed graphs have been consistently used as a language for representing causal dependencies under unmeasured confounding. Zhang (2008) describes recent advances in identifying causal effects with ancestral graphs. Algorithms for learning mixed graph structures are described by Spirtes et al. (2000) and the recent advances in parameterizing such models should result in new algorithms; Many challenges remain. For instance, more flexible models for DMG discrete models are being developed (Drton and Richardson, 2008a), but for large graphs they pose a formidable computational problem. An important question is which other less flexible, but more tractable, parameterizations could be used, and which approximation algorithms to develop. The probit family discussed here was a choice among many. The parameterization by Drton and Richardson (2008a) could be a starting point for trading-off flexibility and computational effort. And while it is true that Gaussian copula models (Pitt et al., 2006) can be adapted to generalize the approach introduced here, it remains to be seen if other copula parameterizations easily lead to DMG models. Acknowledgments We thank the anonymous reviewers for their suggestions, Kenneth Bollen for providing us with the industrialization and democratization data set, and Robert Gramacy for helpful discussions. An earlier version of this paper (Silva and Ghahramani, 2006) appeared in the proceedings of the Uncertainty in Artificial Intelligence conference. This work was funded by the Gatsby Charitable Foundation and a EPSRC grant #EP/D065704/1. Appendix A. Deriving the Sampling Distribution for the Monte Carlo Computation of Normalizing Constants We give here the details on how to derive the sampling distribution used for computing normalizing constants IG (δ, U), as described in Section 3.2.2. Let Ai ≡ Σsp≺ (i),nsp≺ (i) Σ−1 ≺ (i),nsp≺ (i) . Recall from Equation (7) that Bi,nsp≺ (i) = −Bi,sp≺ (i) Ai . nsp The original density p(Bi | γi ), as given by Lemma 1, is a multivariate Gaussian with the following kernel: 1 exp − 2γi T Bi,sp≺ (i) − Msp≺ (i) T Bi,nsp≺ (i) − Mnsp≺ (i) T Uss Usn Uns Unn T Bi,sp≺ (i) − Msp≺ (i) T Bi,nsp≺ (i) − Mnsp≺ (i) (22) where U{i−1},{i−1} in Lemma 1 was rearranged above as the partitioned matrix in (14). The pair {Msp≺ (i) , Mnsp≺ (i) } corresponds to the respective partition of the mean vector Mi . Plugging in the expression for Bi,nsp≺ (i) in (22), we obtain the modified kernel 1232 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS exp − 1 2γi T Bi,sp≺ (i) − Msp≺ (i) TB T −Ai i,sp≺ (i) − Mnsp≺ (i) T Uss Usn Uns Unn T Bi,sp≺ (i) − Msp≺ (i) TB T −Ai i,sp≺ (i) − Mnsp≺ (i) (23) which can be rewritten as pb (Bi,sp≺ (i) ; Ki mi , γi Ki ) × (2π)#sp≺ (i)/2 |γi |#sp≺ (i)/2 |Ki (Φi−1 )|1/2 1 × exp − γ−1 Ui 2 i (24) where #sp≺ (i) is the size of set sp≺ (i), pb (·; α, Σ) is the density function of a multivariate Gaussian distribution with mean α and covariance Σ, Ki (Φi−1 ) ≡ Ki to emphasize the contribution of previous parameters, and mi = (Uss − Ai Uns )Msp≺ (i) + (Usn − Ai Unn )Mnsp≺ (i) , K−1 = Uss − Ai Uns − Usn AT + Ai Unn AT , i i i Ui = MT U{i−1},{i−1} Mi − mT Ki mi . i i / If sp≺ (i) = 0, it follows that Bi = Bi,nsp≺ (i) = 0. The kernel (23) reduces to exp(−0.5Ui /γi ), / and Ui ≡ MT U{i−1},{i−1} Mi . If nsp≺ (i) = 0, then the expression for the kernel does not change i (Ui ≡ 0), and Equation (24) corresponds to the original kernel in Equation (11). Inserting the re-expressed kernel into the original function (11), we obtain pb (Bi,sp≺ (i) ; Ki mi , γi Ki )pg γi ; δ + i − 1 + #nsp≺ (i) uii.{i−1},{i−1} + Ui , 2 2 fi (Φi−1 ) where pg (·; α, β) is an inverse gamma density function and (i−1)−#sp≺ (i) 2 |Ki (Φi−1 )|1/2 |U{i−1},{i−1} |1/2 fi (Φi−1 ) ≡ (2π)− (uii.{i−1},{i−1} /2)(δ+i−1)/2 Γ((δ + i − 1 + #nsp≺ (i))/2) . × Γ((δ + i − 1)/2) ((uii.{i−1},{i−1} + Ui )/2)(δ+i−1+#nsp≺ (i))/2 Appendix B. Variational Updates for Gaussian Mixed Graph Models The variational updates for the coefficient and intercept parameters are essentially identical to their joint conditional distribution given V and X, where occurrences of V and X are substituted by expectations V−1 q(V) and X q(X) , respectively. Let Vi j be the i j-th entry of V−1 q(V) . The covariance matrix of (B, α) is the covariance matrix of the vector vec(B, α). Such vector is constructed using all (non-zero) coefficients and intercepts. We denote this covariance matrix by ΣB,α . For simplicity of notation, we will treat αi as the coefficient bi(m+1) , m being the number of variables. We will (d) also adopt the notation Ym+1 ≡ 1 in the following derivations. As an abuse of notation, let Y also 1233 S ILVA AND G HAHRAMANI refer to latent variables. In this case, if Yi and Y j refer to latent variables Xhi and Xh j , then define Yi ≡ Xhi q(X) , and YiY j ≡ Xhi Xh j q(X) . Let bi j and btv be the r-th and s-th entries of vec(B, α), respectively. The rs-th entry of the inverse matrix Σ−1 is given by Bα (Σ−1 )rs = Vit Bα n ∑ Yj (d) (d) Yv + 1(i = t)1( j = v) d=1 cbj i sbj i where bxpx ≡ 0 if no edge Yx ← Ypx exists in the graph, 1(·) is the indicator function, and cbj , sbj are i i the given prior parameters defined in Section 4. Similarly to the factorization criterion explained in Section 6, the matrix q(V) will in general be block-diagonal, and this summation can be highly simplified. Define now a vector cb analogous to the Gibbs sampling case, where m cb = r ∑ Vit t=1 n ∑ Y j Yt (d) (d) d=1 + cbj i sbj i . The variational distribution q(B, α) is then a N(ΣB,α c, ΣB,α ). The variational distribution for the latent variables will exactly the same as the Gibbs distribution, except that references to B, α, V−1 are substituted by B q(B,α) , α q(B,α) and V−1 q(V) . Appendix C. Proofs Proof of Lemma 2: Arrange the columns of the Jacobian such that their order corresponds to the sequence σ11 , σ21 , σ22 , σ31 , σ32 , σ33 , . . . , σmm , excluding the entries σi j that are identically zero by construction. Arrange the rows of the Jacobian such that their order corresponds to the sequence γ1 , β21 , γ2 , β31 , β32 , . . . , γm , excluding the entries βi j that are not in ΦG (i.e., exclude any βi j corresponding to a pair {Yi ,Y j } that is not adjacent in the graph). By the definition of Bartlett’s decomposition, Σ{i},{i} and βst are functionally independent for s > i. The same holds for Σ{i},{i} and γs . As such, ∂σi j /∂βst = 0 and ∂σi j /∂γs = 0 for s > i. This implies that J(ΦG ) is a (lower) block triangular matrix of 2m − 1 blocks: for k odd, the k-th block is the singleton ∂σii /∂γi = 1, where i = (k + 1)/2. For k even, the k-th block is the Jacobian ∂Σi,sp≺(i) /∂Bi,sp≺(i) , where i = 1 + k/2 and Σi,sp≺(i) is the vector of covariances of Yi and its preceding spouses. From the interpretation given by Equation (8), it follows that Bi,sp≺ (i) can also be defined by the regression of Yi on Zi . That is Bi,sp≺ (i) = ΣYi ,Zi Σ−1 i ≡ ΣYi ,Zi R−1 . i Zi ,Z (25) However, ΣYi ,Zi = Σi,sp≺ (i) , since Yi is independent of its non-spouses. From (25) we get Σi,sp≺ (i) = Bi,sp≺ (i) Ri , and as such the submatrix ∂Σi,sp≺(i) /∂Bi,sp≺(i) turns out to be Ri . Since the determinant of the block triangular Jacobian J(ΦG ) is given by the determinant of the blocks, this implies m |J(ΦG )| = ∏ |Ri |. i=2 By the matrix identity 1234 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS A B C D |Σ{i−1},{i−1} | = = |A||D − CA−1 B|, |Σnsp≺(i) ,nsp≺(i) ||Σsp≺(i) ,sp≺(i) − Σsp≺(i) ,nsp≺(i) Σ−1 ≺(i) ,nsp≺(i) Σnsp≺(i) ,sp≺(i) | nsp (26) ≡ i−1 |Σnsp≺(i) ,nsp≺(i) ||Ri |. Since |Σ{i−1},{i−1} | = ∏t=1 γt , the second equality holds. Proof of Theorem 4: We first describe a mapping from each path in G to a path in G ⋆ , and viceversa (such mappings are not inverse functions of each other, since the number of paths in G ⋆ is larger than in G ). By construction, all bi-directed edges in G ∗ have two UVs as endpoints, with an one-to-one mapping between each Ys⋆ ↔ Yt⋆ in G ⋆ and each Ys ↔ Yt in G . All directed edges in G ⋆ are of two types: Ys → Yt⋆ , with s = t, or Ys⋆ → Ys . Therefore, one can define an unique path P in G as a function of a path P⋆ in G ⋆ , obtained by relabeling each Y ⋆ as Y , and by collapsing any Y → Y edges that might result from this relabeling into a single vertex Y . A mapping in the opposite direction is analogous as given by the construction rule of Type-II models. A collider in a path is any vertex within a head-to-head collision in the path, that is, any vertex Yt where the preceding and the next vertex in the path are connected to Yt with an edge (directed or bi-directed) into Yt . Yi and Y j are m-separated by Z in an acyclic DMG if and only if there is no active path connecting Yi and Y j . Like in d-separation, a path is active if all of its colliders have some descendant in Z, and none of its non-colliders is in Z (Richardson, 2003). The mappings between paths P and P⋆ are such that Yt is a collider in P if and only if Yt is in P⋆ and is a collider, or Yt⋆ is in P⋆ and is a collider. Since by construction any Yt⋆ will have the same Y-descendants in G ⋆ as Yt has in G , and Z ⊂ Y, the result follows. Proof of Theorem 7: The first of the two claims of the theorem trivially holds, since connectivity is a transitive property and as such this partition will always exist (where K(i) = 1 is a possibility). We will prove the validity of the second claim by induction. Let {R1 , . . . , Rk } be the perfect sequence that generated our perfect ordering. The second claim automatically holds for all vertices in Rk , since Rk is a clique. Assume the second claim holds for the subsequence {Rl+1 , Rl+2 , . . . , Rk }. Let Yi be an element of Rl . Assume there is some non-spouse Yq of Yi in Rl ′ , and some spouse Yp of Yi in Rl ′′ , such that l < l ′ ≤ l ′′ . We will assume that both Yq and Yp belong to the same component Vt and show this leads to a contradiction. Without loss of generality, we can assume that Yq and Yp are adjacent: otherwise, the fact that Yq and Yp are in the connected set Vt will imply there is a path connecting Yq and Yp in the subgraph induced by {Rl+1 , . . . , Rk }. We can redefine {Yq ,Yp } to be the endpoints of the first edge in the path containing a non-spouse and a spouse of Yi . It will still be the case that q > p, by the induction hypothesis. Since Yp ∈ Rl ′′ , there is a separator Sl ′′ between Hl ′′ \Sl ′′ and Rl ′′ . But Yi ∈ Hl ′′ , and Yi is adjacent to Yp , which implies Yi ∈ Sl ′′ . If l ′ < l ′′ , this will also imply that Yq ∈ Sl ′′ , which is a contradiction, since Sl ′′ is a complete set. If l ′ = l ′′ , this implies that Yi and Yq are both in YP(l ′′ ) , which is also a contradiction since YP(l ′′ ) is a clique. 1235 S ILVA AND G HAHRAMANI References A. Al-Awadhi and P. Garthwaite. An elicitation method for multivariate normal distributions. Communications in Statistics - Theory and Methods, 27:1123–1142, 1998. J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88:669–679, 1993. A. Atay-Kayis and H. Massam. A Monte Carlo method for computing the marginal likelihood in nondecomposable Gaussian graphical models. Biometrika, 92:317–335, 2005. D. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis. Arnold Publishers, 1999. M. Beal. Variational algorithms for approximate Bayesian inference. PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. M. Beal and Z. Ghahramani. Variational Baeysian learning of directed graphical models with hidden variables. Bayesian Analysis, 1:793–832, 2006. P. Bickel and E. Levina. Covariance regularization by thresholding. Annals of Statistics, 36:2577– 2604, 2008. K. Bollen. Structural Equation Models with Latent Variables. John Wiley & Sons, 1989. P. Brown, N. Le, and J. Zidek. Inference for a covariance matrix. In P.R. Freeman, A.F.M. Smith (editors), Aspects of Uncertainty, a tribute to D. V. Lindley, pages 77–92, 1993. Z. Chen and D. Dunson. Random effects selection in linear mixed models. Biometrics, 59:762–769, 2003. D. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002. M. Daniels and R. Kass. Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association, 94:1254–1263, 1999. M. Drton and M. Perlman. Multiple testing and error control in Gaussian graphical model selection. Statistical Science, pages 430–449, 2007. M. Drton and T. Richardson. A new algorithm for maximum likelihood estimation in Gaussian models for marginal independence. Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, 2003. M. Drton and T. Richardson. Iterative conditional fitting for Gaussian ancestral graph models. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2004. M. Drton and T. Richardson. Binary models for marginal independence. Department of Statistics, University of Washington, Tech. report 474, 2005. M. Drton and T. Richardson. Binary models for marginal independence. Journal of the Royal Statistical Society, Series B, 70:287–309, 2008a. 1236 BAYESIAN L EARNING WITH M IXED G RAPH M ODELS M. Drton and T. Richardson. Graphical methods for efficient likelihood inference in Gaussian covariance models. Journal of Machine Learning Research, pages 893–914, 2008b. D. Dunson, J. Palomo, and K. Bollen. Bayesian structural equation modeling. Statistical and Applied Mathematical Sciences Institute, Technical Report #2005-5, 2005. N. Friedman and D. Koller. Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Machine Learning Journal, 50:95–126, 2003. P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch¨ lkopf. Nonlinear causal discovery with additive o noise models. Neural Information Processing Systems, 2008. J. Huang and B. Frey. Cumulative distribution networks and the derivative-sum-product algorithm. Proceedings of 24th Conference on Uncertainty in Artificial Intelligence, 2008. B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter, and M. West. Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20:388–400, 2005. M. Jordan. Learning in Graphical Models. MIT Press, 1998. M. Jordan, Z. Ghaharamani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. In M. Jordan (Ed.), Learning in Graphical Models, pages 105–162, 1998. C. Kang and J. Tian. Local Markov property for models satisfying the composition axiom. Proceedings of 21st Conference on Uncertainty in Artificial Intelligence, 2005. J. Kotecha and P. Djuric. Gibbs sampling approach for the generation of truncated multivariate Gaussian random variables. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1757–1760, 1999. S. Lauritzen. Graphical Models. Oxford University Press, 1996. S.-Y. Lee. Structural Equation Modeling: a Bayesian Approach. Wiley, 2007. D. MacKay. Introduction to Monte Carlo methods. Learning in Graphical Models, pages 175–204, 1998. I. Murray, Z. Ghahramani, and D. MacKay. MCMC for doubly-intractable distributions. Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence, 2006. R. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. R. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. J. Pearl. Probabilistic Reasoning in Expert Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000. M. Pitt, D. Chan, and R. Kohn. Efficient Bayesian inference for Gaussian copula regression models. Biometrika, 93:537–554, 2006. 1237 S ILVA AND G HAHRAMANI T. Richardson. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30:145–157, 2003. T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. A. Roverato. Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics, 29: 391–411, 2002. R. Scheines, R. Hoijtink, and A. Boomsma. Bayesian estimation and testing of structural equation models. Psychometrika, 64:37–52, 1999. R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence, 2006. R. Silva and Z. Ghahramani. Factorial mixtures of Gaussians and the marginal independence model. Artificial Intelligence & Statistics (AISTATS ’09), 2009. R. Silva and R. Scheines. Bayesian learning of measurement and structural models. 23rd International Conference on Machine Learning, 2006. R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7:191–246, 2006. R. Silva, W. Chu, and Z. Ghahramani. Hidden common cause relations in relational learning. Neural Information Processing Systems (NIPS ’07), 2007. J. Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1:833–860, 2006. P. Spirtes. Directed cyclic graphical representations of feedback models. Proceedings of 11th Conference on Uncertainty in Artificial Intelligence, 1995. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge University Press, 2000. R. Tarjan. Decomposition by clique separators. Discrete Mathematics, 55:221–232, 1985. E. Webb and J. Forster. Bayesian model determination for multivariate ordinal and binary data. Technical report, Southampton Statistical Sciences Research Institute, 2006. F. Wong, C. Carter, and R. Kohn. Efficient estimation of covariance selection models. Biometrika, 90:809–830, 2003. S. Wright. Correlation and causation. Journal of Agricultural Research, pages 557–585, 1921. J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51:2282–2312, 2005. J. Zhang. Causal reasoning with ancestral graphs. Journal of Machine Learning Research, pages 1437–1474, 2008. 1238
4 0.054763872 74 jmlr-2009-Properties of Monotonic Effects on Directed Acyclic Graphs
Author: Tyler J. VanderWeele, James M. Robins
Abstract: Various relationships are shown hold between monotonic effects and weak monotonic effects and the monotonicity of certain conditional expectations. Counterexamples are provided to show that the results do not hold under less restrictive conditions. Monotonic effects are furthermore used to relate signed edges on a causal directed acyclic graph to qualitative effect modification. The theory is applied to an example concerning the direct effect of smoking on cardiovascular disease controlling for hypercholesterolemia. Monotonicity assumptions are used to construct a test for whether there is a variable that confounds the relationship between the mediator, hypercholesterolemia, and the outcome, cardiovascular disease. Keywords: Bayesian networks, conditional expectation, covariance, directed acyclic graphs, effect modification, monotonicity
5 0.052833661 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
Author: Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras
Abstract: In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we refine this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion. Keywords: Bayesian networks, consistency, scoring criterion, model selection, BIC
6 0.047886644 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
7 0.041010655 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
8 0.040506296 90 jmlr-2009-Structure Spaces
10 0.034960028 54 jmlr-2009-Markov Properties for Linear Causal Models with Correlated Errors (Special Topic on Causality)
11 0.034452941 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
12 0.032752629 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning
13 0.029964805 50 jmlr-2009-Learning When Concepts Abound
14 0.029102633 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
16 0.022961844 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
17 0.022349399 15 jmlr-2009-Cautious Collective Classification
18 0.022015402 23 jmlr-2009-Discriminative Learning Under Covariate Shift
19 0.021906357 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
20 0.020393565 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)
topicId topicWeight
[(0, 0.113), (1, -0.06), (2, 0.083), (3, 0.16), (4, 0.025), (5, -0.069), (6, -0.09), (7, -0.048), (8, 0.087), (9, -0.017), (10, -0.102), (11, -0.008), (12, 0.095), (13, 0.037), (14, -0.003), (15, -0.067), (16, 0.042), (17, 0.041), (18, 0.005), (19, 0.002), (20, 0.069), (21, 0.086), (22, 0.136), (23, 0.006), (24, -0.211), (25, -0.017), (26, 0.093), (27, 0.189), (28, -0.099), (29, -0.01), (30, 0.07), (31, 0.045), (32, -0.159), (33, 0.278), (34, 0.041), (35, -0.013), (36, -0.072), (37, -0.121), (38, -0.082), (39, -0.142), (40, -0.051), (41, 0.085), (42, -0.179), (43, -0.111), (44, -0.1), (45, 0.001), (46, -0.348), (47, -0.03), (48, -0.07), (49, -0.211)]
simIndex simValue paperId paperTitle
same-paper 1 0.95535576 11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification
Author: Raanan Yehezkel, Boaz Lerner
Abstract: We propose the recursive autonomy identification (RAI) algorithm for constraint-based (CB) Bayesian network structure learning. The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. The sequence of operations is performed recursively for each autonomous substructure while simultaneously increasing the order of the CI test. While other CB algorithms d-separate structures and then direct the resulted undirected graph, the RAI algorithm combines the two processes from the outset and along the procedure. By this means and due to structure decomposition, learning a structure using RAI requires a smaller number of CI tests of high orders. This reduces the complexity and run-time of the algorithm and increases the accuracy by diminishing the curse-of-dimensionality. When the RAI algorithm learned structures from databases representing synthetic problems, known networks and natural problems, it demonstrated superiority with respect to computational complexity, run-time, structural correctness and classification accuracy over the PC, Three Phase Dependency Analysis, Optimal Reinsertion, greedy search, Greedy Equivalence Search, Sparse Candidate, and Max-Min Hill-Climbing algorithms. Keywords: Bayesian networks, constraint-based structure learning
2 0.34387547 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
Author: Babak Shahbaba, Radford Neal
Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification
Author: Facundo Bromberg, Dimitris Margaritis
Abstract: We address the problem of improving the reliability of independence-based causal discovery algorithms that results from the execution of statistical independence tests on small data sets, which typically have low reliability. We model the problem as a knowledge base containing a set of independence facts that are related through Pearl’s well-known axioms. Statistical tests on finite data sets may result in errors in these tests and inconsistencies in the knowledge base. We resolve these inconsistencies through the use of an instance of the class of defeasible logics called argumentation, augmented with a preference function, that is used to reason about and possibly correct errors in these tests. This results in a more robust conditional independence test, called an argumentative independence test. Our experimental evaluation shows clear positive improvements in the accuracy of argumentative over purely statistical tests. We also demonstrate significant improvements on the accuracy of causal structure discovery from the outcomes of independence tests both on sampled data from randomly generated causal models and on real-world data sets. Keywords: independence-based causal discovery, causal Bayesian networks, structure learning, argumentation, reliability improvement 1. Introduction and Motivation Directed graphical models, also called Bayesian networks, can be used to represent the probability distribution of a domain. This makes them a useful and important tool for machine learning where a common task is inference, that is, predicting the probability distribution of a variable of interest given some other knowledge, usually in the form of values of other variables in the domain. An additional use of Bayesian networks comes by augmenting them with causal semantics that represent cause and effect relationships in the domain. The resulting networks are called causal. An important problem is inferring the structure of these networks, a process that is sometimes called causal discovery, which can provide insights into the underlying data generation process. Two major classes of algorithms exist for learning the structure of Bayesian networks. One class contains so-called score-based methods, which learn the structure by conducting a search in the space of all structures in an attempt to find the structure of maximum score. This score is usually penalized log-likelihood, for example, the Bayesian Information Criterion (BIC) or the (equivalent) Minimum Description Length (MDL). A second class of algorithms works by exploiting the fact that a causal Bayesian network implies the existence of a set of conditional independence statements between sets of domain variables. Algorithms in this class use the outcomes of a number of condic 2009 Facundo Bromberg and Dimitris Margaritis. B ROMBERG AND M ARGARITIS tional independences to constrain the set of possible structures consistent with these to a singleton (if possible) and infer that structure as the only possible one. As such they are called constraintbased or independence-based algorithms. In this paper we address open problems related to the latter class of algorithms. It is well-known that independence-based algorithms have several shortcomings. A major one has to do with the effect that unreliable independence information has on the their output. In general such independence information comes from two sources: (a) a domain expert that can provide his or her opinion on the validity of certain conditional independences among some of the variables, sometimes with a degree of confidence attached to them, and/or (b) statistical tests of independence, conducted on data gathered from the domain. As expert information is often costly and difficult to obtain, (b) is the most commonly used option in practice. A problem that occurs frequently however is that the data set available may be small. This may happen for various reasons: lack of subjects to observe (e.g., in medical domains), an expensive data-gathering process, privacy concerns and others. Unfortunately, the reliability of statistical tests significantly diminishes on small data sets. For example, Cochran (1954) recommends that Pearson’s χ 2 independence test be deemed unreliable if more than 20% of the cells of the test’s contingency table have an expected count of less than 5 data points. Unreliable tests, besides producing errors in the resulting causal model structure, may also produce cascading errors due to the way that independence-based algorithms work: their operation, including which test to evaluate next, typically depends on the outcomes of previous ones. Thus a single error in a statistical test can be propagated by the subsequent choices of tests to be performed by the algorithm, and finally when the edges are oriented. Therefore, an error in a previous test may have large (negative) consequences in the resulting structure, a property that is called instability in Spirtes et al. (2000). One possible method for addressing the effect of multiple errors in the construction of a causal model through multiple independence tests is the Bonferroni correction (Hochberg, 1988; Abdi, 2007), which works by dividing the type I error probability α of each test by the number of such tests evaluated during the entire execution of the causal learning algorithm. As a result, the collective type I error probability (of all tests evaluated) is α, that is, 0.05 typically. However, this may make the detection of true dependences harder, as now larger data sets would be required to reach the adjusted confidence threshold of each test. The types of adjustments that may be appropriate for each case to tests that may be dependent is an open problem and the subject of current research in statistics (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001; Storey, 2002). In this paper we present and evaluate a number of methods for increasing the reliability of independence tests for small data sets. A result of this is the improvement in reliability of independencebased causal discovery algorithms that use these data sets, as we demonstrate in our experiments. We model this setting as a knowledge base whose contents are propositions representing conditional independences that may contain errors. Our main insight is to recognize that the outcomes of independence tests are not themselves independent but are constrained by the outcomes of other tests through Pearl’s well-known properties of the conditional independence relation (Pearl, 1988; Dawid, 1979). These can therefore be seen as integrity constraints that can correct certain inconsistent test outcomes, choosing instead the outcome that can be inferred by tests that do not result in contradictions. We illustrate this by an example. 302 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Example 1. Consider an independence-based knowledge base that contains the following propositions, obtained through statistical tests on data. ({0}⊥ ⊥{1} | {2}) (1) ({0}⊥ {3} | {2}) ⊥ (2) ({0}⊥ ⊥{3} | {1, 2}) (3) where (X⊥ | Z) denotes conditional independence of the set of variables X with Y conditional ⊥Y on set Z, and (X⊥ Y | Z) denotes conditional dependence. Suppose that (3) is in fact wrong. Such ⊥ an error can be avoided if there exists a constraint involving these independence propositions. For example, suppose that we also know that the following rule holds in the domain (this is an instance of an application of the Contraction and Decomposition axioms, described later in Section 2): ({0}⊥ ⊥{1} | {2}) ∧ ({0}⊥ {3} | {2}) =⇒ ({0}⊥ {3} | {1, 2}). ⊥ ⊥ (4) Rule (4), together with independence proposition (1) and dependence proposition (2), contradict independence proposition (3), resulting in an inconsistent knowledge base. If Rule (4) and propositions (1) and (2) are accepted, then proposition (3) must be rejected (and its value reversed), correcting the error in this case. The framework presented in the rest of the paper provides a principled approach for resolving such inconsistencies. The situation described in the previous example, while simple, illustrates the general idea that we will use in the rest of the paper: the set of independences and dependences used in a causal discovery algorithm form a potentially inconsistent knowledge base, and making use of general rules, derived from axioms and theorems that we know hold in the domain, helps us correct certain outcomes of statistical tests. In this way we will be able to improve the reliability of causal discovery algorithms that use them to derive causal models. To accomplish this we use the framework of argumentation, which provides a sound and elegant way of resolving inconsistencies in such knowledge bases, including ones that contain independences. The rest of the paper is organized as follows. The next section introduces our notation and definitions. Section 3 presents the argumentation framework and its extension with preferences, and describes our approach for applying it to represent and reason in knowledge bases containing independence facts that may be inconsistent. Section 4 introduces the argumentative independence test, implemented by the top-down algorithm introduced in Section 5. We then present an approximation for the top-down algorithm in Section 6 that reduces its time complexity to polynomial. We experimentally evaluate our approach in Section 7, and conclude with a summary and possible directions of future research in Section 8. Most of the proofs are presented in detail in Appendices A and B, which contain proofs for the computability (termination) and the validity (no AIT test can return a dependence and an independence result at the same time) of AIT, respectively. Note that, as our main goal in this paper is to address the problem of robust causal learning and not necessarily to advance the theory of argumentation itself, our exposition in the rest of the paper is geared toward causality theorists and practitioners. As this community may be unfamiliar with the theory and methods of the argumentation framework, we have included a self-contained discussion that covers the basic definitions and theorems of argumentation theory in some detail. 303 B ROMBERG AND M ARGARITIS 2. Notation and Preliminaries In this work we denote random variables with capitals (e.g., X,Y, Z) and sets of variables with bold capitals (e.g., X, Y, Z). In particular, we denote by V = {1, . . . , n} the set of all n variables in the domain, naming the variables by their indices in V; for instance, we refer to the third variable in V simply by 3. We assume that all variables in the domain are discrete following a multinomial distribution or are continuous following a Gaussian distribution. We denote the data set by D and its size (number of data points) by N. We use the notation (X⊥ | Z) to denote that the variables in set ⊥Y X are (jointly) independent of those in Y conditional on the values of the variables in Z, for disjoint sets of variables X, Y, and Z, while (X⊥ Y | Z) denotes conditional dependence. For the sake of ⊥ readability, we slightly abuse this notation and use (X⊥ | Z) as shorthand for ({X}⊥ } | {Z}). ⊥Y ⊥{Y A Bayesian network (BN) is a directed graphical model which represents the joint probability distribution over V. Each node in the graph represents one of the random variables in the domain. The structure of the network implicitly represents a set of conditional independences on the domain variables. Given the structure of a BN, the set of independences implied by it can be identified by a process called d-separation (Pearl, 1988); the latter follows from the local Markov property that states that each node in the network is conditionally independent of all its non-descendants in the graph given its parents. All independences identified by d-separation are implied by the model structure. If, in addition, all remaining triplets (X, Y, Z) correspond to dependencies, we say that the BN is directed graph-isomorph (abbreviated DAG-isomorph) or simply causal (as defined by Pearl, 1988). The concept of DAG-isomorphism is equivalent to a property called Faithfulness in Spirtes et al. (2000). A graph G is said to be faithful to some distribution if exactly those independences that exist in the distribution and no others are returned by the process of d-separation on G. In this paper we assume Faithfulness. For learning the structure of the Bayesian network of a domain we make use of the PC algorithm (Spirtes et al., 2000), which is only able to correctly identify the structure under the assumption of causal sufficiency. We therefore also assume causal sufficiency. A domain is causally sufficient if it does not contain any hidden or latent variables. As mentioned above, independence-based algorithms operate by conducting a series of conditional independence queries. For these we assume that an independence-query oracle exists that is able to provide such information. This approach can be viewed as an instance of the statistical query oracle theory of Kearns and Vazirani (1994). In practice such an oracle does not exist, but can be implemented approximately by a statistical test evaluated on the data set (for example, this can be Pearson’s conditional independence χ2 (chi-square) test (Agresti, 2002), Wilk’s G2 test, a mutual information test etc.). In this work we used Wilk’s G2 test (Agresti, 2002). To determine conditional independence between two variables X and Y given a set Z from data, the statistical test G2 (and many other independence tests based on hypothesis testing, for example, the χ 2 test) uses the values in the contingency table (a table containing the data point counts for each possible combination of the variables that participate in the test) to compute a test statistic. For a given value of the test statistic, the test then computes the likelihood of obtaining that or a more extreme value by chance under the null hypothesis, which in our case is that the two variables are conditionally independent. This likelihood, called the p-value of the test, is then returned. The p-value of a test equals the probability of falsely rejecting the null hypothesis (independence). Assuming that the p-value of a test is p(X,Y | Z), the statistical test concludes independence if and only if p(X,Y | Z) is greater than a threshold α, that is, (X⊥ | Z) ⇐⇒ p(X,Y | Z) > α. ⊥Y 304 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (Symmetry) (Composition) (Decomposition) (Intersection) (Weak Union) (Contraction) (Weak Transitivity) (Chordality) (X⊥ | Z) ⇐⇒ (Y⊥ | Z) ⊥Y ⊥X (X⊥ ∪ W | Z) =⇒ (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥Y ⊥W | Z) (X⊥ ∪ W | Z) =⇒ (X⊥ | Z ∪ W) ⊥Y ⊥Y (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ | Z ∪ W) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ | Z) ⇐⇒ (Y⊥ | Z) ⊥Y ⊥X (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥W | Z) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ ∪ W | Z) =⇒ (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥Y ⊥W | Z) (X⊥ | Z ∪ W) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ ∪ W | Z) =⇒ (X⊥ | Z ∪ W) ⊥Y ⊥Y (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z) ⊥Y (X⊥ | Z) ∧ (X⊥ | Z ∪ γ) =⇒ (X⊥ | Z) ∨ (γ⊥ | Z) ⊥Y ⊥Y ⊥γ ⊥Y (α⊥ | γ ∪ δ) ∧ (γ⊥ | α ∪ β) =⇒ (α⊥ | γ) ∨ (α⊥ | δ) ⊥β ⊥δ ⊥β ⊥β (5) (6) Common values in statistics for α are 0.05 and 0.01, corresponding to confidence thresholds (1 − α) of 0.95 and 0.99 respectively. The value 0.10 for α is also sometimes used, depending on the application, while values as low as 0.005 and 0.001 are sometimes used for structure learning. The conditional independences and dependences of a domain are connected through a set of general rules, introduced in Pearl (1988) and shown boxed in Eq. (5). These can be seen as constraints in a meta-space representing all possible independences in the domain. More specifically, let us imagine a meta-space of binary variables, each corresponding to the truth value of the independence of a triplet (X, Y | Z) (e.g., true for independence and false for dependence). Each point in this space corresponds to a conditional independence assignment to all possible triplets in the domain. In this conceptual space not all points are tenable; in particular the set of rules of Eq. (5) constrain the truth values of independences corresponding to triplets. For domains for which there exists a faithful Bayesian network a more relaxed set of properties hold, shown boxed in Eq. (6) where α, β, γ and δ correspond to single variables. In both sets of axioms, the property of Intersection holds if the probability distribution of the domain is positive, meaning that every assignment to all variables in the domain has a non-zero probability. Eq. (6) were first introduced by Dawid (1979) in a slightly different form and independently re-discovered by Pearl and Paz (1985). ´ Note that the axioms of Eq. (5) are necessarily incomplete; Studen y (1991) showed that there is no finite axiomatization of the conditional independence relation in general. The implication of this is that there may be some inconsistencies involving some set of independences and dependences that no method can detect and resolve. In the next section we describe the argumentation framework, which allows one to make beneficial use of these constraints. This is followed by its application to our problem of answering independence queries from knowledge bases that contain sets of potentially inconsistent independence propositions. 3. The Argumentation Framework There exist two major approaches for reasoning with information contained in inconsistent knowledge bases such as those containing independence statements that were described in the previous 305 B ROMBERG AND M ARGARITIS section. These two distinct approaches correspond to two different attitudes: One is to resolve the inconsistencies by removing a subset of propositions such that the resulting KB becomes consistent; this is called belief revision in the literature (G¨ rdenfors, 1992; G¨ rdenfors and Rott, 1995; Shapiro, a a 1998; Martins, 1992). A potential shortcoming (Shapiro, 1998) of belief revision stems from the fact that it removes propositions, which discards potentially valuable information. In addition, an erroneous modification of the KB (such as the removal of a proposition) may have unintended negative consequences if later more propositions are inserted in the KB. A second approach to inconsistent KBs is to allow inconsistencies but to use rules that may be possibly contained in it to deduce which truth value of a proposition query is “preferred” in some way. One instance of this approach is argumentation (Dung, 1995; Loui, 1987; Prakken, 1997; Prakken and Vreeswijk, 2002), which is a sound approach that allows inconsistencies but uses a proof procedure that is able to deduce (if possible) that one of the truth values of certain propositions is preferred over its negation. Argumentation is a reasoning model that belongs to the broader class of defeasible logics (Pollock, 1992; Prakken, 1997). Our approach uses the argumentation framework of Amgoud and Cayrol (2002) that considers preferences over arguments, extending Dung’s more fundamental framework (Dung, 1995). Preference relations give an extra level of specificity for comparing arguments, allowing a more refined form of selection between conflicting propositions. Preference-based argumentation is presented in more detail in Section 3.2. We proceed now to describe the argumentation framework. Definition 1. An argumentation framework is a pair A , R , where A is a set of arguments and R is a binary relation representing a defeasibility relationship between arguments, that is, R ⊆ A × A . (a, b) ∈ R or equivalently “a R b” reads that argument a defeats argument b. We also say that a and b are in conflict. An example of the defeat relation R is logical defeat, which occurs when an argument contradicts another logically. The elements of the argumentation framework are not propositions but arguments. Given a potentially inconsistent knowledge base K = Σ, Ψ with a set of propositions Σ and a set of inference rules Ψ, arguments are defined formally as follows. Definition 2. An argument over knowledge base Σ, Ψ is a pair (H, h) where H ⊆ Σ such that: • H is consistent, • H Ψ h, • H is minimal (with respect to set inclusion). H is called the support and h the conclusion or head of the argument. In the above definition Ψ stands for classical logical inference over the set of inference rules Ψ. Intuitively an argument (H, h) can be thought as an “if-then” rule, that is, “if H then h.” In inconsistent knowledge bases two arguments may contradict or defeat each other. The defeat relation is defined through the rebut and undercut relations, defined as follows. Definition 3. Let (H1 , h1 ), (H2 , h2 ) be two arguments. • (H1 , h1 ) rebuts (H2 , h2 ) iff h1 ≡ ¬h2 . 306 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Algorithm 1 Recursive computation of acceptable arguments: Acc R = F (A , R , S) 1: S ←− S ∪ {a ∈ A | a is defended by S} 2: if S = S then 3: return S 4: else 5: return F (A , R , S ) • (H1 , h1 ) undercuts (H2 , h2 ) iff ∃h ∈ H2 such that h ≡ ¬h1 . If (H1 , h1 ) rebuts or undercuts (H2 , h2 ) we say that (H1 , h1 ) defeats (H2 , h2 ). (The symbol “≡” stands for logical equivalence.) In other words, (H1 , h1 ) R (H2 , h2 ) if and only if (H1 , h1 ) rebuts or undercuts (H2 , h2 ). The objective of argumentation is to decide on the acceptability of a given argument. There are three possibilities: an argument can be accepted, rejected, or neither. This partitions the space of arguments A in three classes: • The class AccR of acceptable arguments. Intuitively, these are the “good” arguments. In the case of an inconsistent knowledge base, these will be inferred from the knowledge base. • The class RejR of rejected arguments. These are the arguments defeated by acceptable arguments. When applied to an inconsistent knowledge base, these will not be inferred from it. • The class AbR of arguments in abeyance. These arguments are neither accepted nor rejected. The semantics of acceptability proposed by Dung (1995) dictates that an argument should be accepted if it is not defeated, or if it is defended by acceptable arguments, that is, each of its defeaters is itself defeated by an acceptable argument. This is formalized in the following definitions. Definition 4. Let A , R be an argumentation framework, and S ⊆ A . An argument a is defended by S if and only if ∀b, if (b R a) then ∃c ∈ S such that (c R b). Dung characterizes the set of acceptable arguments by a monotonic function F , that is, F (S) ⊆ F (S ∪ T ) for some S and T . Given a set of arguments S ⊆ A as input, F returns the set of all arguments defended by S: Definition 5. Let S ⊆ A . Then F (S) = {a ∈ A | a is defended by S}. Slightly overloading our notation, we define F (∅) to contain the set of arguments that are not defeated by any argument in the framework. Definition 6. F (∅) = {a ∈ A | a is not defeated by any argument in A }. Dung proved that the set of acceptable arguments is the least fix-point of F , that is, the smallest set S such that F (S) = S. Theorem 7 (Dung 1995). Let A , R be an argumentation framework. The set of acceptable arguments AccR is the least fix-point of the function F . 307 B ROMBERG AND M ARGARITIS Dung also showed that if the argumentation framework A , R is finitary, that is, for each argument A there are finitely many arguments that defeat A, the least fix-point of function F can be obtained by iterative application of F to the empty set. We can understand this intuitively: From our semantics of acceptability it follows that all arguments in F (∅) are accepted. Also, every argument in F (F (∅)) must be acceptable as well since each of its arguments is defended by acceptable arguments. This reasoning can be applied recursively until a fix-point is reached. This happens when the arguments in S cannot be used to defend any other argument not in S, that is, no additional argument is accepted. This suggests a simple algorithm for computing the set of acceptable arguments. Algorithm 1 shows a recursive procedure for this, based on the above definition. The algorithm takes as input an argumentation framework A , R and the set S of arguments found acceptable so far, that is, S = ∅ initially. Let us illustrate these ideas with an example. Example 2. Let A , R be an argumentation framework defined by A = {a, b, c} and R = {(a, b), (b, c)}. The only argument that is not defeated is a, and therefore F (∅) = {a}. Argument b is defeated by the acceptable argument a, so b cannot be defended and is therefore rejected, that is, b ∈ RejR . Argument c, though defeated by b, is defended by (acceptable argument) a which defeats b, so c is acceptable. The set of acceptable arguments is therefore Acc R = {a, c} and the set of rejected arguments is RejR = {b}. The bottom-up approach of Algorithm 1 has the disadvantage that it requires the computation of all acceptable arguments to answer the acceptability status of a single one. In practice, and in particular in the application of argumentation to independence tests, the entire set of acceptable arguments is rarely needed. An alternative is to take a top-down approach (Amgoud and Cayrol, 2002; Dung, 1995; Toni and Kakas, 1995; Kakas and Toni, 1999) that evaluate the acceptability of some input argument by evaluating (recursively) the acceptability of its attackers. Below we present an alternative algorithm, called the top-down algorithm, for deciding the acceptability of an input argument. This algorithm is a version of the dialog tree algorithm of Amgoud and Cayrol (2002), where details unnecessary for the current exposition are not shown. This algorithm is provably equivalent to Algorithm 1 (whenever it is given the same input it is guaranteed to produce the same output), but it is considerably more efficient (as shown later in Section 5.2). We sketch the algorithm here and show a concrete version using the preference-based argumentation framework in Section 3.2. Given an input argument a, the top-down algorithm employs a goal-driven approach for answering whether a is accepted or not. Its operation is guided by the same acceptability semantics as those used for Algorithm 1. Let us denote the predicates A(a) ≡ (a ∈ Acc R ), R(a) ≡ (a ∈ RejR ), and Ab(a) ≡ (a ∈ AbR ). Then, the acceptability semantics are as follows. Algorithm 2 Top-down computation of acceptable arguments: top-down(A , R , a) 1: defeaters ← set of arguments in A that defeat a according to R . 2: for d ∈ defeaters do 3: if top-down(A , R , d) = accepted then 4: return rejected 5: return accepted 308 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION (Acceptance) A node is accepted iff it has no defeaters or all its defeaters are rejected: A(a) ⇐⇒ ∀b ∈ defeaters(a), R(b). (Rejection) A node is rejected iff at least one of its defeaters is accepted: R(a) ⇐⇒ ∃b ∈ defeaters(a), A(b). (7) (Abeyance) A node is in abeyance iff its not accepted nor rejected: Ab(a) ⇐⇒ ¬A(a) ∧ ¬R(a). The logic of these equations can be easily implemented with a recursive algorithm, shown in Algorithm 2. The algorithm, given some input argument a, loops over all defeaters of a and responds rejected if any of its defeaters is accepted (line 4). If execution reaches the end of the loop at line 5 then that means that none of its defeaters was accepted, and thus the algorithm accepts the input argument a. We can represent the execution of the top-down algorithm graphically by a tree that contains a at the root node, and all the defeaters of a node as its children. A leaf is reached when a node has no defeaters. In that case the loop contains no iterations and line 5 is reached trivially. Unfortunately, the top-down algorithm, as shown in Algorithm 2, will fail to terminate when a node is in abeyance. This is clear from the following lemma (proved formally in Appendix A but reproduced here to aid our intuition). Lemma 8. For every argument a, Ab(a) =⇒ ∃b ∈ attackers(a), Ab(b). (An attacker is a type of defeater; it is explained in detail in the next section. For the following discussion the reader can substitute “attacker” with “defeater” in the lemma above.) From this lemma we can see that, if an argument is in abeyance, its set of defeaters must contain an argument in abeyance and thus the recursive call of the top-down algorithm will never terminate, as there will always be another defeater in abeyance during each call. While there are ways to overcome this difficulty in the general case, we can prove that using the preference-based argumentation framework (presented later in the paper) and for the particular preference relation introduced for deciding on independence tests (c.f. Section 3.3), no argument can be in abeyance and thus the top-down algorithm always terminates. A formal proof of this is presented later in Section 5. We conclude the section by proving that the top-down algorithm is equivalent to the bottom-up algorithm of Algorithm 1 that is, given the same input as Algorithm 1 it is guaranteed to produce the same output. The proof assumes no argument is in abeyance. This assumption is satisfied for argumentation in independence knowledge bases (c.f. Theorem 20, Section 5). Theorem 9. Let a be an argument in the argumentation framework A , R , and let F be the set of acceptable arguments output by Algorithm 1. Assuming a is not in abeyance, top-down(A , R , a) = accepted ⇐⇒ a ∈ F (8) top-down(A , R , a) = rejected ⇐⇒ a ∈ F . / (9) 309 B ROMBERG AND M ARGARITIS Proof According to Theorem 7, the fix point of function F returned by Algorithm 1 contains the set of arguments considered acceptable by the acceptability semantics of Dung. As the topdown algorithm is a straightforward implementation of Dung’s acceptability semantics expressed by Eq. (7), the double implication of Eq. (8) must follow. To prove Eq. (9) we can prove the equivalent expression with both sides negated, that is, top-down(A , R , a) = rejected ⇐⇒ a ∈ F . Since a is not in abeyance, if the top-down algorithm does not return rejected it must return accepted. The double implication is thus equivalent to Eq. (8), which was proved true. 3.1 Argumentation in Independence Knowledge Bases We can now apply the argumentation framework to our problem of answering queries from knowledge bases that contain a number of potentially inconsistent independences and dependencies and a set of rules that express relations among them. Definition 10. An independence knowledge base (IKB) is a knowledge base Σ, Ψ such that its set of propositions Σ contains independence propositions of the form (X⊥ | Z) or (X⊥ Y | Z) for ⊥Y ⊥ X, Y and Z disjoint subsets of V, and its set of inference rules Ψ is either the general set of axioms shown in Eq. (5) or the specific set of axioms shown in Eq. (6). For IKBs, the set of arguments A is obtained in two steps. First, for each proposition σ ∈ Σ (independence or dependence) we add to A the argument ({σ}, σ). This is a valid argument according to Definition 2 since its support {σ} is (trivially) consistent, it (trivially) implies the head σ, and it is minimal (the pair (∅, σ) is not a valid argument since ∅ is equivalent to the proposition true which does not entail σ in general). We call arguments of the form ({σ}, σ) propositional arguments since they correspond to single propositions. The second step in the construction of the set of arguments A concerns rules. Based on the chosen set of axioms (general or directed) we construct an alternative, logically equivalent set of rules Ψ , each member of which is singleheaded, that is, contains a single proposition as the consequent, and decomposed, that is, each of its propositions is an independence statement over single variables (the last step is justified by the fact that typical algorithms for causal learning never produce nor require the evaluation of independence between sets). To construct the set of single-headed rules we consider, for each axiom, all possible contrapositive versions of it that have a single head. To illustrate, consider the Weak Transitivity axiom (X⊥ | Z) ∧ (X⊥ | Z ∪ γ) =⇒ (X⊥ | Z) ∨ (γ⊥ | Z) ⊥Y ⊥Y ⊥γ ⊥Y from which we obtain the following set of single-headed rules: (X⊥ | Z) ∧ (X⊥ | Z ∪ γ) ∧ (X⊥ γ | Z) ⊥Y ⊥Y ⊥ =⇒ (γ⊥ | Z) ⊥Y (X⊥ | Z) ∧ (X⊥ | Z ∪ γ) ∧ (γ⊥ Y | Z) ⊥Y ⊥Y ⊥ =⇒ (X⊥ | Z) ⊥γ (X⊥ | Z ∪ γ) ∧ (γ⊥ Y | Z) ∧ (X⊥ γ | Z) ⊥Y ⊥ ⊥ =⇒ (X⊥ Y | Z) ⊥ (X⊥ | Z) ∧ (γ⊥ Y | Z) ∧ (X⊥ γ | Z) ⊥Y ⊥ ⊥ =⇒ (X⊥ Y | Z ∪ γ). ⊥ 310 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION To obtain decomposed rules we apply the Decomposition axiom to every single-headed rule to produce only propositions over singletons. To illustrate, consider the Intersection axiom: (X⊥ | Z ∪ W) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ ∪ W | Z). ⊥Y In the above the consequent coincides with the antecedent of the Decomposition axiom, and we thus replace the Intersection axiom with a decomposed version: (X⊥ | Z ∪ W) ∧ (X⊥ ⊥Y ⊥W | Z ∪ Y) =⇒ (X⊥ | Z) ∧ (X⊥ ⊥Y ⊥W | Z). Finally, note that it is easy to show that this rule is equivalent to two single-headed rules, one implying (X⊥ | Z) and the other implying (X⊥ ⊥Y ⊥W | Z). The result of the application of the above procedures is a set of single-headed, decomposed rules Ψ . We construct, for each such rule (Φ1 ∧ Φ2 . . . ∧ Φn =⇒ ϕ) ∈ Ψ and for each subset of Σ that matches exactly the set of antecedents, that is, each subset {ϕ 1 , ϕ2 . . . , ϕn } of Σ such that Φ1 ≡ ϕ1 , Φ2 ≡ ϕ2 . . . Φn ≡ ϕn , the argument ({ϕ1 ∧ ϕ2 ∧ . . . ∧ ϕn }, ϕ), and add it to A .1 IKBs can be augmented with a set of preferences that allow one to take into account the reliability of each test when deciding on the truth value of independence queries. This is described in the next section. 3.2 Preference-based Argumentation Framework Following Amgoud and Cayrol (2002), we now refine the argumentation framework of Dung (1995) for cases where it is possible to define a preference order Π over arguments. Definition 11. A preference-based argumentation framework (PAF) is a triplet A , R , Π where A is a set of arguments, R ⊆ A × A is a binary relation representing a defeat relationship between pairs of arguments, and Π is a (partial or total) order over A . For the case of inconsistent knowledge bases, preference Π over arguments follows the preference π over their support, that is, stronger support implies a stronger argument, which is given as a partial or total order over sets of propositions. Formally: Definition 12. Let K = Σ, Ψ be a knowledge base, π be a (partial or total) order on subsets of Σ and (H, h), (H , h ) two arguments over K . Argument (H, h) is π-preferred to (H , h ) (denoted (H, h) π (H , h )) if and only if H is preferred to H with respect to π. In what follows we overload our notation by using π to denote either the ordering over arguments or over their supports. An important sub-class of preference relations is the strict and transitive preference relation, defined as follows. Definition 13. We say that preference relation π over arguments is strict if the order of arguments induced by it is strict and total, that is, for every pair of arguments a and b, a π b ⇐⇒ ¬ b π a . 1. This is equivalent to propositionalizing the set of rules, which are first-order (the rules of Eqs. (5) and (6) are universally quantified over all sets of variables, and thus are the rules in Ψ ). As this may be expensive (exponential in the number of propositions), in practice it is not implemented in this way; instead, appropriate rules are matched on the fly during the argumentation inference process. 311 B ROMBERG AND M ARGARITIS Definition 14. We say that preference relation π over arguments is transitive if, for every three arguments a, b and c, a π b ∧ b π c =⇒ a π c . The importance of the properties of strictness and transitivity will become clear later when we talk about the correctness of the argumentative independence test (defined later in Section 4). We now introduce the concept of attack relation, a combination of the concepts of defeat and preference relation. Definition 15. Let A , R , π be a PAF, and a, b ∈ A be two arguments. We say b attacks a if and only if b R a and ¬(a π b). We can see that the attack relation is a special case of the defeat relation and therefore the same conclusions apply; in particular Theorem 7, which allows us to compute the set of acceptable arguments of a PAF using Algorithm 1 or Algorithm 2. In Sections 3.3 and 4 below, we apply these ideas to construct an approximation to the independencequery oracle that is more reliable than a statistical independence test. 3.3 Preference-based Argumentation in Independence Knowledge Bases We now describe how to apply the preference-based argumentation framework of Section 3.2 to improve the reliability of conditional independence tests conducted on a (possibly small) data set. A preference-based argumentation framework has three components. The first two, namely A and R , are identical to the general argumentation framework. We now describe how to construct the third component, namely the preference order π over subsets H of Σ, in IKBs. We define it using a belief estimate ν(H) that all propositions in H are correct, H π H ⇐⇒ ν(H) > ν(H ) ∨ ν(H) = ν(H ) ∧ f (H, H ) . (10) That is, H is preferred over H if and only if its belief of correctness is higher than that of H or, in the case that these beliefs are equal, we break the tie using predicate f . For that we require that ∀H, H ⊆ A , such that H = H , f (H, H ) = ¬ f (H , H). (11) In addition, we require that f be transitive, that is, f (H, H ) ∧ f (H , H ) =⇒ f (H, H ). This implies that the preference relation π is transitive, which is a necessary condition for proving a number of important theorems in Appendix A. In our implementation we resolved ties by assuming an arbitrary order of the variables in the domain, determined at the beginning of the algorithm and maintained fixed during its entire execution. Based on this ordering, f (H, H ) resolved ties by the lexicographic order of the variables in H and H . By this definition, our f is both non-commutative and transitive. Before we define ν(H) we first show that π, as defined by Eqs. (10) and (11) and for any definition of ν(H), satisfies two important properties, namely strictness (Definition 13) and transitivity (Definition 14). We do this in the following two lemmas. Lemma 16. The preference relation for independence knowledge bases defined by Equations (10) and (11) is strict. 312 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Proof H π H ⇐⇒ ν(H) > ν(H ) ∨ ν(H) = ν(H ) ∧ f (H, H ) [By Eq. (10)] ⇐⇒ ν(H) ≥ ν(H ) ∧ ν(H) > ν(H ) ∨ f (H, H ) [Distributivity of ∨ over ∧] ⇐⇒ ¬ ν(H ) > ν(H) ∨ ν(H ) ≥ ν(H) ∧ f (H , H) [Double negation and Eq. (11)] ν(H ) > ν(H) ∨ ν(H ) ≥ ν(H) ∧ ν(H ) > ν(H) ∨ f (H , H) ⇐⇒ ¬ ⇐⇒ ¬ ν(H ) ≥ ν(H) ∧ ν(H ) > ν(H) ∨ f (H , H) ν(H ) > ν(H) ∨ ν(H ) = ν(H) ∧ ν(H ) > ν(H) ∨ f (H , H) ⇐⇒ ¬ ⇐⇒ ¬ ν(H ) > ν(H) ∨ ν(H ) = ν(H) ∧ f (H , H) ⇐⇒ ¬(H π [Common factor ν(H ) > ν(H)] H) [Again by Eq. (10)] Lemma 17. The preference relation defined by Equations (10) and (11) is transitive. Proof H π J ∧J ⇐⇒ π K ν(H) > ν(J) ∨ ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) > ν(K) ∨ ν(J) = ν(K) ∧ f (J, K) ⇐⇒ [By Eq. (10)] ν(H) > ν(J) ∧ ν(J) > ν(K) [Case A] ∨ ν(H) > ν(J) ∧ ν(J) = ν(K) ∧ f (J, K) [Case B] ∨ ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) > ν(K) [Case C] ∨ ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) = ν(K) ∧ f (J, K) [Case D] To complete the proof we show that each of the cases A, B, C and D implies H (Case A) ν(H) > ν(J) ∧ ν(J) > ν(K) =⇒ ν(H) > ν(K) =⇒ H π π K. K. (Case B) ν(H) > ν(J) ∧ ν(J) = ν(K) ∧ f (J, K) =⇒ ν(H) > ν(K) =⇒ H π K. (Case C) ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) > ν(K) =⇒ ν(H) > ν(K) =⇒ H π K. (Case D) ν(H) = ν(J) ∧ f (H, J) ∧ ν(J) = ν(K) ∧ f (J, K) =⇒ ν(H) = ν(K) ∧ f (H, K) =⇒ H due to the transitivity of predicate f . 313 π K, B ROMBERG AND M ARGARITIS We now return to the computation of ν(H). We estimate the belief ν(H) that a set of propositions H is correct by assuming independence among these propositions. 2 Overloading notation and denoting by ν(h) the probability of an individual proposition h being correct, the probability of all elements in H being correct under this assumption of independence is ν(H) = ∏ ν(h). (12) h∈H The belief that a proposition stating independence is correct can be computed in different ways, depending on the particular choice of independence oracle chosen. In this paper we use Wilk’s G 2 test, but the resulting belief can be easily adapted to any other independence oracle that produces p-values. We hope that the following discussion serves as a starting point for others to adapt it to other types of independence oracles. As discussed in Section 2, the p-value p(X,Y | Z) computed by this test is the probability of error in rejecting the null hypothesis (conditional independence in our case) and assuming that X and Y are dependent. Therefore, the probability of a test returning dependence of being correct is νD (X ⊥ | Z) = 1 − p(X,Y | Z) ⊥Y where the subscript D indicates that this expression is valid only for dependencies. Formally, the error of falsely rejecting the null hypothesis is called a type I error. To determine the preference of a test returning independence we can, in principle, use this procedure symmetrically: use the probability of error in falsely accepting the null hypothesis (again, this is conditional independence), called a type II error, which we denote by β(X,Y | Z). In this case we can define the preference of independence (X⊥ | Z) as the probability of correctly assuming independence by ⊥Y νI (X⊥ | Z) = 1 − β(X,Y | Z) ⊥Y where again the subscript I indicates that it is valid only for independences. Unfortunately value of β cannot be obtained without assumptions, because it requires the computation of the probability of the test statistic under the hypothesis of dependence, and there are typically an infinite number of dependent models. In statistical applications, the β value is commonly approximated by assuming one particular dependence model if prior knowledge about that is available. In the absence of such information however in this paper we estimate it using a heuristic function of the p-value, assuming the following heuristic constraints on β: 1 α α − 2+|Z| β(X,Y | Z) = α if p(X,Y | Z) = 0 if p(X,Y | Z) = 1 if p(X,Y | Z) = α. The first constraint (for p(X,Y | Z) = 0) corresponds to the intuition that when the p-value of the test is close to 0, the test statistic is very far from its value under the model that assumes independence, and thus we would give more preference to the “dependence” decision. The intuition for 2. The assumption of independence is a heuristic, and is made mainly due to the difficulty of determining the dependence between two or more statistical tests evaluated on the same data set. Other possible ways of defining the preference of a set of propositions are possible. The problem of dealing with multiple tests is an open problem and an area of active research in statistics; see Section 1 for a discussion. 314 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Figure 1: Preference functions νI (h) and νD (h) for statements of independence and dependence respectively, as functions of the p-value of test h. the second case (p(X,Y | Z) = 1) is reversed—when the value of the statistic is very close to the expected one under independence then independence is preferred. The value of the second case is tempered by the number of variables in the conditioning set. This reflects the practical consideration that, as the number 2 + |Z| of variables involved in the test increases, given a fixed data set, the discriminatory power of the test diminishes as |Z| → ∞. The third case causes the two functions ν I and νD to intersect at p-value α. This is due to fairness: in the absence of non-propositional arguments (i.e., in the absence of inference rules in our knowledge base), the independence decisions of the argumentation framework should match those of the purely statistical tests, that is, “dependence” if and only if (p-value ≤ α). If instead we chose a different intersection point, then the resulting change in the outcome of tests may have been simply due to bias in the independence decision that favors dependence or independence, that is, equivalent to an arbitrary change of the threshold of the statistical test, and the comparison of the statistical and the new test based on argumentation would not be a fair one. The remaining values of β are approximated by linear interpolation among the above points. The result is summarized in Fig. 1, which depicts preference functions ν D and νI with respect to the p-value of the corresponding statistical test. Let us illustrate how the preference-based argumentation can be used to resolve the inconsistencies of Example 1. Example 3. In example 1 we considered an IKB with the following propositions (0⊥ | 2) ⊥1 (13) (0⊥ 3 | 2) ⊥ (14) (0⊥ | {1, 2}) ⊥3 (15) (0⊥ | 2) ∧ (0⊥ 3 | 2) =⇒ (0⊥ 3 | {1, 2}). ⊥1 ⊥ ⊥ (16) 315 B ROMBERG AND M ARGARITIS Following the IKB construction procedure described in the previous section, propositions (13), (14) and (15) correspond to the following arguments, respectively: (0⊥ | 2) , (0⊥ | 2) ⊥1 ⊥1 (0⊥ 3 | 2) , (0⊥ 3 | 2) ⊥ ⊥ (0⊥ | {1, 2}) , (0⊥ | {1, 2}) ⊥3 ⊥3 (17) while rule (16) corresponds to the argument (0⊥ | 2), (0⊥ 3 | 2) , (0⊥ 3 | {1, 2}) . ⊥1 ⊥ ⊥ (18) Let us extend this IKB with the following preference values for its propositions and rule. Pref [(0⊥ | 2)] = 0.8 ⊥1 Pref [(0⊥ 3 | 2)] = 0.7 ⊥ Pref [(0⊥ | {1, 2})] = 0.5. ⊥3 According to Definition (12), the preference of each argument ({σ}, σ) is equal to the preference value of {σ} which is equal to the preference of σ, as it contains only a single proposition. Thus, Pref = 0.8 Pref Pref (0⊥ | 2) , (0⊥ | 2) ⊥1 ⊥1 (0⊥ 3 | 2) , (0⊥ 3 | 2) ⊥ ⊥ = 0.7 (0⊥ | {1, 2}) , (0⊥ | {1, 2}) ⊥3 ⊥3 = 0.5. The preference of argument (18) equals the preference of the set of its antecedents, which, according to Eq. (12), is equal to the product of their individual preferences, that is, Pref (0⊥ | 2), (0⊥ 3 | 2) , (0⊥ 3 | {1, 2}) ⊥1 ⊥ ⊥ = 0.8 × 0.7 = 0.56. Proposition (15) and rule (16) contradict each other logically, that is, their corresponding arguments (17) and (18) defeat each other. However, argument (18) is not attacked as its preference is 0.56 which is larger than 0.5, the preference of its defeater argument (17). Since no other argument defeats (18), it is acceptable, and (17), being attacked by an acceptable argument, must be rejected. We therefore see that using preferences the inconsistency of Example 1 has been resolved in favor of rule (16). Let us now illustrate the defend relation, that is, how an argument can be defended by some other argument. The example also illustrates an alternative resolution for the inconsistency of Example 1, this time in favor of the independence proposition (15). Example 4. Let us extend the IKB of Example 3 with two additional independence propositions and an additional rule. The new propositions and their preference are: Pref [(0⊥ | {2, 3})] = 0.9 ⊥4 Pref [(0⊥ | {2, 4})] = 0.8 ⊥3 316 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION and the new rule is: (0⊥ | {2, 3}) ∧ (0⊥ | {2, 4}) ⊥4 ⊥3 =⇒ (0⊥ | 2). ⊥3 This rule is an instance of the Intersection axiom followed by Decomposition. The corresponding arguments and preferences are: Pref (0⊥ | {2, 3}) , (0⊥ | {2, 3}) ⊥4 ⊥4 = 0.9 Pref (0⊥ | {2, 4}) , (0⊥ | {2, 4}) ⊥3 ⊥3 = 0.8 corresponding to the two propositions, and Pref (0⊥ | {2, 3}), (0⊥ | {2, 4}) , (0⊥ | 2) ⊥4 ⊥3 ⊥3 = 0.9 × 0.8 = 0.72 (19) corresponding to the rule. As in Example 3, argument (17) is attacked by argument (18). Let us represent this graphically using an arrow from argument a to argument b to denote that a attacks b, that is, Argument (18) −→ Argument (17). If the IKB was as in Example 3, (18) would had been accepted and (17) would have been rejected. However, the additional argument (19) now defeats (undercuts) (18) by logically contradicting its antecedent (0⊥ 3 | 2). Since the preference of (19), namely 0.72, is larger than that of ⊥ (18), namely 0.56, (19) attacks (18). Therefore, (19) defends all arguments that are attacked by argument (18), and in particular (17). Graphically, Argument (19) −→ Argument (18) −→ Argument (17). Note this is not sufficient for accepting (17) as it has not been proved that its defender (19) is itself acceptable. We leave the proof of this as an exercise for the reader. 4. The Argumentative Independence Test (AIT) The independence-based preference argumentation framework described in the previous section provides a semantics for the acceptance of arguments consisting of independence propositions. However, what we need is a procedure for a test of independence that, given as input a triplet σ = (X,Y | Z) responds whether X is independent or dependent of Y given Z. In other words, we need a semantics for the acceptance of propositions, not arguments. Let us consider the two propositions related to the input triplet σ = (X,Y | Z), proposition (σ = true), abbreviated σ t , and proposition (σ = false), abbreviated σf , that correspond to independence (X⊥ | Z) and ⊥Y dependence (X ⊥ | Z) of σ, respectively. The basic idea for deciding on the independence or de⊥Y pendence of input triplet σ is to define a semantics for the acceptance or rejection of propositions σ t and σf based on the acceptance or rejection of their respective propositional arguments ({σ t }, σt ) and ({σf }, σf ). Formally, (X ⊥ | Z) is accepted ⊥Y iff ({(X ⊥ | Z)}, (X ⊥ | Z)) is accepted, and ⊥Y ⊥Y (X⊥ | Z) is accepted ⊥Y iff ({(X⊥ | Z)}, (X⊥ | Z)) is accepted. ⊥Y ⊥Y 317 (20) B ROMBERG AND M ARGARITIS Based on this semantics over propositions, we decide on the dependence or independence of triplet σ as follows: σt = (X⊥ | Z) is accepted ⊥Y =⇒ (X⊥ | Z) ⊥Y σf = (X ⊥ | Z) is accepted ⊥Y =⇒ (X ⊥ | Z). ⊥Y (21) We call the test that determines independence in this manner the Argumentative Independence Test or AIT. For the above semantics to be well-defined, a triplet σ must be either independent or dependent, that is, not both or neither. For that, exactly one of the antecedents of the above implications of Eq. (21) must be true. Formally, Theorem 18. For any input triplet σ = (X,Y | Z), the argumentative independence test (AIT) defined by Eqs. (20) and (21) produces a non-ambiguous decision, that is, it decides σ evaluates to either independence or dependence, but nor both or neither. For that to happen, one and only one of its corresponding propositions σ t or σf must be accepted. A necessary condition for this is given by the following theorem. Theorem 19. Given a PAF A , R , π with a strict and transitive preference relation π, every propositional argument ({σt }, σt ) ∈ A and its negation ({σf }, σf ) satisfy ({σt }, σt ) is accepted iff ({σf }, σf ) is rejected. The above theorem is not sufficient because the propositions may still be in abeyance, but this possibility is ruled out for strict preference relations by Theorem 20, presented in the next section. The formal proofs of Theorems 18, 19 and 20 are presented in Appendix B. We now illustrate the use of AIT with an example. Example 5. We consider an extension of Example 3 to illustrate the use of the AIT to decide on the independence or dependence of input triplet (0, 3 | {1, 2}). According to Eq. (20) the decision depends on the status of the two propositional arguments: ({(0⊥ 3 | {1, 2})}, (0⊥ 3 | {1, 2})), and ⊥ ⊥ (22) ({(0⊥ | {1, 2})}, (0⊥ | {1, 2})). ⊥3 ⊥3 (23) Argument (23) is equal to argument (17) of Example 3 that was proved to be rejected in that example. Therefore, according to Theorem 19, its negated propositional argument Eq. (22) must be accepted, and we can conclude that triplet (0, 3 | {1, 2}) corresponds to a dependence, that is, we conclude that (0⊥ 3 | {1, 2}). ⊥ 5. The Top-down AIT Algorithm We now discuss in more detail the top-down algorithm which is used to implement the argumentative independence test, introduced in Section 3. We start by simplifying the recursion of Eq. (7) that determines the state (accepted, rejected, or in abeyance) of an argument a. We then explain the algorithm and analyze its computability (i.e., prove that its recursive execution is always finite) and its time complexity. To simplify the recursion Eq. (7) we use the following theorem (proved in Appendix B). 318 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Theorem 20. Let A , R , π be a PAF with a strict preference relation π. Then no argument a ∈ A is in abeyance. This theorem reduces the number of states of each argument to two, that is, an argument can be either accepted or not accepted (rejected). We will use the name of the argument a to denote the predicate “a is accepted” and its negation ¬a to denote the predicate “a is rejected.” With this notation, the above theorem, and the fact that we have extended the semantics of acceptability from the defeat to the attack relation (using preferences), the recursion of Eq. (7) can be expressed as follows a ⇐⇒ ∀b ∈ attackers(a), ¬b ¬a ⇐⇒ ∃b ∈ attackers(a), b or, equivalently, a ^ ⇐⇒ ¬b b∈attackers(a) ¬a _ ⇐⇒ b. b∈attackers(a) Finally, we notice that the second formula is logically equivalent to the first (simply negating both sides of the double implication recovers the first). Therefore, the Boolean value of the dialog tree for a can be computed by the simple expression a ^ ⇐⇒ ¬b. (24) b∈attackers(a) To illustrate, consider an attacker b of a. If b is rejected, that is, ¬b, the conjunction on the right cannot be determined without examining the other attackers of a. Only when all attackers of a are known to be rejected can the value of a be determined, that is, accepted. Instead, if b is accepted, that is, b, the state of ¬b is false and the conjunction can be immediately evaluated to false, that is, a is rejected regardless of the acceptability of any other attackers. An iterative version of the top-down algorithm is shown in Algorithm 3. We assume that the algorithm can access a global PAF A , R , π , with arguments in A defined over a knowledge base K = Σ, Ψ . Given as input a triplet t = (X,Y | Z), if the algorithm returns true (false) then we conclude that t is independent (dependent). It starts by creating a root node u for the propositional argument U of proposition t = true (lines 1–6). According to Eqs. (20) and (21), the algorithm then decides true if U is accepted (line 22). Otherwise, the algorithm returns false (line 23). This is because in this case, according to Theorem 19, the negation of propositional argument U must be accepted. Algorithm 3 is an iterative version of a tree traversal algorithm. It maintains a queue of the nodes that have not been expanded yet. A node is expanded when its children are added to the tree. In the algorithm, this is done in the loop of lines 17 to 21, which uses subroutine getAttackers of Algorithm 5 to obtain all attackers of an argument. This subroutine finds all attackers of the input argument a in a backward-chaining fashion, that is, given an argument a = (H, h), it searches for all rules in the knowledge base K whose consequent matches the negation of some proposition in the 319 B ROMBERG AND M ARGARITIS Algorithm 3 independent(triplet t). 1: f true ← proposition (t = true) /* Creates independence proposition (t = true). */ 2: Utrue ← ({ f true }, f true ) 3: utrue ← node for argument Utrue 4: utrue .parent ← nil 5: u.STATE ← nil 6: f ringe ← [u] /* Initialize with u (root). */ 7: /* Create global rejected node, denoted by ρ. */ 8: ρ ← node with no argument and state rejected 9: while f ringe = ∅ do 10: u ← dequeue( f ringe) 11: attackers ← getAttackers(u.argument) 12: if (attackers = ∅) then 13: u.STATE ← accepted 14: if sendMsg(ρ, u) = terminate then break 15: attackers ← sort attackers in decreasing order of preference. 16: /* Enqueue attackers after decomposing them. */ 17: for each A ∈ attackers do 18: a ← node for argument A 19: a.parent ← u 20: a.STATE ← nil 21: enqueue a in f ringe /* See details in text. */ 22: if (u.STATE = accepted) then return true 23: if (u.STATE = rejected) then return false Algorithm 4 sendMsg(Node c, Node p). 1: /* Try to evaluate node p given new information in c.STAT E */ 2: if p = nil then 3: if c.STATE = accepted then p.STATE ← rejected 4: else if (∀ children q of p, q.STATE = rejected) then p.STATE ← accepted 5: /* If p was successfully evaluated, try to evaluate its parent by sending message upward. */ 6: if p.STATE = nil then 7: return sendMsg(p, p.parent) 8: else 9: return continue 10: else 11: return terminate /* The root node has been evaluated. */ support H (undercutters), or the negation of its head h (rebutters). Every node maintains a threevalued state variable STATE ∈ {nil, accepted, rejected}. The nil state denotes that the value of the node is not yet known, and a node is initialized to this state when it is added to the tree. The algorithm recurses down the dialog tree until a node is found that has no attackers (line 12). Such a node is accepted in line 13, that is, the conjunction of Eq. (24) is trivially true, and its STATE is propagated upwards toward the root to the parent using subroutine sendMsg (Algorithm 4). Every 320 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Algorithm 5 Finds all attackers of input argument a in knowledge base K = Σ, Ψ : getAttackers(a = (H, h)) 1: attackers ← ∅ 2: /* Get all undercutters or rebutters of a. */ 3: for all propositions ϕ ∈ H ∪ {h} do 4: /* Get all defeaters of proposition ϕ. */ 5: for all rules (Φ1 ∧ Φ2 . . . ∧ Φn =⇒ ¬ϕ) ∈ Ψ do 6: /* Find all propositionalizations of the rule whose consequent matches ¬ϕ. */ 7: for all subsets {ϕ1 , ϕ2 . . . , ϕn } of Σ s.t. Φ1 ≡ ϕ1 , Φ2 ≡ ϕ2 . . . Φn ≡ ϕn do 8: d ← ({ϕ1 ∧ ϕ2 . . . ϕn }, ¬ϕ) /* Create defeater. */ 9: /* Is the defeater an attacker? */ 10: if ¬(a π d) then 11: attackers ← attackers ∪ {d} 12: return attackers time a node receives a message from a child, if the message is accepted, the node is rejected (line 3 of Algorithm 4), otherwise the node is accepted if all its children has been evaluated to rejected (line 4 of Algorithm 4). The subroutine sendMsg then proceeds recursively by forwarding a message to the parent whenever a node has been evaluated (line 7). If the root is reached and evaluated, the message is sent to its parent, which is nil. In this case, the subroutine returns the special keyword terminate back to the caller, indicating that the root has been evaluated and thus the main algorithm (Algorithm 3) can terminate. The caller can be either the subroutine sendMsg, in which case it pushes the returned message up the method-calling stack, or the top-down algorithm in line 14, in which case its “while” loop is terminated. An important part of the algorithm is yet underspecified, namely the order in which the attackers of a node are explored in the tree (i.e., the priority with which nodes are enqueued in line 21). This affects the order of expansion of the nodes in the dialog tree. Possible orderings are depth-first, breadth-first, iterative deepening, as well as informed searches such as best-first when a heuristic is available. In our experiments we used iterative deepening because it combines the benefits of depth-first and breadth-first search, that is, small memory requirements on the same order as depthfirst search (i.e., on the order of the maximum number of children a node can have) but also the advantage of finding the shallowest solution like breadth-first search. We also used a heuristic for enqueuing the children of a node. According to iterative deepening, the position in the queue of the children of a node is specified relative to other nodes, but not relative to each other. We therefore specified the relative order of the children according to the value of the preference function: children with higher preference are enqueued first (line 15 of the top-down algorithm), and thus, according to iterative deepening, would be dequeued first. 5.1 Computability of the Top-Down Algorithm An open question is whether the top-down algorithm is computable, that is, whether it always terminates. In this section we prove that it is. To prove this we need to show that under certain general conditions the acceptability of an argument a can always be determined, that is, that the algorithm always terminates. This is proved by the following theorem. 321 B ROMBERG AND M ARGARITIS Theorem 21. Given an arbitrary triplet t = (X,Y | Z), and a PAF A , R , π with a strict preference relation π, Algorithm 3 with input t over A , R , π terminates. The proof consists on showing that the path from the root a to any leaf is always finite. For that, the concept of an attack sequence is needed. Definition 22. An attack sequence is a sequence a1 , a2 , . . . , an of n arguments such that for every 2 ≤ i ≤ n, ai attacks ai−1 . By the manner in which the top-down algorithm constructs the dialog tree it is clear that any path from the root to a leaf is an attack sequence. It therefore suffices to show that any such sequence is finite. This is done by the following theorem. Theorem 23. Every attack sequence a1 , a2 , . . . , an in a PAF A , R , π with strict π and finite A is finite. Intuitively, if the preference relation is strict then an element can attack its predecessor in the sequence but not vice versa. Since the set of arguments A is finite, the only way for an attack sequence to be infinite is to contain a cycle. In that case, an argument would be attacking at least one of its predecessors, which cannot happen in a PAF with a strict preference relation. We present formal proofs of Theorems 21 and 23 in Appendix A. We thus arrived at the important conclusion that, under a strict preference function and a finite argument set, the state of any argument is computable. As we showed in Section 3.3, the preference function for independence knowledge bases is strict, and thus the computability of the top-down algorithm is guaranteed. 5.2 Computational Complexity of the Top-Down Algorithm Since Algorithm 3 is a tree traversal algorithm, its time complexity can be obtained by techniques contained in standard algorithmic texts, for example, Cormen et al. (2001). The actual performance depends on the tree exploration procedure. In our case we used iterative deepening due to its linear memory requirements in d, where d is the smallest depth at which the algorithm terminates. Iterative deepening has a worst-time time complexity of O(bd ), where b is an upper bound on the dialog tree branching factor. Therefore, for a constant b > 1 the execution time is exponential in d in the worst case. Furthermore, for the case of independence tests, b itself may also be exponential in n (the number of variables in the domain). This is because the inference rules of Eqs. (5) and (6) are universally quantified, and therefore their propositionalization (lines 7–11 of Algorithm 5), may result in an exponential number of rules with the same consequent (attackers). A tighter bound may be possible but, lacking such a bound, we introduce in the next section an approximate top-down algorithm, which reduces the running time to polynomial. As we show in our experiments, the use of this approximation does not appreciably affect the accuracy improvement due to argumentation. 6. The Approximate Top-Down AIT Algorithm As the top-down algorithm has a theoretically exponential running time in the worst case, we hereby present a practical polynomial-time approximation of the top-down algorithm. We make use of two approximations: (a) To address the exponential behavior due to the depth of the dialog tree we apply a cutoff depth d for the process of iterative deepening. (b) To address the potentially 322 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION exponential size of the branching factor b (which equals the maximum number of defeaters of any argument appearing in the dialog tree) we limit the number of defeaters of each node—thus bounding the number of its attackers/children—to a polynomial function of n (the domain size) during the propositionalization process of Algorithm 5 (lines 7–11). Let (H, h) be an argument and let ϕ ∈ H ∪ {h} be one of its propositions, as in line 3 of Algorithm 5. The set of attackers Σ ϕ of (H, h) consists of all rules {ϕ1 ∧ ϕ2 . . . ∧ ϕk =⇒ ¬ϕ} of Σ, for some constant upper bound k on the size of their support. If ϕ = (X, Y | Z) and ϕi = (Xi , Yi | Zi ) for all 1 ≤ i ≤ k, then our approximation generates and uses a subset of Σϕ in the dialog tree such that |X| − c ≤ |Xi | ≤ |X| + c |Y| − c ≤ |Yi | ≤ |Y| + c |Z| − c ≤ |Zi | ≤ |Z| + c (25) where | · | denotes set cardinality, and c is a user-specified integer parameter that defines the approximation. We call this the approximate top-down algorithm. The computational complexity of the approximate top-down algorithm is polynomial in n, as shown in the next section. 6.1 Test Complexity of the Approximate Top-Down Algorithm In this section we prove that the number of statistical tests required by the Approximate Top-Down algorithm is polynomial in n. As described in the previous section, the approximate algorithm generates a bounded number of attackers for each proposition in the argument corresponding to some node in the dialog tree. A bound on the number of the possible attackers can be defined by the approximation of Eq. (25). These equations dictate that the size of each possible set X i in some proposition (Xi , Yi | Zi ) of some attacker of proposition (X, Y | Z) is between |X| + c and |X| − c (inclusively). As the number of elements that can be members of X i is bounded by n (the domain size), this produces at most n2c+1 possible instantiations for set Xi . Similarly, the number of possible instantiations for Yi and Zi is also n2c+1 . Therefore, an upper bound for the number of matches to some proposition in the antecedent of an attacking rule is O(n 6c+3 ) for some constant c. As there are r rules in the rule set and up to k propositions in each rule for some constants r and k (for example, r = 5 and k = 3 for Eq. (5) and r = 8 and k = 4 for Eq. (6)), an upper bound on the number of children of a node in the dialog tree is O(rkn6c+3 ), and thus an upper bound on the number of nodes in the dialog tree of depth d is O((rk)d nd(6c+3) ). As we demonstrate in our experiments, this is a rather loose upper bound and the performance of the approximate top-down algorithm is reasonable in practice, but it does serve to show that the theoretical worst-case performance is polynomial in n. In the experiments shown in the next section we used c = 1 and d = 3. 7. Experimental Results We conducted experiments on sampled and real-world data sets for the purpose of (a) evaluating the accuracy improvement of the argumentative test (both the exact and approximate versions) over its statistical counterpart; (b) demonstrating the performance improvements that can be achieved by the approximate version compared to the exact counterpart, without significant reduction in accuracy improvement; and (c) evaluating the improvements that result by the use of the argumentative framework for causal discovery. We address these issues below. 323 B ROMBERG AND M ARGARITIS 7.1 Comparative Evaluation of Bottom-Up, Exact Top-Down, and Approximate Top-Down Argumentative Tests In this section we demonstrate that the argumentation approach, implemented either by the (exact) bottom-up or the exact top-down algorithm (Algorithm 3), improves the accuracy of independence tests on small data sets. We also show that the approximate top-down algorithm (see Section 6) has accuracy performance improvements similar to its exact counterpart but significantly better execution times (orders of magnitude), that make it more practical and usable for larger domains. As the output of the bottom-up algorithm is guaranteed to be equal to the exact top-down algorithm as Theorem 9 of Section 3, we omit accuracy results for the bottom-up algorithm here. As the exact algorithm is impractical for large domains, for the present comparison we sampled data sets from two randomly generated Bayesian networks with n = 8 nodes. The networks were generated using BNGenerator (Ide et al., 2002), a publicly available Java package, with maximum degree per node τ equal to 3 and 7 to evaluate the performance in sparsely as well as densely connected domains. A large data set D was sampled from each network and our experiments were conducted on subsets of it containing an increasing number of data points N. This was done in order to assess the accuracy on varying conditions of reliability, as the reliability of a test varies (typically increases) with the amount of data available. To reduce variance, each experiment was repeated for ten data subsets of equal size, obtained by permuting the data points of D randomly and using the first N of them as input to our algorithms. We first compare the accuracy of argumentative tests versus their purely statistical counterparts (i.e., the G2 test) on several data sets sampled from randomly generated Bayesian networks. Sampled data experiments have the advantage of a more precise estimation of the accuracy since the underlying model is known. We present experiments for two versions of the exact top-down argumentative test, one using Pearl’s general axioms of of Eq. (5), denoted AIT t -G, and another that uses Pearl’s “directed” axioms of Eq. (6), denoted AITt -D, as well as two versions of the approximate top-down argumentative test, denoted AITt -G and AITt -D respectively. We abbreviate purely statistical independence tests as SIT. We report the estimated accuracy, which, for each data set, is calculated by comparing the result of a number of conditional independence tests (SITs or AITs) on data with the true value of independence, computed by querying the underlying model for the conditional independence of the same test using d-separation. Since the number of possible tests is exponential, we estimated the independence accuracy by randomly sampling a set T of 1,000 triplets (X,Y, Z), evenly distributed among all possible conditioning set sizes m ∈ {0, . . . , n − 2}, that is, 1000/(n − 1) tests for each m. The independence or dependence value of the triplets (in the true, underlying model) were not controlled, but left to be decided randomly. This resulted in a non-uniform distribution of dependencies and independences. For instance, in the experiments shown next (n = 8, τ = 3, 7), the average proportion of independences vs. dependencies was 36.6% to 63.4% respectively for τ = 3, and 11.4% to 88.6% respectively for τ = 7. Denoting a triplet in T by t, by Itrue (t) the result of a test on t performed on the underlying model, and by Idata-Y (t) the results of performing a test on t of type Y on data, for Y equal to SIT, AITt -G, AITt -D, AITt -G, or AITt -D, the estimated accuracy of test type Y is defined as accdata = Y 1 |T | t ∈ T | Idata-Y (t) = Itrue (t) . 324 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, general axioms 100 Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 7, general axioms 100 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) accuracy(AIT, exact) − accuracy(SIT) 90 80 80 70 Accuracy (%) 70 Accuracy (%) Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) accuracy(AIT, exact) − accuracy(SIT) 90 60 50 40 60 50 40 30 30 20 20 10 10 0 0 10 100 1000 10000 10 100 1000 Data set size N (number of data points) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, directed axioms 100 Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 7, directed axioms 100 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) accuracy(AIT, exact) − accuracy(SIT) 90 80 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) accuracy(AIT, exact) − accuracy(SIT) 90 80 70 Accuracy (%) 70 Accuracy (%) 10000 Data set size N (number of data points) 60 50 40 60 50 40 30 30 20 20 10 10 0 0 10 100 1000 10000 Data set size N (number of data points) 10 100 1000 10000 Data set size N (number of data points) Figure 2: Accuracy comparison of statistical tests (SIT) vs. exact and approximate argumentative tests for domain size n = 8 and maximum degree per node τ = 3, 7. The histograms show the absolute value of the accuracy while the line curves show the difference between SIT and the argumentative tests. 95% confidence intervals are also shown for the line graphs. Top row: General axioms. Bottom row: Directed axioms. Figure 2 (top row) shows a comparison of the SIT with the exact and approximate top-down argumentative test over the general axioms for data set with increasing number of data points. The figure shows two plots for τ = 3, 7 of the mean values (over runs for ten different data sets) of accdata , SIT accdatat -G , and accdata -G (histograms) and the difference between the accuracies of the AIT tests AIT AITt and the statistical one (line graphs) for various data set sizes N. A positive value of the difference corresponds to an improvement of the argumentative test over SIT. The plots also show the statistical significance of this difference with 95% confidence intervals (error bars), that is, the interval around the mean value that has a 0.95 probability of containing the true difference. The figure demonstrates that there exist modest but consistent and statistically significant improvements in the accuracy of both the exact and approximate argumentative tests over the statistical test. We can observe improvements over the entire range of data set sizes in both cases with maximum improvements of up to 9% and 6% for the exact and approximate cases respectively (at τ = 3 and N = 600). In certain situations where the experimenter knows that the underlying distribution belongs to the class of Bayesian networks, it is appropriate to use the specific axioms of Eq. (6) instead of the general axioms of Eq. (5). The bottom row of Figure 2 presents the same comparison as the top 325 B ROMBERG AND M ARGARITIS row but for the exact and approximate argumentative tests AIT t -D and AITt -D that use the directed axioms instead of the general ones. As in the case for AIT using the general axioms, we can observe statistically significant improvements over the entire range of data set sizes in both cases. In this case however, the improvements are larger, with maximum increases in the accuracy of the exact and approximate test of up to 13% and 9% respectively (again for τ = 3 and N = 600). Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 160 data points, general axioms 100 Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 160 data points, general axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 900 data points, general axioms 100 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 900 data points, general axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 5000 data points, general axioms 100 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 5000 data points, general axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 Conditioning set size (number of variables) 1 2 3 4 5 6 Conditioning set size (number of variables) Figure 3: Accuracy comparison of SIT vs. exact (AITt -G) and approximate (AITt -G) argumentative tests over the general axioms for increasing conditioning set sizes. The six plots correspond to maximum degrees per node τ = 3, 7, and data set sizes N = 160, 900 and 5000. 326 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 160 data points, directed axioms 100 Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 160 data points, directed axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 900 data points, directed axioms 100 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 900 data points, directed axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 5000 data points, directed axioms 100 Conditioning set size (number of variables) Accuracy comparison of SIT and AIT on sampled data n = 8 variables, τ = 3, N = 5000 data points, directed axioms Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 100 90 Accuracy (%) Accuracy (%) 90 Statistical test (SIT) Argumentative test (AIT, exact) Argumentative test (AIT, approx.) accuracy(AIT, exact) − accuracy(SIT) accuracy(AIT, approx) − accuracy(SIT) 80 70 60 50 40 80 70 60 50 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 6 0 Conditioning set size (number of variables) 1 2 3 4 5 6 Conditioning set size (number of variables) Figure 4: Same as Figure 3 but for AIT using the directed axioms instead of the general ones. We also evaluated the accuracy of these tests for increasing conditioning set sizes. Figures 3 and 4 show a comparison of the SIT with the exact and approximate top-down argumentative test using the general and directed axioms respectively, for accuracies over increasing conditioning set size for data sizes N = 160, 900, and 5000 (different rows). We can observe statistically significant improvements over the entire range of conditioning set sizes in all twelve graphs. Except in one case (directed axioms, N = 5000, τ = 3), all graphs show an upward trend in accuracy for increasing conditioning set size, representing a positive impact of the argumentative approach that increases with a decrease in test reliability, that is, increasing conditioning set size. 327 B ROMBERG AND M ARGARITIS We also compared the execution times of the bottom-up, exact top-down and approximate topdown algorithms on the same data sets. To run the bottom-up algorithm we generated the set of all propositional arguments possible, that is, arguments of the form ({σ}, σ), by iterating over all possible triplets (X, Y | Z), and inserted them in the knowledge base together with their preference, as described in Section 3.1. Similarly, for the set of axioms that we used in each case, that is, either the general (Eq. (5)) or the specific ones (Eq. (6)), we iterated over all possible matches of each rule, inserting the corresponding (single-headed and decomposed) instantiated rule in the knowledge base again together with its preference. The reason for including all propositional and rule-based arguments in our IKB is to allow the argumentation framework to consider all possible arguments in favor of or against an independence query. We compared the bottom-up algorithm AITb , the exact top-down algorithms AITt , and the approximate top-down algorithm AITt . For this, we measured the time it takes to discover the structure of a Bayesian networks using three versions of the PC algorithm (Spirtes et al., 2000), each using one of the three argumentative tests AITb , AITt , or AITt to conduct the independence tests. As usual, we consider two versions of each test AITb , AITt , and AITt , one that uses the general axioms of Eq. (5), that is, AITb -G, AITt -G, and AITt -G, respectively, and one that uses the specific axioms of Eq. (6) (applicable to Bayesian networks), that is, AITb -D, AITt -D, and AITt -D, respectively. The data sets used are the same as the ones used in the accuracy comparisons above. Figure 5 plots the execution time of argumentative tests AITb -G vs. AITt -G vs. AITt -G (top row) and AITb -D vs. AITt -D vs. AITt -D (bottom row) for tests that were conducted by the PC algorithm while learning the structure. Note that both the x and y-axes are plotted in log-scale. We can observe improvements in the execution time of the exact top-down algorithm over that of the bottom-up algorithm of an order of magnitude over the entire range of data set sizes in all four plots. We can also see improvement of a similar order between the exact and approximate topdown argumentative algorithms. For instance, for the general axioms and τ = 3 (top-left plot), the execution time for N = 5000 is 2749 seconds for the bottom-up against 107 seconds for the exact top-down and 15 seconds for the approximate top-down algorithm. We see even more pronounced execution time improvements when using the directed axioms (bottom row of Fig. 5). The execution-time results demonstrate that the exact top-down algorithm performs significantly better than the bottom-up algorithm, while producing the exact same output (according to Theorem 9 of Section 3). This implies a clear advantage of using the top-down over the bottom-up algorithm. Furthermore, we also saw that the approximate top-down algorithm performs similarly in terms of accuracy improvement while having polynomial worst-case execution time and in practice being several orders of magnitude faster than the exact top-down algorithm, which is exponential in the worst-case. As in the next two sections we continue our evaluation on domains significantly larger than the n = 8 variables that we examined here, it would be difficult or impractical for the exact algorithms to be employed. For these reasons in the following experiments we use the more practical approximate algorithm, which can be applied to larger domains. 7.2 Causal Discovery in Larger Domains We also conducted experiments that demonstrate the performance of the approximate top-down algorithm by (a) showing its applicability to large domains, and (b) demonstrating positive improvements in accuracy of argumentative tests on the learning of the structure of Bayesian networks, the main problem faced by causal discovery algorithms. In the following experiments we used the PC 328 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Execution time comparison of AIT algorithms on sampled data n = 8 variables, τ = 3, general axioms Bottom-up (exact) Top-down exact Top-down approximate Bottom-up (exact) Top-down exact Top-down approximate 10000 Execution time (seconds) 10000 Execution time (seconds) Execution time comparison of AIT algorithms on sampled data n = 8 variables, τ = 7, general axioms 1000 100 10 1 0.1 1000 100 10 1 10 100 1000 10000 10 100 1000 10000 Data set size N (number of data points) Execution time comparison of AIT algorithms on sampled data n = 8 variables, τ = 3, directed axioms Execution time comparison of AIT algorithms on sampled data n = 8 variables, τ = 7, directed axioms Bottom-up (exact) Top-down exact Top-down approximate Bottom-up (exact) Top-down exact Top-down approximate 10000 Execution time (seconds) 10000 Execution time (seconds) Data set size N (number of data points) 1000 100 10 1 1000 100 10 1 10 100 1000 10000 Data set size N (number of data points) 10 100 1000 10000 Data set size N (number of data points) Figure 5: Execution time comparison for the PC algorithm when it uses the bottom-up and exact top-down and approximate top-down argumentative tests to learn the structure of a Bayesian network from data sampled from Bayesian models with domain size n = 8, maximum degrees τ = 3, 7. The bars show the absolute value of the running time using a logarithmic scale. Top row: general axioms. Bottom row: directed axioms. algorithm. We compared the true structure of the underlying model to the resulting structure of the PC algorithm when it uses SITs as independence tests, denoted PC-SIT, and its output when it uses argumentative independence tests, denoted PC-AITt -D, when using the directed axioms. We evaluated the resulting networks by their ability to accurately represent the true independences in the domain, calculated by comparing the results (true or false) of a number of conditional tests conducted using d-separation on the output networks (PC-SIT or PC- AITt -D). Denoting by T this set of 2,000 triplets, by t ∈ T a triplet, by Itrue (t) the result of a test performed on the underlying model, and by IPC-Y (t) the result of performing a d-separation test on the network output by the PC algorithm using the Y test, Y equal to SIT or AITt -D, the estimated accuracy is defined as accPC = Y 1 |T | t ∈ T | IPC-Y (t) = Itrue (t) . (26) We considered data sampled from randomly generated Bayesian networks of sizes n = 24, and maximum degrees τ = 3, 7. For each network we sampled ten data sets, and, for each data set, we 329 B ROMBERG AND M ARGARITIS Accuracy comparison of SIT and AIT when used by the PC algorithm n = 24 variables, τ = 3, directed axioms 100 90 Accuracy comparison of SIT and AIT when used by the PC algorithm n = 24 variables, τ = 7, directed axioms 100 Statistical test (SIT) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) 90 70 Accuracy (%) 80 70 Accuracy (%) 80 Statistical test (SIT) Argumentative test (AIT, approx.) accuracy(AIT, approx.) − accuracy(SIT) 60 50 40 60 50 40 30 30 20 20 10 10 0 0 100 1000 10000 25000 100 Data set size N (number of data points) 1000 10000 25000 Data set size N (number of data points) Figure 6: Comparison of statistical tests (SIT) vs. approximate argumentative tests on the directed axioms (AITt -D) for data sets sampled from Bayesian models for domain size n = 24 and maximum degrees τ = 3, 7. conducted experiments on subsets of D containing an increasing number of data points. We report the average over the ten data sets of the estimated accuracy calculated using Eq. (26), for Y = SIT or AITt -D, as well as the difference between the average accuracies including the 95% confidence interval for the difference. Execution time of the PC algorithm using approximate AIT n = 24 variables, directed axioms Execution time (seconds) τ=3 τ=7 100000 10000 1000 100 10 10 100 1000 10000 100000 Data set size N (number of data points) Figure 7: Execution times for the PC algorithm using the approximate argumentative test on the directed axioms (AITt -D) on data sets sampled from Bayesian models for domain size n = 24 and maximum degrees τ = 3, 7. For the approximate AIT test we limited the depth of the dialog tree to 3 and its the branching factor as described in Section 6. Figure 6 shows a comparison of the argumentative tests AITt -D using the directed axioms with the corresponding SIT. The figure shows two plots for different values of τ of the mean values (over runs for ten different data sets) of accPC and accPC -D (histograms), the difference between these SIT AIT t 330 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION averages (line graph), and the 95% confidence intervals for the difference (error bars), for different data set sizes N. As usual, a positive value of the difference corresponds to an improvement of AITt -D over SIT. As in practically all experiments so far, we have statistically significant improvements over the entire range of data set sizes, with maximum improvements of up to 20% for τ = 3, N = 25000, and τ = 7, N = 900. The corresponding execution times for the entire PC algorithm are shown in Fig. 7. We can make two observations from this graph. One, the cost is significantly lower for sparse domains, which benefits real-world application domains that are sparse. The second observation is that the execution time scales linearly with the number of data points; this exhibits the same behavior as the use of a SIT test in PC, as each test needs to scan the data set once to compute the contingency table and relevant test statistics. In summary, these results demonstrate that the approximate argumentative test is practical for larger domains and can result in positive, statistically significant accuracy improvements when used for causal discovery. However, the cost of AIT for large data sets, although not prohibitive, can be non-negligible. Therefore the accuracy benefits of AIT vs. a SIT must be carefully weighed off the ability of the user to expend the extra computation. Note that the practicality of the approximate algorithm also depends on the parameters used (the cutoff depth of iterative deepening and the branching factor limit—see Section 6); different parameter values or alternative ways of limiting the size of the dialog tree may be needed for even larger domains. 7.3 Real-world and Benchmark Data Experiments While the sampled data set studies of the previous section have the advantage of a more controlled and systematic study of the performance of the algorithms, experiments on real-world data are necessary for a more realistic assessment. In this section we present experiments on a number of real-world and benchmark data sets obtained from the UCI machine learning repository (D. J. Newman and Merz, 1998) and the Knowledge Discovery Data repository (Hettich and Bay, 1999). As in the sampled data case of the previous section, for each data set D, we conducted experiments on subsets of D containing an increasing number of data points N to assess the performance of the independence tests on varying conditions of reliability. Again, to reduce variance we repeated each experiment ten times, each time choosing a different randomly selected data subset of equal size. Because for real-world data sets the underlying model is unknown, we could only be sure the general axioms of Eq. (5) apply. We therefore only used these axioms in this section. Also, as mentioned in the previous section, because some of the data sets have much larger domains (e.g., the alarm data set contains 37 variables), and given the exponential nature of the exact algorithms we could only perform experiments for the approximate version of the argumentative test. For these reasons, in the following experiments we only report the accuracy of AITt -G, the approximate argumentative independence test defined over the general axioms. Unfortunately, for real-world data the underlying model is typically unknown and therefore it is impossible to know the true value of any independence. We therefore approximate it by a statistical test on the entire data set, and limit the size of the data set subsets that we use up to a third of the size of the entire data set. This corresponds to the hypothetical scenario that a much smaller data set is available to the researcher, allowing us to evaluate the improvement of argumentation under these more challenging situations. Again, as in the previous two sections, for comparison we sampled 2,000 triplets and calculated the accuracy as a fraction of tests correct, where for the true value of independences and dependences we used the method just described. 331 B ROMBERG AND M ARGARITIS Accuracy comparison of SIT and AIT on real-world data Data set: cmc, general axioms 10 Accuracy comparison of SIT and AIT on real-world data Data set: letterRecognition, general axioms 10 accuracy(AIT, approx.) − accuracy(SIT) 7 Accuracy (%) 8 7 accuracy(AIT, approx.) − accuracy(SIT) 9 8 Accuracy (%) 9 6 5 4 3 6 5 4 3 2 2 1 1 0 0 -1 -1 10 100 1000 10 100 1000 Data set size N (number of data points) Accuracy comparison of SIT and AIT on real-world data Data set: alarm, general axioms 10 Accuracy comparison of SIT and AIT on real-world data Data set: nursery, general axioms 10 accuracy(AIT, approx.) − accuracy(SIT) 7 Accuracy (%) 8 7 accuracy(AIT, approx.) − accuracy(SIT) 9 8 Accuracy (%) 9 6 5 4 3 6 5 4 3 2 2 1 1 0 0 -1 -1 10 100 1000 10000 10 Data set size N (number of data points) 100 1000 10000 Data set size N (number of data points) Accuracy comparison of SIT and AIT on real-world data Data set: car, general axioms 10 Accuracy comparison of SIT and AIT on real-world data Data set: flare2, general axioms 10 accuracy(AIT, approx.) − accuracy(SIT) 8 8 7 accuracy(AIT, approx.) − accuracy(SIT) 9 7 Accuracy (%) 9 Accuracy (%) 10000 Data set size N (number of data points) 6 5 4 3 6 5 4 3 2 2 1 1 0 0 -1 -1 10 100 1000 Data set size N (number of data points) 10 100 1000 Data set size N (number of data points) Figure 8: Difference in the mean value of the accuracy AITt -G with the mean value of the accuracy of SIT for a number of real-world data sets. The error bars denote the 95% confidence interval of the difference. Figure 8 and Table 1 show the result of our comparison between the argumentative test AITt -G and statistical test SIT for real-world data sets. In the table, the best-performing method is shown in bold. The figure contains 6 plots, one for each data set, depicting the difference between the mean value of the accuracy of AITt -G and that of SIT, where as usual a positive value denotes an improvement of AITt -G over SIT. While in a few cases the average difference is negative (e.g., data set cmc, N = 40), in each case the negative value is not statistically significant as the confidence in332 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION N = 40 N = 240 N = 600 N = 1200 N = 3500 Data set car Domain size 7 Data set size 1730 SIT 80.1 80.1 AITt -G AITt -G − SIT 0.0 ± 0.7 Runtime of AITt -G (ms) 0.56 SIT 86.7 AITt -G 88.6 1.9 ± 0.6 AITt -G − SIT Runtime of AITt -G (ms) 1.37 SIT AITt -G AITt -G − SIT Runtime of AITt -G (ms) SIT AITt -G AITt -G − SIT Runtime of AITt -G (ms) SIT AITt -G AITt -G − SIT Runtime of AITt -G (ms) cmc 10 1475 77.8 77.5 -0.3 ± 0.6 1.07 84.1 84.7 0.5 ± 1.2 5.19 flare2 letterRecognition nursery 13 17 9 1067 20002 12962 77.0 47.9 83.3 77.1 47.8 83.8 0.1 ± 0.3 -0.1 ± 0.2 0.4 ± 0.1 2.61 4.19 0.88 85.5 50.7 86.1 86.9 51.0 87.2 1.3 ± 0.8 0.2 ± 0.4 1.1 ± 0.4 8.73 90.50 1.84 55.8 88.5 57.3 89.3 1.5 ± 0.5 0.8 ± 0.1 575.53 4.37 63.3 89.7 64.3 91.2 1.0 ± 0.3 1.5 ± 0.3 2008.76 14.05 73.8 94.1 76.5 95.4 2.6 ± 0.7 1.3 ± 0.3 24540.51 76.48 alarm 37 20003 76.7 76.7 0.0 ± 0.4 52.06 84.3 85.1 0.8 ± 0.3 202.05 88.6 89.8 1.2 ± 0.4 547.77 90.8 92.0 1.2 ± 0.4 1151.05 95.2 96.3 1.1 ± 0.3 3895.2 Table 1: Average accuracies (in percentage) of SIT and AITt -G, their differences (denoted AITt -G − SIT in the table), the 95% confidence interval for the difference, and the average runtime per test (in ms) for AITt -G for several real-world and benchmark data sets. For each data set the table shows these quantities for number of data points N = 40, 240, 600, 1200, 3500. The best performing algorithm ( AITt -G or SIT, with respect to accuracy) is indicated in bold. Empty cells correspond to cases where one third of the data set was smaller than the value of N in that column. terval contains a portion of the positive half-plane. The figure demonstrates a clear advantage of the argumentative approach, with all data sets reaching statistically significant positive improvements in accuracy of up to 3% and all confidence intervals covering positive values either partially or completely. The table also shows the average execution time (in ms) for the AITt -G tests evaluated. 8. Conclusion We presented a framework for addressing one of the most important problems of independencebased structure discovery algorithms, namely the problem of unreliability of statistical independence tests. Our main idea was to recognize the existence of interdependences among the outcomes of conditional independence tests—in the form of Pearl’s axiomatic characterization of the conditional independence relation—that can be seen as integrity constraints and exploited to correct unreliable statistical tests. We modeled this setting as a knowledge base containing conditional independences that are potentially inconsistent, and used the preference-based argumentation framework to reason with and resolve these inconsistencies. We presented in detail how to apply the argumentation framework to independence knowledge bases and how to compute the preference among the independence propositions. We also presented a number of algorithms, both exact and approximate, for implementing statistical testing using this framework. We analyzed the approximate algorithm 333 B ROMBERG AND M ARGARITIS and proved that is has polynomial worst-case execution time. We also experimentally verified that its accuracy improvement is close to the exact one while providing orders of magnitude faster execution, making possible its use for causal discovery in large domains. Overall, our experimental evaluation demonstrated statistically significant improvements in the accuracy of causal discovery for the overwhelming majority of sampled, benchmark and real-world data sets. Appendix A. Computability of the Argumentative Independence Test In this appendix we prove that the argumentative test terminates, a property that we call its computability. Some of the theorems and lemmas presented are not original work but adaptations of well known properties of relations. We include them to allow a complete exposition of the proof of computability, given by Theorem 21. We first introduce some notation. We denote independence propositions (e.g., (X⊥ | Z)) by σ and their negation (e.g., (X ⊥ | Z)) by ¬σ. We abbreviate ⊥Y ⊥Y their corresponding propositional arguments ({σ}, σ) and ({¬σ}, ¬σ) by a σ and a¬σ , respectively, and we will refer to a¬σ as the negation of aσ (and vice versa). Also, we use the predicates A(a), R(a), Ab(a) to denote the fact the argument a is accepted, rejected, or in abeyance, respectively. For completeness we repeat here the definition of strict and transitive preference relation. Definition 13. We say preference relation π over arguments is strict if the ordering of arguments induced by it is strict and total, that is, for every pair of arguments a and b, a π b ⇐⇒ ¬ b π a . (27) Definition 14. We say that preference relation π over arguments is transitive if, for every three arguments a, b and c, a π b ∧ b π c =⇒ a π c . Lemma 24. A strict preference relation π satisfies the condition that for every pair of arguments such that a defeats b and b defeats a, it is the case that a attacks b or b attacks a, that is, at least one of a and b attacks the other. Proof We prove by contradiction: Let us assume that a defeats b and b defeats a but neither a attacks b nor b attacks a. By definition of the attack relation (Definition 15), ¬ a attacks b =⇒ ¬ ¬(b π a) =⇒ b π a ¬ b attacks a =⇒ ¬ ¬(a π b) =⇒ a π b. and However, this is a contradiction since, by assumption, the preference ordering is strict, and therefore it cannot be true that both a π b and b π a are true at the same time. Lemma 25. A strict preference π satisfies the condition that for every pair a and b of arguments, it is not the case that both a attacks b and b attacks a, that is, there can be no mutual attack. Proof We prove by contradiction. Let us consider two mutually attacking arguments a and b. By the definition of the attack relation, and because π is a total order, we have that a attacks b =⇒ ¬(b π a) =⇒ a 334 π b ∨ a ≡π b I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION and b attacks b =⇒ ¬(a π b) =⇒ b π a ∨ b ≡π a where a ≡π b means a is equally preferable to b. However, equality of preference is not possible in a strict preference relation. Therefore it must be the case that a π b and b π a, which is a contradiction of Eq. (27), again due to strictness. We next prove that no argument is in abeyance if the preference relation over arguments is strict. For that, we first prove that an argument in abeyance is always attacked by at least another argument in abeyance. Lemma 8. For every argument a, Ab(a) =⇒ ∃b ∈ attackers(a), Ab(b). Proof By definition, an argument a is in abeyance if it is neither accepted nor rejected. Applying the definitions of acceptance and rejection and manipulating the Boolean formulae we obtain, ⇐⇒ ¬A(a) ∧ ¬R(a) ⇐⇒ ¬ ∀b ∈ attackers(a), R(b) ∧ ¬ ∃b ∈ attackers(a), A(b)) ⇐⇒ ∃b ∈ attackers(a), ¬R(b) ∧ ∀b ∈ attackers(a), ¬A(b)) ⇐⇒ ∃b ∈ attackers(a), (A(b) ∨ Ab(b)) ∧ ∀b ∈ attackers(a), ¬A(b)) ⇐⇒ Ab(a) ∃b ∈ attackers(a), Ab(b) ∧ ∀b ∈ attackers(a), ¬A(b)) =⇒ ∃b ∈ attackers(a), Ab(b). Definition 22. An attack sequence is a sequence a1 , a2 , . . . , an of n arguments such that for every 2 ≤ i ≤ n, ai attacks ai−1 . Lemma 26. Let A , R , π be a PAF with a strict and transitive preference relation π. Then, no argument can appear more than once in any attack sequence, that is, for every attack sequence a1 , a2 , . . . , an and every pair of integers i, j ∈ [1, n] such that i = j, ai = a j . Proof We first note that by definition of the attack relation, it must be the case that for any two consecutive arguments ai , ai+1 , it is true that ¬(ai π ai+1 ). Since π is strict, this is equivalent to ai+1 π ai (c.f. Eq. (27)). That is, an π an−1 π ... π a2 π a1 (28) We now assume, for contradiction, there exists an argument a that appears twice in the attack sequence at indexes i and j , that is, ∃ i , j ∈ [1, n], i = j , such that ai = a j = a . 335 B ROMBERG AND M ARGARITIS Since no argument defeats itself, it cannot attack itself, and thus the smallest possible attack sequence with a repeated argument must have at least length 3. From this fact, Eq. (28), and transitivity, there must exist an argument b = a such that a π b π a . This last fact implies that a b and b π a must hold, which contradicts strictness (Eq. (27)). π A corollary of this lemma is the following theorem. Theorem 23. Every attack sequence a1 , a2 , . . . , an in a PAF A , R , π with strict and transitive π, and finite A is finite. Proof Follows directly from Lemma 26 and the fact that A is finite. We can now prove the main result of this section in the following theorem. Theorem 21. Given an arbitrary triplet t = (X,Y | Z), and a PAF A , R , π with a strict and transitive preference relation π, and finite arguments set A , the top-down algorithm of Algorithm 3 run for input t over A , R , π terminates. Proof In the tree traversed by the top-down algorithm, any path from the root to a leaf is an attack sequence. Since for strict and transitive π, and finite A each such sequence is finite, the algorithm always terminates. Appendix B. Validity of the Argumentative Independence Test In this section we prove the property of the argumentative independence test of deciding that an input triplet (X,Y | Z) evaluates to either independence or dependence, but not both or neither. We call this property the validity of the test. We start we proving that under the assumption of a strict and transitive preference relation, no argument is in abeyance. Theorem 20. Let A , R , π be a PAF with a strict and transitive preference relation π. Then no argument a ∈ A is in abeyance. Proof Let us assume, for contradiction, that there is an argument a in abeyance. From Lemma 8, not only a has an attacker in abeyance, say argument b, but b also has an attacker in abeyance, and so on. That is, we can construct an attack sequence starting at a that contains only arguments in abeyance. Moreover, this sequence must be infinite, since the lemma assures as we always have at least one attacker in abeyance. This is in direct contradiction with Theorem 23. Corollary 27. For every argument a in a PAF A , R , π with strict and transitive π, A(a) ⇐⇒ ¬R(a). We now prove a number of lemmas that hold only for the sub-class of propositional arguments (arguments whose support contains only one proposition, equal to the head of that argument). We start with a lemma that demonstrates that it cannot be the case that an attacker of a propositional argument aσ and an attacker of its negation a¬σ do not attack each other. The former must attack the latter or vice versa. 336 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION Lemma 28. Let A , R , π be a PAF with a strict preference relation π, a σ ∈ A be a propositional argument, and a¬σ its negation. For every pair of arguments b and c that attacks a σ and a¬σ respectively, (b attacks c) ∨ (c attacks b). Proof Since aσ and a¬σ are propositional arguments, their support contains the head and only the head, and thus any defeater (i.e., rebutter or undercutter) must have as head ¬σ and σ, respectively, that is, the head of b must be ¬σ and the head of c must be σ. Thus, b rebuts (and thus defeats) c and vice versa. The lemma then follows directly from Lemma 24. Lemma 29. Let A , R , π be a PAF with a strict preference relation π, and a σ and a¬σ be a propositional argument and its negation. Then, R(aσ ) =⇒ ¬R(a¬σ ). Proof By assumption, R(aσ ). We assume, for contradiction, that R(a¬σ ). Therefore, by the definition of rejection, ∃b ∈ attackers(aσ ) such that A(b), and ∃c ∈ attackers(a¬σ ) such that A(c). By Lemma 28 b attacks c or c attacks b. In either case, an accepted argument is attacking an accepted argument, which contradicts the definition of acceptance. Lemma 30. Given a PAF A , R , π with a strict preference relation π, every propositional argument aσ ∈ A satisfies A(aσ ) =⇒ ¬A(a¬σ ) Proof We prove by contradiction. Let us assume that both a σ and a¬σ are accepted. Since aσ and a¬σ are propositional arguments, they defeat each other. Then, by Lemma 24 a σ attacks a¬σ or vice versa. In either case an accepted argument has an accepted attacker, which is a contradiction. We now prove Theorem 19 that was introduced in Section 4, reproduced here for convenience. Theorem 19. Given a PAF A , R , π with a strict and transitive preference relation π, every propositional argument aσ ∈ A and its negation a¬σ satisfy A(aσ ) ⇐⇒ R(a¬σ ). Proof The ( =⇒ ) direction follows from Lemma 30 and Theorem 20. The (⇐=) direction follows from Lemma 29 and Theorem 20. In Section 4 we defined the following semantics for deciding on the dependence or independence of an input triplet (X,Y | Z): ({(X ⊥ | Z)}, (X ⊥ | Z)) is accepted ⇐⇒ (X ⊥ | Z) is accepted =⇒ (X ⊥ | Z) ⊥Y ⊥Y ⊥Y ⊥Y ({(X⊥ | Z)}, (X⊥ | Z)) is accepted ⇐⇒ (X⊥ | Z) is accepted =⇒ (X⊥ | Z) ⊥Y ⊥Y ⊥Y ⊥Y (29) where acceptance is defined over an independence-based PAF as defined in Section 3.3. For this argumentative test of independence to be valid, its decision must be non-ambiguous, that is, it must decide either independence or dependence, but not both or neither. For that, exactly one of the antecedents of the above implications must be true. Formally: 337 B ROMBERG AND M ARGARITIS Theorem 18. For any input triplet σ = (X,Y | Z), the argumentative independence test defined by Eq. (29) produces a non-ambiguous decision, that is, it decides that σ evaluates to either independence or dependence, but not both or neither. Proof Let us denote (X⊥ | Z) by σt and (X ⊥ | Z) by σf . Since strictness and transitivity of ⊥Y ⊥Y the independence preference relation hold (proved in Section 3.3, lemmas 16 and 17 respectively), Theorems 19 and 20 hold as well. From Theorem 20 we know that neither of the propositional arguments is in abeyance. Thus, since aσt corresponds to the negation of aσf it follows from Theorem 19 that exactly one of them is accepted. References ˇ a H. Abdi. The Bonferonni and Sid´ k corrections for multiple comparisons. In Neil Salkind, editor, Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage, 2007. A. Agresti. Categorical Data Analysis. Wiley, 2nd edition, 2002. L. Amgoud and C. Cayrol. A reasoning model based on the production of acceptable arguments. Annals of Mathematics and Artificial Intelligence, 34:197–215, 2002. Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57 (1):289–300, 1995. Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4):1165–1188, 2001. W. G. Cochran. Some methods of strengthening the common χ 2 tests. Biometrics, 10:417–451, 1954. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2nd edition, 2001. C. L. Blake D. J. Newman, S. Hettich and C. J. Merz. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science, 1998. A. P. Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society, 41:1–31, 1979. P. M. Dung. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games. Artificial Intelligence, 77:321–357, 1995. P. G¨ rdenfors. Belief Revision. Cambridge Computer Tracts. Cambridge University Press, Cama bridge, 1992. P. G¨ rdenfors and H. Rott. Belief revision. In Gabbay, D. M., Hogger, C. J. and Robinson, J. a A., editors, Handbook of Logic in Artificial Intelligence and Logic Programming, volume 4. Clarendon Press, Oxford, 1995. 338 I MPROVING THE R ELIABILITY OF C AUSAL D ISCOVERY U SING A RGUMENTATION S. Hettich and S. D. Bay. The UCI KDD archive. Irvine, CA: University of California, Department of Information and Computer Science, 1999. Y. Hochberg. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4): 800–802, December 1988. J. S. Ide, F. G. Cozman, and F. T. Ramos. Generating random Bayesian networks with constraints on induced width. Brazilian Symposium on Artificial Intelligence, Recife, Pernambuco, Brazil, 2002. A. C. Kakas and F. Toni. Computing argumentation in logic programming. Journal of Logic and Computation, 9(4):515–562, 1999. M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA, 1994. R. P. Loui. Defeat among arguments: a system of defeasible inference. Computational Intelligence, 2:100–106, 1987. J. P. Martins. Belief revision. In Shapiro, S. C., editor, Encyclopedia of Artificial Intelligence, pages 110–116. John Wiley & Sons, New York, second edition, 1992. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, CA, 2nd edition, 1988. J. Pearl and A. Paz. GRAPHOIDS: A graph-based logic for reasoning about relevance relations. Technical Report 850038 (R-53-L), Cognitive Systems Laboratory, University of California, 1985. J. L. Pollock. How to reason defeasibly. Artificial Intelligence, 57:1–42, 1992. H. Prakken. Logical Tools for Modelling Legal Argument. A Study of Defeasible Reasoning in Law. Kluwer Law and Philosophy Library, Dordrecht, 1997. H. Prakken and G. Vreeswijk. Logics for Defeasible Argumentation, volume 4 of Handbook of Philosophical Logic. Kluwer Academic Publishers, Dordrecht, 2 edition, 2002. S. C. Shapiro. Belief revision and truth maintenance systems: An overview and a proposal. Technical Report CSE 98-10, Dept of Computer Science and Engineering, State University of New York at Buffalo, 1998. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Adaptive Computation and Machine Learning Series. MIT Press, 2nd edition, January 2000. J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B (Methodological), 64(3):479–498, 2002. M. Studen´ . Conditional independence relations have no finite complete characterization. In Transy actions of the 11th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, volume B, pages 377–396, 1991. 339 B ROMBERG AND M ARGARITIS F. Toni and A. C. Kakas. In A. Nerode, editor, 3rd International Conference on Logic Programming and Non-monotonic Reasoning, volume 928 of Lecture Notes in Artificial Intelligence, pages 401–415. Springer Verlag, 1995. 340
4 0.30394208 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes
Author: Jan Ramon, Siegfried Nijssen
Abstract: Algorithms that list graphs such that no two listed graphs are isomorphic, are important building blocks of systems for mining and learning in graphs. Algorithms are already known that solve this problem efficiently for many classes of graphs of restricted topology, such as trees. In this article we introduce the concept of a dense augmentation schema, and introduce an algorithm that can be used to enumerate any class of graphs with polynomial delay, as long as the class of graphs can be described using a monotonic predicate operating on a dense augmentation schema. In practice this means that this is the first enumeration algorithm that can be applied theoretically efficiently in any frequent subgraph mining algorithm, and that this algorithm generalizes to situations beyond the standard frequent subgraph mining setting. Keywords: graph mining, enumeration, monotonic graph classes
5 0.278905 74 jmlr-2009-Properties of Monotonic Effects on Directed Acyclic Graphs
Author: Tyler J. VanderWeele, James M. Robins
Abstract: Various relationships are shown hold between monotonic effects and weak monotonic effects and the monotonicity of certain conditional expectations. Counterexamples are provided to show that the results do not hold under less restrictive conditions. Monotonic effects are furthermore used to relate signed edges on a causal directed acyclic graph to qualitative effect modification. The theory is applied to an example concerning the direct effect of smoking on cardiovascular disease controlling for hypercholesterolemia. Monotonicity assumptions are used to construct a test for whether there is a variable that confounds the relationship between the mediator, hypercholesterolemia, and the outcome, cardiovascular disease. Keywords: Bayesian networks, conditional expectation, covariance, directed acyclic graphs, effect modification, monotonicity
7 0.26718235 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
8 0.24466425 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
9 0.23251337 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
11 0.18204933 90 jmlr-2009-Structure Spaces
12 0.16858464 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations
13 0.16646996 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
14 0.15738299 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)
15 0.15575536 15 jmlr-2009-Cautious Collective Classification
16 0.15483859 48 jmlr-2009-Learning Nondeterministic Classifiers
17 0.15223268 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
18 0.13785961 50 jmlr-2009-Learning When Concepts Abound
19 0.13277042 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning
20 0.11973923 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
topicId topicWeight
[(8, 0.019), (11, 0.558), (26, 0.019), (38, 0.016), (52, 0.044), (55, 0.035), (58, 0.026), (66, 0.068), (68, 0.013), (90, 0.049), (96, 0.025)]
simIndex simValue paperId paperTitle
1 0.91406208 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
Author: Mathias Drton, Michael Eichler, Thomas S. Richardson
Abstract: In recursive linear models, the multivariate normal joint distribution of all variables exhibits a dependence structure induced by a recursive (or acyclic) system of linear structural equations. These linear models have a long tradition and appear in seemingly unrelated regressions, structural equation modelling, and approaches to causal inference. They are also related to Gaussian graphical models via a classical representation known as a path diagram. Despite the models’ long history, a number of problems remain open. In this paper, we address the problem of computing maximum likelihood estimates in the subclass of ‘bow-free’ recursive linear models. The term ‘bow-free’ refers to the condition that the errors for variables i and j be uncorrelated if variable i occurs in the structural equation for variable j. We introduce a new algorithm, termed Residual Iterative Conditional Fitting (RICF), that can be implemented using only least squares computations. In contrast to existing algorithms, RICF has clear convergence properties and yields exact maximum likelihood estimates after the first iteration whenever the MLE is available in closed form. Keywords: linear regression, maximum likelihood estimation, path diagram, structural equation model, recursive semi-Markov model, residual iterative conditional fitting
same-paper 2 0.87161678 11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification
Author: Raanan Yehezkel, Boaz Lerner
Abstract: We propose the recursive autonomy identification (RAI) algorithm for constraint-based (CB) Bayesian network structure learning. The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. The sequence of operations is performed recursively for each autonomous substructure while simultaneously increasing the order of the CI test. While other CB algorithms d-separate structures and then direct the resulted undirected graph, the RAI algorithm combines the two processes from the outset and along the procedure. By this means and due to structure decomposition, learning a structure using RAI requires a smaller number of CI tests of high orders. This reduces the complexity and run-time of the algorithm and increases the accuracy by diminishing the curse-of-dimensionality. When the RAI algorithm learned structures from databases representing synthetic problems, known networks and natural problems, it demonstrated superiority with respect to computational complexity, run-time, structural correctness and classification accuracy over the PC, Three Phase Dependency Analysis, Optimal Reinsertion, greedy search, Greedy Equivalence Search, Sparse Candidate, and Max-Min Hill-Climbing algorithms. Keywords: Bayesian networks, constraint-based structure learning
3 0.69135642 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the L2 norm, then the learned vector is a linear combination of the input data. This result, known as the representer theorem, lies at the basis of kernel-based methods in machine learning. In this paper, we prove the necessity of the above condition, in the case of differentiable regularizers. We further extend our analysis to regularization methods which learn a matrix, a problem which is motivated by the application to multi-task learning. In this context, we study a more general representer theorem, which holds for a larger class of regularizers. We provide a necessary and sufficient condition characterizing this class of matrix regularizers and we highlight some concrete examples of practical importance. Our analysis uses basic principles from matrix theory, especially the useful notion of matrix nondecreasing functions. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, regularization
4 0.66884124 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
5 0.55154383 54 jmlr-2009-Markov Properties for Linear Causal Models with Correlated Errors (Special Topic on Causality)
Author: Changsung Kang, Jin Tian
Abstract: A linear causal model with correlated errors, represented by a DAG with bi-directed edges, can be tested by the set of conditional independence relations implied by the model. A global Markov property specifies, by the d-separation criterion, the set of all conditional independence relations holding in any model associated with a graph. A local Markov property specifies a much smaller set of conditional independence relations which will imply all other conditional independence relations which hold under the global Markov property. For DAGs with bi-directed edges associated with arbitrary probability distributions, a local Markov property is given in Richardson (2003) which may invoke an exponential number of conditional independencies. In this paper, we show that for a class of linear structural equation models with correlated errors, there is a local Markov property which will invoke only a linear number of conditional independence relations. For general linear models, we provide a local Markov property that often invokes far fewer conditional independencies than that in Richardson (2003). The results have applications in testing linear structural equation models with correlated errors. Keywords: Markov properties, linear causal models, linear structural equation models, graphical models
6 0.49930915 74 jmlr-2009-Properties of Monotonic Effects on Directed Acyclic Graphs
8 0.43103734 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
9 0.36309871 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
10 0.35588285 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
11 0.34756598 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
12 0.34307432 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
13 0.33865011 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
14 0.32647413 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
15 0.3247987 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning
16 0.32320383 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
17 0.32279855 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
18 0.3183386 78 jmlr-2009-Refinement of Reproducing Kernels
20 0.31690472 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning