jmlr jmlr2008 jmlr2008-35 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
Reference: text
sentIndex sentText sentNum sentScore
1 However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. [sent-11, score-0.402]
2 Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. [sent-12, score-0.403]
3 We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. [sent-13, score-0.252]
4 Maximizing a score function over the space of DAGs is a promising approach towards learning structure from data. [sent-31, score-0.233]
5 A search strategy called optimal search (OS) have been developed to find the graphs having the highest score (or global optima) in exponential time. [sent-32, score-0.542]
6 The resulting graphs are local optima and their accuracy strongly depends on the heuristic search strategy. [sent-34, score-0.272]
7 In general, given no prior knowledge, the best strategy is still a basic greedy hill climbing search (HC). [sent-35, score-0.262]
8 (2006) proposed to constrain the search space by learning a skeleton using an IT-based technique before proceeding to a restricted search. [sent-37, score-0.242]
9 This is an undirected graph that is assumed to contain the skeleton of the true graph (i. [sent-43, score-0.325]
10 A sound super-structure (that effectively contains the true skeleton) could be provided by prior knowledge or learned from data much more easily (with a higher probability) than the true skeleton itself. [sent-47, score-0.311]
11 Subsequently, we consider the problem of maximizing a score function given a super-structure, and we derive a constrained optimal search, COS, that finds a global optimum over the restricted search space. [sent-48, score-0.343]
12 Not surprisingly, our algorithm is faster than OS since the search space is smaller; more precisely, its computational complexity is proportional to the number of connected subsets of the super-structure. [sent-49, score-0.258]
13 Moreover, for a sound super-structure, learned graphs are more accurate than unconstrained optima: this is because, some incorrect edges are forbidden, even if their addition to the graph improves the score. [sent-54, score-0.455]
14 The idea is to use “relaxed” independency testing to obtain an undirected graph that may contain the true skeleton with a high probability, while yet being sparse. [sent-58, score-0.299]
15 In that case, we can consider the significance level of the independency tests, α, as a tool to choose between accuracy (high values return dense but probably sound structures) and speed (low values give sparse but incomplete structures). [sent-59, score-0.236]
16 Interestingly, even for really low significance levels (α ≈ 10 −5 ), COS+ returns graphs more accurate and of a higher score than both MMHC and HC. [sent-68, score-0.384]
17 Usually, algorithms start by learning the skeleton of the graph (by propagating constraints on the neighborhood of each variable) and then edges are oriented to cope with dependencies revealed from data. [sent-88, score-0.319]
18 Moreover, their complexity is polynomial, assuming that the maximal degree of the network, that is, the maximal size of nodes neighborhood, is bounded (Kalisch and B¨ hlmann, 2007). [sent-91, score-0.253]
19 There exist various implementations using different empirical tricks to improve the score of the results, such as TABU list, restarting, simulated annealing, or searching with different orderings of the variables (Chickering et al. [sent-102, score-0.233]
20 • Then, from a list of possible transformations containing at least addition, withdrawal or reversal of an edge, select and apply the transformation that improves the score most while also ensuring that graph remains acyclic. [sent-105, score-0.314]
21 • Finally repeat previous step until strict improvements to the score can no longer be found. [sent-106, score-0.233]
22 SC was one of the first to propose a reduction in the search space, thereby sensitively improving the score of resulting networks without increasing the complexity too much if candidate parents are correctly selected. [sent-116, score-0.492]
23 Main scoring functions have been proved to be score equivalent (Chickering, 1995), that is, two equivalent DAGs (representing the same set of independencies among variables) have the same score. [sent-120, score-0.284]
24 Moreover, the authors developed efficient data-structures to rationalize score evaluations and retrieve easily evaluation of their operators. [sent-129, score-0.233]
25 Based on this, the authors proposed a strategy to find an optimal graph by selecting the best parent set of a node among the subsets of nodes preceding it. [sent-133, score-0.256]
26 However, in practice, a greedy search that considers adding and withdrawing a parent is applied to select a locally optimal parent set. [sent-137, score-0.238]
27 It first learns an approximation of the skeleton of 2255 P ERRIER the true graph by using an IT strategy. [sent-150, score-0.244]
28 It is worth to notice that the skeleton learned in the first phase can differ from the one of the final DAG, since all edges will not be for sure added during the greedy search. [sent-154, score-0.343]
29 More precisely, all independencies entailed in a graph G are summarized by its skeleton and by its v-structures (Pearl, 1988). [sent-180, score-0.295]
30 Although there are distributions P that do not admit a faithful BN (for example the case when parents are connected to a node via a parity or XOR structure), such cases are regarded as “rare” (Meek, 1995), which justifies our hypothesis. [sent-186, score-0.233]
31 To understand how OS finds global optima in O(n2n ) without having to explicitly check every DAG possible, we must first explain how a score function is defined. [sent-210, score-0.275]
32 Another classical attribute is score equivalence, which means that two equivalent graphs will have the same score. [sent-216, score-0.384]
33 In our study, we will use BIC, thereby our score is local and equivalent, and our task will be to find a DAG over X that maximizes the score given the data D. [sent-218, score-0.466]
34 (2004) defined for every node Xi and every candidate parent set A ⊆ X \ {Xi }: • The best local score on Xi : Fs (Xi , A) = max score(Xi , B, D) ; B⊆A • The best parent set for Xi : Fp (Xi , A) = argmax score(Xi , B, D) . [sent-220, score-0.365]
35 B⊆A From now we omit writing D when referring to the score function. [sent-221, score-0.233]
36 The parents of the last element Xi∗ are for sure Fp (Xi∗ , Bi∗ ), where B j = A\{X j }; thus its local score is Fs (Xi∗ , Bi∗ ). [sent-235, score-0.294]
37 Moreover, the subgraph of G∗ induced when removing Xi∗ must be optimal for Bi∗ ; thus, its score is Ms (Bi∗ ). [sent-236, score-0.233]
38 This limitation is easily justifiable, as graphs having many parents for a node are usually strongly penalized by score functions. [sent-250, score-0.485]
39 Consequently, the total number of score evaluated is reduced to O(n c+1 ), which is a decisive improvement since computing a score is costly. [sent-252, score-0.466]
40 Since a parent set requires O(n) space and 3 a graph O(n2 ), we derive that the maximal memory usage with recycling is O(n 2 2n ), while total memory usage of F in Algorithm 1 was O(n2 2n ). [sent-260, score-0.302]
41 Such capacity is precious in order to find which structural similarities are shared by highly probable graphs, particularly when the score criteria used is not equivalent. [sent-264, score-0.266]
42 An undirected graph S = (V, ES ) is said to be a super-structure of a DAG G = (V, EG ), if the skeleton of G, G = (V, EG ) is a subgraph of S (i. [sent-276, score-0.244]
43 In fact, the skeleton of the graph returned by MMHC is not proven to be the same than the learned one; some edges can be missing. [sent-284, score-0.357]
44 Our task is to globally maximize the score function over our reduced search space. [sent-304, score-0.312]
45 Given an undirected graph S = (X, ES ), a subset A ⊆ X is said to be connected if / A = 0 and the subgraph of S induced by A, SA , is connected (cf. [sent-312, score-0.261]
46 The subsets C1 , · · · , C p are called the maximal connected subsets of A (cf. [sent-318, score-0.262]
47 Proof: First, let consider a subset A ∈ Con(S), its maximal connected subsets C 1 , · · · , C p (p > 1), and an optimal constrained DAG G∗ = (A, E∗ ). [sent-327, score-0.243]
48 Since G∗ is constrained by the super-structure, and following the definition of the maximal connected subsets, there cannot be edges in G ∗ between any element in Ci and any element in C j if i = j. [sent-328, score-0.268]
49 (8) i=1 Formula (7) directly follows our previous considerations that maximizing the score over A is equivalent to maximizing it over each Ci independently, since they cannot affect each other. [sent-332, score-0.233]
50 the values of Ms for the maximal connected subsets of B j are already computed since these subsets are of smaller sizes than A. [sent-339, score-0.262]
51 Although set operators are used heavily in our algorithm, such operations can be efficiently implemented and considered of a negligible cost as compared to other operations, such as score calculations. [sent-348, score-0.233]
52 Still, since it depends only linearly on the size of the graphs, F can be computed for graphs of any size if their maximal degree is less than around thirty. [sent-350, score-0.254]
53 Now let suppose that the j k1 , forbidden sets were correctly defined and that the visits correctly proceeded until A + : if i = p by ki hypothesis we explored every connected supersets of the children of A and the search back track correctly. [sent-399, score-0.254]
54 For every pair (m, n) of parameters considered , we randomly generated 500 undirected graphs S pushing their structures towards a maximization of the number of connected subsets. [sent-465, score-0.241]
55 Finally, for each pair (m, n), from the maximal ln(R ) number Rn, m of connected subsets found, we calculated exp( nn, m )) in order to search for an exponential behavior as shown in Figure 5(a). [sent-468, score-0.291]
56 The weak point of our strategy is that more graphs should be consider while increasing n, since the number of possible graphs also increases. [sent-472, score-0.302]
57 For each ˜ pair (n, m) considered, we generated 10000 random graphs and averaged the number of connected ˜ subsets to obtain Rn, m . [sent-499, score-0.291]
58 Since the learning task consists in the maximization of the score function, a natural criterion to evaluate the quality of the learned network is its score. [sent-547, score-0.317]
59 The better is the score obtained by an algorithm, the closer to 1 is its score ratio. [sent-550, score-0.466]
60 We preferred to use the best score rather than the score of the true network, because the true network is rarely optimal; its score is even strongly penalized if its structure is dense and data sets are small. [sent-551, score-0.745]
61 To avoid bias of this criterion, common routines are shared among algorithms (such as the score function, the structure learning method and the hill climbing search). [sent-555, score-0.349]
62 In our case, we penalize wrongly oriented edges only by half, because we consider that errors in the skeleton are more “grave” than those in edges orientation. [sent-560, score-0.313]
63 The search stops as soons as the score cannot be strictly improved anymore. [sent-570, score-0.312]
64 This implies that the results will depend on the ordering of the variables; however, since the graphs considered are randomly generated, their topological ordering is also random, and in average the results found by our search should not be biased. [sent-572, score-0.336]
65 5 α ~ ~ Error and M i ng Rati dependi on α ( = 50,m = 2. [sent-616, score-0.286]
66 5) ssi o ng n m ng n Figure 6: Effects of d and α on the results of Method 2. [sent-618, score-0.263]
67 In the present case our criteria are: the time of execution Time, the ratio of wrong edges (missing and extra) Error, and the ratio of missing edges Miss of the learned skeleton. [sent-620, score-0.256]
68 Here again, these ratios are defined while comparing to the true skeleton and dividing by its total number of edges nm ˜ 2 . [sent-621, score-0.238]
69 15 M i ng Rati ssi o 0 10 20 30 n 40 50 60 10 Error Rati ofM M PC dependi on n o ng ~ ( = 2. [sent-651, score-0.442]
70 d = 10000) m 5, 20 30 n 40 50 60 10 20 30 n 40 50 60 M i ng Rati ofM M PC dependi on n ssi o ng Ti ofM M PC dependi on n me ng (d) (f ) 0. [sent-652, score-0.728]
71 5 ~ Error Rati ofM M PC dependi on m o ng ( = 50,d = 10000) n 2 0. [sent-660, score-0.286]
72 5 ~ M i ng Rati ofM M PC dependi on m ssi o ng 2. [sent-686, score-0.442]
73 5 ~ Ti ofM M PC dependi on m me ng Figure 7: The criteria depending on m and n, for α in [0. [sent-690, score-0.286]
74 However, we know from previous section that MMPC will probably learn an incomplete superstructure, and that GCOS(α) will be penalized both in its score and accuracy. [sent-710, score-0.294]
75 ˜ This is because, the complexity of OS is due to the costly O(n c ) score calculations (here c = O(m)) ˜ and the O(2n ) steps (b) and (d) that dominate the complexity only for large n. [sent-720, score-0.311]
76 To evaluate the quality of the graphs learned depending on the algorithm used, for n = 12 and m = 2 we compare the SER and the score ratio of GCOS(α) and GCOS(α)+ depending on α with the ˜ ones of GOS and GTCOS in Figure 9. [sent-724, score-0.422]
77 Results 6 In average, GTCOS has a slightly lower score than GOS (cf. [sent-730, score-0.233]
78 This important result emphasizes again that the optima of a score function with finite data are not the true networks usually: some false edges improve the score of a graph. [sent-734, score-0.63]
79 95 α ~ Score ofCO S+ i detai ( = 12,m = 2) n ln α ~ Score ofCO S and CO S+ dependi on α ( = 12,m = 2) ng n 0. [sent-753, score-0.286]
80 95 α ~ Error Rati ofCO S and CO S+ dependi on α ( = 12,m = 2) o ng n Figure 9: Score and SER for COS(α) and COS(α)+ tural constraint should be generalized whenever a sound super-structure is known: by doing so, the resulting graphs although having a lower score can be more accurate. [sent-767, score-0.78]
81 , α → 1), false edges that improve the score are learned faster than the true ones missing. [sent-786, score-0.346]
82 Interestingly, our assumption about the potential interest of a hill climbing starting from GCOS(α) are encouraged since GCOS(α) has a low score while being relatively accurate. [sent-790, score-0.349]
83 0 ~ m ~ n Score Rati dependi on m ( = 16) o ng (c) 6 CO S( 001) 0. [sent-827, score-0.286]
84 0 ~ m ~ n Error Rati dependi on m ( = 16) o ng Figure 10: Evolution of Score and SER for every algorithm depending on n and m. [sent-833, score-0.286]
85 This is probably due to the fact that MMPC misses many true edges when the structure is dense: the greedy search would add more true edges missing than false ones, improving consequently the SER. [sent-846, score-0.362]
86 One efficient strategy to improve both score and accuracy of G COS is to proceed to a post-processing unconstrained hill climbing search. [sent-849, score-0.349]
87 With such an improvement, both score and accuracy of G COS(α)+ become similar for any α used, despite a relative superiority for higher significance levels. [sent-850, score-0.233]
88 Consequently, it would be interesting to compare COS and COS+ to greedy searches, to see if they enable to learn efficiently more accurate graphs than other methods, while being given incomplete super-structures. [sent-851, score-0.248]
89 0 ~ n Ti dependi on m ( = 14) me ng Figure 11: Score, SER and time of every algorithm depending on n and m. [sent-909, score-0.286]
90 Finally, a classic hill climbing search from the empty graph is also considered (HC). [sent-915, score-0.276]
91 To obtain these tables, we ordered algorithms by score for all the 100 triplets (n, m, d) (the scores after averaging over ˜ the g graphs were used). [sent-922, score-0.384]
92 In other words, the super-structure penalizes the score in comparison with HC; however, starting a greedy search from a constrained optimal graph enables to find better scores than HC. [sent-931, score-0.524]
93 Such parent sets could not be learned by the greedy strategy of HC in any cases: as for example in the case of a XOR structure or any configurations for which HC should add several parents in the same time to obtain a score improvement. [sent-947, score-0.445]
94 Incidentally, the true skeleton entails 86818458 ≈ 2 26 connected subsets: if it was given ˜ as a prior knowledge, Algorithm 2 could be applied. [sent-959, score-0.253]
95 25 COS 3000 5000 10 −4 10000 -7 SER dependi on d ( = 10 ) ng α 10 −5 α 10 −6 10 −7 SER dependi on α ( = 5000) ng d Figure 14: Some detailed results on ALARM network. [sent-1062, score-0.572]
96 Consequently, more attention should be paid to learning sound superstructures rather than true skeleton from data. [sent-1072, score-0.273]
97 In addition, current IT methods that learn a skeleton can be used for approximating a superstructure by relaxing the independency testing. [sent-1075, score-0.248]
98 This algorithm enables to balance between speed and accuracy as it sets up a bridge between optimal search and hill climbing search. [sent-1081, score-0.268]
99 With a good set of greedy operations, a constrained optimal graph could be quickly approached using a hill climbing over orderings, without having to calculate M values. [sent-1092, score-0.325]
100 We just need to consider every candidate parent set on the neighborhood of Xi , and search for each of them separately an optimal graph on each connected component of A\{Xi }. [sent-1097, score-0.296]
wordName wordTfidf (topN-words)
[('cos', 0.328), ('ser', 0.262), ('score', 0.233), ('mmpc', 0.221), ('os', 0.211), ('rati', 0.203), ('gcos', 0.187), ('dependi', 0.179), ('hc', 0.17), ('skeleton', 0.163), ('graphs', 0.151), ('errier', 0.146), ('iven', 0.138), ('uper', 0.138), ('co', 0.126), ('mmhc', 0.125), ('etwork', 0.117), ('inding', 0.117), ('tructure', 0.117), ('fb', 0.111), ('sound', 0.11), ('ng', 0.107), ('ptimal', 0.105), ('bayesian', 0.104), ('ott', 0.098), ('connected', 0.09), ('dag', 0.089), ('ml', 0.089), ('ms', 0.084), ('pc', 0.083), ('graph', 0.081), ('search', 0.079), ('edges', 0.075), ('maximal', 0.072), ('xi', 0.072), ('fs', 0.069), ('greedy', 0.067), ('climbing', 0.065), ('gos', 0.065), ('fp', 0.062), ('parents', 0.061), ('tsamardinos', 0.058), ('gmmhc', 0.057), ('ofm', 0.057), ('pit', 0.057), ('independency', 0.055), ('hill', 0.051), ('independencies', 0.051), ('subsets', 0.05), ('forbidden', 0.049), ('ssi', 0.049), ('tco', 0.049), ('tokyo', 0.049), ('pai', 0.047), ('networks', 0.047), ('dags', 0.047), ('parent', 0.046), ('network', 0.046), ('ti', 0.045), ('con', 0.043), ('faithful', 0.042), ('optima', 0.042), ('imoto', 0.041), ('myllym', 0.041), ('recycling', 0.041), ('silander', 0.041), ('structuralerror', 0.041), ('tcos', 0.041), ('cance', 0.041), ('node', 0.04), ('speed', 0.04), ('feasible', 0.04), ('nodes', 0.039), ('complexity', 0.039), ('ordering', 0.038), ('learned', 0.038), ('alarm', 0.037), ('bic', 0.037), ('ki', 0.036), ('practically', 0.036), ('missing', 0.035), ('sub', 0.035), ('unconnected', 0.034), ('formulae', 0.034), ('chickering', 0.034), ('enables', 0.033), ('structural', 0.033), ('execution', 0.033), ('koivisto', 0.033), ('neapolitan', 0.033), ('sensitively', 0.033), ('build', 0.032), ('constrained', 0.031), ('degree', 0.031), ('probably', 0.031), ('memory', 0.031), ('calculate', 0.03), ('learn', 0.03), ('topological', 0.03), ('figures', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
2 0.17975952 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
Author: Xianchao Xie, Zhi Geng
Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is first decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efficiency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning
3 0.15727217 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
Author: Zongming Ma, Xianchao Xie, Zhi Geng
Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning
4 0.11255187 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
Author: Jean-Philippe Pellet, André Elisseeff
Abstract: We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. We prove this under the Faithfulness assumption for the data distribution. In a causal graph, the strongly relevant variables for a node X are its parents, children, and children’s parents (or spouses), also known as the Markov blanket of X. Identifying the spouses leads to the detection of the V-structure patterns and thus to causal orientations. Repeating the task for all variables yields a valid partially oriented causal graph. We first show an efficient way to identify the spouse links. We then perform several experiments in the continuous domain using the Recursive Feature Elimination feature-selection algorithm with Support Vector Regression and empirically verify the intuition of this direct (but computationally expensive) approach. Within the same framework, we then devise a fast and consistent algorithm, Total Conditioning (TC), and a variant, TCbw , with an explicit backward feature-selection heuristics, for Gaussian data. After running a series of comparative experiments on five artificial networks, we argue that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy. Keywords: causal structure learning, feature selection, Markov blanket, partial correlation, statistical test of conditional independence
Author: Yang-Bo He, Zhi Geng
Abstract: The causal discovery from data is important for various scientific investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we first find a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph. Keywords: active learning, causal networks, directed acyclic graphs, intervention, Markov equivalence class, optimal design, structural learning
6 0.080799758 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
7 0.069440596 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
8 0.067686915 54 jmlr-2008-Learning to Select Features using their Properties
10 0.064130776 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
11 0.060197558 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
12 0.058554024 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
13 0.051188257 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
14 0.049636342 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
15 0.048554238 56 jmlr-2008-Magic Moments for Structured Output Prediction
16 0.046454579 58 jmlr-2008-Max-margin Classification of Data with Absent Features
17 0.046207808 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
18 0.045412809 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
19 0.044797637 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
20 0.043320533 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
topicId topicWeight
[(0, 0.255), (1, 0.263), (2, -0.017), (3, 0.0), (4, -0.028), (5, 0.087), (6, -0.001), (7, 0.031), (8, 0.073), (9, 0.078), (10, -0.081), (11, -0.049), (12, -0.099), (13, 0.142), (14, -0.059), (15, 0.028), (16, 0.023), (17, 0.065), (18, 0.054), (19, 0.069), (20, 0.047), (21, 0.076), (22, -0.056), (23, 0.117), (24, -0.015), (25, 0.065), (26, -0.04), (27, 0.026), (28, -0.049), (29, 0.08), (30, -0.066), (31, -0.135), (32, 0.122), (33, 0.03), (34, 0.022), (35, 0.028), (36, -0.061), (37, -0.007), (38, 0.03), (39, -0.154), (40, -0.096), (41, 0.118), (42, -0.107), (43, -0.047), (44, 0.153), (45, -0.113), (46, 0.083), (47, -0.174), (48, 0.075), (49, 0.286)]
simIndex simValue paperId paperTitle
same-paper 1 0.95045948 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
2 0.57918417 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
Author: Jean-Philippe Pellet, André Elisseeff
Abstract: We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. We prove this under the Faithfulness assumption for the data distribution. In a causal graph, the strongly relevant variables for a node X are its parents, children, and children’s parents (or spouses), also known as the Markov blanket of X. Identifying the spouses leads to the detection of the V-structure patterns and thus to causal orientations. Repeating the task for all variables yields a valid partially oriented causal graph. We first show an efficient way to identify the spouse links. We then perform several experiments in the continuous domain using the Recursive Feature Elimination feature-selection algorithm with Support Vector Regression and empirically verify the intuition of this direct (but computationally expensive) approach. Within the same framework, we then devise a fast and consistent algorithm, Total Conditioning (TC), and a variant, TCbw , with an explicit backward feature-selection heuristics, for Gaussian data. After running a series of comparative experiments on five artificial networks, we argue that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy. Keywords: causal structure learning, feature selection, Markov blanket, partial correlation, statistical test of conditional independence
3 0.57505047 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
Author: Xianchao Xie, Zhi Geng
Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is first decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efficiency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning
4 0.52125549 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
Author: Zongming Ma, Xianchao Xie, Zhi Geng
Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning
Author: Shann-Ching Chen, Geoffrey J. Gordon, Robert F. Murphy
Abstract: In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as Markov random fields or factor graphs can represent arbitrary dependences among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models we can use. In this paper, we study a new class of potential functions, which we call decomposable k-way potentials, and provide efficient algorithms for computing messages from these potentials during belief propagation. We believe these new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. We discuss three instances of decomposable potentials: the associative Markov network potential, the nested junction tree, and a new type of potential which we call the voting potential. We use these potentials to classify images of protein subcellular location patterns in groups of cells. Classifying subcellular location patterns can help us answer many important questions in computational biology, including questions about how various treatments affect the synthesis and behavior of proteins and networks of proteins within a cell. Our new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy. Keywords: factor graphs, approximate inference algorithms, structured classification, protein subcellular location patterns, location proteomics
6 0.38755387 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
7 0.3766318 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
8 0.35320026 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
9 0.30290371 54 jmlr-2008-Learning to Select Features using their Properties
10 0.29209748 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)
11 0.28999484 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
12 0.28497016 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
13 0.27743298 56 jmlr-2008-Magic Moments for Structured Output Prediction
14 0.26820511 58 jmlr-2008-Max-margin Classification of Data with Absent Features
15 0.25652871 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
16 0.23741794 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
17 0.22880231 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
18 0.22391616 80 jmlr-2008-Ranking Individuals by Group Comparisons
19 0.21579792 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
20 0.20905942 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
topicId topicWeight
[(0, 0.027), (5, 0.016), (31, 0.021), (40, 0.035), (54, 0.024), (58, 0.038), (61, 0.031), (66, 0.043), (76, 0.502), (88, 0.056), (92, 0.03), (94, 0.046), (99, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.92075318 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
2 0.87384236 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
Author: Xing Sun, Andrew B. Nobel
Abstract: Binary matrices, and their associated submatrices of 1s, play a central role in the study of random bipartite graphs and in core data mining problems such as frequent itemset mining (FIM). Motivated by these connections, this paper addresses several statistical questions regarding submatrices of 1s in a random binary matrix with independent Bernoulli entries. We establish a three-point concentration result, and a related probability bound, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms, which search only for submatrices of 1s, cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. 1 Keywords: frequent itemset mining, bipartite graph, biclique, submatrix of 1s, statistical significance
3 0.85859925 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
4 0.55479801 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
5 0.4572475 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
Author: Zongming Ma, Xianchao Xie, Zhi Geng
Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning
6 0.42322356 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
7 0.40937111 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
8 0.40841019 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
9 0.4080933 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
10 0.40158564 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
11 0.39765134 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
13 0.36260426 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
14 0.35940397 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
15 0.3575128 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
16 0.35622308 22 jmlr-2008-Closed Sets for Labeled Data
17 0.35357362 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
18 0.35082752 56 jmlr-2008-Magic Moments for Structured Output Prediction
19 0.3467041 83 jmlr-2008-Robust Submodular Observation Selection
20 0.34497494 58 jmlr-2008-Max-margin Classification of Data with Absent Features