jmlr jmlr2008 jmlr2008-48 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
Reference: text
sentIndex sentText sentNum sentScore
1 In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. [sent-6, score-1.681]
2 At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. [sent-7, score-1.22]
3 We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. [sent-8, score-0.829]
4 Keywords: Bayesian networks, structure learning, model selection, bounded treewidth 1. [sent-10, score-0.81]
5 With the goal of making predictions or providing probabilistic explanations, it is desirable to learn models that generalize well and at the same time have low inference complexity or a small treewidth (Robertson and Seymour, 1987). [sent-13, score-0.796]
6 E LIDAN AND G OULD exponential in the treewidth bound (e. [sent-20, score-0.797]
7 In the context of general Bayesian networks, Bach and Jordan (2002) propose a local greedy approach that upper bounds the treewidth of the model at each step. [sent-23, score-0.81]
8 This approach, like standard structure search, does not provide guarantees on the performance of the model, but is appealing in its ability to efficiently learn Bayesian networks with an arbitrary treewidth bound. [sent-25, score-0.829]
9 First, even the best of heuristics exhibits some variance in the treewidth estimate (see, for example, Koster et al. [sent-27, score-0.778]
10 , 2001) and thus a single edge modification can result in a jump in the treewidth estimate despite the fact that adding a single edge to the network can increase the true treewidth by at most one. [sent-28, score-1.877]
11 Intuitively, to generalize well, we want to learn bounded treewidth Bayesian networks where structure modifications are globally beneficial (contribute to the score in many regions of the network). [sent-32, score-0.867]
12 In this work we propose a novel approach for efficiently learning Bayesian networks of bounded treewidth that addresses these concerns. [sent-33, score-0.813]
13 Briefly, we use a novel triangulation procedure that is treewidthfriendly: the treewidth of the triangulated graph is guaranteed to increase by at most one when an edge is added to the Bayesian network. [sent-35, score-1.455]
14 Building on the single edge triangulation, we are also able to characterize sets of edges that jointly increase the treewidth of the triangulation by at most one. [sent-36, score-1.3]
15 Finally, we learn a bounded treewidth Bayesian network by iteratively augmenting the model with such chains. [sent-38, score-0.862]
16 At the same time, we are able to guarantee that the bound on the model’s treewidth grows by at most one at each iteration. [sent-40, score-0.797]
17 We evaluate our method on several challenging real-life data sets and show that our method is able to learn richer models that generalize better on test data than a greedy variant for a range of treewidth bounds. [sent-42, score-0.828]
18 Importantly, we show that even when models with unbounded treewidth are learned, by employing global structure modification operators, we are better able to cope with the problem of local maxima in the search and learn models that generalize better. [sent-43, score-0.835]
19 2 Tree-Decompositions and Treewidth The notions of tree-decompositions (or clique trees) and treewidth were introduced by Robertson and Seymour (1987). [sent-91, score-0.838]
20 The treewidth TW (H ) of an undirected graph H is the minimum treewidth over all possible treedecompositions of H . [sent-97, score-1.66]
21 An equivalent notion of treewidth can be phrased in terms of a graph that is a triangulation of H . [sent-98, score-1.118]
22 2702 L EARNING B OUNDED T REEWIDTH BAYESIAN N ETWORKS It can be easily shown (Robertson and Seymour, 1987) that the treewidth of a given triangulated graph is the size of the maximal clique of the graph minus one. [sent-110, score-1.168]
23 The treewidth of an undirected graph H is then equivalently the minimum treewidth over all possible triangulations of H . [sent-111, score-1.676]
24 For the underlying directed acyclic graph of a Bayesian network, the treewidth can be characterized via a triangulation of the moralized graph. [sent-112, score-1.293]
25 4: A moralized graph M of a directed acyclic graph G is an undirected graph that includes an edge (i— j) for every edge (i → j) in G and an edge (p—q) for every pair of edges (p → i), (q → i) in G . [sent-114, score-0.899]
26 The treewidth of a Bayesian network graph G is defined as the treewidth of its moralized graph M , and corresponds to the complexity of inference in the model. [sent-115, score-1.93]
27 It follows that the maximal clique of any moralized triangulation of G is an upper bound on the treewidth of the model, and thus its inference complexity. [sent-116, score-1.32]
28 Learning Bounded Treewidth Bayesian Networks: Overview Our goal is to develop an efficient algorithm for learning Bayesian networks with an arbitrary treewidth bound. [sent-118, score-0.796]
29 Thus, we cannot rely on methods such as that of Bodlaender (1996) to verify the boundedness of our network as they are super-exponential in the treewidth and are practical only for small treewidths. [sent-121, score-0.827]
30 At the heart of our method is the idea of using a dynamically maintained moralized triangulated graph to upper bound the treewidth of the current Bayesian network. [sent-127, score-1.204]
31 That is, our update is guaranteed to increase the size of the maximal clique of the triangulated graph, and hence the treewidth bound, by at most one. [sent-129, score-1.073]
32 That is, these edge sets are guaranteed to increase the treewidth of the triangulated graph by at most one when all edges in the set are added to the Bayesian network structure. [sent-134, score-1.358]
33 2703 E LIDAN AND G OULD Figure 1: The building blocks of our method for learning Bayesian networks of bounded treewidth and how they depend on each other. [sent-138, score-0.859]
34 Appealingly, as each global modification can increase our estimate of the treewidth by at most one, if our bound on the treewidth is K, at least K such chains will be added before we even face the problem of local maxima. [sent-143, score-1.701]
35 In practice, as some chains do not increase the treewidth, many more such chains are added for a given maximum treewidth bound. [sent-144, score-0.967]
36 Once the maximal clique size reaches the treewidth bound K, we continue to add edges greedily until no more edges can be added without increasing the treewidth (Line 16). [sent-149, score-1.938]
37 1: Given a treewidth bound K, Algorithm (1) runs in time polynomial in the number of variables and K. [sent-151, score-0.797]
38 Note that we will show that our method is guaranteed to be polynomial both in the size of the graph and the treewidth bound. [sent-153, score-0.853]
39 Thus, like the greedy thin junction tree approach of Bach and Jordan (2002), it can be used to learn a Bayesian networks given an arbitrary treewidth bound. [sent-154, score-1.024]
40 The treewidth of the original graph is one, while the graph augmented with (s → t) has a treewidth of two (maximal clique of size three). [sent-168, score-1.784]
41 2: The treewidth of the output graph M + (s→t) of Algorithm (2) is greater than the treewidth of the input graph M + by at most one. [sent-179, score-1.706]
42 It follows that the treewidth of the moralized triangulated graph cannot increase by more than one. [sent-193, score-1.206]
43 As an example, Figure 3 shows a simple case where two successive single-source edge updates increase the treewidth by two while an alternative approach increases the treewidth by only one. [sent-196, score-1.694]
44 This triangulation already includes the edge (v2 —v5 ) and the moralizing edge (v2 —v4 ) and thus is also a valid moralized triangulation after (v2 → v5 ) has been added, but has a treewidth of only two. [sent-200, score-1.765]
45 This is required to prevent moralization and triangulation edges from interacting in a way that will increase the treewidth by more than one (see Figure 5(f) for an example). [sent-224, score-1.209]
46 The original graph has a treewidth of two, while the graph augmented with (s → t) has treewidth three (maximal clique of size four). [sent-253, score-1.784]
47 7: If M + is a valid moralized triangulation of the graph G then Algorithm (3) produces a moralized triangulation M + (s→t) of the graph G(s→t) ≡ G ∪ (s → t). [sent-258, score-1.046]
48 Having shown that our update produces a valid triangulation, we now prove that our edge update is indeed treewidth-friendly and that it can increase the treewidth of the moralized triangulated graph by at most one. [sent-276, score-1.407]
49 8: The treewidth of the output graph M + (s→t) of Algorithm (3) is greater than the treewidth of the input graph M + by at most one. [sent-278, score-1.706]
50 Thus, the only way that the treewidth of the graph can further increase is via the zig-zag edges. [sent-283, score-0.874]
51 In the simple case of two blocks with two nodes (a single edge) and that intersect at a single cut-vertex, a zig-zag edge can indeed increase the treewidth by one. [sent-286, score-1.034]
52 In this case, however, there is no within-block triangulation and so the overall treewidth cannot increase by more than one. [sent-287, score-1.064]
53 Multiple Edge Updates In this section we define the notion of a contaminated set, or the subset of nodes that are incident to edges added to M + in Algorithm (3), and characterize sets of edges that are jointly guaranteed not to increase the treewidth of the triangulated graph by more than one. [sent-316, score-1.544]
54 1: We say that a node v is contaminated by the addition of the edge (s → t) to G if it is incident to an edge added to the moralized triangulated graph M + by a call to Algorithm (3). [sent-319, score-0.935]
55 Using the separation properties of cut-vertices, one might be tempted to claim that if the contaminated sets of two edges overlap at most by a single cut-vertex then the two edges jointly increase the treewidth by at most one. [sent-323, score-1.17]
56 2: Consider the Bayesian network shown below in (a) and its triangulation (b) after (v1 → v4 ) is added, increasing the treewidth from one to two. [sent-326, score-1.092]
57 Despite the fact that the contaminated sets (solid nodes) of two edge additions overlap only by the cut-vertex v4 , (d) shows that jointly adding the two edges to the graph results in a triangulated graph with a treewidth of three. [sent-328, score-1.471]
58 In (b), (c), (d), and (e) the treewidth is increased by one; In (f) the treewidth does not change. [sent-331, score-1.556]
59 Then the addition of the edges to G does not increase the treewidth of M + by more than one if one of the following conditions holds: • the contaminated sets of (s → t) and (u → v) are disjoint. [sent-336, score-1.03]
60 • the endpoints of each of the two edges are not in the same block and the block paths between the endpoints of the two edges do not overlap and the contaminated sets of the two edge overlap at a single cut-vertex. [sent-337, score-0.897]
61 It follows that the maximal clique size of M + , and hence the treewidth bound, cannot grow by more than one. [sent-345, score-0.861]
62 3 then adding all edges to G can increase the treewidth bound by at most one. [sent-351, score-0.954]
63 Furthermore, identifying the nodes that are incident to moralizing edges and edges that are part of the zigzag block level triangulation is easy. [sent-356, score-0.801]
64 Learning Optimal Treewidth-Friendly Chains We now want to build on the results of the previous sections to facilitate the addition of global moves that are both optimal in some sense and are guaranteed to increase the treewidth by at most one. [sent-361, score-0.799]
65 Given a treewidth-friendly chain C to be added for Bayesian network G , we can apply the edge update of Algorithm (3) successively to every edge in C to produce a valid moralized triangulation of G ∪ C . [sent-374, score-0.904]
66 4 ensures that the resulting moralized triangulation will have treewidth at most one greater than the original moralized triangulation M + . [sent-376, score-1.658]
67 With the ability to learn optimal chains with respect to a node ordering, we have completed the description of all the components of our algorithm for learning bounded treewidth Bayesian network outlined in Algorithm (1). [sent-398, score-1.016]
68 Its efficiency is a direct consequence of our ability to learn treewidthfriendly chains in time that is polynomial both in the number of variables and in the treewidth at each iteration. [sent-399, score-0.859]
69 1: Given a treewidth bound K, Algorithm (1) runs in time polynomial in the number of variables and K. [sent-403, score-0.797]
70 Since the algorithm adds at least one edge per iteration it cannot loop for more than K · N iterations before exceeding a treewidth of K (where N is the number of variables). [sent-408, score-0.895]
71 Block-Shortest-Path Ordering In the previous sections we presented an algorithm for learning bounded treewidth Bayesian networks given any topological ordering of the variables. [sent-412, score-0.919]
72 We now consider how to order interchangeable blocks by taking into account that our triangulation following an edge addition (s → t) only involves variables that are in blocks along the unique path between the block containing s and the block containing t and its parents. [sent-423, score-0.862]
73 , 2001), as well as using our treewidth friendly triangulation, and take the triangulation that results in a lower treewidth. [sent-524, score-1.043]
74 5 As in our method, when the treewidth bound is reached, we continue to add edges that improve the model selection score until no such edges can be found that do not also increase the treewidth bound. [sent-525, score-1.855]
75 Figure 8 shows test log-loss results for the 89 variable gene expression data set as a function of the treewidth bound. [sent-543, score-0.792]
76 Compared are our method (solid blue squares) with the Thin junction tree approach (dashed red circles), and an Aggressive greedy approach of unbounded treewidth that also uses a TABU list and random moves (dotted black). [sent-547, score-0.985]
77 11 10 Treewidth bound 9 8 7 6 5 4 3 Length of chain 2 1 0 0 5 10 15 20 25 30 Iteration Figure 9: Plot showing the number of edges (in the learned chain) added during each iteration for a typical run with treewidth bound of 10 for the 89 variables gene expression data set. [sent-548, score-1.08]
78 The graph also shows our treewidth estimate at the end of each iteration. [sent-549, score-0.853]
79 Importantly, even when the treewidth bound is increased passed the saturation point our method surpasses both the thin junction tree approach of Bach and Jordan (2002) and the aggressive search strategy. [sent-555, score-1.02]
80 To qualitatively illustrate the progression of our algorithm from iteration to iteration, we plot the number of edges in the chain (solid blue squares) and treewidth estimate (dashed red) at the end of each iteration. [sent-557, score-1.005]
81 Figure 9 shows a typical run for the 89 variable gene expression data set with treewidth bound set to 10. [sent-558, score-0.811]
82 Our algorithm aggressively adds many edges (making up an optimal chain) per iteration until parts of the network reach the treewidth bound. [sent-559, score-0.946]
83 At that point (iteration 24) the algorithm resorts to adding the single best edge per iteration until no more edges can be added without increasing the treewidth (or that have an adverse effect on the score). [sent-560, score-1.073]
84 It is also worth noting that despite their complexity, some chains do not increase the treewidth estimate and for a given treewidth bound K, we typically have more than K iterations (in this example 24 chains are added before reaching the treewidth bound). [sent-562, score-2.542]
85 The number of such iterations is still polynomially bounded as for a Bayesian network with N variables adding more than K · N edges will necessarily result in a treewidth that is greater than K. [sent-563, score-0.98]
86 Observe that our method (solid blue squares) and the greedy thin junction tree approach (dashed red circles) are both approximately linear in the treewidth bound. [sent-566, score-1.024]
87 2 The Traffic and Temperature Data Sets We now compare our method to the mutual-information based LPACJT method for learning bounded treewidth model of Chechetka and Guestrin (2008) (we compare to better of the variants presented in that work). [sent-575, score-0.795]
88 Furthermore, as their method can only be applied to a small treewidth bound, we also limited our model to a treewidth of two. [sent-586, score-1.556]
89 In fact, Chechetka and Guestrin (2008) show that even a Chow-Liu tree does rather well on these data sets (compare this to the gene expression data set where the aggressive variant was superior even at a treewidth of four). [sent-592, score-0.888]
90 For all of our methods except the unbounded Aggressive, the treewidth bound was set to two. [sent-597, score-0.821]
91 Our small benefit over the greedy thin junction tree approach of Bach and Jordan (2002) when the treewidth bound is non-trivial (>2) grows significantly when we take advantage of the natural variable order. [sent-601, score-1.007]
92 Discussion and Future Work In this work we presented a novel method for learning Bayesian networks of bounded treewidth in time that is polynomial in both the number of variables and the treewidth bound. [sent-606, score-1.591]
93 Our method builds on an edge update algorithm that dynamically maintains a valid moralized triangulation in a way that facilitates the addition of chains that are guaranteed to increase the treewidth by at most one. [sent-607, score-1.469]
94 The graph compares our method (solid blue squares) with the greedy approach (dashed red circles), and an aggressive greedy approach of unbounded treewidth that also uses a TABU list and random moves (dotted black). [sent-611, score-1.022]
95 showed that by using global structure modification operators, we are able to learn better models than competing methods even when the treewidth of the models learned is not constrained. [sent-614, score-0.811]
96 In addition, unlike the thin junction trees approach of Bach and Jordan (2002), we also provide a guarantee that our estimate of the treewidth bound will not increase by more than one at each iteration. [sent-616, score-0.945]
97 To our knowledge, ours is the first method for efficiently learning bounded treewidth Bayesian networks with structure modifications that are not fully greedy. [sent-618, score-0.828]
98 Karger and Srebro (2001) propose a method that is guaranteed to learn a good approximation of the optimal Markov network given a treewidth bound. [sent-620, score-0.845]
99 Other works study this question but in the context where the true distribution is assumed to have bounded treewidth (e. [sent-630, score-0.795]
100 Finally, we are most interested in exploring whether tools similar to the ones employed in this work could be used to dynamically update the bounded treewidth structure that is the approximating distribution in a variational approximate inference setting. [sent-638, score-0.844]
wordName wordTfidf (topN-words)
[('treewidth', 0.778), ('triangulation', 0.265), ('moralized', 0.175), ('triangulated', 0.157), ('block', 0.145), ('dmin', 0.126), ('edges', 0.119), ('edge', 0.117), ('contaminated', 0.112), ('bayesian', 0.102), ('node', 0.091), ('chain', 0.089), ('ould', 0.085), ('ounded', 0.085), ('reewidth', 0.085), ('ordering', 0.084), ('ot', 0.08), ('graph', 0.075), ('nodes', 0.072), ('lidan', 0.072), ('chechetka', 0.069), ('junction', 0.064), ('thin', 0.063), ('induced', 0.063), ('chains', 0.063), ('clique', 0.06), ('foreach', 0.059), ('path', 0.056), ('contamination', 0.055), ('pat', 0.054), ('triangulate', 0.053), ('etworks', 0.051), ('tree', 0.051), ('network', 0.049), ('incident', 0.049), ('bestscore', 0.048), ('blocks', 0.046), ('cm', 0.045), ('aggressive', 0.045), ('chow', 0.044), ('added', 0.042), ('paths', 0.042), ('bestchain', 0.037), ('guestrin', 0.035), ('update', 0.034), ('greedy', 0.032), ('moralizing', 0.032), ('os', 0.031), ('undirected', 0.029), ('cycle', 0.029), ('earning', 0.028), ('endpoints', 0.028), ('contaminate', 0.027), ('moralization', 0.026), ('ol', 0.026), ('robertson', 0.026), ('bach', 0.025), ('liu', 0.024), ('pai', 0.024), ('unbounded', 0.024), ('maximal', 0.023), ('topologically', 0.022), ('topological', 0.022), ('parent', 0.022), ('containing', 0.021), ('score', 0.021), ('increase', 0.021), ('bsp', 0.021), ('hapmap', 0.021), ('seymour', 0.021), ('overlap', 0.021), ('parents', 0.021), ('tabu', 0.02), ('cycles', 0.019), ('blue', 0.019), ('bound', 0.019), ('traf', 0.018), ('elidan', 0.018), ('augmented', 0.018), ('networks', 0.018), ('karger', 0.018), ('learn', 0.018), ('ordered', 0.018), ('adding', 0.017), ('solid', 0.017), ('bounded', 0.017), ('red', 0.017), ('valid', 0.016), ('bodlaender', 0.016), ('inducednodes', 0.016), ('lpacjt', 0.016), ('treewidths', 0.016), ('triangulations', 0.016), ('jordan', 0.016), ('scoring', 0.016), ('dashed', 0.015), ('structure', 0.015), ('circles', 0.015), ('gene', 0.014), ('dynamic', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
2 0.1285888 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
Author: Zongming Ma, Xianchao Xie, Zhi Geng
Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning
3 0.088144295 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)
Author: Yang-Bo He, Zhi Geng
Abstract: The causal discovery from data is important for various scientific investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we first find a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph. Keywords: active learning, causal networks, directed acyclic graphs, intervention, Markov equivalence class, optimal design, structural learning
4 0.080799758 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
5 0.079656057 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
Author: Xianchao Xie, Zhi Geng
Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is first decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efficiency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning
7 0.052724641 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
8 0.043848217 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
9 0.043601841 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
10 0.041661803 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
11 0.040829044 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
12 0.036507986 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
13 0.034262247 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
14 0.034146369 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
15 0.033272296 58 jmlr-2008-Max-margin Classification of Data with Absent Features
16 0.032291315 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
17 0.030084837 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
18 0.029801782 9 jmlr-2008-Active Learning by Spherical Subdivision
19 0.029320266 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
20 0.027512111 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
topicId topicWeight
[(0, 0.155), (1, 0.19), (2, 0.012), (3, 0.001), (4, -0.053), (5, 0.079), (6, -0.022), (7, 0.035), (8, 0.013), (9, 0.044), (10, -0.022), (11, -0.032), (12, -0.051), (13, 0.185), (14, -0.042), (15, -0.07), (16, 0.028), (17, 0.017), (18, 0.064), (19, -0.074), (20, -0.079), (21, 0.016), (22, 0.037), (23, 0.008), (24, -0.045), (25, 0.088), (26, 0.037), (27, -0.146), (28, -0.049), (29, -0.135), (30, 0.007), (31, -0.143), (32, 0.076), (33, -0.092), (34, -0.026), (35, 0.044), (36, -0.023), (37, 0.097), (38, 0.201), (39, -0.326), (40, -0.071), (41, 0.112), (42, 0.238), (43, -0.16), (44, -0.173), (45, 0.339), (46, -0.055), (47, -0.206), (48, -0.215), (49, -0.112)]
simIndex simValue paperId paperTitle
same-paper 1 0.94184703 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
2 0.38032857 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
Author: Zongming Ma, Xianchao Xie, Zhi Geng
Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning
3 0.28770947 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
Author: Shann-Ching Chen, Geoffrey J. Gordon, Robert F. Murphy
Abstract: In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as Markov random fields or factor graphs can represent arbitrary dependences among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models we can use. In this paper, we study a new class of potential functions, which we call decomposable k-way potentials, and provide efficient algorithms for computing messages from these potentials during belief propagation. We believe these new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. We discuss three instances of decomposable potentials: the associative Markov network potential, the nested junction tree, and a new type of potential which we call the voting potential. We use these potentials to classify images of protein subcellular location patterns in groups of cells. Classifying subcellular location patterns can help us answer many important questions in computational biology, including questions about how various treatments affect the synthesis and behavior of proteins and networks of proteins within a cell. Our new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy. Keywords: factor graphs, approximate inference algorithms, structured classification, protein subcellular location patterns, location proteomics
Author: Yang-Bo He, Zhi Geng
Abstract: The causal discovery from data is important for various scientific investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we first find a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph. Keywords: active learning, causal networks, directed acyclic graphs, intervention, Markov equivalence class, optimal design, structural learning
6 0.25423458 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
7 0.25409365 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
8 0.23331659 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
9 0.22535227 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
10 0.21056627 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
11 0.18411495 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
12 0.16047965 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
13 0.146127 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
14 0.14340472 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
15 0.13684681 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
16 0.13105172 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
17 0.12581733 9 jmlr-2008-Active Learning by Spherical Subdivision
18 0.12448597 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
19 0.11182962 57 jmlr-2008-Manifold Learning: The Price of Normalization
20 0.10997839 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
topicId topicWeight
[(0, 0.031), (5, 0.015), (9, 0.012), (21, 0.022), (40, 0.02), (54, 0.048), (58, 0.03), (60, 0.011), (61, 0.028), (66, 0.063), (76, 0.091), (87, 0.024), (88, 0.065), (92, 0.039), (93, 0.332), (94, 0.033), (99, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.7295891 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
2 0.69713175 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
Author: Jieping Ye, Shuiwang Ji, Jianhui Chen
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming
3 0.39867997 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
Author: Xing Sun, Andrew B. Nobel
Abstract: Binary matrices, and their associated submatrices of 1s, play a central role in the study of random bipartite graphs and in core data mining problems such as frequent itemset mining (FIM). Motivated by these connections, this paper addresses several statistical questions regarding submatrices of 1s in a random binary matrix with independent Bernoulli entries. We establish a three-point concentration result, and a related probability bound, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms, which search only for submatrices of 1s, cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. 1 Keywords: frequent itemset mining, bipartite graph, biclique, submatrix of 1s, statistical significance
4 0.39255419 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
5 0.36701533 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
6 0.3606205 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
8 0.34892151 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
9 0.34734792 58 jmlr-2008-Max-margin Classification of Data with Absent Features
10 0.34422103 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
11 0.3408629 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
12 0.3386113 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
13 0.33840713 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
14 0.33142766 83 jmlr-2008-Robust Submodular Observation Selection
15 0.33084244 56 jmlr-2008-Magic Moments for Structured Output Prediction
16 0.32957134 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
17 0.32560256 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
18 0.32529277 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
19 0.32516813 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
20 0.32483211 9 jmlr-2008-Active Learning by Spherical Subdivision