jmlr jmlr2010 jmlr2010-88 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse superstructures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained graph. Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm. Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. Keywords: Bayesian networks, structure learning, constrained optimal search
Reference: text
sentIndex sentText sentNum sentScore
1 To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. [sent-15, score-0.175]
2 Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. [sent-16, score-0.146]
3 Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. [sent-19, score-0.193]
4 Introduction Although structure learning is a fundamental task for building Bayesian networks (BNs), when minimizing a score function, the computational complexity often prevents us from finding optimal BN structures (Perrier et al. [sent-21, score-0.206]
5 , 2005) and a decomposable score like BDeu, the computational complexity remains exponential, and therefore, such algorithms are intractable for BNs with more than around 30 vertices given our actual computational capacity. [sent-27, score-0.232]
6 (2008) proposed an algorithm that can learn an optimal BN when an undirected graph is given as a structural constraint. [sent-39, score-0.139]
7 (2008) defined this undirected graph as a super-structure; the skeleton of every graph considered is compelled to be a subgraph of the super-structure. [sent-41, score-0.307]
8 This algorithm can learn optimal BNs containing up to 50 vertices when the average degree of the super-structure is around two, that is, a sparse structural constraint is assumed. [sent-42, score-0.182]
9 (1999) suggested that when the structural constraint is a directed graph (in the case of SC), an optimal search can be carried out on the cluster tree extracted from the constraint. [sent-45, score-0.385]
10 (1999) requires to be given a directed graph-based constraint and to extract a cluster tree. [sent-47, score-0.222]
11 For the latter, a large cluster might be generated, preventing an optimal search from being carried out. [sent-48, score-0.264]
12 Thus, for the estimation of the best BN of more than hundred vertices and sufficient data samples, the initial network may contain hundreds of small cycles, and it is impossible to check these cycles in the search process. [sent-52, score-0.16]
13 In addition, we use a different cluster decomposition that enables us to consider more complex cases. [sent-56, score-0.168]
14 If the super-structure is divided into clusters of moderate size (around 30 vertices), a constrained optimal search can be applied on each cluster. [sent-61, score-0.206]
15 Then, to find a globally optimal graph, one could consider all patterns of directions for the edges between clusters and apply a constrained optimal search on each cluster for every pattern of directions independently and return the best result found. [sent-62, score-0.54]
16 We theorize this idea by introducing ancestral constraints; further, we derive the necessary and sufficient ancestral constrains that we must consider to find an optimal network and introduce a pruning technique to skip superfluous cases. [sent-63, score-0.248]
17 (1999) suggested the possibility of optimally searching acyclic subgraphs of a digraph constructed by connecting each vertex and its pre-selected candidate parents with directed edges without checking acyclicity. [sent-106, score-0.239]
18 An algorithm would proceed by converting the digraph into a cluster tree, where clusters are densely connected subgraphs. [sent-108, score-0.247]
19 Then, it would perform an optimal search on each cluster for every ordering of the vertices contained in the separators of clusters. [sent-109, score-0.428]
20 However, due to the difficulty of building a minimal cluster tree, large clusters can make the search impractical. [sent-110, score-0.299]
21 This algorithm constructs an initial directed graph by linking every vertex to its optimal parents although this might create directed cycles. [sent-114, score-0.271]
22 Then, it tries to search every possible case in which the direction of one edge comprising each directed cycle is constrained for keeping acyclicity, and finds optimal parents under the constraints iteratively until DAGs are obtained. [sent-115, score-0.257]
23 In addition, for score functions decomposable into penalization and fitting components, optimal parents under the constraints are further effectively computed using a branch-and-bound technique that was originally proposed by Suzuki et al. [sent-117, score-0.246]
24 When the sample size is small, few directed cycles occur in the initial directed graph and updated graphs because information criteria tend to select a smaller parent set for each vertex in small sample data (Dojer, 2006). [sent-120, score-0.185]
25 Under the assumption that the skeleton is separated to small subgraphs, we first describe the definition of ancestral constraints for each subgraph and consider an algorithm to learn an optimal BN on a subgraph under some ancestral constrains. [sent-127, score-0.443]
26 We then explain the procedures in order to efficiently build up an optimal BN on the skeleton by using information of optimal BN on each subgraph under the conditions of ancestral constraints to be considered. [sent-128, score-0.348]
27 1 Ancestrally Constrained Optimal Search for A Cluster Hereafter, we assume that we are given a set of edges E − ⊂ E such that the undirected graph G− = (V, E \ E − ) is not connected. [sent-130, score-0.137]
28 The edges of E − are dashed and they define a cluster C (gray). [sent-132, score-0.241]
29 Definition 1 (cluster and cluster edges) Let C = (VC , EC ) be a maximal connected component of G− . [sent-134, score-0.168]
30 We refer to C as a cluster and call E C ⊂ E − containing all and only the edges incident to a vertex in VC the set of cluster edges for C. [sent-135, score-0.514]
31 In the following sections, we show that by successively considering every possible δ on E − and ∗ learning the optimal BN on each cluster independently we can reconstruct NG . [sent-138, score-0.261]
32 Definition 3 (in-vertex and out-vertex) Considering a cluster C and a TDM δ on E C , we define in out in out VC,δ and VC,δ as VC,δ = {v ∈ V \ VC | ∃va ∈ VC , ({v, va }, →) ∈ δ} and VC,δ = {v ∈ VC | ∃va ∈ V \ VC , ({v, va }, →) ∈ δ}, respectively. [sent-140, score-0.27]
33 We say that A is a nested ancestral in constraint set (NACS) if and only if for any va and vb in VC,δ , A (va ) ⊆ A (vb ) or A (va ) ⊇ A (vb ) in holds. [sent-147, score-0.205]
34 Theorem 1 There exists δ∗ on E − and NACSs Ai∗ for every cluster Ci of G− coherent with a global ∗ ∗ optimal BN NG . [sent-157, score-0.261]
35 From δ∗ , we define for every cluster C the ACS A ∗ such that unique TDM δ i i in out in ∀v ∈ VCi ,δ∗ , Ai∗ (v) = Pπ∗ (v) ∩ VCi ,δ∗ . [sent-161, score-0.217]
36 Ai∗ are definitely NACSs since for every va and vb ∈ VCi ,δ∗ ∗ if va ≺π∗ vb , then by definition, Ai∗ (va ) ⊆ Ai∗ (vb ). [sent-162, score-0.205]
37 Furthermore, each N ∗ is Vi = VCi ∪VCi ,δ∗ i i i in an optimal graph on VCi ∪VCi ,δ∗ given the constraints of δ∗ and Ai∗ (otherwise, we could build a DAG ∗ having a lower score than NG ). [sent-164, score-0.203]
38 Therefore, if we independently compute an optimal BN for every cluster, for every TDM and every NACS, and return the best combination, we can build a globally optimal BN on V . [sent-165, score-0.235]
39 ∗ From Theorem 1, the ACS to be considered is limited to only NACS, and NG can be obtained by searching the best combination of an optimal BN separately obtained on every cluster and for every NACS and every TDM. [sent-166, score-0.359]
40 First, an optimal BN and its score on every cluster for every NACS and every TDM are computed. [sent-168, score-0.482]
41 Then, by using this information, an optimal BN and its score on a cluster obtained by merging two clusters are computed for every NACS and every TDM. [sent-169, score-0.512]
42 After the repeated computation of optimal BNs and scores on merged clusters, an optimal BN and its score on a single cluster covering the super-structure are finally obtained. [sent-170, score-0.455]
43 The fundamental step involves learning an optimal BN on a cluster C for a given δ and A ; we call this algorithm ancestrally constrained optimal search (ACOS). [sent-172, score-0.365]
44 Fp (v, X \ u∗ ) otherwise Note that the in-vertices are considered differently in our algorithm; they either have no parents or fixed parents, and can only be parents of few vertices in VC depending on A . [sent-190, score-0.186]
45 Thus, although the in in DAGs considered in the following algorithm are optimal on X ∪ VC,δ , the score of v ∈ VC,δ (that is fixed depending on δ) is not counted in s(X) since it is the same for every DAG irrespective of ˆ whether it is optimal or not. [sent-191, score-0.26]
46 We define for every X ⊆ VC the associate set PC(X) = X ∪ {v ∈ VC,δ | A (v) ⊆ X}; in other in words, for a given topological ordering π over X, the possible parents in X ∪ VC,δ of w ∈ X are in PC(Pπ (w)) ∩ N (w). [sent-195, score-0.168]
47 Construct N ∗ , an optimal A -linear BN over VC , using Fp and ℓ, and return N ∗ and its score s(VC ). [sent-202, score-0.167]
48 Consequently, after adding the structural constraint G, the best score for v is Fs (v, PC(X \ {v}) ∩ N (v)). [sent-217, score-0.179]
49 Finally, since v cannot be a parent of any nodes in X \ {v}, the best score over this set is s(X \ {v}). [sent-218, score-0.156]
50 ) (because all NACSs can be generated in out through orderings of VC,δ ∪ VC,δ ), it is experimentally worse than exponential in the number of cluster edges. [sent-225, score-0.168]
51 For the cluster C and TDM δ shown in Figure 1, Figure 3 shows an optimal BN N under NACS A = {(v1 , u1 )}. [sent-227, score-0.212]
52 For a given C and δ, we consider a score-and-network (SN) map SC,δ as a list containing pairs of optimal scores and networks generated by every NACS, not to be pruned. [sent-236, score-0.168]
53 (b) Otherwise, learn N ∗ , an optimal Ai -linear DAG of score s∗ , using Algorithm 2. [sent-242, score-0.167]
54 The algorithm given below builds a set of SN maps SC for the merged cluster C out of SC1 and SC2 (with VC = VC1 ∪VC2 ). [sent-252, score-0.208]
55 Algorithm 4: MergeCluster Input: Clusters C1 and C2 and sets of SN maps SC1 and SC2 Output: Merged cluster C and set of SN maps SC 1. [sent-253, score-0.168]
56 If there exists an optimal A -linear network in SC,δ that has a score smaller than s∗ , restart step i with the next pair. [sent-258, score-0.167]
57 The next theorem shows that SC,δ contains an optimal BN and its score for every NACS on C. [sent-267, score-0.216]
58 Theorem 4 If for every pair of TDMs δ1 and δ2 and for every NACS, SC1 ,δ1 and SC2 ,δ2 contain pairs of an optimal BN and its score, then SC,δ constructed by Algorithm 4 contains a pair of an optimal BN and its score for every NACS. [sent-268, score-0.358]
59 Proof First, we show that for every NACS A over C, we can build an optimal A -linear BN by merging two optimal networks on C1 and C2 for some NACSs A1 and A2 . [sent-269, score-0.176]
60 To do so, for a given TDM δ, let us consider a NACS A for C, an optimal A -linear BN N ∗ of score s∗ and one of its topological orderings π∗ defined over VC (that is also in agreement with δ and A ). [sent-270, score-0.201]
61 Given an optimal Ai -linear network of SCi ,δi Ni∗ and its score s∗ , let us coni ∗ ∗ sider the graph N ′ = N1 ∪ N2 . [sent-275, score-0.203]
62 Algorithm 5: ClusterAssembler Input: Set of all clusters C ∗ Output: Optimal BN NG and its score s∗ 1. [sent-281, score-0.202]
63 (b) Compute the cluster C and SC by merging C1 and C2 using Algorithm 4. [sent-285, score-0.168]
64 In (a), although we do not prove that the complexity is minimal by merging the clusters that imply less cluster edges for the merged cluster at each step, we decided to use this heuristic. [sent-293, score-0.528]
65 This is because the complexity depends on the number of cluster edges in Algorithm 4; therefore, it is faster to always manipulate a cluster with a small number of cluster edges. [sent-294, score-0.577]
66 A block is a biconnected subgraph of the undirected graph, and vertices in the intersection of blocks are called cut-vertices, that is, the removal of cutvertices separates blocks. [sent-301, score-0.178]
67 Then, all edges (v, w), where w ∈ VC , are considered as cluster edges; because only v is connected ∗ to cluster edges, no cycle can be created while merging an optimal DAG NC over VC with another ∗ . [sent-305, score-0.453]
68 For every TDM δ, we learn an optimal DAG ∗ Nδ and its score s∗ over VC . [sent-307, score-0.216]
69 For example, if a candidate parent set X is set to {a1 , a2 } in 296 O PTIMAL S EARCH ON C LUSTERED S TRUCTURAL C ONSTRAINT Figure 4, unique TDM δX for cluster edges (v, a1 ), (v, a2 ), (v, a3 ), and (v, a4 ) is indicated by arrows. [sent-312, score-0.274]
70 2 PARTITIONING THE S UPER - STRUCTURE INTO C LUSTERS To apply our algorithm, we need to select a set of edges E − that separates the super-structure into small strongly connected subgraphs (clusters) having balanced numbers of vertices while minimizing the number of cluster edges for each cluster. [sent-318, score-0.449]
71 Algorithm 6: EdgeConstrainedOptimalSearch (ECOS) Input: Super-structure G = (V, E) and data D ∗ Output: Optimal constrained BN NG and its score s∗ 1. [sent-324, score-0.154]
72 Merge all clusters using Algorithm 5 to obtain NG and its score s∗ . [sent-333, score-0.202]
73 Given E − , we define n2 = maxC∈C |VC |, the size of the largest cluster, and k = maxC∈C |E C |, the largest number of cluster edges. [sent-348, score-0.168]
74 Finally, at worst, step 5 involves trying every pair of entries in two SN map sets; with the maximum size of cluster edges of merged clusters K, the complexity might theoretically be as bad as O(q(K! [sent-355, score-0.409]
75 • Replace every vertex vi by a small graph Ci (up to 20 or slightly more) and randomly connect all edges connected to vi in G0 to vertices in Ci . [sent-361, score-0.272]
76 If ECOS can select all edges between clusters for such networks while defining E − , the search should finish in reasonable time even for larger networks (up to several hundreds of vertices). [sent-362, score-0.282]
77 ˜ Further, for every average degree, we evaluated the maximal number of vertices nmax (m) feasible ˜ from the value of δm calculated as proposed in Perrier et al. [sent-411, score-0.199]
78 The super-structures were generated in two different ways: the true skeleton was given or a skeleton was inferred by using MMPC (Tsamardinos et al. [sent-429, score-0.272]
79 For every experimental condition, the algorithms are compared both in terms of score (we used the negative BDeu score, that is, smaller values are better) and structural hamming distance (SHD, Tsamardinos et al. [sent-676, score-0.203]
80 Figures 5 and 6 respectively show BDeu and SHD scores of ECOS, MMHC, and HC for ten data sets of Alarm10 with 10,000 samples given the true skeleton and super-structures inferred by MMPC with α = 0. [sent-678, score-0.212]
81 For the true skeleton and super-structures obtained with α = 0. [sent-684, score-0.151]
82 With regard to the structural constraint, the best results were obtained when the true skeleton was known. [sent-692, score-0.182]
83 BDeu and SHD scores for all the experiments are summarized in Tables 5 and 6; for each experimental setup, the best result is in bold and the best result without the knowledge of the true skeleton is underlined. [sent-700, score-0.187]
84 However, this should not mislead us into thinking that minimizing a score function is not a good method to learn a graph having a small SHD. [sent-708, score-0.159]
85 The best score in each case is in bold font; the best score for each data sample without the knowledge of the true skeleton is underlined. [sent-764, score-0.397]
86 The best score in each case is in bold font; the best score for each data sample without the knowledge of the true skeleton is underlined. [sent-804, score-0.397]
87 Our algorithm decomposes the super-structure into several clusters and computes optimal networks on each cluster for every ancestral constraint in order to ensure acyclic networks. [sent-846, score-0.506]
88 Experimental evaluations using extended Alarm, Insurance, and Child networks show that our algorithm outperforms state-of-the-art greedy algorithms such as MMHC and HC with a TABU search extension both in terms of the BDeu score and SHD. [sent-847, score-0.238]
89 If there exist many edges between strongly connected components in the skeleton, the clusters obtained have too many cluster edges; this is usually the main bottleneck of our algorithm that limits its the feasibility. [sent-855, score-0.32]
90 We should emphasize the fact that IT approaches add false-positive edges to the skeleton according to the specified significance level when learning the super-structure. [sent-856, score-0.194]
91 Therefore, although the true skeleton may be well structured (i. [sent-857, score-0.151]
92 , the true skeleton can be clustered with a small number of cluster edges), the false-positive edges tend to be cluster edges and limit the feasibility of ECOS. [sent-859, score-0.67]
93 Although we can argue that our method has been implemented and tested practically, it is also obvious that both strategies are different since ECOS can be applied on a skeleton that is not only decomposable in cluster trees but also decomposable in cluster graphs. [sent-865, score-0.511]
94 For example, we converted the true skeleton of Alarm5 into a cluster tree. [sent-866, score-0.319]
95 The true skeleton was decomposed into nine clusters, the largest of them containing 34 vertices and 20 cluster edges. [sent-867, score-0.401]
96 Following the experimental results, we conclude that our approach, that is, decomposing the search on every cluster, succeeded in scaling-up the constrained optimal search. [sent-870, score-0.176]
97 Definition 7 (NACS template) The NACS template of A is a finite list of pairs of positive integers out {(I1 , O1 ), · · · , (Il , Ol )} such that |VC,δ | ≥ O1 > · · · > Ol ≥ 0 and there are exactly Ik in-vertices vin j such that |A (vin )| = Ok . [sent-877, score-0.138]
98 , vout } defined by 3 A (vin ) = {vout , vout , vout }, 3 2 1 1 A (vin ) = {vout , vout , vout }, 3 3 1 6 A (vin ) = {vout , vout }, 3 1 2 A (vin ) = {vout }, 4 3 A (vin ) = {vout }, 3 3 A (vin ) = ø. [sent-884, score-0.618]
99 5 Here, we arranged the AC sets by size to illustrate the fact that due to the definition of NACS, two in-vertices having the same size of ancestral constraint actually have the same ancestral constraint. [sent-885, score-0.229]
100 Therefore, to prove that Algorithm 7 is correct, we simply need to prove that it considers every template in increasing order, that is, that the loop of step in 3 generates the successor of a given template NT . [sent-906, score-0.211]
wordName wordTfidf (topN-words)
[('ecos', 0.453), ('nacs', 0.35), ('vc', 0.29), ('tdm', 0.205), ('mmhc', 0.197), ('cluster', 0.168), ('hc', 0.147), ('nacss', 0.145), ('perrier', 0.145), ('bn', 0.14), ('ol', 0.131), ('kojima', 0.128), ('score', 0.123), ('skeleton', 0.121), ('errier', 0.111), ('iyano', 0.111), ('moto', 0.111), ('acos', 0.103), ('acs', 0.103), ('lustered', 0.103), ('vout', 0.103), ('ancestral', 0.102), ('bdeu', 0.099), ('dag', 0.096), ('fp', 0.092), ('sc', 0.091), ('fs', 0.088), ('ptimal', 0.088), ('onstraint', 0.088), ('vci', 0.085), ('vertices', 0.082), ('template', 0.081), ('shd', 0.08), ('clusters', 0.079), ('earch', 0.079), ('tructural', 0.079), ('tsamardinos', 0.076), ('edges', 0.073), ('nmax', 0.068), ('nished', 0.067), ('cos', 0.064), ('vin', 0.057), ('nt', 0.056), ('ng', 0.055), ('subgraphs', 0.053), ('parents', 0.052), ('search', 0.052), ('va', 0.051), ('every', 0.049), ('ec', 0.049), ('il', 0.049), ('mmpc', 0.046), ('optimal', 0.044), ('alarm', 0.043), ('shrunk', 0.043), ('merged', 0.04), ('networks', 0.039), ('subgraph', 0.037), ('feasibility', 0.037), ('ss', 0.037), ('imoto', 0.037), ('ott', 0.037), ('scores', 0.036), ('graph', 0.036), ('const', 0.034), ('topological', 0.034), ('nish', 0.034), ('ordering', 0.033), ('mal', 0.033), ('parent', 0.033), ('vertex', 0.032), ('constrained', 0.031), ('block', 0.031), ('structural', 0.031), ('tabu', 0.03), ('true', 0.03), ('insurance', 0.029), ('directed', 0.029), ('tokyo', 0.029), ('sn', 0.029), ('undirected', 0.028), ('decomposable', 0.027), ('vb', 0.027), ('acyclicity', 0.026), ('ancestrally', 0.026), ('koivisto', 0.026), ('silander', 0.026), ('tdms', 0.026), ('cycles', 0.026), ('learnt', 0.026), ('constraint', 0.025), ('ten', 0.025), ('none', 0.024), ('pan', 0.024), ('ims', 0.024), ('op', 0.024), ('friedman', 0.024), ('greedy', 0.024), ('computations', 0.023), ('ai', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure
Author: Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse superstructures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained graph. Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm. Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. Keywords: Bayesian networks, structure learning, constrained optimal search
2 0.055957209 63 jmlr-2010-Learning Instance-Specific Predictive Models
Author: Shyam Visweswaran, Gregory F. Cooper
Abstract: This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including nave Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms. Keywords: instance-specific, Bayesian network, Markov blanket, Bayesian model averaging
Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos
Abstract: In part I of this work we introduced and evaluated the Generalized Local Learning (GLL) framework for producing local causal and Markov blanket induction algorithms. In the present second part we analyze the behavior of GLL algorithms and provide extensions to the core methods. SpeciÄ?Ĺš cally, we investigate the empirical convergence of GLL to the true local neighborhood as a function of sample size. Moreover, we study how predictivity improves with increasing sample size. Then we investigate how sensitive are the algorithms to multiple statistical testing, especially in the presence of many irrelevant features. Next we discuss the role of the algorithm parameters and also show that Markov blanket and causal graph concepts can be used to understand deviations from optimality of state-of-the-art non-causal algorithms. The present paper also introduces the following extensions to the core GLL framework: parallel and distributed versions of GLL algorithms, versions with false discovery rate control, strategies for constructing novel heuristics for speciÄ?Ĺš c domains, and divide-and-conquer local-to-global learning (LGL) strategies. We test the generality of the LGL approach by deriving a novel LGL-based algorithm that compares favorably c 2010 Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani and Xenofon D. Koutsoukos. A LIFERIS , S TATNIKOV, T SAMARDINOS , M ANI AND KOUTSOUKOS to the state-of-the-art global learning algorithms. In addition, we investigate the use of non-causal feature selection methods to facilitate global learning. Open problems and future research paths related to local and local-to-global causal learning are discussed. Keywords: local causal discovery, Markov blanket induction, feature selection, classiÄ?Ĺš cation, causal structure learning, learning of Bayesian networks
Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos
Abstract: We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples. The selected feature sets can be used for causal discovery and classiÄ?Ĺš cation. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms. The resulting algorithms are sound under well-deÄ?Ĺš ned sufÄ?Ĺš cient conditions. In a Ä?Ĺš rst set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare to other local causal discovery methods and to state-of-the-art non-causal feature selection methods using real data. A second set of experimental evaluations compares the algorithms in terms of ability to induce local causal neighborhoods using simulated and resimulated data and examines the relation of predictivity with causal induction performance. Our experiments demonstrate, consistently with causal feature selection theory, that local causal feature selection methods (under broad assumptions encompassing appropriate family of distribuc 2010 Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani and Xenofon D. Koutsoukos. A LIFERIS , S TATNIKOV, T SAMARDINOS , M ANI AND KOUTSOUKOS tions, types of classiÄ?Ĺš ers, and loss functions) exhibit strong feature set parsimony, high predictivity and local causal interpretability. Although non-causal feature selection methods are often used in practice to shed light on causal relationships, we Ä?Ĺš nd that they cannot be interpreted causally even when they achieve excellent predictivity. Therefore we conclude that only local causal techniques should be used when insight into causal structure is sought. In a companion paper we examine in depth the behavior of GLL algorithms, provide extensions, and show
5 0.050363131 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
Author: Rahul Gupta, Sunita Sarawagi, Ajit A. Diwan
Abstract: Many structured information extraction tasks employ collective graphical models that capture interinstance associativity by coupling them with various clique potentials. We propose tractable families of such potentials that are invariant under permutations of their arguments, and call them symmetric clique potentials. We present three families of symmetric potentials—MAX, SUM, and MAJORITY . We propose cluster message passing for collective inference with symmetric clique potentials, and present message computation algorithms tailored to such potentials. Our first message computation algorithm, called α-pass, is sub-quadratic in the clique size, outputs exact messages for 13 MAX , and computes 15 -approximate messages for Potts, a popular member of the SUM family. Empirically, it is upto two orders of magnitude faster than existing algorithms based on graph-cuts or belief propagation. Our second algorithm, based on Lagrangian relaxation, operates on MAJORITY potentials and provides close to exact solutions while being two orders of magnitude faster. We show that the cluster message passing framework is more principled, accurate and converges faster than competing approaches. We extend our collective inference framework to exploit associativity of more general intradomain properties of instance labelings, which opens up interesting applications in domain adaptation. Our approach leads to significant error reduction on unseen domains without incurring any overhead of model retraining. Keywords: passing graphical models, collective inference, clique potentials, cluster graphs, message
6 0.043902837 56 jmlr-2010-Introduction to Causal Inference
8 0.042499926 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
9 0.040258177 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
10 0.039087515 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
11 0.038362559 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
12 0.037830304 60 jmlr-2010-Learnability, Stability and Uniform Convergence
13 0.036206793 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
14 0.0351489 10 jmlr-2010-An Exponential Model for Infinite Rankings
15 0.034387488 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
16 0.033780586 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
17 0.032131534 44 jmlr-2010-Graph Kernels
18 0.029277751 22 jmlr-2010-Classification Using Geometric Level Sets
19 0.028906865 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
20 0.026195491 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
topicId topicWeight
[(0, -0.121), (1, 0.09), (2, 0.025), (3, -0.02), (4, -0.023), (5, -0.003), (6, -0.043), (7, 0.037), (8, -0.015), (9, 0.125), (10, -0.046), (11, 0.06), (12, 0.048), (13, 0.017), (14, -0.101), (15, -0.075), (16, -0.051), (17, 0.033), (18, 0.055), (19, -0.027), (20, 0.034), (21, -0.129), (22, -0.01), (23, -0.102), (24, -0.039), (25, -0.088), (26, -0.042), (27, -0.243), (28, 0.012), (29, -0.112), (30, 0.107), (31, -0.058), (32, 0.026), (33, -0.214), (34, 0.126), (35, -0.088), (36, 0.019), (37, -0.159), (38, 0.049), (39, 0.193), (40, 0.103), (41, 0.019), (42, 0.235), (43, -0.166), (44, 0.091), (45, 0.125), (46, -0.002), (47, 0.138), (48, -0.123), (49, -0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.96040279 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure
Author: Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse superstructures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained graph. Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm. Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. Keywords: Bayesian networks, structure learning, constrained optimal search
2 0.52945179 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
Author: Rahul Gupta, Sunita Sarawagi, Ajit A. Diwan
Abstract: Many structured information extraction tasks employ collective graphical models that capture interinstance associativity by coupling them with various clique potentials. We propose tractable families of such potentials that are invariant under permutations of their arguments, and call them symmetric clique potentials. We present three families of symmetric potentials—MAX, SUM, and MAJORITY . We propose cluster message passing for collective inference with symmetric clique potentials, and present message computation algorithms tailored to such potentials. Our first message computation algorithm, called α-pass, is sub-quadratic in the clique size, outputs exact messages for 13 MAX , and computes 15 -approximate messages for Potts, a popular member of the SUM family. Empirically, it is upto two orders of magnitude faster than existing algorithms based on graph-cuts or belief propagation. Our second algorithm, based on Lagrangian relaxation, operates on MAJORITY potentials and provides close to exact solutions while being two orders of magnitude faster. We show that the cluster message passing framework is more principled, accurate and converges faster than competing approaches. We extend our collective inference framework to exploit associativity of more general intradomain properties of instance labelings, which opens up interesting applications in domain adaptation. Our approach leads to significant error reduction on unseen domains without incurring any overhead of model retraining. Keywords: passing graphical models, collective inference, clique potentials, cluster graphs, message
3 0.37460715 63 jmlr-2010-Learning Instance-Specific Predictive Models
Author: Shyam Visweswaran, Gregory F. Cooper
Abstract: This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including nave Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms. Keywords: instance-specific, Bayesian network, Markov blanket, Bayesian model averaging
4 0.34456095 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
Author: Miloš Radovanović, Alexandros Nanopoulos, Mirjana Ivanović
Abstract: Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent “popular” nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its influence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families. Keywords: nearest neighbors, curse of dimensionality, classification, semi-supervised learning, clustering
5 0.33527482 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený
Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach
6 0.30103573 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
7 0.27760559 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
8 0.23338228 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
10 0.20334494 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
11 0.1989069 60 jmlr-2010-Learnability, Stability and Uniform Convergence
12 0.19312805 10 jmlr-2010-An Exponential Model for Infinite Rankings
14 0.17540707 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
16 0.14454669 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
17 0.142498 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
18 0.14142211 40 jmlr-2010-Fast and Scalable Local Kernel Machines
19 0.13287497 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
20 0.13031143 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
topicId topicWeight
[(3, 0.012), (4, 0.021), (8, 0.031), (12, 0.487), (21, 0.016), (32, 0.035), (33, 0.021), (36, 0.053), (37, 0.036), (75, 0.119), (81, 0.022), (85, 0.036), (96, 0.016)]
simIndex simValue paperId paperTitle
1 0.70260048 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith
Abstract: The SGD-QN algorithm described in Bordes et al. (2009) contains a subtle flaw that prevents it from reaching its design goals. Yet the flawed SGD-QN algorithm has worked well enough to be a winner of the first Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008). This document clarifies the situation, proposes a corrected algorithm, and evaluates its performance. Keywords: stochastic gradient descent, support vector machine, conditional random fields
same-paper 2 0.64813232 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure
Author: Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse superstructures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained graph. Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm. Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. Keywords: Bayesian networks, structure learning, constrained optimal search
3 0.28370753 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
4 0.28032675 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
Author: Yu Fan, Jing Xu, Christian R. Shelton
Abstract: A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a finite number of states which evolves in continuous time. Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we first present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle filtering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental results on both synthetic networks and a network learned from a real data set on people’s life history events. We show the accuracy as well as the time efficiency of our algorithms, and compare them to other approximate algorithms: expectation propagation and Gibbs sampling. Keywords: continuous time Bayesian networks, importance sampling, approximate inference, filtering, smoothing
5 0.27929968 63 jmlr-2010-Learning Instance-Specific Predictive Models
Author: Shyam Visweswaran, Gregory F. Cooper
Abstract: This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including nave Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms. Keywords: instance-specific, Bayesian network, Markov blanket, Bayesian model averaging
6 0.27844837 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
7 0.2779386 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
8 0.27749979 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
9 0.27747178 69 jmlr-2010-Lp-Nested Symmetric Distributions
10 0.27442446 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
11 0.27402923 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
12 0.27382591 56 jmlr-2010-Introduction to Causal Inference
13 0.27299076 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
14 0.27298343 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
15 0.27093375 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
16 0.27076554 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
18 0.27043802 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
19 0.27031794 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
20 0.270055 109 jmlr-2010-Stochastic Composite Likelihood