jmlr jmlr2011 jmlr2011-75 knowledge-graph by maker-knowledge-mining

75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Source: pdf

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

Abstract: We present a parallel algorithm for the score-based optimal structure search of Bayesian networks. This algorithm is based on a dynamic programming (DP) algorithm having O(n · 2n ) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes. The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes. While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(nσ ) time and space overhead calculations, where σ > 0 controls the communication-space trade-off. The overall time and space complexity is O(nσ+1 2n ). This algorithm splits the search space so that the required communication between independent calculations is minimal. Because of this advantage, our algorithm can run on distributed memory supercomputers. Through computational experiments, we conﬁrmed that our algorithm can run in parallel using up to 256 processors with a parallelization efﬁciency of 0.74, compared to the original DP algorithm with a single processor. We also demonstrate optimal structure search for a 32-node network without any constraints, which is the largest network search presented in literature. Keywords: optimal Bayesian network structure, parallel algorithm

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 JP Human Genome Center Institute of Medical Science, The University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Editor: Russ Greiner Abstract We present a parallel algorithm for the score-based optimal structure search of Bayesian networks. [sent-10, score-0.285]

2 This algorithm splits the search space so that the required communication between independent calculations is minimal. [sent-15, score-0.304]

3 Through computational experiments, we conﬁrmed that our algorithm can run in parallel using up to 256 processors with a parallelization efﬁciency of 0. [sent-17, score-0.361]

4 We also demonstrate optimal structure search for a 32-node network without any constraints, which is the largest network search presented in literature. [sent-19, score-0.601]

5 Keywords: optimal Bayesian network structure, parallel algorithm 1. [sent-20, score-0.266]

6 In this paper, we focus on a score-based learning algorithm and formalize it as a problem to search for an optimal structure that derives the maximal (or minimal) score using a score function deﬁned on a structure with respect to an observed data set. [sent-27, score-0.642]

7 A score function has to be decomposed as the sum of the local score functions for each node in a network. [sent-28, score-0.442]

8 When these algorithms are applied to real problems, the main bottleneck is bound to be the memory space requirement rather than the time requirement, because these algorithms need to store the intermediate optimal structures of all combinations of node subsets during DP steps. [sent-46, score-0.459]

9 Using parallelization, they suggested that it might be possible to search larger-scale network structures using their algorithm. [sent-60, score-0.25]

10 Therefore, they can be used to search larger networks than the current 29-node network that requires 100 GB of memory. [sent-66, score-0.282]

11 Here, we present a parallelized optimal Bayesian network search algorithm called Para-OS. [sent-71, score-0.325]

12 Our algorithm realizes direct parallelization of the DP steps in the original algorithm by splitting the search space of DP with O(nσ ) time and space overhead calculations, where σ = 1, 2, . [sent-74, score-0.262]

13 > 0 is a parameter that is used to split the search space and controls the trade-off between the number of communications (not the volume of communication) and the memory space requirement. [sent-77, score-0.301]

14 Another important feature of our algorithm is that it calculates no redundant score functions, and it can distribute the entire calculation almost equally across all processors. [sent-81, score-0.462]

15 The main operation in our algorithm is the calculation of the score function. [sent-82, score-0.33]

16 Although our algorithm has slightly greater space and time complexties, it makes it possible to realize large-scale optimal network search in practice using widely available low-cost supercomputers. [sent-86, score-0.29]

17 We conﬁrmed that the program can run efﬁciently in parallel using up to 256 processors (CPU cores) with a parallelization efﬁciency of more than 0. [sent-89, score-0.361]

18 74 on a current supercomputer system, and acceptably using up to 512 processors with a parallelization efﬁciency of 0. [sent-90, score-0.394]

19 Finally, we demonstrate the largest optimal Bayesian network search attempted thus far on a 32-node network with 256 processors using our proposed algorithm without any constraints and without an external hard disk drive. [sent-92, score-0.666]

20 Our algorithm was found to complete the optimal search including the score calculation within a week. [sent-93, score-0.474]

21 Section 2 presents an overview of the Bayesian network and the optimal search algorithm, which serves as the basis for our proposed algorithm. [sent-95, score-0.29]

22 Preliminaries In this section, we ﬁrst present a brief introduction to the Bayesian network model, and then, we describe the optimal search (OS) algorithm, which is the basal algorithm that we parallelize in the proposed algorithm. [sent-101, score-0.29]

23 A score-based Bayesian network structure search or a Bayesian network estimation problem is to search for the DAG structure ﬁtted to the observed data, in which the ﬁtness of the structure to the given data is measured by a score function. [sent-115, score-0.871]

24 The score function is deﬁned on a node and its parent set. [sent-116, score-0.297]

25 The scores of nodes obtained by a score function are called local scores. [sent-117, score-0.275]

26 A network score is deﬁned simply as the sum of local scores of all nodes in a network. [sent-118, score-0.421]

27 Therefore, the problem becomes one of ﬁnding the structure that minimizes the score function. [sent-127, score-0.249]

28 The optimal network structure search by DP can be regarded as an optimal permutation search problem. [sent-128, score-0.548]

29 B⊂A That is, F(v, A) calculates the optimal choice of the parent set from A for node v and returns its optimal local score. [sent-132, score-0.287]

30 Deﬁnition 2 (Optimal network score on a permutation) Let π : {1, 2, . [sent-134, score-0.334]

31 Given a permutation π ∈ ΠA , the optimal network score on π can be described as def QA (π) = ∑ F(v, {u ∈ A : π−1 (u) < π−1 (v)}). [sent-138, score-0.478]

32 v∈A Deﬁnition 3 (Optimal network score) By using QA (π) deﬁned above, we can formalize the network structure search as a problem to ﬁnd the optimal permutation that gives the minimal network score: def M(A) = arg min QA (π). [sent-139, score-0.747]

33 π∈ΠA 2440 PARALLEL A LGORITHM FOR L EARNING O PTIMAL BAYESIAN N ETWORKS Here, M(A) represents the optimal permutation that derives the minimal score of the network consisting of nodes in A. [sent-140, score-0.514]

34 v∈A By applying the above equations from |A| = 0 to |A| = |V |, we obtain the optimal permutation π on V and its score QV (M(V )) in O(n · 2n ) steps. [sent-145, score-0.281]

35 Note that in order to reconstruct the network structure, we need to keep the optimal choice of the parent set derived in Equation (1) and the optimal permutation π = M(A) in Equation (3) for all the combinations of A ⊂ V in an iterative loop for the next size of A. [sent-146, score-0.45]

36 Parallel Optimal Search Algorithm The key to parallelizing the calculation of the optimal search algorithm by DP is splitting all the combinations of nodes in a single loop of DP for F(v, A), M(A), and QA (M(A)), given above by Equations (1), (2), and (3), respectively. [sent-148, score-0.501]

37 For example, we can calculate M(B) (|B| = |A|) in the same processor that calculates M(A) such that some of M(B \ {b}) (b ∈ B) overlaps M(A \ {a}). [sent-153, score-0.295]

38 Time is mainly required to calculate the score function s(v, A, X) for all the nodes and their parent combinations. [sent-157, score-0.41]

39 Thus, our algorithm calculates s(v, A, X) equally in independent processors without any redundant calculations for this part. [sent-158, score-0.397]

40 We show that the calculation can be split by the super-combination of A, and it is the optimal separation of the combinations in terms of the number of communications required. [sent-177, score-0.387]

41 Therefore, the number of sub-combinations required to derive k+σ combinations of length k in A is equal to the number of possible combinations that can be k generated from k + σ elements, and is k+σ . [sent-183, score-0.267]

42 Therefore, S can derive A , and all the combinations of k length k − 1 derived from elements in S are the sub-combinations required to generate combinations in A . [sent-187, score-0.267]

43 If we generate all the combinations of length k taken from T , then these include all the combinations of length k from nodes in V without v1 , . [sent-193, score-0.345]

44 If we remove any S ∈ S from it, then there exist combinations that cannot be generated from another S ∈ S because S lists all the combinations except for nodes v1 , . [sent-215, score-0.277]

45 We can generate S by taking the ﬁrst n−σ combinations from n combinations k k k arranged in lexicographical order. [sent-220, score-0.39]

46 Theorem 7 can be used to split the search space of DP using by super-combinations, and Theorem 8 provides the number of super-combinations required in the parallel computation for a certain size of A for M(A). [sent-221, score-0.26]

47 Input: S ⊂ V : Super-combination, a ∈ N : size of combination to be calculated, n : total number of nodes in the network, n p : number of CPU processors (cores). [sent-243, score-0.368]

48 If the algorithm runs in parallel with hundreds of processors, the increment calculation in each processor is negligible as compared to the total amount of calculations, and thus, it does not noticeably affect the overall computation time. [sent-254, score-0.448]

49 2444 PARALLEL A LGORITHM FOR L EARNING O PTIMAL BAYESIAN N ETWORKS Algorithm 2 Para-OS(V, X, s, σ, n p ) calculates the exactly global optimal structure of the Bayesian network with respect to the input data X and the local score function s with n p processors. [sent-257, score-0.533]

50 Input: V : set of input nodes (variables) where |V | = n, X: (N × n)-input data matrix, s(v, Pa, X): function V × 2V × RN×n → R that returns the local Bayesian network score for variable v with its parent set Pa ⊂ V w. [sent-258, score-0.464]

51 In the former, each processor calculates the score function s(v, Pa, X) independently without communication, whereas the latter calculates F(v, A), M(A), and QA (M(A)) along with communications among each other to exchange the results of F(v, A), M(A), and QA (M(A)). [sent-280, score-0.581]

52 Function LQ (A, n, n p ) locates the processor index used to store the results of M(A) and QA (M(A)) and LF (v, A, n, n p ), the results of F(v, A). [sent-292, score-0.257]

53 By using RLIs, the algorithm can independently and discontinuously generate the required combinations and processor indices for storing/retrieving of intermediate results. [sent-293, score-0.286]

54 We use RLIs instead of ordinal lexicographical indices because the conversion between a combination and the RLI can be calculated in linear time by preparing the index table once (Tamada et al. [sent-294, score-0.368]

55 Figure 1 shows an example of the calculation of the DP for the super-combination S in a single processor in a single loop. [sent-298, score-0.29]

56 Note that in the ﬁgure, although a super-combination is assigned to a single processor, s(v, A, X) is calculated in a different processor from one that calculates F(v, A) for the same v ∈ V and A ⊂ V \ {v}. [sent-299, score-0.317]

57 Finally, we tried to run the algorithm with as many nodes as possible on our supercomputers, as a proof of long-run practical execution that realizes the optimal large network structure learning. [sent-309, score-0.403]

58 , 2002) could search for a network structure having a better score than that of the optimal structure. [sent-335, score-0.539]

59 The calculation with a single processor for n = 24 exceeds this limit. [sent-341, score-0.29]

60 The total running times are measured for the entire execution of the program, including the input of the data from a ﬁle, output of the network to a ﬁle, and MPI initialization and ﬁnalization routine calls. [sent-343, score-0.262]

61 We measured the running times using 256 processors here. [sent-349, score-0.248]

62 During the computation, we also measured the times required for calling MPI functions to exchange required data between processors, and the times required for calculating the score funtion s(·). [sent-350, score-0.348]

63 As shown in the ﬁgure, σ does not affect the score calculation time. [sent-355, score-0.33]

64 “Total Time” represents the total time required for execution in seconds, “Cm Time” represents the total communication time required for calling MPI functions within the total time; “Sc Time,” the time required for score calculation; and “Mem,” the memory requirement in GiB. [sent-385, score-0.719]

65 The speedup S(n p ) is deﬁned as S(n p ) = T (1)/T (n p ), where n p is the number of processors and T (n p ), the running time with n p processors. [sent-438, score-0.306]

66 This result suggests that the redundant calculation in our proposed algorithm does not have a great effect, and the communication cost is the main cause of the inefﬁciency of our algorithm. [sent-504, score-0.269]

67 We measured the running times with 256 processors and σ = 3. [sent-508, score-0.248]

68 From these results, we can say that the score calculation remains dominant and the communication does not contribute signiﬁcantly to the total running times for this range of n with n p = 256. [sent-514, score-0.499]

69 We conﬁrmed that the communication times are almost constant for all the tested sample sizes and that the score calculation increased with the sample sizes, as was expected. [sent-546, score-0.423]

70 Note that the calculation of the BNRC score function is not in linear time, and it is difﬁcult to determine the exact time complexity because it involves an iterative optimization step (Imoto et al. [sent-547, score-0.33]

71 3 Structure Search for Large Networks Finally, we tried to search for the optimal structure of nodes with as many nodes as possible in the HGC system because it allows long execution for up to two weeks with 256 processor cores and has a larger memory in each computation node. [sent-550, score-0.735]

72 With the BNRC score function, we have succeeded in searching for the optimal structure of a 31-node network by using 464. [sent-552, score-0.435]

73 As described in Parviainen and Koivisto (2009), thus far, the largest network search that has been reported was for a 29-node network (Silander and Myllym¨ ki, 2006). [sent-560, score-0.396]

74 To search for the optimal structure of an even larger network, we used the BDe network score, which is a discrete model that is much faster than the BNRC score. [sent-562, score-0.351]

75 Generally, the BDe score can be 2451 TAMADA , I MOTO AND M IYANO calculated 100 times faster than the BNRC score (data not shown). [sent-563, score-0.447]

76 Using the BDe score function, we successfully carried out optimal structure search for a 32-node network without any restriction using 836. [sent-564, score-0.539]

77 Thus, for n = 32 with the BDe score function, the calculation of score functions requires relatively very little time (actually, it required only around 1. [sent-569, score-0.561]

78 Discussions In this paper, we have presented a parallel algorithm to search for the score-based optimal structure of Bayesian networks. [sent-573, score-0.285]

79 We conﬁrmed the scalability of the algorithm to the number of processors through computational experiments and successfully demonstrated optimal structure search for a 32-node network with 256 processors, an improvement over the most successful result reported thus far. [sent-575, score-0.552]

80 Our algorithm overcomes the bottleneck of the previous algorithm by using a large amount of distributed memory for large-scale Bayesian network structure search. [sent-576, score-0.366]

81 Both are capable of breaking the current limitation of the network size in optimal network structure search. [sent-579, score-0.422]

82 Therefore, although the time requirement increases with a decrease in the space requirement, they mentioned that their algorithm can obtain the optimal structure for a 34-node network by massive parallelization. [sent-583, score-0.283]

83 However, our algorithm can actually search for the optimal structure of a 32-node network with 256 processors in less than a week, including score calculation, whereas Parviainen and Koivisto (2009) computed only the partial problems. [sent-587, score-0.74]

84 Their estimate from their empirical result is 4 weeks using 100 processors to obtain the optimal structure for a 31-node network. [sent-588, score-0.302]

85 In addition, their estimate ignores the parallelization overhead that generally becomes problematic in parallelization as well as score calculation, which requires the most time in the actual application. [sent-589, score-0.397]

86 Therefore, exploiting the modern multi-core CPUs can reduce the communications required among computation nodes and increase the amount of memory space for independent calculations without requiring improved hardware relative to current supercomputers. [sent-593, score-0.358]

87 We search the network structure G by maximizing the posterior of G for the input data matrix X. [sent-605, score-0.311]

88 In this model, the conditional probability of X j is parameterized as P(X j = uk |PaG (X j ) = u jl ) = θ jlk , where u jl (l = 1, . [sent-626, score-0.326]

89 For the discrete model, the likelihood can be expressed as k=1 n rq j r jlk P(X|G, θ) = ∏ ∏ ∏ θ jlk , N j=1 l=1 k=1 where N jlk is the number of observations for the j-th node whose values equal uk in the data matrix X with respect to a combination of the parents’ observation l. [sent-631, score-0.712]

90 N jl = ∑r N jlk and θ denotes a set of k=1 parameters θ jlk . [sent-632, score-0.445]

91 To do this efﬁciently, we require an algorithm that calculates a combination vector from its index and the index from its combination vector. [sent-689, score-0.292]

92 Our algorithm actually deals with the reverse lexicographical index instead of the ordinal lexicographical index; this enables us to calculate the table only once and to make it reusable. [sent-694, score-0.54]

93 ,Cm } be the set of all the combinations of length k taken from n objects, arranged in lexicographical order, where m = n . [sent-703, score-0.329]

94 Let us deﬁne the reverse lexicographical index of Ci ∈ C , RLI(Ci , n) 2455 0 2 −2 1 −2 0 1 2 4 −6 −2 2 4 4 2 2 0 −4 −4 2 2 4 4 4 2 0 (b) −2 (g) −2 −4 −6 (e) 0 (h) fh (X) = (X − 1. [sent-705, score-0.291]

95 Algorithm 4 RLI(X, n, T ) calculates the reverse lexicographical index of the given combination X. [sent-727, score-0.44]

96 Corollary 15 Let RLI(X, n) be the reverse lexicographical index deﬁned above for combination X = {x1 , x2 , . [sent-742, score-0.342]

97 The inverse function of RLI(X, n), that is, the i-th element xi of 2457 TAMADA , I MOTO AND M IYANO Algorithm 5 RLI−1 (r, n, k, T ) calculates the combination vector of length k from n elements, corresponding to the given reverse lexicographical index r. [sent-746, score-0.474]

98 Input: r : reverse lexicographical index of the combination to be calculated, n : total number of elements, k : length of the combination, T : index table calculated by RLITable(·). [sent-747, score-0.522]

99 The normal lexicographical order can also be calculated in a similar manner for ﬁxed values of n and k. [sent-771, score-0.271]

100 Conversion between a combination vector and the lexicographical index in linear time with polynomial time preprocessing, 2011. [sent-843, score-0.297]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rli', 0.426), ('tamada', 0.263), ('qa', 0.251), ('processors', 0.201), ('lexicographical', 0.2), ('score', 0.188), ('jlk', 0.188), ('bnrc', 0.163), ('processor', 0.148), ('network', 0.146), ('calculation', 0.142), ('iyano', 0.138), ('moto', 0.138), ('dp', 0.133), ('bde', 0.128), ('imoto', 0.128), ('pag', 0.117), ('memory', 0.115), ('etworks', 0.105), ('ptimal', 0.105), ('search', 0.104), ('gib', 0.1), ('calculates', 0.098), ('koivisto', 0.096), ('combinations', 0.095), ('communication', 0.093), ('ord', 0.087), ('nodes', 0.087), ('parviainen', 0.085), ('parallelization', 0.08), ('parallel', 0.08), ('mpi', 0.077), ('supercomputer', 0.075), ('tc', 0.075), ('calculated', 0.071), ('jl', 0.069), ('bayesian', 0.067), ('ott', 0.067), ('node', 0.066), ('lq', 0.064), ('calculations', 0.064), ('store', 0.063), ('supercomputers', 0.063), ('lf', 0.062), ('structure', 0.061), ('speedup', 0.058), ('lgorithm', 0.056), ('cores', 0.053), ('permutation', 0.053), ('combination', 0.051), ('tokyo', 0.051), ('def', 0.051), ('mem', 0.05), ('communications', 0.049), ('calculate', 0.049), ('increment', 0.049), ('overhead', 0.049), ('aa', 0.048), ('running', 0.047), ('index', 0.046), ('reverse', 0.045), ('dag', 0.044), ('bottleneck', 0.044), ('parent', 0.043), ('massively', 0.043), ('riken', 0.043), ('required', 0.043), ('mod', 0.041), ('optimal', 0.04), ('execution', 0.04), ('acceptably', 0.038), ('comm', 0.038), ('cpus', 0.038), ('ricc', 0.038), ('rlitable', 0.038), ('cpu', 0.037), ('cm', 0.036), ('requirement', 0.036), ('parallelized', 0.035), ('os', 0.035), ('redundant', 0.034), ('earning', 0.034), ('length', 0.034), ('split', 0.033), ('parents', 0.033), ('loop', 0.033), ('networks', 0.032), ('compiler', 0.032), ('calling', 0.031), ('retrieve', 0.031), ('rq', 0.031), ('total', 0.029), ('limitation', 0.029), ('sc', 0.029), ('realizes', 0.029), ('myllym', 0.029), ('silander', 0.029), ('disk', 0.029), ('separation', 0.028), ('rmed', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

2 0.13824561 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the beneﬁts of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique

3 0.096956827 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

Author: Benoît Patra

Abstract: Motivated by the problem of effectively executing clustering algorithms on very large data sets, we address a model for large scale distributed clustering methods. To this end, we brieﬂy recall some standards on the quantization problem and some results on the almost sure convergence of the competitive learning vector quantization (CLVQ) procedure. A general model for linear distributed asynchronous algorithms well adapted to several parallel computing architectures is also discussed. Our approach brings together this scalable model and the CLVQ algorithm, and we call the resulting technique the distributed asynchronous learning vector quantization algorithm (DALVQ). An indepth analysis of the almost sure convergence of the DALVQ algorithm is performed. A striking result is that we prove that the multiple versions of the quantizers distributed among the processors in the parallel architecture asymptotically reach a consensus almost surely. Furthermore, we also show that these versions converge almost surely towards the same nearly optimal value for the quantization criterion. Keywords: k-means, vector quantization, distributed, asynchronous, stochastic optimization, scalability, distributed consensus

4 0.059502531 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we consider sparse and identiﬁable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efﬁcient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identiﬁable Multivariate modeling), is validated and bench-marked on artiﬁcial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identiﬁability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identiﬁable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identiﬁability, factor models, linear Bayesian networks

5 0.05233562 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

6 0.040901881 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

7 0.04024617 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

8 0.037594017 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

9 0.034959294 50 jmlr-2011-LPmade: Link Prediction Made Easy

10 0.034053326 48 jmlr-2011-Kernel Analysis of Deep Networks

11 0.03385606 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

12 0.033505637 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

13 0.030269209 55 jmlr-2011-Learning Multi-modal Similarity

14 0.030196363 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

15 0.029709045 104 jmlr-2011-X-Armed Bandits

16 0.029210065 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

17 0.02899538 61 jmlr-2011-Logistic Stick-Breaking Process

18 0.028379178 35 jmlr-2011-Forest Density Estimation

19 0.028201552 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

20 0.025782483 101 jmlr-2011-Variable Sparsity Kernel Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.158), (1, -0.067), (2, -0.017), (3, -0.059), (4, -0.005), (5, 0.067), (6, 0.006), (7, 0.06), (8, 0.146), (9, 0.193), (10, -0.012), (11, -0.048), (12, 0.075), (13, 0.144), (14, -0.084), (15, -0.012), (16, 0.015), (17, 0.008), (18, -0.086), (19, 0.074), (20, 0.023), (21, -0.005), (22, -0.027), (23, 0.078), (24, 0.149), (25, -0.306), (26, -0.232), (27, -0.074), (28, 0.083), (29, 0.266), (30, -0.25), (31, 0.127), (32, 0.154), (33, -0.006), (34, 0.045), (35, -0.001), (36, -0.114), (37, -0.095), (38, -0.017), (39, -0.074), (40, 0.051), (41, 0.02), (42, 0.094), (43, -0.05), (44, -0.058), (45, 0.008), (46, -0.089), (47, -0.112), (48, -0.113), (49, 0.07)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95912111 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

2 0.61138868 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

Author: Benoît Patra

3 0.45685622 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

4 0.28551707 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki

Abstract: We propose an efﬁcient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classiﬁers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efﬁcient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classiﬁers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classiﬁers achieve at least as good accuracy as the best compared classiﬁers, using signiﬁcantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classiﬁcation, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI

5 0.27102816 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

6 0.24858803 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

7 0.21651177 48 jmlr-2011-Kernel Analysis of Deep Networks

8 0.21496195 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

9 0.20384431 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

10 0.20349278 50 jmlr-2011-LPmade: Link Prediction Made Easy

11 0.18978158 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

12 0.18529227 83 jmlr-2011-Scikit-learn: Machine Learning in Python

13 0.18446116 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

14 0.18077961 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

15 0.17780752 12 jmlr-2011-Bayesian Co-Training

16 0.17337453 35 jmlr-2011-Forest Density Estimation

17 0.168364 54 jmlr-2011-Learning Latent Tree Graphical Models

18 0.15764597 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

19 0.15174918 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

20 0.14727898 104 jmlr-2011-X-Armed Bandits

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.036), (9, 0.012), (10, 0.681), (24, 0.028), (31, 0.038), (32, 0.014), (41, 0.016), (71, 0.021), (73, 0.021), (78, 0.032), (90, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96913797 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

2 0.91754705 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

Author: Julian J. McAuley, TibĂŠrio S. Caetano

Abstract: Maximum A Posteriori inference in graphical models is often solved via message-passing algorithms, such as the junction-tree algorithm or loopy belief-propagation. The exact solution to this problem is well-known to be exponential in the size of the maximal cliques of the triangulated model, while approximate inference is typically exponential in the size of the modelĂ˘€™s factors. In this paper, we take advantage of the fact that many models have maximal cliques that are larger than their constituent factors, and also of the fact that many factors consist only of latent variables (i.e., they do not depend on an observation). This is a common case in a wide variety of applications that deal with grid-, tree-, and ring-structured models. In such cases, we are able to decrease the exponent of complexity for message-passing by 0.5 for both exact and approximate inference. We demonstrate that message-passing operations in such models are equivalent to some variant of matrix multiplication in the tropical semiring, for which we offer an O(N 2.5 ) expected-case solution. Keywords: graphical models, belief-propagation, tropical matrix multiplication

3 0.45655558 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

4 0.45178345 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efﬁciently compute marginals and update MAP conﬁgurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP conﬁgurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU

5 0.35678387 16 jmlr-2011-Clustering Algorithms for Chains

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a ﬁnite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-speciﬁc scores are either difﬁcult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The ﬁrst one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the signiﬁcance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artiﬁcial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

6 0.34186327 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

7 0.33812398 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

8 0.33157364 61 jmlr-2011-Logistic Stick-Breaking Process

9 0.32759076 59 jmlr-2011-Learning with Structured Sparsity

10 0.31972545 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

11 0.31824136 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

12 0.31344998 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

13 0.31170955 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

14 0.30776015 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

15 0.30542937 12 jmlr-2011-Bayesian Co-Training

16 0.30296353 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

17 0.29946789 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

18 0.29894173 48 jmlr-2011-Kernel Analysis of Deep Networks

19 0.29847422 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

20 0.29600394 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization