jmlr jmlr2011 jmlr2011-54 knowledge-graph by maker-knowledge-mining

54 jmlr-2011-Learning Latent Tree Graphical Models

Source: pdf

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Stochastic Systems Group Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 02139 Editor: Marina Meil˘ a Abstract We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. [sent-9, score-0.619]

2 Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. [sent-13, score-0.66]

3 This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. [sent-16, score-0.617]

4 We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. [sent-19, score-0.788]

5 Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. [sent-21, score-0.77]

6 There are three challenging problems in learning a model with latent variables: learning the number of latent variables; inferring the structure of how these latent variables relate to each other and to the observed variables; and estimating the parameters characterizing those relationships. [sent-27, score-0.917]

7 One class of models that has received considerable attention in the literature is the class of latent tree models, that is, graphical models Markov on trees, in which variables at some nodes represent the original (observed) variables of interest while others represent the latent variables. [sent-29, score-1.107]

8 If all variables are observed in the tree under consideration, then the well-known algorithm of Chow and Liu (1968) provides a tractable algorithm for performing maximum likelihood (ML) estimation of the tree structure. [sent-39, score-0.613]

9 This global step then provides guidance for groups of observed nodes that are likely to be topologically close to each other in the latent tree, thereby guiding subsequent recursive grouping or neighbor-joining (Saitou and Nei, 1987) computations. [sent-51, score-0.617]

10 For example, we can take any such latent model and add another hidden variable as a leaf node connected to only one other (hidden or observed) node. [sent-54, score-0.701]

11 (2006) considered the learning of directed latent models using so-called tetrad constraints, and there have also been attempts to tailor the learning of latent tree models in order to perform approximate inference accurately and efﬁciently downstream (Wang et al. [sent-69, score-0.829]

12 In contrast, we ﬁx the state space of each hidden node, but allow the possibility that some observed nodes are internal nodes (non-leaves). [sent-72, score-0.65]

13 A tree is called a complete k-ary tree (or k-complete tree), if all its internal nodes have degree k and there exists one node (commonly referred as the root node) that has the exactly same distance to all leaf nodes. [sent-75, score-1.12]

14 The reconstruction of latent trees has been studied extensively by the phylogenetic community where sequences of extant species are available and the unknown phylogenetic tree is to be inferred from these sequences. [sent-92, score-0.754]

15 The algorithm proceeds by recursively pairing two nodes that are the closest neighbors in the true latent tree and introducing a hidden node as the parent of the two nodes. [sent-101, score-1.159]

16 However, when we only have access to the samples at the observed nodes, then it is not straightforward to construct a latent tree from a set of quartets since the quartets may be not be consistent. [sent-110, score-0.719]

17 Here, we say that a set of quartets is consistent if there exists a latent tree such that all quartets agree with the tree. [sent-122, score-0.654]

18 In the subsequent two sections, we make two assumptions: Firstly, the true distribution is a latent tree and secondly, perfect knowledge of information distance of observed variables is available. [sent-142, score-0.647]

19 We also discuss extensions of our algorithms for the case when the underlying model is not a tree and our goal is to learn an approximation to it using a latent tree model. [sent-146, score-0.827]

20 A latent tree is a tree with node set W := V ∪ H, the union of a set of observed nodes V (with m = |V |), and a set of latent (or hidden) nodes H. [sent-169, score-1.721]

21 samples drawn from a graphical model (distribution) p, Markov on a latent tree Tp = (W, E p ), where W = V ∪ H. [sent-206, score-0.619]

22 Our algorithms learn latent tree structures using the information distances (deﬁned in Section 3) between pairs of observed variables, which can be estimated from samples. [sent-212, score-0.721]

23 3 Minimal Tree Extensions Our ultimate goal is to recover the graphical model p, that is, the latent tree structure and its paramn eters, given n i. [sent-222, score-0.619]

24 However, in general, there can be multiple latent tree models which result in the same observed statistics, that is, the same joint distribution pV of the observed variables. [sent-226, score-0.636]

25 We consider the class of tree models where it is possible to recover the latent tree model uniquely and provide necessary conditions for structure identiﬁability, that is, the identiﬁability of the edge set E. [sent-227, score-0.886]

26 As an example, consider a discrete latent tree model with binary observed variables (K = 2). [sent-232, score-0.664]

27 A latent tree with the simplest structure (fewest number of nodes) is a tree in which all m observed binary variables are connected to one hidden variable. [sent-233, score-1.064]

28 The following conditions ensure that a latent tree does not include a redundant hidden node (Pearl, 1988): (C1) Each hidden variable has at least three neighbors (which can be either hidden or observed). [sent-244, score-1.218]

29 As illustrated in Figure 1, using marginalization operations, any non-minimal latent tree distribution can be reduced to a minimal latent tree model. [sent-256, score-1.126]

30 Deﬁnition 2 (Consistency) A latent tree reconstruction algorithm A is a map from the observed n samples xV to an estimated tree T n and an estimated tree-structured graphical model pn . [sent-264, score-0.968]

31 In the following sections, we design structurally and risk consistent algorithms for (minimal) Gaussian and symmetric discrete latent tree models, deﬁned in (4). [sent-267, score-0.648]

32 To do so, for any three variables i, j, k ∈ V , we deﬁne Φi jk := dik − d jk to be the difference between the information distances dik and d jk . [sent-325, score-0.707]

33 Lemma 4 (Sibling Grouping) For distances di j for all i, j ∈ V on a tree T ∈ T≥3 , the following two properties on Φi jk = dik − d jk hold: (i) Φi jk = di j for all k ∈ V \ {i, j} if and only if i is a leaf node and j is its parent. [sent-327, score-1.435]

34 In the following section, we describe a recursive algorithm that is based on the above TestNodeRelationships procedure to reconstruct the entire latent tree model assuming that the true model is a latent tree and that the true distance matrix D = [di j ] are known. [sent-348, score-1.184]

35 Recursive Grouping Algorithm Given Information Distances This section is devoted to the development of the ﬁrst algorithm for reconstructing latent tree models, recursive grouping (RG). [sent-354, score-0.688]

36 This newly introduced parent node corresponds to a hidden node in the original unknown latent tree. [sent-359, score-0.861]

37 We now describe how to compute the information distances in Step 6 for each new hidden node h ∈ Y and all other active nodes l ∈ Y . [sent-394, score-0.696]

38 From Lemma 4 and Proposition 3, we have that dih − d jh = dik − d jk = Φi jk and dih + d jh = di j , from which we can recover the information distances between a previously active node i ∈ Yold and its new hidden parent h ∈ Y as follows: dih = 1 di j + Φi jk . [sent-396, score-1.603]

39 2 (13) For any other active node l ∈ Y , we can compute dhl using a child node i ∈ C (h) as follows: dhl = dil − dih , if l ∈ Yold , dik − dih − dlk , otherwise, where k ∈ C (l). [sent-397, score-0.756]

40 (14) Using Equations (13) and (14), we can infer all the information distances dhl between a newly introduced hidden node h to all other active nodes l ∈ Y . [sent-398, score-0.719]

41 Theorem 5 (Correctness and Computational Complexity of RG) If Tp ∈ T≥3 and the matrix of information distances D (between nodes in V ) is available, then RG outputs the true latent tree Tp correctly in time O(diam(Tp )m3 ). [sent-403, score-0.871]

42 Note that since we use the active set Y in the TestNodeRelationships procedure, the leaf nodes are deﬁned with respect to Y , that is, a node is considered as a leaf node if it has only one neighbor in Y or in the set of nodes that have not yet been in an active set. [sent-419, score-1.0]

43 2 The new active set is the union of all nodes in the single-node subsets, a parent node, and a new hidden node Ynew = {1, 2, 3, h1 }. [sent-434, score-0.62]

44 In the second step, we complete the recovery of the latent tree by applying a distance-based latent tree reconstruction algorithm (such as RG or NJ) repeatedly on smaller subsets of nodes. [sent-453, score-1.136]

45 1, relate the Chow-Liu tree to the true latent tree in Section 5. [sent-455, score-0.827]

46 2, derive a simple transformation of the Chow-Liu tree to 1784 L EARNING L ATENT T REE G RAPHICAL M ODELS obtain the latent tree in Section 5. [sent-456, score-0.827]

47 Lemma 6 (Correspondence between TCL and MST) If pV is a Gaussian distribution or a symmetric discrete distribution, then the Chow-Liu tree in (16) reduces to the minimum spanning tree (MST) where the edge weights are the information distances di j , that is, TCL = MST(V ; D) := argmin ∑ di j . [sent-473, score-1.091]

48 Deﬁnition 7 (Surrogate Node) Given the latent tree Tp = (W, E p ) and any node i ∈ W , the surrogate node of i with respect to V is deﬁned as Sg(i; Tp ,V ) := argmin di j . [sent-492, score-1.176]

49 j∈V Intuitively, the surrogate node of a hidden node h ∈ H is an observed node j ∈ V that is most strongly correlated to h. [sent-493, score-0.869]

50 When the tree Tp and the observed vertex set V are understood from context, the surrogate node of h and the inverse surrogate set of i are abbreviated as Sg(h) and Sg−1 (i) respectively. [sent-497, score-0.714]

51 We now relate the original latent tree Tp = (W, E p ) to the Chow-Liu tree (also termed the MST) MST(V ; D) formed using the distance matrix D. [sent-498, score-0.859]

52 Lemma 8 (Properties of the MST) The MST in (17) and surrogate nodes satisfy the following properties: (i) The surrogate nodes of any two neighboring nodes in E p are neighbors in the MST, that is, for all i, j ∈ W with Sg(i) = Sg( j), (i, j) ∈ E p ⇒ (Sg(i), Sg( j)) ∈ MST(V ; D). [sent-499, score-0.804]

53 More precisely, an edge-contraction operation on an edge ( j, h) ∈ V × H in the latent tree Tp is deﬁned as the “shrinking” of ( j, h) to a single node whose label is the observed node j. [sent-507, score-1.012]

54 However, this algorithm, called Chow-Liu Blind (or CLBlind), is applicable only to a subset of latent trees called blind latent tree-structured graphical models P (Tblind ). [sent-525, score-0.701]

55 4 to design the CLGrouping algorithm that produces the correct latent tree structure from the MST for all minimal latent tree models. [sent-527, score-1.149]

56 If p ∈ P (Tblind ), then its structure Tp = (W, E p ) and the distance matrix D satisfy the following properties: (i) The true latent tree Tp ∈ T≥3 and all the internal nodes17 are hidden, that is, V = Leaf(Tp ). [sent-528, score-0.68]

57 Theorem 9 (Correctness and Computational Complexity of CLBlind) If the distribution p ∈ P (Tblind ) is a blind tree-structured graphical model Markov on Tp and the matrix of distances D is known, then CLBlind outputs the true latent tree Tp correctly in time O(m2 log m). [sent-548, score-0.768]

58 If the ﬁrst constraint is satisﬁed but not the second, then the blind transformation BlindTransform(MST(V ; D)) does not overestimate the number of hidden variables in the latent tree (the proof follows from Lemma 8 and is omitted). [sent-554, score-0.769]

59 4 Chow-Liu Grouping Algorithm Even though CLBlind is computationally efﬁcient, it only succeeds in recovering latent trees for a restricted subclass of minimal latent trees. [sent-558, score-0.634]

60 The dotted lines denote surrogate mappings for the hidden nodes so for example, node 3 is the surrogate of h3 . [sent-573, score-0.74]

61 Theorem 10 (Correctness and Computational Complexity of CLRG) If the distribution Tp ∈ T≥3 is a minimal latent tree and the matrix of information distances D is available, then CLRG outputs the true latent tree Tp correctly in time O(m2 log m + |J|∆3 (MST(V ; D))). [sent-589, score-1.253]

62 For example, the Chow-Liu tree in Figure 5(b) is obtained from the latent tree Tp in Figure 5(a) by sequentially contracting all edges connecting an observed node to its inverse surrogate set (cf. [sent-616, score-1.187]

63 These newly introduced hidden nodes in fact, turn out to be the inverse surrogate set of node 5, that is, Sg−1 (5) = {5, h1 , h2 }. [sent-620, score-0.637]

64 Therefore, for general (non-symmetric) discrete models, we compute MST(V ; D) (instead of the Chow-Liu tree TCL with edge weights I(Xi ; X j )), and apply RG or NJ to each internal node and its neighbors. [sent-636, score-0.624]

65 After learning the latent tree, if we ﬁnd that there exists an edge (i, h) ∈ W × H with the estimated distance dih < l, then (i, h) is contracted to a single node whose label is i, that is, the hidden node h is removed and merged with node i. [sent-665, score-1.165]

66 Firstly, in the relaxed RG algorithm, we only compute Φi jk for those estimated distances di j , dik and d jk that are below a prescribed threshold τ > 0 since longer distance estimates are unreliable. [sent-688, score-0.682]

67 Similarly, an observed node k is identiﬁed as a parent node if |dik + dk j − di j | < ε for all i and j in the same family. [sent-696, score-0.614]

68 If such an observed node does not exists for a group of family, then a new hidden node is introduced as the parent node for the group. [sent-697, score-0.817]

69 (24) Similarly, for any other node k ∈ C (h), we compute dkh using all child nodes in C (h) and C (k) (if / / C (k) = 0) as follows: dkh = 1 |C (h)| ∑i∈C (h) (dik − dih ), 1 |C (h)||C (k)| ∑(i, j)∈C (h)×C (k) (di j − dih − d jk ), if k ∈ V, otherwise. [sent-699, score-0.78]

70 NJ typically assume that all observed nodes are at the leaves of the latent tree, so after learning the latent tree, we perform the post-processing step described in Section 6. [sent-728, score-0.788]

71 1794 L EARNING L ATENT T REE G RAPHICAL M ODELS where T is a latent tree structure and κ(T ) is the number of free parameters, which grows linearly with the number of hidden variables because T is a tree. [sent-760, score-0.747]

72 25 After we compute the BIC scores for all subtrees corresponding to all internal nodes in the Chow-Liu tree, we choose the subtree that results in the highest BIC score and incorporate that subtree into the current tree model. [sent-765, score-0.63]

73 Alternatively, if we wish to learn a latent tree with a given number of hidden nodes, we can used the BIC-based procedure mentioned in the previous paragraph to learn subtrees until the desired number of hidden nodes is introduced. [sent-771, score-1.046]

74 1 Simulations using Synthetic Data Sets In order to analyze the performances of different tree reconstruction algorithms, we generate samples from known latent tree structures with varying sample sizes and apply reconstruction algorithms. [sent-781, score-0.914]

75 The double-star has 2 hidden and 80 observed nodes, the HMM has 78 hidden and 80 observed nodes, and the 5complete tree has 25 hidden and 81 observed nodes including the root node. [sent-793, score-1.046]

76 (iii) Error in the number of hidden variables: We compute the average number of hidden variables introduced by each method and plot the absolute difference between the average estimated hidden variables and the number of hidden variables in the true structure. [sent-803, score-0.667]

77 In general, given the same number of observed variables, a latent tree with more hidden variables or larger effective depth (see Section 2) is more difﬁcult to recover. [sent-833, score-0.792]

78 tree of a latent HMM is close to a chain, the Chow-Liu tree tend to be more accurate for the HMM than for the double star. [sent-856, score-0.827]

79 Based on the simulation results, we conclude that for a latent tree with a few hidden variables, RG is most accurate, and for a latent tree with a large diameter, CLNJ performs the best. [sent-868, score-1.255]

80 A latent tree with multiple layers of hidden variables is more difﬁcult to recover correctly using any method, and CLNJ and CLRG outperform NJ and RG. [sent-869, score-0.724]

81 We apply our latent tree learning algorithms to model the dependency structure of monthly stock returns of 84 companies in the S&P; 100 stock index. [sent-874, score-0.706]

82 NJ introduces more hidden variables than CLNJ and has lower log-likelihoods, which implies that starting from a Chow-Liu tree helps to get a better latent tree approximation. [sent-877, score-0.999]

83 6), the latent cluster model (LCM) (Lazarsfeld and Henry, 1968), and BIN, which is a greedy algorithm for learning latent trees (Harmeling and Williams, 2010). [sent-897, score-0.612]

84 Discussion and Conclusion In this paper, we proposed algorithms to learn a latent tree model from the information distances of observed variables. [sent-977, score-0.721]

85 Using simulations on synthetic data sets, we showed that RG performs well when the number of hidden variables is small, while CLGrouping performs signiﬁcantly better than other algorithms when there are many hidden variables in the latent tree. [sent-981, score-0.621]

86 In addition, we introduced regularized CLGrouping, which can learn a latent tree approximation by trading off model complexity (number of hidden nodes) with data ﬁdelity. [sent-984, score-0.703]

87 If : From the additive property of information distances in (12), if i is a leaf node and j is its parent, dik = di j + d jk and thus Φi jk = di j for all k = i, j. [sent-1007, score-1.025]

88 Since the new parent node added to a partition that does not 1804 L EARNING L ATENT T REE G RAPHICAL M ODELS Vi\j j i u Sg(i) j Vj\i u k (a) h Sg(j) i k i u j l k (b) (c) j k (d) Figure 13: Shaded nodes indicate observed nodes and the rest indicate hidden nodes. [sent-1020, score-0.819]

89 (d) Figure for Proof of Lemma 8(iI) contain an observed parent corresponds to a hidden node (in the original latent tree), a subforest of Tp is recovered at each iteration, and when |Y | ≤ 2, and the entire latent tree is recovered. [sent-1024, score-1.264]

90 3 Proof of Lemma 8: Properties of the MST (i) For an edge (i, j) ∈ E p such that Sg(i) = Sg( j), let Vi\ j ⊂ V and V j\i ⊂ V denote observed nodes in the subtrees obtained by the removal of edge (i, j), where the former includes node i and excludes node j and vice versa (see Figure 13(c)). [sent-1030, score-0.688]

91 From (18), all the children of a hidden node h, except its surrogate Sg(h), are neighbors of its surrogate node Sg(h) in MST(V ; d). [sent-1054, score-0.761]

92 Moreover, these children of h which are not surrogates of any hidden nodes are leaf nodes in the MST. [sent-1055, score-0.617]

93 Let vr ∈ I be the internal node visited at iteration r, and let H r be all hidden nodes in the inverse surrogate set Sg−1 (vr ), that is, H r = Sg−1 (vr ) \ {vr }. [sent-1062, score-0.755]

94 Let Ar := nbd[vr ; T r−1 ], and hence Ar is the node set input to the recursive grouping routine at iteration r, and let RG(Ar , d) be the output latent tree learned by recursive grouping. [sent-1063, score-0.897]

95 Let EC(Tp ,V r ) be the tree constructed using edge contractions as follows: in the latent tree Tp , we contract edges corresponding to each node u ∈ V r and all hidden nodes in its inverse surrogate set Sg−1 (u). [sent-1075, score-1.5]

96 (33) In words, the latent tree after k iterations of CLGrouping can be constructed by contracting each surrogate node in Tp that has not been visited by CLGrouping with its inverse surrogate set. [sent-1093, score-0.973]

97 Base Step r = 0: The claim in (33) holds since V 0 = I and the input to the CLGrouping procedure is the Chow-Liu tree MST(V ; D), which is obtained by contracting all surrogate nodes and their inverse surrogate sets (see Section 5. [sent-1096, score-0.697]

98 By the deﬁnition of EC, if we contract edges with vr and the hidden nodes in its inverse surrogate set H r on the tree EC(Tp ,V r ), then we obtain EC(Tp ,V r−1 ), which is equivalent to T r−1 by the induction assumption. [sent-1104, score-0.766]

99 Since we assume that the true information distances are uniformly bounded, there exist τ > 0 and some sufﬁciently small λ > 0 so that if |Φi jk − Φi jk | ≤ λ for all (i, j, k) ∈ J , then RG recovers the correct latent structure. [sent-1126, score-0.674]

100 In any iteration of the CLGrouping, the information distances satisfy di j ≤ γu, where γ, deﬁned in (30), is the worst-case graph distance of any hidden node from its surrogate. [sent-1143, score-0.64]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tp', 0.309), ('rg', 0.307), ('mst', 0.284), ('latent', 0.277), ('tree', 0.275), ('clgrouping', 0.271), ('nodes', 0.192), ('clrg', 0.192), ('node', 0.191), ('hidden', 0.151), ('sg', 0.141), ('clnj', 0.141), ('di', 0.139), ('jk', 0.135), ('distances', 0.127), ('surrogate', 0.103), ('illsky', 0.101), ('nandkumar', 0.101), ('pv', 0.099), ('raphical', 0.096), ('ree', 0.096), ('atent', 0.096), ('bic', 0.096), ('dih', 0.096), ('nj', 0.094), ('hoi', 0.083), ('tan', 0.083), ('leaf', 0.082), ('dik', 0.077), ('tcl', 0.073), ('internal', 0.073), ('nbd', 0.067), ('ec', 0.066), ('lcm', 0.062), ('sibling', 0.06), ('grouping', 0.058), ('trees', 0.058), ('phylogenetic', 0.056), ('hmm', 0.052), ('xv', 0.052), ('parent', 0.051), ('clblind', 0.051), ('quartets', 0.051), ('ynew', 0.051), ('odels', 0.051), ('discrete', 0.049), ('recursive', 0.048), ('regclnj', 0.045), ('regclrg', 0.045), ('testnoderelationships', 0.045), ('vr', 0.045), ('subtree', 0.045), ('blind', 0.045), ('graphical', 0.044), ('observed', 0.042), ('star', 0.041), ('stock', 0.038), ('relaxed', 0.037), ('edge', 0.036), ('active', 0.035), ('pcl', 0.034), ('tblind', 0.034), ('reconstruction', 0.032), ('monthly', 0.032), ('distance', 0.032), ('firstly', 0.03), ('reconstructing', 0.03), ('sr', 0.03), ('ar', 0.029), ('crossover', 0.028), ('harmeling', 0.028), ('ko', 0.028), ('newsgroup', 0.028), ('robinson', 0.028), ('earning', 0.028), ('spanning', 0.026), ('anandkumar', 0.026), ('depth', 0.026), ('ml', 0.026), ('symmetric', 0.025), ('consistency', 0.025), ('contracting', 0.024), ('yold', 0.024), ('erd', 0.024), ('child', 0.024), ('structure', 0.023), ('samples', 0.023), ('companies', 0.023), ('dhl', 0.023), ('dkh', 0.023), ('foulds', 0.023), ('saitou', 0.023), ('structurally', 0.022), ('neighbors', 0.022), ('secondly', 0.022), ('minimal', 0.022), ('siblings', 0.022), ('chow', 0.022), ('variables', 0.021), ('correctness', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

2 0.13645746 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the beneﬁts of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique

3 0.13359809 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efﬁciently compute marginals and update MAP conﬁgurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP conﬁgurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU

4 0.13164505 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under ﬁxed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufﬁcient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identiﬁed; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

5 0.094713986 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efﬁcient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efﬁciency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aﬂalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

6 0.088757373 35 jmlr-2011-Forest Density Estimation

7 0.086560637 104 jmlr-2011-X-Armed Bandits

8 0.077298492 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

9 0.075733013 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

10 0.072689131 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

11 0.068817757 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

12 0.065219767 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

13 0.058312118 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

14 0.05448696 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.053476024 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

16 0.052456122 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

17 0.05233562 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

18 0.051755484 16 jmlr-2011-Clustering Algorithms for Chains

19 0.049630702 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

20 0.04683847 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.243), (1, -0.138), (2, -0.019), (3, -0.021), (4, 0.052), (5, 0.051), (6, 0.046), (7, 0.078), (8, 0.241), (9, 0.362), (10, -0.088), (11, 0.141), (12, -0.161), (13, -0.057), (14, -0.029), (15, -0.11), (16, -0.079), (17, 0.049), (18, -0.02), (19, -0.051), (20, -0.126), (21, -0.002), (22, 0.075), (23, 0.003), (24, -0.025), (25, 0.087), (26, -0.071), (27, 0.024), (28, 0.114), (29, -0.078), (30, 0.124), (31, -0.081), (32, 0.033), (33, 0.001), (34, -0.092), (35, -0.09), (36, 0.03), (37, -0.037), (38, 0.016), (39, 0.012), (40, 0.093), (41, -0.065), (42, -0.003), (43, 0.009), (44, 0.044), (45, 0.007), (46, 0.001), (47, 0.019), (48, 0.062), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97744012 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

2 0.77129191 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu

3 0.47061431 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

Author: Piotr Zwiernik

Abstract: The standard Bayesian Information Criterion (BIC) is derived under regularity conditions which are not always satisÄ?Ĺš ed in the case of graphical models with hidden variables. In this paper we derive the BIC for the binary graphical tree models where all the inner nodes of a tree represent binary hidden variables. This provides an extension of a similar formula given by Rusakov and Geiger for naive Bayes models. The main tool used in this paper is the connection between the growth behavior of marginal likelihood integrals and the real log-canonical threshold. Keywords: BIC, marginal likelihood, singular models, tree models, Bayesian networks, real logcanonical threshold

4 0.46553069 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

5 0.41895494 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that ﬁnding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency

6 0.41262141 104 jmlr-2011-X-Armed Bandits

7 0.40461004 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

8 0.36349154 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

9 0.33488235 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

10 0.33394817 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

11 0.32835183 101 jmlr-2011-Variable Sparsity Kernel Learning

12 0.30165106 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

13 0.28899127 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

14 0.27720255 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

15 0.27416566 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

16 0.27134165 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

17 0.27109069 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

18 0.24332848 16 jmlr-2011-Clustering Algorithms for Chains

19 0.23655567 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

20 0.21967082 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.564), (9, 0.02), (10, 0.027), (24, 0.029), (31, 0.069), (32, 0.016), (41, 0.022), (60, 0.01), (71, 0.025), (73, 0.03), (78, 0.049), (90, 0.026), (94, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95160967 6 jmlr-2011-A Simpler Approach to Matrix Completion

Author: Benjamin Recht

Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisﬁes a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing

same-paper 2 0.93805367 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

3 0.92919797 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the ﬁeld of statistics, is a simpliﬁed model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation

4 0.65794456 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efﬁciently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

5 0.57672065 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

6 0.55119586 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

7 0.54551256 91 jmlr-2011-The Sample Complexity of Dictionary Learning

8 0.52438712 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

9 0.49823439 104 jmlr-2011-X-Armed Bandits

10 0.47292802 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

11 0.46375757 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

12 0.45536259 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

13 0.45324299 35 jmlr-2011-Forest Density Estimation

14 0.44696411 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

15 0.44498715 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

16 0.4427433 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

17 0.43344206 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

18 0.43213418 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

19 0.42924231 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

20 0.41955554 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing