jmlr jmlr2011 jmlr2011-54 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Stochastic Systems Group Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 02139 Editor: Marina Meil˘ a Abstract We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. [sent-9, score-0.619]
2 Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. [sent-13, score-0.66]
3 This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. [sent-16, score-0.617]
4 We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. [sent-19, score-0.788]
5 Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. [sent-21, score-0.77]
6 There are three challenging problems in learning a model with latent variables: learning the number of latent variables; inferring the structure of how these latent variables relate to each other and to the observed variables; and estimating the parameters characterizing those relationships. [sent-27, score-0.917]
7 One class of models that has received considerable attention in the literature is the class of latent tree models, that is, graphical models Markov on trees, in which variables at some nodes represent the original (observed) variables of interest while others represent the latent variables. [sent-29, score-1.107]
8 If all variables are observed in the tree under consideration, then the well-known algorithm of Chow and Liu (1968) provides a tractable algorithm for performing maximum likelihood (ML) estimation of the tree structure. [sent-39, score-0.613]
9 This global step then provides guidance for groups of observed nodes that are likely to be topologically close to each other in the latent tree, thereby guiding subsequent recursive grouping or neighbor-joining (Saitou and Nei, 1987) computations. [sent-51, score-0.617]
10 For example, we can take any such latent model and add another hidden variable as a leaf node connected to only one other (hidden or observed) node. [sent-54, score-0.701]
11 (2006) considered the learning of directed latent models using so-called tetrad constraints, and there have also been attempts to tailor the learning of latent tree models in order to perform approximate inference accurately and efficiently downstream (Wang et al. [sent-69, score-0.829]
12 In contrast, we fix the state space of each hidden node, but allow the possibility that some observed nodes are internal nodes (non-leaves). [sent-72, score-0.65]
13 A tree is called a complete k-ary tree (or k-complete tree), if all its internal nodes have degree k and there exists one node (commonly referred as the root node) that has the exactly same distance to all leaf nodes. [sent-75, score-1.12]
14 The reconstruction of latent trees has been studied extensively by the phylogenetic community where sequences of extant species are available and the unknown phylogenetic tree is to be inferred from these sequences. [sent-92, score-0.754]
15 The algorithm proceeds by recursively pairing two nodes that are the closest neighbors in the true latent tree and introducing a hidden node as the parent of the two nodes. [sent-101, score-1.159]
16 However, when we only have access to the samples at the observed nodes, then it is not straightforward to construct a latent tree from a set of quartets since the quartets may be not be consistent. [sent-110, score-0.719]
17 Here, we say that a set of quartets is consistent if there exists a latent tree such that all quartets agree with the tree. [sent-122, score-0.654]
18 In the subsequent two sections, we make two assumptions: Firstly, the true distribution is a latent tree and secondly, perfect knowledge of information distance of observed variables is available. [sent-142, score-0.647]
19 We also discuss extensions of our algorithms for the case when the underlying model is not a tree and our goal is to learn an approximation to it using a latent tree model. [sent-146, score-0.827]
20 A latent tree is a tree with node set W := V ∪ H, the union of a set of observed nodes V (with m = |V |), and a set of latent (or hidden) nodes H. [sent-169, score-1.721]
21 samples drawn from a graphical model (distribution) p, Markov on a latent tree Tp = (W, E p ), where W = V ∪ H. [sent-206, score-0.619]
22 Our algorithms learn latent tree structures using the information distances (defined in Section 3) between pairs of observed variables, which can be estimated from samples. [sent-212, score-0.721]
23 3 Minimal Tree Extensions Our ultimate goal is to recover the graphical model p, that is, the latent tree structure and its paramn eters, given n i. [sent-222, score-0.619]
24 However, in general, there can be multiple latent tree models which result in the same observed statistics, that is, the same joint distribution pV of the observed variables. [sent-226, score-0.636]
25 We consider the class of tree models where it is possible to recover the latent tree model uniquely and provide necessary conditions for structure identifiability, that is, the identifiability of the edge set E. [sent-227, score-0.886]
26 As an example, consider a discrete latent tree model with binary observed variables (K = 2). [sent-232, score-0.664]
27 A latent tree with the simplest structure (fewest number of nodes) is a tree in which all m observed binary variables are connected to one hidden variable. [sent-233, score-1.064]
28 The following conditions ensure that a latent tree does not include a redundant hidden node (Pearl, 1988): (C1) Each hidden variable has at least three neighbors (which can be either hidden or observed). [sent-244, score-1.218]
29 As illustrated in Figure 1, using marginalization operations, any non-minimal latent tree distribution can be reduced to a minimal latent tree model. [sent-256, score-1.126]
30 Definition 2 (Consistency) A latent tree reconstruction algorithm A is a map from the observed n samples xV to an estimated tree T n and an estimated tree-structured graphical model pn . [sent-264, score-0.968]
31 In the following sections, we design structurally and risk consistent algorithms for (minimal) Gaussian and symmetric discrete latent tree models, defined in (4). [sent-267, score-0.648]
32 To do so, for any three variables i, j, k ∈ V , we define Φi jk := dik − d jk to be the difference between the information distances dik and d jk . [sent-325, score-0.707]
33 Lemma 4 (Sibling Grouping) For distances di j for all i, j ∈ V on a tree T ∈ T≥3 , the following two properties on Φi jk = dik − d jk hold: (i) Φi jk = di j for all k ∈ V \ {i, j} if and only if i is a leaf node and j is its parent. [sent-327, score-1.435]
34 In the following section, we describe a recursive algorithm that is based on the above TestNodeRelationships procedure to reconstruct the entire latent tree model assuming that the true model is a latent tree and that the true distance matrix D = [di j ] are known. [sent-348, score-1.184]
35 Recursive Grouping Algorithm Given Information Distances This section is devoted to the development of the first algorithm for reconstructing latent tree models, recursive grouping (RG). [sent-354, score-0.688]
36 This newly introduced parent node corresponds to a hidden node in the original unknown latent tree. [sent-359, score-0.861]
37 We now describe how to compute the information distances in Step 6 for each new hidden node h ∈ Y and all other active nodes l ∈ Y . [sent-394, score-0.696]
38 From Lemma 4 and Proposition 3, we have that dih − d jh = dik − d jk = Φi jk and dih + d jh = di j , from which we can recover the information distances between a previously active node i ∈ Yold and its new hidden parent h ∈ Y as follows: dih = 1 di j + Φi jk . [sent-396, score-1.603]
39 2 (13) For any other active node l ∈ Y , we can compute dhl using a child node i ∈ C (h) as follows: dhl = dil − dih , if l ∈ Yold , dik − dih − dlk , otherwise, where k ∈ C (l). [sent-397, score-0.756]
40 (14) Using Equations (13) and (14), we can infer all the information distances dhl between a newly introduced hidden node h to all other active nodes l ∈ Y . [sent-398, score-0.719]
41 Theorem 5 (Correctness and Computational Complexity of RG) If Tp ∈ T≥3 and the matrix of information distances D (between nodes in V ) is available, then RG outputs the true latent tree Tp correctly in time O(diam(Tp )m3 ). [sent-403, score-0.871]
42 Note that since we use the active set Y in the TestNodeRelationships procedure, the leaf nodes are defined with respect to Y , that is, a node is considered as a leaf node if it has only one neighbor in Y or in the set of nodes that have not yet been in an active set. [sent-419, score-1.0]
43 2 The new active set is the union of all nodes in the single-node subsets, a parent node, and a new hidden node Ynew = {1, 2, 3, h1 }. [sent-434, score-0.62]
44 In the second step, we complete the recovery of the latent tree by applying a distance-based latent tree reconstruction algorithm (such as RG or NJ) repeatedly on smaller subsets of nodes. [sent-453, score-1.136]
45 1, relate the Chow-Liu tree to the true latent tree in Section 5. [sent-455, score-0.827]
46 2, derive a simple transformation of the Chow-Liu tree to 1784 L EARNING L ATENT T REE G RAPHICAL M ODELS obtain the latent tree in Section 5. [sent-456, score-0.827]
47 Lemma 6 (Correspondence between TCL and MST) If pV is a Gaussian distribution or a symmetric discrete distribution, then the Chow-Liu tree in (16) reduces to the minimum spanning tree (MST) where the edge weights are the information distances di j , that is, TCL = MST(V ; D) := argmin ∑ di j . [sent-473, score-1.091]
48 Definition 7 (Surrogate Node) Given the latent tree Tp = (W, E p ) and any node i ∈ W , the surrogate node of i with respect to V is defined as Sg(i; Tp ,V ) := argmin di j . [sent-492, score-1.176]
49 j∈V Intuitively, the surrogate node of a hidden node h ∈ H is an observed node j ∈ V that is most strongly correlated to h. [sent-493, score-0.869]
50 When the tree Tp and the observed vertex set V are understood from context, the surrogate node of h and the inverse surrogate set of i are abbreviated as Sg(h) and Sg−1 (i) respectively. [sent-497, score-0.714]
51 We now relate the original latent tree Tp = (W, E p ) to the Chow-Liu tree (also termed the MST) MST(V ; D) formed using the distance matrix D. [sent-498, score-0.859]
52 Lemma 8 (Properties of the MST) The MST in (17) and surrogate nodes satisfy the following properties: (i) The surrogate nodes of any two neighboring nodes in E p are neighbors in the MST, that is, for all i, j ∈ W with Sg(i) = Sg( j), (i, j) ∈ E p ⇒ (Sg(i), Sg( j)) ∈ MST(V ; D). [sent-499, score-0.804]
53 More precisely, an edge-contraction operation on an edge ( j, h) ∈ V × H in the latent tree Tp is defined as the “shrinking” of ( j, h) to a single node whose label is the observed node j. [sent-507, score-1.012]
54 However, this algorithm, called Chow-Liu Blind (or CLBlind), is applicable only to a subset of latent trees called blind latent tree-structured graphical models P (Tblind ). [sent-525, score-0.701]
55 4 to design the CLGrouping algorithm that produces the correct latent tree structure from the MST for all minimal latent tree models. [sent-527, score-1.149]
56 If p ∈ P (Tblind ), then its structure Tp = (W, E p ) and the distance matrix D satisfy the following properties: (i) The true latent tree Tp ∈ T≥3 and all the internal nodes17 are hidden, that is, V = Leaf(Tp ). [sent-528, score-0.68]
57 Theorem 9 (Correctness and Computational Complexity of CLBlind) If the distribution p ∈ P (Tblind ) is a blind tree-structured graphical model Markov on Tp and the matrix of distances D is known, then CLBlind outputs the true latent tree Tp correctly in time O(m2 log m). [sent-548, score-0.768]
58 If the first constraint is satisfied but not the second, then the blind transformation BlindTransform(MST(V ; D)) does not overestimate the number of hidden variables in the latent tree (the proof follows from Lemma 8 and is omitted). [sent-554, score-0.769]
59 4 Chow-Liu Grouping Algorithm Even though CLBlind is computationally efficient, it only succeeds in recovering latent trees for a restricted subclass of minimal latent trees. [sent-558, score-0.634]
60 The dotted lines denote surrogate mappings for the hidden nodes so for example, node 3 is the surrogate of h3 . [sent-573, score-0.74]
61 Theorem 10 (Correctness and Computational Complexity of CLRG) If the distribution Tp ∈ T≥3 is a minimal latent tree and the matrix of information distances D is available, then CLRG outputs the true latent tree Tp correctly in time O(m2 log m + |J|∆3 (MST(V ; D))). [sent-589, score-1.253]
62 For example, the Chow-Liu tree in Figure 5(b) is obtained from the latent tree Tp in Figure 5(a) by sequentially contracting all edges connecting an observed node to its inverse surrogate set (cf. [sent-616, score-1.187]
63 These newly introduced hidden nodes in fact, turn out to be the inverse surrogate set of node 5, that is, Sg−1 (5) = {5, h1 , h2 }. [sent-620, score-0.637]
64 Therefore, for general (non-symmetric) discrete models, we compute MST(V ; D) (instead of the Chow-Liu tree TCL with edge weights I(Xi ; X j )), and apply RG or NJ to each internal node and its neighbors. [sent-636, score-0.624]
65 After learning the latent tree, if we find that there exists an edge (i, h) ∈ W × H with the estimated distance dih < l, then (i, h) is contracted to a single node whose label is i, that is, the hidden node h is removed and merged with node i. [sent-665, score-1.165]
66 Firstly, in the relaxed RG algorithm, we only compute Φi jk for those estimated distances di j , dik and d jk that are below a prescribed threshold τ > 0 since longer distance estimates are unreliable. [sent-688, score-0.682]
67 Similarly, an observed node k is identified as a parent node if |dik + dk j − di j | < ε for all i and j in the same family. [sent-696, score-0.614]
68 If such an observed node does not exists for a group of family, then a new hidden node is introduced as the parent node for the group. [sent-697, score-0.817]
69 (24) Similarly, for any other node k ∈ C (h), we compute dkh using all child nodes in C (h) and C (k) (if / / C (k) = 0) as follows: dkh = 1 |C (h)| ∑i∈C (h) (dik − dih ), 1 |C (h)||C (k)| ∑(i, j)∈C (h)×C (k) (di j − dih − d jk ), if k ∈ V, otherwise. [sent-699, score-0.78]
70 NJ typically assume that all observed nodes are at the leaves of the latent tree, so after learning the latent tree, we perform the post-processing step described in Section 6. [sent-728, score-0.788]
71 1794 L EARNING L ATENT T REE G RAPHICAL M ODELS where T is a latent tree structure and κ(T ) is the number of free parameters, which grows linearly with the number of hidden variables because T is a tree. [sent-760, score-0.747]
72 25 After we compute the BIC scores for all subtrees corresponding to all internal nodes in the Chow-Liu tree, we choose the subtree that results in the highest BIC score and incorporate that subtree into the current tree model. [sent-765, score-0.63]
73 Alternatively, if we wish to learn a latent tree with a given number of hidden nodes, we can used the BIC-based procedure mentioned in the previous paragraph to learn subtrees until the desired number of hidden nodes is introduced. [sent-771, score-1.046]
74 1 Simulations using Synthetic Data Sets In order to analyze the performances of different tree reconstruction algorithms, we generate samples from known latent tree structures with varying sample sizes and apply reconstruction algorithms. [sent-781, score-0.914]
75 The double-star has 2 hidden and 80 observed nodes, the HMM has 78 hidden and 80 observed nodes, and the 5complete tree has 25 hidden and 81 observed nodes including the root node. [sent-793, score-1.046]
76 (iii) Error in the number of hidden variables: We compute the average number of hidden variables introduced by each method and plot the absolute difference between the average estimated hidden variables and the number of hidden variables in the true structure. [sent-803, score-0.667]
77 In general, given the same number of observed variables, a latent tree with more hidden variables or larger effective depth (see Section 2) is more difficult to recover. [sent-833, score-0.792]
78 tree of a latent HMM is close to a chain, the Chow-Liu tree tend to be more accurate for the HMM than for the double star. [sent-856, score-0.827]
79 Based on the simulation results, we conclude that for a latent tree with a few hidden variables, RG is most accurate, and for a latent tree with a large diameter, CLNJ performs the best. [sent-868, score-1.255]
80 A latent tree with multiple layers of hidden variables is more difficult to recover correctly using any method, and CLNJ and CLRG outperform NJ and RG. [sent-869, score-0.724]
81 We apply our latent tree learning algorithms to model the dependency structure of monthly stock returns of 84 companies in the S&P; 100 stock index. [sent-874, score-0.706]
82 NJ introduces more hidden variables than CLNJ and has lower log-likelihoods, which implies that starting from a Chow-Liu tree helps to get a better latent tree approximation. [sent-877, score-0.999]
83 6), the latent cluster model (LCM) (Lazarsfeld and Henry, 1968), and BIN, which is a greedy algorithm for learning latent trees (Harmeling and Williams, 2010). [sent-897, score-0.612]
84 Discussion and Conclusion In this paper, we proposed algorithms to learn a latent tree model from the information distances of observed variables. [sent-977, score-0.721]
85 Using simulations on synthetic data sets, we showed that RG performs well when the number of hidden variables is small, while CLGrouping performs significantly better than other algorithms when there are many hidden variables in the latent tree. [sent-981, score-0.621]
86 In addition, we introduced regularized CLGrouping, which can learn a latent tree approximation by trading off model complexity (number of hidden nodes) with data fidelity. [sent-984, score-0.703]
87 If : From the additive property of information distances in (12), if i is a leaf node and j is its parent, dik = di j + d jk and thus Φi jk = di j for all k = i, j. [sent-1007, score-1.025]
88 Since the new parent node added to a partition that does not 1804 L EARNING L ATENT T REE G RAPHICAL M ODELS Vi\j j i u Sg(i) j Vj\i u k (a) h Sg(j) i k i u j l k (b) (c) j k (d) Figure 13: Shaded nodes indicate observed nodes and the rest indicate hidden nodes. [sent-1020, score-0.819]
89 (d) Figure for Proof of Lemma 8(iI) contain an observed parent corresponds to a hidden node (in the original latent tree), a subforest of Tp is recovered at each iteration, and when |Y | ≤ 2, and the entire latent tree is recovered. [sent-1024, score-1.264]
90 3 Proof of Lemma 8: Properties of the MST (i) For an edge (i, j) ∈ E p such that Sg(i) = Sg( j), let Vi\ j ⊂ V and V j\i ⊂ V denote observed nodes in the subtrees obtained by the removal of edge (i, j), where the former includes node i and excludes node j and vice versa (see Figure 13(c)). [sent-1030, score-0.688]
91 From (18), all the children of a hidden node h, except its surrogate Sg(h), are neighbors of its surrogate node Sg(h) in MST(V ; d). [sent-1054, score-0.761]
92 Moreover, these children of h which are not surrogates of any hidden nodes are leaf nodes in the MST. [sent-1055, score-0.617]
93 Let vr ∈ I be the internal node visited at iteration r, and let H r be all hidden nodes in the inverse surrogate set Sg−1 (vr ), that is, H r = Sg−1 (vr ) \ {vr }. [sent-1062, score-0.755]
94 Let Ar := nbd[vr ; T r−1 ], and hence Ar is the node set input to the recursive grouping routine at iteration r, and let RG(Ar , d) be the output latent tree learned by recursive grouping. [sent-1063, score-0.897]
95 Let EC(Tp ,V r ) be the tree constructed using edge contractions as follows: in the latent tree Tp , we contract edges corresponding to each node u ∈ V r and all hidden nodes in its inverse surrogate set Sg−1 (u). [sent-1075, score-1.5]
96 (33) In words, the latent tree after k iterations of CLGrouping can be constructed by contracting each surrogate node in Tp that has not been visited by CLGrouping with its inverse surrogate set. [sent-1093, score-0.973]
97 Base Step r = 0: The claim in (33) holds since V 0 = I and the input to the CLGrouping procedure is the Chow-Liu tree MST(V ; D), which is obtained by contracting all surrogate nodes and their inverse surrogate sets (see Section 5. [sent-1096, score-0.697]
98 By the definition of EC, if we contract edges with vr and the hidden nodes in its inverse surrogate set H r on the tree EC(Tp ,V r ), then we obtain EC(Tp ,V r−1 ), which is equivalent to T r−1 by the induction assumption. [sent-1104, score-0.766]
99 Since we assume that the true information distances are uniformly bounded, there exist τ > 0 and some sufficiently small λ > 0 so that if |Φi jk − Φi jk | ≤ λ for all (i, j, k) ∈ J , then RG recovers the correct latent structure. [sent-1126, score-0.674]
100 In any iteration of the CLGrouping, the information distances satisfy di j ≤ γu, where γ, defined in (30), is the worst-case graph distance of any hidden node from its surrogate. [sent-1143, score-0.64]
wordName wordTfidf (topN-words)
[('tp', 0.309), ('rg', 0.307), ('mst', 0.284), ('latent', 0.277), ('tree', 0.275), ('clgrouping', 0.271), ('nodes', 0.192), ('clrg', 0.192), ('node', 0.191), ('hidden', 0.151), ('sg', 0.141), ('clnj', 0.141), ('di', 0.139), ('jk', 0.135), ('distances', 0.127), ('surrogate', 0.103), ('illsky', 0.101), ('nandkumar', 0.101), ('pv', 0.099), ('raphical', 0.096), ('ree', 0.096), ('atent', 0.096), ('bic', 0.096), ('dih', 0.096), ('nj', 0.094), ('hoi', 0.083), ('tan', 0.083), ('leaf', 0.082), ('dik', 0.077), ('tcl', 0.073), ('internal', 0.073), ('nbd', 0.067), ('ec', 0.066), ('lcm', 0.062), ('sibling', 0.06), ('grouping', 0.058), ('trees', 0.058), ('phylogenetic', 0.056), ('hmm', 0.052), ('xv', 0.052), ('parent', 0.051), ('clblind', 0.051), ('quartets', 0.051), ('ynew', 0.051), ('odels', 0.051), ('discrete', 0.049), ('recursive', 0.048), ('regclnj', 0.045), ('regclrg', 0.045), ('testnoderelationships', 0.045), ('vr', 0.045), ('subtree', 0.045), ('blind', 0.045), ('graphical', 0.044), ('observed', 0.042), ('star', 0.041), ('stock', 0.038), ('relaxed', 0.037), ('edge', 0.036), ('active', 0.035), ('pcl', 0.034), ('tblind', 0.034), ('reconstruction', 0.032), ('monthly', 0.032), ('distance', 0.032), ('firstly', 0.03), ('reconstructing', 0.03), ('sr', 0.03), ('ar', 0.029), ('crossover', 0.028), ('harmeling', 0.028), ('ko', 0.028), ('newsgroup', 0.028), ('robinson', 0.028), ('earning', 0.028), ('spanning', 0.026), ('anandkumar', 0.026), ('depth', 0.026), ('ml', 0.026), ('symmetric', 0.025), ('consistency', 0.025), ('contracting', 0.024), ('yold', 0.024), ('erd', 0.024), ('child', 0.024), ('structure', 0.023), ('samples', 0.023), ('companies', 0.023), ('dhl', 0.023), ('dkh', 0.023), ('foulds', 0.023), ('saitou', 0.023), ('structurally', 0.022), ('neighbors', 0.022), ('secondly', 0.022), ('minimal', 0.022), ('siblings', 0.022), ('chow', 0.022), ('variables', 0.021), ('correctness', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 54 jmlr-2011-Learning Latent Tree Graphical Models
Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY
2 0.13645746 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
Author: Cassio P. de Campos, Qiang Ji
Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique
3 0.13359809 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu
Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efficiently compute marginals and update MAP configurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP configurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU
4 0.13164505 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
5 0.094713986 101 jmlr-2011-Variable Sparsity Kernel Learning
Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman
Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efficient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efficiency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN
6 0.088757373 35 jmlr-2011-Forest Density Estimation
7 0.086560637 104 jmlr-2011-X-Armed Bandits
8 0.077298492 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
9 0.075733013 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
10 0.072689131 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
11 0.068817757 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
12 0.065219767 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
13 0.058312118 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
14 0.05448696 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
15 0.053476024 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
16 0.052456122 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
17 0.05233562 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
18 0.051755484 16 jmlr-2011-Clustering Algorithms for Chains
20 0.04683847 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
topicId topicWeight
[(0, 0.243), (1, -0.138), (2, -0.019), (3, -0.021), (4, 0.052), (5, 0.051), (6, 0.046), (7, 0.078), (8, 0.241), (9, 0.362), (10, -0.088), (11, 0.141), (12, -0.161), (13, -0.057), (14, -0.029), (15, -0.11), (16, -0.079), (17, 0.049), (18, -0.02), (19, -0.051), (20, -0.126), (21, -0.002), (22, 0.075), (23, 0.003), (24, -0.025), (25, 0.087), (26, -0.071), (27, 0.024), (28, 0.114), (29, -0.078), (30, 0.124), (31, -0.081), (32, 0.033), (33, 0.001), (34, -0.092), (35, -0.09), (36, 0.03), (37, -0.037), (38, 0.016), (39, 0.012), (40, 0.093), (41, -0.065), (42, -0.003), (43, 0.009), (44, 0.044), (45, 0.007), (46, 0.001), (47, 0.019), (48, 0.062), (49, -0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.97744012 54 jmlr-2011-Learning Latent Tree Graphical Models
Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY
2 0.77129191 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu
Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efficiently compute marginals and update MAP configurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP configurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU
3 0.47061431 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
Author: Piotr Zwiernik
Abstract: The standard Bayesian Information Criterion (BIC) is derived under regularity conditions which are not always satisÄ?Ĺš ed in the case of graphical models with hidden variables. In this paper we derive the BIC for the binary graphical tree models where all the inner nodes of a tree represent binary hidden variables. This provides an extension of a similar formula given by Rusakov and Geiger for naive Bayes models. The main tool used in this paper is the connection between the growth behavior of marginal likelihood integrals and the real log-canonical threshold. Keywords: BIC, marginal likelihood, singular models, tree models, Bayesian networks, real logcanonical threshold
4 0.46553069 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
Author: Cassio P. de Campos, Qiang Ji
Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique
5 0.41895494 35 jmlr-2011-Forest Density Estimation
Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman
Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency
6 0.41262141 104 jmlr-2011-X-Armed Bandits
7 0.40461004 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
8 0.36349154 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
9 0.33488235 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
10 0.33394817 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
11 0.32835183 101 jmlr-2011-Variable Sparsity Kernel Learning
12 0.30165106 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
14 0.27720255 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
15 0.27416566 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
16 0.27134165 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
17 0.27109069 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
18 0.24332848 16 jmlr-2011-Clustering Algorithms for Chains
19 0.23655567 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
20 0.21967082 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
topicId topicWeight
[(4, 0.564), (9, 0.02), (10, 0.027), (24, 0.029), (31, 0.069), (32, 0.016), (41, 0.022), (60, 0.01), (71, 0.025), (73, 0.03), (78, 0.049), (90, 0.026), (94, 0.011)]
simIndex simValue paperId paperTitle
1 0.95160967 6 jmlr-2011-A Simpler Approach to Matrix Completion
Author: Benjamin Recht
Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisfies a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing
same-paper 2 0.93805367 54 jmlr-2011-Learning Latent Tree Graphical Models
Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY
3 0.92919797 97 jmlr-2011-Union Support Recovery in Multi-task Learning
Author: Mladen Kolar, John Lafferty, Larry Wasserman
Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation
4 0.65794456 59 jmlr-2011-Learning with Structured Sparsity
Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing
5 0.57672065 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
7 0.54551256 91 jmlr-2011-The Sample Complexity of Dictionary Learning
8 0.52438712 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
9 0.49823439 104 jmlr-2011-X-Armed Bandits
10 0.47292802 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
11 0.46375757 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
12 0.45536259 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
13 0.45324299 35 jmlr-2011-Forest Density Estimation
14 0.44696411 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
15 0.44498715 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
16 0.4427433 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
17 0.43344206 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
18 0.43213418 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
19 0.42924231 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
20 0.41955554 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing