jmlr jmlr2011 jmlr2011-53 knowledge-graph by maker-knowledge-mining

53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates


Source: pdf

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. [sent-13, score-0.284]

2 Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types 1. [sent-18, score-0.328]

3 For learning the forest structure, the ML (Chow-Liu) algorithm does not produce a consistent estimate since ML favors richer model classes and hence, outputs a tree in general. [sent-39, score-0.277]

4 We provide tight bounds on the overestimation and underestimation errors, that is, the error probability that the output of the algorithm has more or fewer edges than the true model. [sent-41, score-0.597]

5 Secondly, we prove that CLThres is risk consistent, meaning that the risk under the estimated model converges to the risk of the forest projection2 of the underlying distribution, which may not be a forest. [sent-45, score-0.406]

6 We show that the error rate is in fact, dominated by the rate of decay of the overestimation error probability. [sent-51, score-0.422]

7 We provide an upper bound on the error probability by using convex duality to find a surprising connection between the overestimation error 1. [sent-54, score-0.343]

8 We use the term proper forest to denote the set of disconnected, acyclic graphs. [sent-56, score-0.241]

9 The forest projection is the forest-structured graphical model that is closest in the KL-divergence sense to the true distribution. [sent-58, score-0.3]

10 The overestimation error probability is the probability that the number of edges learned exceeds the true number of edges. [sent-61, score-0.391]

11 1618 L EARNING H IGH -D IMENSIONAL M ARKOV F OREST D ISTRIBUTIONS rate and a semidefinite program (Vandenberghe and Boyd, 1996) and show that the overestimation error in structure learning decays faster than any polynomial in n for a fixed data dimension d. [sent-63, score-0.386]

12 Thus, the empty graph and connected trees are the extremal forest structures for learning. [sent-70, score-0.297]

13 , the risk of the estimated forest distribution converges to the risk of the forest projection of the true model at a rate of O p (d log d/n1−γ ) for any γ > 0. [sent-73, score-0.685]

14 Finally, we use CLThres to learn forest-structured distributions given synthetic and real-world data sets and show that in the finite-sample case, there exists an inevitable trade-off between the underestimation and overestimation errors. [sent-77, score-0.437]

15 The error exponent is a quantitative measure of performance of the learning algorithm since a larger exponent implies a faster decay of the error probability. [sent-97, score-0.279]

16 However, the analysis does not readily extend to learning forest models and furthermore it was for the scenario when number of variables d does not grow with the number of samples n. [sent-98, score-0.268]

17 (2011) derived consistency (and sparsistency) guarantees for learning tree and forest models. [sent-102, score-0.318]

18 We build on some of these ideas and proof techniques to identify the correct set of edges (and in particular the number of edges) in the forest model and also to provide strong theoretical guarantees of the rate of convergence of the estimated forest-structured distribution to the true one. [sent-117, score-0.394]

19 Let the set of labeled trees (connected, acyclic graphs) with d nodes be T d and let the set of forests (acyclic graphs) with k d edges and d nodes be Tk for 0 ≤ k ≤ d − 1. [sent-132, score-0.258]

20 We also use the notad tion D(Tk ) ⊂ P(Xd ) to denote the set of d-variate distributions Markov on a forest with k edges. [sent-149, score-0.241]

21 Note from our minimality assumption that Imin > 0 since all edges in the forest have positive mutual information (none of the edges are degenerate). [sent-159, score-0.506]

22 Estimate the true number of edges using the thresholding estimator: kn := argmin I(Pe j ) : I(Pe j ) ≥ εn , I(Pe j+1 ) ≤ εn . [sent-201, score-0.335]

23 Prune the tree by retaining only the top kn edges, that is, define the estimated edge set of the forest to be Ekn := {e1 , . [sent-204, score-0.549]

24 Define the estimated forest to be Tkn := (V, Ekn ). [sent-208, score-0.241]

25 Pi (xi )Pj (x j ) i∈V (i, j)∈E kn Intuitively, CLThres first constructs a connected tree (V, Ed−1 ) via Chow-Liu (in Steps 1–3) before pruning the weak edges (with small mutual information) to obtain the final structure Ekn . [sent-216, score-0.486]

26 Note that if Step 4 is omitted and kn is defined to be d − 1, then CLThres simply reduces to the Chow-Liu ML algorithm. [sent-218, score-0.245]

27 Structural Consistency For Fixed Model Size In this section, we keep d and k fixed and consider a probability model P, which is assumed to be d Markov on a forest in Tk . [sent-230, score-0.267]

28 Recall that Ekn , with cardinality kn , is the learned edge set by using CLThres. [sent-235, score-0.272]

29 log n (6) Then, if the true model TP = (V, EP ) is a proper forest (k < d − 1), there exists a constant CP ∈ (1, ∞) such that 1 log Pn (An ) n→∞ nεn 1 ≤ lim sup log Pn (An ) ≤ −1. [sent-248, score-0.382]

30 n→∞ nεn −CP ≤ lim inf (7) (8) Finally, if the true model TP = (V, EP ) is a tree (k = d − 1), then lim n→∞ 1 log Pn (An ) < 0, n that is, the error probability decays exponentially fast. [sent-249, score-0.327]

31 2 Interpretation of Result From (8), the rate of decay of the error probability for proper forests is subexponential but nonetheless can be made faster than any polynomial for an appropriate choice of εn . [sent-253, score-0.26]

32 Thus, the true edges will be correctly identified by CLThres implying that with high probability, there will not be underestimation as n → ∞. [sent-260, score-0.322]

33 Thus, (8) is a universal result for all forest distributions P ∈ D(Fd ). [sent-273, score-0.241]

34 The overestimation error results from testing whether pairs of random variables are independent and our asymptotic bound for the error probability of this test does not depend on the true distribution P. [sent-283, score-0.343]

35 We state a converse (a necessary lower bound on sample complexity) in Theorem 7 by treating the unknown forest graph as a uniform random variable over all possible forests of fixed size. [sent-287, score-0.36]

36 , 1996), that bounding the overestimation error poses the greatest challenge. [sent-296, score-0.249]

37 Indeed, we show that the underestimation and Chow-Liu errors have exponential decay in n. [sent-297, score-0.293]

38 However, the overestimation error has subexponential decay (≈ exp(−nεn )). [sent-298, score-0.31]

39 In this subsection, we consider the situation when the underlying unknown distribution P is not forest-structured but we wish to learn its best forest approximation. [sent-309, score-0.241]

40 To this end, we define the projection of P onto the set of 1626 L EARNING H IGH -D IMENSIONAL M ARKOV F OREST D ISTRIBUTIONS forests (or forest projection) to be P := argmin D(P || Q). [sent-310, score-0.363]

41 We will sometimes make the dependence of d and k on n explicit, that is, d = dn and k = kn . [sent-327, score-0.298]

42 d∈N xi ,x j ∈X (13) (14) That is, assumptions (A1) and (A2) insure that there exists uniform lower bounds on the minimum mutual information and the minimum entry in the pairwise probabilities in the forest models as the size of the graph grows. [sent-329, score-0.326]

43 (16) nεn Note that hn is a function of both the number of variables d = dn and the number of edges k = kn in the models P(d) since it is a sequence indexed by n. [sent-346, score-0.425]

44 In the next result, we assume (n, d, k) satisfies the scaling law in (15) and answer the following question: How does hn in (16) depend on the number (d) (d) of edges kn for a given dn ? [sent-347, score-0.425]

45 Let P1 and P2 be two sequences of forest-structured distributions with (d) (d) a common number of nodes dn and number of edges kn (P1 ) and kn (P2 ) respectively. [sent-348, score-0.656]

46 Corollary 6 (Extremal Forests) Assume that CLThres is employed as the forest learning algo(d) (d) (d) (d) rithm. [sent-349, score-0.241]

47 As n → ∞, hn (P1 ) ≤ hn (P2 ) whenever kn (P1 ) ≥ kn (P2 ) implying that hn is maximized when P(d) are product distributions (i. [sent-350, score-0.601]

48 , kn = 0) and minimized when P(d) are tree-structured dis(d) (d) (d) (d) tributions (i. [sent-352, score-0.245]

49 Furthermore, if kn (P1 ) = kn (P2 ), then hn (P1 ) = hn (P2 ). [sent-355, score-0.564]

50 We are not claiming that such a result holds for all other forest learning algorithms. [sent-357, score-0.241]

51 The intuition for this result is the following: We recall from the discussion after Theorem 3 that the overestimation error dominates the probability of error for structure learning. [sent-358, score-0.349]

52 , kn is very small relative to dn ), the CLThres estimator is more likely to overestimate the number of edges as compared to if there are many edges (i. [sent-362, score-0.478]

53 3 Lower Bounds on Sample Complexity Thus far, our results are for a specific algorithm CLThres for learning the structure of Markov forest distributions. [sent-368, score-0.271]

54 This is because there are fewer forests with a small number of edges as compared to forests with a large number of edges. [sent-385, score-0.28]

55 1629 TAN , A NANDKUMAR AND W ILLSKY where P is the forest projection of P defined in (12). [sent-396, score-0.268]

56 2 The High-Dimensional Setting We again consider the high-dimensional setting where the tuple of parameters (n, dn , kn ) tend to infinity and we have a sequence of learning problems indexed by the number of data points n. [sent-422, score-0.298]

57 The order of the risk, or equivalently the rate of convergence of the estimated distribution to the forest projection, is almost linear in the number of variables d and inversely proportional to n. [sent-436, score-0.275]

58 1 Synthetic Data Sets In order to compare our estimate to the ground truth graph, we learn the structure of distributions that are Markov on the forest shown in Figure 2. [sent-464, score-0.271]

59 We will vary β in our experiments to observe its effect on the overestimation and underestimation errors. [sent-483, score-0.437]

60 Figure 4 show the simulated overestimation and underestimation errors for this experiment. [sent-488, score-0.437]

61 625, we have the best tradeoff between overestimation and underestimation for this particular experimental setting. [sent-494, score-0.437]

62 75 0 500 1000 1500 Number of samples n 2000 2500 Figure 4: The overestimation and underestimation errors for β ∈ (0, 1). [sent-525, score-0.464]

63 samples is very large, it is shown that the overestimation error dominates the overall probability of error and so one should choose β to be close to zero. [sent-526, score-0.346]

64 When the number of data points n is large, β in (10) should be chosen to be small to ensure that the learned edge set is equal to the true one (assuming the underlying model is a forest) with high probability as the overestimation error dominates. [sent-582, score-0.302]

65 Proof of Proposition 2 Proof (Sketch) The proof of this result hinges on the fact that both the overestimation and underestimation errors decay to zero exponentially fast when the threshold is chosen to be Imin /2. [sent-611, score-0.527]

66 The error for learning the top k edges of the forest also decays exponentially fast (Tan et al. [sent-613, score-0.448]

67 (24) n→∞ n In other words, KP is the error exponent for learning the forest structure incorrectly assuming the true model order k is known and Chow-Liu terminates after the addition of exactly k edges in the MWST procedure (Kruskal, 1956). [sent-627, score-0.47]

68 n n n 1637 (27) TAN , A NANDKUMAR AND W ILLSKY The first and second terms are known as the overestimation and underestimation errors respectively. [sent-638, score-0.437]

69 We will show that the underestimation error decays exponentially fast. [sent-639, score-0.349]

70 The overestimation error decays only subexponentially fast and so its rate of decay dominates the overall rate of decay of the error probability for structure learning. [sent-640, score-0.612]

71 1 U NDERESTIMATION E RROR We now bound these terms staring with the underestimation error. [sent-643, score-0.256]

72 By the rule for choosing kn in (3), we have the upper bound Pn (kn = k − 1|Bc ) = Pn (∃ e ∈ EP s. [sent-648, score-0.269]

73 (32) By using the fact that Imin > 0, the exponent L(Pe ; 0) > 0 and thus, we can put the pieces in (28), (29) and (32) together to show that the underestimation error is upper bounded as 2 Pn (kn < k|Bc ) ≤ k(k − 1)(n + 1)r exp −n min (L(Pe ; 0) − η) . [sent-659, score-0.381]

74 n e∈EP (33) Hence, if k is constant, the underestimation error Pn (kn < k|Bc ) decays to zero exponentially fast n as n → ∞, that is, the normalized logarithm of the underestimation error can be bounded as 1 lim sup log Pn (kn < k|Bc ) ≤ − min (L(Pe ; 0) − η). [sent-660, score-0.702]

75 2 OVERESTIMATION E RROR Bounding the overestimation error is harder. [sent-666, score-0.249]

76 As such, the exponent for overestimation in (38) can be approximated by a quadratically constrained quadratic program (QCQP), where z := vec(Q) − vec(Pe ): M(Pe ; εn ) = min 2 z∈Rr subject to 1 T z Πe z, 2 1 T z He z ≥ εn , 2 zT 1 = 0. [sent-688, score-0.27]

77 (48) Putting (35), (36) and (48) together, we see that the overestimation error 2 Pn (kn > k|Bc ) ≤ (d − k − 1)2 (n + 1)r exp (−nεn cP (1 − η)) . [sent-711, score-0.289]

78 Thus, we have consistency overall (since the underestimation, Chow-Liu and now the overestimation errors all tend to zero). [sent-713, score-0.246]

79 2 Proof of Lower Bound in Theorem 3 The key idea is to bound the overestimation error using a modification of the lower bound in Sanov’s theorem. [sent-722, score-0.297]

80 Using this and the fact that if |an − bn | → 0 and |bn − cn | → 0 then, |an − cn | → 0 (triangle inequality), we also have lim M(Pe ; εn ) − D(Q(n) || Pe ) = 0. [sent-738, score-0.299]

81 3 Proof of the Exponential Rate of Decay for Trees in Theorem 3 For the claim in (9), note that for n sufficiently large, Pn (An ) ≥ max{(1 − η)Pn (kn = kn |Bc ), Pn (Bn )}, n (56) from Lemma 11 and the fact that Bn ⊆ An . [sent-748, score-0.245]

82 Equation (56) gives us a lower bound on the error probability in terms of the Chow-Liu error Pn (Bn ) and the underestimation and overestimation errors Pn (kn = kn |Bc ). [sent-749, score-0.82]

83 If k = d − 1, the overestimation error probability is identically zero, so we only n have to be concerned with the underestimation error. [sent-750, score-0.507]

84 lower bound which we omit, the underestimation error event satisfies Pn (kn < k|Bc ) = exp(−nLP ). [sent-752, score-0.344]

85 Proof of Corollary 4 Proof This claim follows from the fact that three errors (i) Chow-Liu error (ii) underestimation error and (iii) overestimation error behave in exactly the same way as in Theorem 3. [sent-766, score-0.569]

86 In particular, the Chow-Liu error, that is, the error for the learning the top k edges in the forest projection model P decays with error exponent KP . [sent-767, score-0.584]

87 The underestimation error behaves as in (34) and the overestimation error as in (50). [sent-768, score-0.525]

88 Proof of Theorem 5 Proof Given assumptions (A1) and (A2), we claim that the underestimation exponent LP(d) , defined in (34), is uniformly bounded away from zero, that is, (d) L := inf LP(d) = inf min L(Pe ; 0) d∈N (60) d∈N e∈EP(d) is positive. [sent-770, score-0.349]

89 Finally, we observe from (33) that if n > (3/L) log k the underestimation error tends to zero because (33) can be further upper bounded as 2 2 Pn (kn < k|Bc ) ≤ (n + 1)r exp(2 log k − nL) < (n + 1)r exp n 2 nL − nL → 0 3 as n → ∞. [sent-790, score-0.403]

90 d∈N Thus, if n > (4/K) log d, the error probability associated to estimating the top k edges (event Bn ) decays to zero along similar lines as in the case of the underestimation error. [sent-793, score-0.497]

91 Finally, from (49), if nεn > 2 log(d − k), then the overestimation error tends to zero. [sent-795, score-0.272]

92 By (26) and (27), these three probabilities constitute the overall error probability when learning the sequence of forest structures {EP(d) }d∈N . [sent-797, score-0.311]

93 Proof of Corollary 6 Proof First note that kn ∈ {0, . [sent-800, score-0.245]

94 From (49), we see that for n sufficiently large, the sequence hn (P) := (nεn )−1 log Pn (An ) is upper bounded by −1 + 2 r2 log(n + 1) log(dn − kn − 1) + . [sent-804, score-0.314]

95 Thus hn (P) = O((nεn )−1 log(dn − kn − 1)), where the implied constant is 2 by (64). [sent-806, score-0.282]

96 Equation (64) also shows that the sequence hn is monotonically decreasing in kn . [sent-810, score-0.282]

97 Let W := TMAP ((Xd )n ) be the range of the function TMAP , that is, a forest t ∈ W if and only if there exists a sequence xn such that TMAP = t. [sent-816, score-0.263]

98 Recall from Appendix B that Bn := {Ek = EP } is the event that the top k edges (in terms of mutual information) in the edge set Ed−1 are not equal to the edges in EP . [sent-834, score-0.336]

99 If P is not Markov on a forest, (81) holds with the forest projection P in place of P, that is, 1 (82) lim sup log Pn (D(P || P∗ ) > δd) ≤ −δ. [sent-884, score-0.345]

100 P (n) 2 (n) (92) Taking the normalized logarithm and lim inf in n on both sides of (92) yields 1 (n) (n) n lim inf log PX,Y ((SδX,Y )c ) ≥ lim inf −D(QX|Y ||PX|Y ) − D(QY ||PY ) = −δ. [sent-904, score-0.245]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pe', 0.393), ('pn', 0.352), ('clthres', 0.261), ('kn', 0.245), ('forest', 0.241), ('underestimation', 0.232), ('bc', 0.224), ('overestimation', 0.205), ('ep', 0.173), ('ekn', 0.145), ('istributions', 0.131), ('tp', 0.127), ('imin', 0.117), ('tan', 0.114), ('pi', 0.112), ('arkov', 0.111), ('orest', 0.111), ('illsky', 0.111), ('nandkumar', 0.111), ('cn', 0.106), ('vec', 0.103), ('forests', 0.095), ('edges', 0.09), ('igh', 0.086), ('imensional', 0.086), ('mutual', 0.085), ('decays', 0.073), ('px', 0.066), ('exponent', 0.065), ('kp', 0.064), ('tk', 0.061), ('decay', 0.061), ('xd', 0.06), ('risk', 0.055), ('dn', 0.053), ('uc', 0.051), ('spect', 0.051), ('fd', 0.051), ('cp', 0.051), ('qx', 0.05), ('lim', 0.045), ('event', 0.044), ('error', 0.044), ('borade', 0.044), ('iinf', 0.044), ('infomation', 0.044), ('csisz', 0.043), ('bn', 0.042), ('consistency', 0.041), ('exp', 0.04), ('pj', 0.038), ('hn', 0.037), ('markov', 0.037), ('asthma', 0.036), ('cpe', 0.036), ('dembo', 0.036), ('tmap', 0.036), ('tree', 0.036), ('rate', 0.034), ('chow', 0.033), ('anandkumar', 0.033), ('earning', 0.033), ('log', 0.032), ('yn', 0.032), ('graphical', 0.032), ('zheng', 0.031), ('tkn', 0.031), ('structure', 0.03), ('proof', 0.029), ('qn', 0.029), ('merhav', 0.029), ('wc', 0.029), ('zeitouni', 0.029), ('structurally', 0.029), ('extremal', 0.029), ('thomas', 0.028), ('wainwright', 0.027), ('edge', 0.027), ('projection', 0.027), ('trees', 0.027), ('samples', 0.027), ('inf', 0.026), ('liu', 0.026), ('probability', 0.026), ('heart', 0.026), ('qi', 0.025), ('ml', 0.025), ('qy', 0.025), ('aigner', 0.025), ('ser', 0.025), ('theorem', 0.025), ('bound', 0.024), ('tends', 0.023), ('nodes', 0.023), ('lauritzen', 0.022), ('xn', 0.022), ('optimizer', 0.022), ('lp', 0.022), ('finesso', 0.022), ('hardest', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

2 0.26085669 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency

3 0.13164505 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

4 0.12148633 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

Author: Botond Cseke, Tom Heskes

Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation

5 0.11566535 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

6 0.11200319 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

7 0.074008726 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

8 0.067804962 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

9 0.067602798 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

10 0.06714163 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

11 0.066254705 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

12 0.06347008 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

13 0.057465132 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

14 0.057114679 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

15 0.051166952 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

16 0.048235413 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

17 0.048192319 58 jmlr-2011-Learning from Partial Labels

18 0.044679288 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

19 0.041061562 66 jmlr-2011-Multiple Kernel Learning Algorithms

20 0.040645789 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.249), (1, -0.124), (2, -0.1), (3, 0.114), (4, 0.238), (5, -0.168), (6, -0.095), (7, -0.101), (8, 0.004), (9, 0.226), (10, -0.128), (11, 0.102), (12, -0.319), (13, -0.226), (14, -0.09), (15, -0.02), (16, 0.011), (17, 0.013), (18, 0.05), (19, 0.125), (20, 0.081), (21, 0.021), (22, 0.054), (23, -0.214), (24, -0.027), (25, 0.086), (26, 0.04), (27, -0.079), (28, -0.136), (29, 0.002), (30, -0.078), (31, 0.053), (32, 0.087), (33, 0.031), (34, 0.043), (35, 0.061), (36, 0.018), (37, 0.054), (38, -0.104), (39, -0.033), (40, -0.079), (41, 0.004), (42, 0.025), (43, 0.008), (44, -0.051), (45, 0.048), (46, 0.001), (47, -0.003), (48, -0.06), (49, -0.088)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95000005 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

2 0.80650133 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency

3 0.35272792 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

Author: Piotr Zwiernik

Abstract: The standard Bayesian Information Criterion (BIC) is derived under regularity conditions which are not always satisÄ?Ĺš ed in the case of graphical models with hidden variables. In this paper we derive the BIC for the binary graphical tree models where all the inner nodes of a tree represent binary hidden variables. This provides an extension of a similar formula given by Rusakov and Geiger for naive Bayes models. The main tool used in this paper is the connection between the growth behavior of marginal likelihood integrals and the real log-canonical threshold. Keywords: BIC, marginal likelihood, singular models, tree models, Bayesian networks, real logcanonical threshold

4 0.34750122 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

Author: Botond Cseke, Tom Heskes

Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation

5 0.3427484 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

6 0.33291304 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

7 0.32263318 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

8 0.30375123 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

9 0.28913349 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

10 0.28207922 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

11 0.2677761 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

12 0.25427371 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

13 0.22436671 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

14 0.22073877 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

15 0.19285963 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

16 0.18365805 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

17 0.17933761 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

18 0.17566745 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

19 0.15907055 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

20 0.14924316 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.108), (6, 0.016), (9, 0.023), (10, 0.025), (24, 0.034), (31, 0.094), (32, 0.031), (41, 0.033), (65, 0.013), (70, 0.056), (71, 0.013), (73, 0.048), (78, 0.069), (81, 0.317), (90, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73594224 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

2 0.46990579 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that finding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency

3 0.44476217 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

4 0.44327757 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

5 0.44141793 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

6 0.43665814 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

7 0.43025783 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

8 0.4294728 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

9 0.42397681 97 jmlr-2011-Union Support Recovery in Multi-task Learning

10 0.42266235 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

11 0.42219034 6 jmlr-2011-A Simpler Approach to Matrix Completion

12 0.4214845 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

13 0.41970977 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

14 0.4181239 104 jmlr-2011-X-Armed Bandits

15 0.41690195 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

16 0.41673231 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

17 0.41527763 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

18 0.41510141 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

19 0.41504788 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

20 0.41413626 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling