jmlr jmlr2011 jmlr2011-35 knowledge-graph by maker-knowledge-mining

35 jmlr-2011-Forest Density Estimation

Source: pdf

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that ﬁnding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213, USA Editor: Tong Zhang Abstract We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. [sent-13, score-0.946]

2 For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. [sent-14, score-1.321]

3 We prove that ﬁnding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. [sent-17, score-0.79]

4 Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. [sent-18, score-0.793]

5 Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency 1. [sent-20, score-0.877]

6 If the graph for P is a forest, then a simple conditioning argument shows that its density p can be written as p(x) = ∏ (i, j)∈E p(xi , x j ) d ∏ p(xk ) p(xi )p(x j ) k=1 where E is the set of edges in the forest (Lauritzen, 1996). [sent-41, score-0.858]

7 We regulate the complexity of the forest by selecting the edges to include using a data splitting scheme, a simple form of cross validation. [sent-50, score-0.705]

8 In particular, we consider the family of forest structured densities that use the marginal kernel density estimates constructed on the ﬁrst partition of the data, and estimate the risk of the resulting densities over a second, held out partition. [sent-51, score-1.042]

9 The optimal forest in terms of the held out risk is then obtained by ﬁnding a maximum weight spanning forest for an appropriate set of edge weights. [sent-52, score-1.448]

10 While tree and forest structured density estimation has been long recognized as a useful tool, there has been little theoretical analysis of the statistical properties of such density estimators. [sent-56, score-0.949]

11 The main contribution of this paper is an analysis of the asymptotic properties of forest density estimation in high dimensions. [sent-57, score-0.752]

12 Restricting the graph to be a forest circumvents the curse of dimensionality by requiring only univariate and bivariate marginal densities. [sent-71, score-0.948]

13 The problem of clustering the variables into small interacting sets, each supported by a tree-structured density, becomes the problem of estimating a maximum weight spanning forest with a restriction on the size of each component tree. [sent-72, score-0.714]

14 Limiting the tree size gives another way of regulating tree complexity that provides larger family of forest to select from in the data splitting procedure. [sent-74, score-0.762]

15 While the problem of ﬁnding a maximum weight forest with restricted tree size may be natural, it appears not to have been studied previously. [sent-75, score-0.735]

16 While ﬁnding the exact optimum is hard, we give a practical 4-approximation algorithm for ﬁnding the optimal tree restricted forest; that is, our algorithm outputs a forest whose weight is 1 guaranteed to be at least 4 w(F ∗ ), where w(F ∗ ) is the weight of the optimal forest. [sent-77, score-0.784]

17 This approximation guarantee translates into excess risk bounds on the constructed forest using our previous analysis. [sent-78, score-0.679]

18 Our experimental results with this approximation algorithm show that it can be effective in practice for forest density estimation. [sent-79, score-0.731]

19 We say that a probability density function p(x) is supported by a forest F if the density can be written as pF (x) = p(xi , x j ) ∏ p(xk ), p(xi ) p(x j ) k∈VF (i, j)∈E(F) ∏ (1) where each p(xi , x j ) is a bivariate density on Xi × X j , and each p(xk ) is a univariate density on Xk . [sent-105, score-1.4]

20 (2) To bound the number of labeled spanning forests on d nodes, note that each such forest can be obtained by forming a labeled tree on d + 1 nodes, and then removing node d + 1. [sent-108, score-0.875]

21 Deﬁne the oracle forest density q∗ = arg min D(p∗ q) (3) q∈Pd where the Kullback-Leibler divergence D(p q) between two densities p and q is D(p q) = p(x) log X p(x) dx, q(x) under the convention that 0 log(0/q) = 0, and p log(p/0) = ∞ for p Bach and Jordan (2003). [sent-111, score-0.887]

22 There exists a forest F ∗ ∈ Fd , such that q∗ = p∗ ∗ = F p∗ (xi , x j ) ∏ p∗ (xk ) p∗ (xi ) p∗ (x j ) k∈V ∗ (i, j)∈E(F ∗ ) ∏ (4) F where p∗ (xi , x j ) and p∗ (xi ) are the bivariate and univariate marginal densities of p∗ . [sent-114, score-0.955]

23 While the best forest will in fact be a spanning tree, the densities p∗ (xi , x j ) are in practice not known. [sent-120, score-0.704]

24 With these estimated marginals, we consider all forest density estimates of the form pF (x) = pn1 (xi , x j ) ∏ pn (xk ). [sent-122, score-0.826]

25 In our setting of density estimation we will regulate the complexity of the forest by cross validating over a held out set. [sent-126, score-0.789]

26 There are several different ways to judge the quality of a forest structured density estimator. [sent-127, score-0.731]

27 F M→∞ n→∞ Deﬁnition 5 ((Structure selection consistency)) An estimator qn ∈ Pd supported by a forest Fn is structure selection consistent if P E(Fn ) E(F ∗ ) → 0, as n goes to inﬁnity, where F ∗ is deﬁned in (4). [sent-133, score-0.681]

28 It is important to note that risk consistency is an oracle property, in the sense that the true density p∗ (x) is not restricted to be supported by a forest; rather, the property assesses how well a given estimator q approximates the best forest density (the oracle) within a class. [sent-136, score-1.008]

29 Kernel Density Estimation For Forests If the true density p∗ (x) were known, by Proposition 2, the density estimation problem would be reduced to ﬁnding the best forest structure Fd∗ , satisfying Fd∗ = arg min R(p∗ ) = arg min D(p∗ p∗ ). [sent-138, score-0.937]

30 Since the entropy term H(X) = ∑k H(Xk ) is constant across all forests, this can be recast as the problem of ﬁnding the maximum weight spanning forest for a weighted graph, where the weight w(i, j) of the edge connecting nodes i and j is I(Xi ; X j ). [sent-140, score-0.845]

31 We replace the population mutual information I(Xi ; X j ) in (5) by the plug-in estimate In (Xi , X j ), deﬁned as In (Xi , X j ) = Xi ×X j pn (xi , x j ) log pn (xi , x j ) dxi dx j pn (xi ) pn (x j ) where pn (xi , x j ) and pn (xi ) are bivariate and univariate kernel density estimates. [sent-147, score-1.115]

32 Given this estimated mutual information matrix Mn = In (Xi , X j ) , we can then apply Kruskal’s algorithm (equivalently, the Chow-Liu algorithm) to ﬁnd the best forest structure Fn . [sent-148, score-0.7]

33 Using D2 , prune the tree Fn1 (k) to ﬁnd a forest Fn1 with k edges, for 0 ≤ k ≤ d − 1. [sent-158, score-0.686]

34 In order to choose k, note that in stage k of the Chow-Liu algorithm we have an edge set E (k) (in (k) (0) the notation of the Algorithm 1) which corresponds to a forest Fn1 with k edges, where Fn1 is the (0) (1) (d−1) union of d disconnected nodes. [sent-187, score-0.675]

35 For a density pF that is supported by a forest F, we deﬁne the held-out negative log-likelihood risk as Rn2 (pF ) = − (9) ∑ (i, j)∈EF Xi ×X j pn2 (xi , x j ) log p(xi , x j ) pn2 (xk ) log p(xk ) dxk . [sent-193, score-0.928]

36 dxi dx j − ∑ p(xi ) p(x j ) k∈VF Xk 914 F OREST D ENSITY E STIMATION (k) The selected forest is then Fn1 where k = arg min Rn2 pF (k) n1 k∈{0,. [sent-194, score-0.767]

37 ,d−1} n2 s∈D2 (k) pn (X ) pn (X ) (i, j)∈E 1 i 1 j (d−1) This minimization can be efﬁciently carried out by iterating over the d − 1 edges in Fn1 Once k is obtained, the ﬁnal forest density estimate is given by ∏ pn (x) = (i, j)∈E (k) . [sent-205, score-0.996]

38 3 Building a Forest on Held-out Data Another approach to estimating the forest structure is to estimate the marginal densities on the training set, but only build graphs on the held-out data. [sent-211, score-0.692]

39 The ﬁnal forest is then Fn2 = arg max wn2 (F) = arg min Rn2 ( pF ) F∈F F∈F By building a forest on held-out data, we directly cross-validate over all forests. [sent-217, score-1.284]

40 Statistical Properties In this section we present our theoretical results on risk consistency, structure selection consistency, and estimation consistency of the forest density estimate pn = p (k) . [sent-219, score-0.907]

41 qF ∈Pd We ﬁrst analyze the forest density estimator obtained using a ﬁxed number of edges k < d; speciﬁcally, consider stopping the Chow-Liu algorithm in Stage 1 after k iterations. [sent-291, score-0.87]

42 (k) Theorem 9 (Risk consistency) Let pF (k) be the forest density estimate with |E(Fd )| = k, obtained d after the ﬁrst k iterations of the Chow-Liu algorithm, for some k ∈ {0, . [sent-295, score-0.751]

43 919 L IU , X U , G U , G UPTA , L AFFERTY AND WASSERMAN Theorem 10 Let p (k) Fd be the forest density estimate using the data-dependent pruning method in (k) Section 3. [sent-303, score-0.751]

44 3, which builds the forest by running Kruskal’s algorithm on the heldout data. [sent-308, score-0.678]

45 Theorem 11 Let Fn2 be the forest obtained using Kruskal’s algorithm on held-out data, and let k = |Fn2 | be the number of edges in Fn2 . [sent-309, score-0.705]

46 Then R( pFn ) − min R( pF ) = OP (k∗ + k) 2 F∈F log n + log d +d nβ/(1+β) log n + log d n2β/(1+2β) where k∗ = |F ∗ | is the number of edges in the optimal forest F ∗ = arg minF∈F R( pF ). [sent-310, score-0.969]

47 Again, we do not assume the true density p∗ (x) is consistent with a forest; rather, we are interested in comparing the estimated forest structure to the oracle forest which minimizes the risk. [sent-313, score-1.368]

48 Let Fd be the estimated forest structure, ﬁxing the number of edges at k; we want to study conditions under which (k) (k) P Fd = Fd → 1. [sent-318, score-0.705]

49 Let us ﬁrst consider the population version of the algorithm—if the algorithm cannot recover (k) the best forest Fd in this ideal case, there is no hope for stable recovery in the data version. [sent-319, score-0.644]

50 For instance, when k ≥ 2 and Ie and Ie′ are the largest two mutual information values, it is guaranteed that e and e′ will both be included in the (k) learned forest Fd whether Ie > Ie′ or Ie < Ie′ . [sent-331, score-0.7]

51 Deﬁne the crucial set J ⊂ E to be a set of pairs of edges (e, e′ ) such that Ie Ie′ and ﬂipping the relative order of Ie and Ie′ changes the learned forest structure in the population Chow-Liu algorithm, with positive probability. [sent-332, score-0.739]

52 if log n + log d (k) (k) Theorem 12 (Structure selection consistency) Let Fd be the optimal forest within Pd that min(k) imizes the negative log-likelihood loss. [sent-341, score-0.726]

53 As discussed in the introduction, clustering problems motivate the goal of constructing forest structured density estimators where each connected component has a restricted number of edges. [sent-361, score-0.731]

54 Deﬁnition 15 A t-restricted forest of a graph G is a subgraph Ft such that 1. [sent-371, score-0.665]

55 Given a weight we assigned to each edge of G, an optimal t-restricted forest Ft∗ satisﬁes w(Ft∗ ) = max w(F) F∈Ft (G) where w(F) = ∑e∈F we is the weight of a forest F and Ft (G) denotes the collection of all t-restricted forests of G. [sent-377, score-1.459]

56 We show that ﬁnding a maximum weight t-restricted forest on this graph would allow us to then recover a solution to X3C by analyzing how the optimal forest must partition each of the gadgets. [sent-383, score-1.324]

57 In the ﬁrst stage, a forest is greedily constructed in such a way that each node has degree no larger than t (a property that is satisﬁed by all t-restricted forests). [sent-386, score-0.643]

58 However, the trees in the forest may have more than t edges; hence, in the second stage, each tree in the forest is partitioned in an optimal way by removing edges, resulting in a collection of trees, each of which has size at most t. [sent-387, score-1.296]

59 1 Pruning Based on t-Restricted Forests For a given t, after producing an approximate maximum weight t-restricted forest Ft using D1 , we prune away edges using D2 . [sent-409, score-0.754]

60 Compute or approximate (for t ≥ 2) the optimal t-restricted forest Ft using In1 as edge weights. [sent-421, score-0.65]

61 To estimate the risk, we simply use Rn2 ( pFt ) as deﬁned in (9), and select the forest Ft according t = arg min Rn2 ( pFt ). [sent-425, score-0.662]

62 Using the approximation guarantee and our previous analysis, we have that the population weights of the approximate t-restricted forest and the optimal forest satisfy the following inequality. [sent-427, score-1.254]

63 For t ≥ 2, let Ft be the forest constructed using a c-approximation algorithm, and let Ft∗ be the optimal forest; both constructed with respect to ﬁnite sample edge weights wn1 = In1 . [sent-430, score-0.65]

64 We mainly compare the forest density estimator with sparse Gaussian graphical models, ﬁtting a multivariate Gaussian with a sparse inverse covariance matrix. [sent-435, score-0.795]

65 Since the held-out log-likelihood of the forest density estimator is indexed by the number of edges included in the tree, while the held-out log-likelihoods of the glasso and the reﬁt glasso are indexed by a continuously varying regularization parameter, we need to ﬁnd a way to calibrate them. [sent-443, score-1.19]

66 To address this issue, we plot the held-out log-likelihood of the forest density estimator as a step function indexed by the tree size. [sent-444, score-0.851]

67 The size of the forest density estimator and the sparsity level of the glasso (and the reﬁt glasso) can then be aligned for a fair comparison. [sent-446, score-0.935]

68 We ﬁrst rescale the data into [0, 1]d and calculate the kernel density estimates (1) (2) (m) on a grid of points; we choose m = 128 evaluation points xi < xi < · · · < xi for each dimension i, and then evaluate the bivariate and the univariate kernel density estimates on this grid. [sent-461, score-1.169]

69 There are three different kernel density estimates that we use—the bivariate kde, the univariate kde, and the marginalized bivariate kde. [sent-462, score-0.735]

70 Speciﬁcally, the bivariate kernel density estimate on xi , x j 926 F OREST D ENSITY E STIMATION (s) (s) based on the observations {Xi , X j }s∈D1 is deﬁned as 1 1 p(xi , x j ) = ∑ h2i h2 j K n1 s∈D1 (s) (s) Xi − xi h2i Xj − xj K h2 j , using a product kernel. [sent-463, score-0.683]

71 (s) Finally, the marginal univariate kernel density estimate pM (xk ) based on the observations {Xk }s∈D1 is deﬁned by integrating the irrelevant dimension out of the bivariate kernel density estimates p(x j , xk ) on the unit square [0, 1]2 . [sent-478, score-0.872]

72 m − 1 ℓ=1 With the above deﬁnitions of the bivariate and univariate kernel density estimates, we consider estimating the mutual information I(Xi ; X j ) in three different ways, depending on which estimates for the univariate densities are employed. [sent-480, score-0.766]

73 We see that for the Gaussian data, the reﬁt glasso has a higher held-out log-likelihood than the forest density estimator and the glasso. [sent-525, score-0.935]

74 For very sparse models, however, the performance of the glasso is worse than that of the forest density estimator, due to the large parameter bias resulting from the ℓ1 regularization. [sent-527, score-0.891]

75 We also observe an efﬁciency loss in the nonparametric forest density estimator, compared to the reﬁt glasso. [sent-528, score-0.731]

76 For the non-Gaussian data, even though the reﬁt glasso results in a reasonable graph, the forest density estimator performs much better in terms of held-out log-likelihood risk and graph estimation accuracy. [sent-589, score-1.035]

77 Unless there are copious amount of heldout data, held-out MST overﬁts on the heldout data and tend to give large graphs; in contrast, t-restricted forest has the weakest theoretical guarantee but it gives the best log-likelihood and produces sparser graphs. [sent-595, score-0.746]

78 Somewhat surprisingly though, Training-MST-with-pruning and t-restricted forest appear to be insensitive to the heldout data size. [sent-597, score-0.678]

79 Top-left Gaussian, and top-right non-Gaussian: Held-out log-likelihood plots of the forest density estimator (black step function), glasso (red stars), and reﬁt glasso (blue circles), the vertical dashed red line indicates the size of the true graph. [sent-599, score-1.095]

80 In contrast, the held-out log-likelihood curves of the glasso and reﬁt glasso achieve maxima when there are around 280 edges and 100 edges respectively, while their predictive estimates are still inferior to those of the tree-based kernel density estimator. [sent-613, score-0.737]

81 We estimate the full 4238 node graph using both the forest density estimator (described in Section 3. [sent-627, score-0.86]

82 2) and the Meinshausen-B¨ hlmann neighborhood search method as proposed in u Meinshausen and B¨ hlmann (2006) with regularization parameter chosen to give it about same u number as edges as the forest graph. [sent-629, score-0.705]

83 The forest density estimated graph reveals one strongly connected component of more than 3000 genes and various isolated genes; this is consistent with the analysis in Nayak et al. [sent-633, score-0.763]

84 Conclusion We have studied forest density estimation for high dimensional data. [sent-640, score-0.752]

85 Our experimental results compared the forest density estimator to the sparse Gaussian graphical model in terms of both predictive risk and the qualitative properties of the estimated graphs for human gene expression array data. [sent-644, score-0.885]

86 Together, these results indicate that forest density estimation can be a useful tool for relaxing the normality assumption in graphical modeling. [sent-645, score-0.772]

87 The bold gray edges in the forest graph are missing from the Gaussian graph and vice versa; the thin black edges are shared by both graphs. [sent-652, score-0.864]

88 6 Proof of NP-hardness of t-Restricted Forest We will reduce an instance of exact 3-cover (X3C) to an instance of ﬁnding a maximum weight t-restricted forest (t-RF). [sent-755, score-0.659]

89 If some forest F ′ is missing a nexus-junction edge, then F ′ must have weight strictly less than F, since N is larger than the sum of all of the non-nexus-junction edges. [sent-772, score-0.659]

90 7 Proof of Theorem 16 Recall that we want to show that Algorithm 2 returns a forest with weight that is at least a quarter of the weight of the optimal t-restricted forest. [sent-804, score-0.708]

91 Note that the optimal t-restricted forest Ft∗ satisﬁes both the constraints above, and hence the maximum weight set of edges that satisfy both the constraints above has weight at least w(Ft∗ ). [sent-806, score-0.803]

92 Recall that the ﬁrst stage of Algorithm 2 greedily adds edges subject to these two constraints—the next 1 two lemmas show that the resulting forest has weight at least 2 w(Ft∗ ). [sent-807, score-0.779]

93 2 Proof Let F ∗∗ be a maximum weight forest that obeys both constraints (a) and (b). [sent-819, score-0.659]

94 Proof Given a graph G, let F1 be the forest output by ﬁrst step of Algorithm 2, and let FA be the forest outputted by the second step. [sent-826, score-1.252]

95 To prove the claim, we ﬁrst show that given any tree T with edge weights and maximum degree 1 t ≥ 2, we can obtain a sub-forest F with total weight w(F) ≥ 2 w(T ), and where the number of edges in each tree in the forest F is at most t − 1. [sent-828, score-0.946]

96 Applying this procedure to each tree T in the forest F1 , we get the existence of a t − 1-restricted subforest F1′ ⊆ F1 that has weight at least 1 w(F1 ). [sent-831, score-0.775]

97 Recall that F ∗∗ is a maximum weight forest satisfying both (a) and (b). [sent-836, score-0.659]

98 A result of Singh and Lau (2007) implies that given any graph G with non-negative edge weights, one can ﬁnd in polynomial time a forest FSL such that w(FSL ) ≥ w(F ∗∗ ) ≥ w(Ft∗ ), (33) but where the maximum degree in FSL is t + 1. [sent-837, score-0.682]

99 Now applying the procedure from the proof of ′ Theorem 16, we get a t-restricted forest FSL whose weight is at least half of w(FSL ). [sent-838, score-0.659]

100 9 The TreePartition Subroutine To produce the best t-restricted subforest of the forest F1 , we use a divide-and-conquer forest partition algorithm described by Lukes (1974), which we now describe in more detail. [sent-846, score-1.283]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('forest', 0.61), ('fd', 0.269), ('pf', 0.211), ('bivariate', 0.202), ('oo', 0.188), ('glasso', 0.16), ('ensity', 0.148), ('iu', 0.148), ('upta', 0.148), ('xk', 0.137), ('xi', 0.127), ('ft', 0.126), ('afferty', 0.126), ('orest', 0.126), ('density', 0.121), ('pft', 0.107), ('op', 0.105), ('wasserman', 0.105), ('univariate', 0.104), ('stimation', 0.104), ('forests', 0.101), ('qf', 0.097), ('edges', 0.095), ('mutual', 0.09), ('gadget', 0.08), ('tree', 0.076), ('dxi', 0.074), ('pd', 0.069), ('heldout', 0.068), ('xki', 0.067), ('subtree', 0.066), ('mst', 0.063), ('kernel', 0.061), ('nexus', 0.06), ('treepartition', 0.06), ('log', 0.058), ('spanning', 0.055), ('kruskal', 0.052), ('dx', 0.051), ('ie', 0.051), ('pn', 0.05), ('weight', 0.049), ('fsl', 0.047), ('risk', 0.047), ('vf', 0.046), ('estimates', 0.045), ('estimator', 0.044), ('nodes', 0.042), ('lure', 0.04), ('pfn', 0.04), ('subforest', 0.04), ('edge', 0.04), ('densities', 0.039), ('consistency', 0.038), ('held', 0.037), ('sup', 0.036), ('junction', 0.035), ('population', 0.034), ('dxk', 0.034), ('ifast', 0.034), ('imedium', 0.034), ('node', 0.033), ('graph', 0.032), ('arg', 0.032), ('ui', 0.03), ('calculate', 0.028), ('oracle', 0.027), ('qn', 0.027), ('gadgets', 0.027), ('guillou', 0.027), ('islow', 0.027), ('subfamily', 0.027), ('gin', 0.026), ('rigollet', 0.026), ('stage', 0.025), ('pm', 0.025), ('xj', 0.025), ('lemma', 0.024), ('partitions', 0.024), ('larry', 0.024), ('vert', 0.024), ('subgraph', 0.023), ('partition', 0.023), ('graphs', 0.023), ('kde', 0.023), ('cmu', 0.022), ('excess', 0.022), ('marginals', 0.021), ('estimation', 0.021), ('tan', 0.02), ('gaussian', 0.02), ('gene', 0.02), ('degf', 0.02), ('haijie', 0.02), ('nayak', 0.02), ('nolan', 0.02), ('estimate', 0.02), ('graphical', 0.02), ('microarray', 0.02), ('bandwidths', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999815 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

2 0.26085669 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under ﬁxed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufﬁcient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identiﬁed; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

3 0.12173611 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and reﬁtting: ﬁrst we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN

4 0.088757373 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

5 0.071035884 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

Abstract: In this article, we propose a family of efﬁcient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be deﬁned based on this Weisfeiler-Lehman sequence of graphs, including a highly efﬁcient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classiﬁcation benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classiﬁcation, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT

6 0.066660851 14 jmlr-2011-Better Algorithms for Benign Bandits

7 0.060971323 97 jmlr-2011-Union Support Recovery in Multi-task Learning

8 0.055334948 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

9 0.053355969 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

10 0.052717019 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

11 0.046782993 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

12 0.044693451 55 jmlr-2011-Learning Multi-modal Similarity

13 0.044232868 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

14 0.042224213 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

15 0.042144116 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

16 0.041841194 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

17 0.041312996 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

18 0.040599857 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

19 0.03995372 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

20 0.037986752 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.231), (1, -0.053), (2, 0.001), (3, 0.068), (4, 0.223), (5, -0.059), (6, -0.015), (7, -0.048), (8, -0.045), (9, 0.273), (10, -0.104), (11, 0.121), (12, -0.262), (13, -0.359), (14, -0.077), (15, -0.055), (16, 0.072), (17, 0.042), (18, 0.048), (19, 0.15), (20, 0.044), (21, 0.073), (22, 0.076), (23, -0.108), (24, -0.013), (25, 0.042), (26, -0.001), (27, -0.08), (28, -0.056), (29, 0.009), (30, -0.056), (31, 0.033), (32, 0.038), (33, -0.018), (34, 0.095), (35, -0.01), (36, -0.23), (37, 0.002), (38, -0.044), (39, -0.03), (40, -0.153), (41, 0.004), (42, 0.02), (43, -0.015), (44, -0.002), (45, 0.021), (46, 0.053), (47, -0.028), (48, -0.171), (49, -0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94140112 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

2 0.79915977 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

3 0.41198152 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

4 0.35213733 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

5 0.31030282 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

6 0.26048195 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

7 0.22358763 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

8 0.20515145 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

9 0.19182988 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

10 0.19148363 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

11 0.19140051 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

12 0.18557365 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

13 0.18226971 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

14 0.17703016 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms

15 0.1751812 97 jmlr-2011-Union Support Recovery in Multi-task Learning

16 0.1745885 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

17 0.17136854 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

18 0.16733965 14 jmlr-2011-Better Algorithms for Benign Bandits

19 0.16708601 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

20 0.16596317 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.1), (6, 0.013), (9, 0.021), (10, 0.03), (24, 0.019), (31, 0.072), (32, 0.028), (41, 0.021), (70, 0.378), (73, 0.052), (78, 0.106), (90, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79539293 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

2 0.79413587 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari

Abstract: This paper considers the robust and efﬁcient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modiﬁcations and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difﬁcult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difﬁcult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation

3 0.73836333 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Author: John Duchi, Elad Hazan, Yoram Singer

Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to ﬁnd needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which signiﬁcantly simpliﬁes setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efﬁcient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization

4 0.46574503 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

5 0.40755922 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

Author: Tony Jebara

Abstract: A multitask learning framework is developed for discriminative classiﬁcation and regression where multiple large-margin linear classiﬁers are estimated for different prediction problems. These classiﬁers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The ﬁrst multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classiﬁer as well as adaptively specialized individual classiﬁers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classiﬁcation, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming

6 0.40748972 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

7 0.40416688 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

8 0.3991583 91 jmlr-2011-The Sample Complexity of Dictionary Learning

9 0.39791599 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

10 0.39376822 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

11 0.39284757 59 jmlr-2011-Learning with Structured Sparsity

12 0.39136168 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

13 0.39083692 12 jmlr-2011-Bayesian Co-Training

14 0.38865224 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

15 0.38067335 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

16 0.38028625 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

17 0.37992319 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

18 0.37878177 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

19 0.37834242 54 jmlr-2011-Learning Latent Tree Graphical Models

20 0.3783212 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets