nips nips2012 nips2012-7 knowledge-graph by maker-knowledge-mining

7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation


Source: pdf

Author: Cho-jui Hsieh, Arindam Banerjee, Inderjit S. Dhillon, Pradeep K. Ravikumar

Abstract: We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. Recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field, even in very high-dimensional regimes with a limited number of samples. In this paper, we are concerned with the computational cost in solving the above optimization problem. Our proposed algorithm partitions the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. For the conquer step, we use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. [sent-14, score-0.479]

2 Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. [sent-18, score-0.497]

3 Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. [sent-19, score-0.307]

4 For the conquer step, we use the approximate solution, i. [sent-20, score-0.439]

5 , solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. [sent-22, score-0.269]

6 An important problem is that of recovering the covariance matrix, or its inverse, given the samples in a high-dimensional regime where n p, and p could number in the tens of thousands. [sent-27, score-0.272]

7 The key focus in this paper is on developing computationally efficient methods to solve this composite log-determinant optimization problem. [sent-30, score-0.094]

8 For instance, in a climate application, if we are modeling a GMRF over random variables corresponding to each Earth grid point, the number of nodes can easily number in the tens of thousands. [sent-34, score-0.291]

9 As we show, our resulting divide and conquer procedure produces overwhelming improvements in computational efficiency. [sent-39, score-0.676]

10 A major drawback to this approach of [8] however is that often the decomposition of the thresholded sample covariance matrix can be very unbalanced — indeed, in many of our real-life examples, we found that the decomposition resulted in one giant component and several very small components. [sent-43, score-0.514]

11 In these cases, the approach in [8] is only a bit faster than directly solving the entire problem. [sent-44, score-0.094]

12 Suppose we are given a particular partitioning, and solve the sub-problems specified by the partition components. [sent-46, score-0.183]

13 However, can we use bounds on the deviation to propose a clustering criterion? [sent-48, score-0.229]

14 Based on this bound, we propose a normalized-cut spectral clustering algorithm to minimize the off-diagonal error, which is able to ¯ find a balanced partition such that Θ is very close to Θ∗ . [sent-50, score-0.501]

15 Interestingly, we show that this clustering criterion can also be motivated as leveraging a property more general than that in [8] of the 1 reg¯ ularized MLE (1). [sent-51, score-0.229]

16 For example, our algorithm can achieve an accurate solution for the climate data problem in 1 hour, whereas directly solving it takes 10 hours. [sent-54, score-0.278]

17 In section 2, we outline the standard skeleton of a divide and conquer framework for GMRF estimation. [sent-55, score-0.744]

18 The key step in such a framework is to come up with a suitable and efficient clustering criterion. [sent-56, score-0.229]

19 In the next section 3, we then outline our clustering criteria. [sent-57, score-0.254]

20 , p} is the node set, Xij is the weighted link between node i and node j. [sent-63, score-0.108]

21 We will use {Vc }k to denote a disjoint partitioning of the node set V, and each Vc will be called a c=1 partition or a cluster. [sent-64, score-0.205]

22 Given a partition {Vc }k , our divide and conquer algorithm first solves GMRF for all node partic=1 tions to get the inverse covariance matrices {Θ(c) }k , and then uses the following matrix c=1  (1)  Θ 0 . [sent-65, score-1.277]

23 Notice that in our framework any sparse inverse covariance solver can 2 be used, however, in this paper we will focus on using the state-of-the-art method QUIC [6] as the base solver, which was shown to have super-linear convergence when close to the solution. [sent-85, score-0.437]

24 The skeleton of the divide and conquer framework is quite simple and is summarized in Algorithm 1. [sent-87, score-0.719]

25 ¯ In order that Algorithm 1 be efficient, we require that Θ defined in (2) should be close to the optimal ¯ solution of the original problem Θ∗ . [sent-88, score-0.086]

26 Based on this bound, we propose a spectral clustering algorithm to find an effective partitioning of the nodes. [sent-90, score-0.345]

27 1 2 3 4 5 6 Algorithm 1: Divide and Conquer method for Sparse Inverse Covariance Estimation Input : Empirical covariance matrix S, scalar λ Output: Θ∗ , the solution of (1) Obtain a partition of the nodes {Vc }k ; c=1 for c = 1, . [sent-91, score-0.495]

28 , Θ(k) as in (2) ; ¯ as an initial point to solve the whole problem (1) ; Use Θ 2. [sent-97, score-0.117]

29 For solving subproblems we can again apply a divide and conquer algorithm. [sent-101, score-0.723]

30 In this way, the initial time can be much less than O(p3 /k 2 ) if we use divide and conquer algorithm hierarchically for each level. [sent-102, score-0.707]

31 3 Main Results: Clustering Criteria for GMRF This section outlines the main contribution of this paper; in coming up with suitable efficient clustering criteria for use within the divide and framework structure in the previous section. [sent-104, score-0.466]

32 For any λ > 0 and a given partition {Vc }k , if |Sij | ≤ λ for all i, j in different c=1 ¯ ¯ partitions, then Θ∗ = Θ, where Θ∗ is the optimal solution of (1) and Θ is as defined in (2). [sent-109, score-0.182]

33 ¯ As a consequence, if a partition {Vc }k satisfies the assumption of Theorem 1, Θ and Θ∗ will be c=1 the same, and the last step of Algorithm 1 is not needed anymore. [sent-110, score-0.125]

34 However, in most real examples, a perfect partitioning as in Theorem 1 does not exist, which motivates a divide and conquer framework that does not need as stringent assumptions as in Theorem 1. [sent-112, score-0.753]

35 ¯ To allow a more general relationship between Θ∗ and Θ, we first prove a similar property for the following generalized inverse covariance problem: Θ∗ = arg min{− log det Θ + tr(SΘ) + Θ 0 Λij |Θij |} = arg min fΛ (Θ). [sent-113, score-0.807]

36 Consider the dual problem of (3): max log det W s. [sent-121, score-0.445]

37 To show that W is the op¯ lution of (4) with the objective function value c=1 log det W ˆ timal solution of (4), we consider an arbitrary feasible solution W . [sent-124, score-0.575]

38 From Fischer’s inequalk ˆ ˆ ˆ ¯ ity [2], det W ≤ c=1 det W (c) for W 0. [sent-125, score-0.65]

39 Since W (c) is the optimizer of the c-th block, ¯ ˆ ˆ ¯ ¯ det W (c) ≥ det W (c) for all c, which implies log det W ≤ log det W . [sent-126, score-1.382]

40 We start by choosing c=1 ¯ a matrix regularization weight Λ such that ¯ Λij = λ max(|Sij |, λ) if i, j are in the same cluster, if i, j are in different clusters. [sent-130, score-0.105]

41 (5) ¯ Now consider the generalized inverse covariance problem (3) with this specified Λ. [sent-131, score-0.336]

42 , k}, the subproblem has the following form: Θ(c) = arg min{− log det Θ + tr(S (c) Θ) + λ Θ 1 }, Θ 0 ¯ where S (c) is the sample covariance matrix of cluster c. [sent-135, score-0.725]

43 Therefore, Θ is the optimal solution of ¯ problem (3) with Λ as the regularization parameter. [sent-136, score-0.091]

44 Based on this observation, we will now provide another view of our divide and conquer algorithm as follows. [sent-137, score-0.676]

45 Considering the dual problem of the sparse inverse covariance estimation with the weighted ¯ regularization defined in (4), Algorithm 1 can be seen to solve (4) with Λ = Λ defined in (5) to get ¯ , and then solve (4) with Λ = 1λ for all elements. [sent-138, score-0.629]

46 Therefore we initially solve the initial point W the problem with looser bounded constraints to get an initial guess, and then solve the problem with ¯ tighter constraints. [sent-139, score-0.211]

47 ¯ For convenience, we use P λ to denote the original dual problem (4) with Λ = 1λ , and P Λ to denote the relaxed dual problem with different edge weights across edges as defined in (5). [sent-142, score-0.125]

48 Based on the ¯ ¯ ¯ above discussions, W ∗ = (Θ∗ )−1 is the solution of P λ and W = Θ−1 is the solution of P Λ . [sent-143, score-0.114]

49 In this case E F measures how much the off-diagonal elements exceed the threshold λ, and a good clustering algorithm should be ¯ able to find a partition to minimize E F . [sent-147, score-0.39]

50 If A is a positive definite matrix and there exists a γ > 0 such that A−1 B then log det(A + B) ≥ log det A − p/(γσmin (A)) B F . [sent-152, score-0.478]

51 2 ≤ 1 − γ, (9) ¯ ¯ Since P Λ has a relaxed bounded constraint than P λ , W may not be a feasible solution of P λ . [sent-153, score-0.134]

52 ˆ = W − G ◦ E, where Gij = sign(Wij ) and ¯ However, we can construct a feasible solution W ◦ indicates the entrywise product of two matrices. [sent-154, score-0.102]

53 From Lemma 1 we have log det W ≥ p ∗ λ ¯ − ˆ is a feasible solution log det W γσmin (W ) E F . [sent-156, score-0.834]

54 Since W is the optimal solution of P and W ¯ p λ ∗ ¯ ˆ ¯ of P , log det W ≥ log det W ≥ log det W − E F . [sent-157, score-1.155]

55 Also, since W is the optimal ¯ γσmin (W ) ¯ ¯ ¯ solution of P Λ and W ∗ is a feasible solution of P Λ , we have log det W ∗ < log det W . [sent-158, score-0.891]

56 Therefore, p ¯ − log det W ∗ | < | log det W E F. [sent-159, score-0.732]

57 ¯ max(σmax (W ),σmax (W ∗ )) > To establish the bound on Θ, we use the mean value theorem again with g(W ) = W −1 = Θ, g(W ) = Θ ⊗ Θ where ⊗ is kronecker product. [sent-161, score-0.087]

58 Assume the real inverse covariance matrix Θ∗ is block-diagonal, then it is easy to show that W ∗ is also block-diagonal. [sent-165, score-0.44]

59 Assume Θ∗ = Θbd + vei eT j where Θbd is the block-diagonal part of Θ∗ and ei denotes the i-th standard basis vector, then from Sherman-Morrison formula, v W ∗ = (Θ∗ )−1 = (Θbd )−1 − θbd (θbd )T , 1 + v(Θbd )ij i j bd where θi is the ith column vector of Θbd . [sent-168, score-0.171]

60 Therefore adding one off-diagonal element to Θbd will introduce at most one nonzero off-diagonal block in W . [sent-169, score-0.157]

61 Moreover, if block (i, j) of W is already nonzero, adding more elements in block (i, j) of Θ will not introduce any more nonzero blocks in W . [sent-170, score-0.223]

62 As long as just a few entries in off-diagonal blocks of Θ∗ are nonzero, W will be block-diagonal with a few nonzero off-diagonal blocks. [sent-171, score-0.091]

63 Since W ∗ −S ∗ ∞ ≤ λ, we are able to use the thresholding matrix S λ to guess the clustering structure of Θ∗ . [sent-172, score-0.331]

64 From (8), ideally we want to find a partition to minimize E ∗ = i √ i (E)|. [sent-174, score-0.161]

65 5 To find a partition minimizing E F , we want to find a partition {Vc }k such that the sum of c=1 off-diagonal block entries of S λ is minimized, where S λ is defined as λ (S λ )ij = max(|Sij | − λ, 0)2 ∀ i = j and Sij = 0 ∀i = j. [sent-176, score-0.316]

66 Therefore, we minimize the following normalized cut objective value [10]: k i∈Vc ,j ∈Vc / N Cut(S λ , {Vc }k ) = c=1 c=1 d(Vc ) λ Sij p λ Sij . [sent-178, score-0.159]

67 As shown in [10, 3], minimizing the F normalized cut is equivalent to finding cluster indicators x1 , . [sent-180, score-0.216]

68 , xc to maximize k min x c=1 xT (D − S λ )xc c = trace(Y T (I − D−1/2 S λ D−1/2 )Y ), xT Dx c (12) p λ where D is a diagonal matrix with Dii = j=1 Sij , Y = D1/2 X and X = [x1 . [sent-183, score-0.167]

69 Therefore, a common way for getting cluster indicators is to compute the leading k eigenvectors of D−1/2 S λ D−1/2 and then conduct kmeans on these eigenvectors. [sent-187, score-0.164]

70 The time complexity of normalized cut on S λ is mainly from computing the leading k eigenvectors of D−1/2 S λ D−1/2 , which is at most O(p3 ). [sent-188, score-0.161]

71 Since most state-of-the-art methods for solving (1) require O(p3 ) per iteration, the cost for clustering is no more than one iteration for the original solver. [sent-189, score-0.305]

72 If S λ is sparse, as is common in real situations, we could speed up the clustering phase by using the Graclus multilevel algorithm, which is a faster heuristic to minimize normalized cut [3]. [sent-190, score-0.511]

73 4 Experimental Results In this section, we first show that the normalized cut criterion for the thresholded matrix S λ in (10) can capture the block diagonal structure of the inverse covariance matrix before solving (1). [sent-191, score-0.828]

74 Using the clustering results, we show that our divide and conquer algorithm significantly reduces the time needed for solving the sparse inverse covariance estimation problem. [sent-192, score-1.366]

75 Climate: This dataset is generated from NCEP/NCAR Reanalysis data 1 , with focus on the daily temperature at several grid points on earth. [sent-196, score-0.134]

76 We treat each grid point as a random variable, and use daily temperature in year 2001 as features. [sent-197, score-0.107]

77 Synthetic: We generated synthetic data containing 20, 000 nodes with 100 randomly generated group centers µ1 , . [sent-202, score-0.094]

78 1 Clustering quality on real datasets Given a clustering partition {Vc }k , we use the following “within-cluster ratio” to determine its c=1 performance on Θ∗ : k ∗ 2 c=1 i,j:i=j and i,j∈Vc (Θij ) . [sent-209, score-0.387]

79 We can see that our proposed clustering ˆ method Spectral S λ is very close to the clustering based on Θ = Θ∗ ◦ Θ∗ , which we cannot see before solving (1). [sent-220, score-0.505]

80 93 Higher values of R({Vc }k ) are indicative of better performance of the clustering algorithm. [sent-253, score-0.229]

81 1, we presented theoretical justification for using normalized cut on the thresholded matrix S λ . [sent-255, score-0.308]

82 Table 2 shows the within-cluster ratios (13) of the inverse covariance matrix using different clustering methods. [sent-257, score-0.667]

83 We include the following methods in our comparison: • Random partition: partition the nodes randomly into k clusters. [sent-258, score-0.172]

84 • Spectral clustering on thresholded matrix S λ : Our proposed method. [sent-260, score-0.414]

85 ˆ • Spectral clustering on Θ = Θ∗ ◦ Θ∗ , which is the element-wise square of Θ∗ : This is the best clustering method we can conduct, which directly minimizes within-cluster ratio of the Θ∗ matrix. [sent-261, score-0.458]

86 We can observe in Table 2 that our proposed spectral clustering on S λ achieves almost the same performance as spectral clustering on Θ∗ ◦ Θ∗ even though we do not know Θ∗ . [sent-263, score-0.602]

87 Also, Figure 1 gives a pictorial view of how our clustering results help in recovering the sparse inverse covariance matrix at different levels. [sent-264, score-0.718]

88 We run a hierarchical 2-way clustering on the Leukemia ¯ ¯ dataset, and plot the original Θ∗ (solution of (1)), Θ with 1-level clustering and Θ with 2-level ∗ clustering. [sent-265, score-0.517]

89 We can see that although our clustering method does not look at Θ , the clustering result matches the nonzero pattern of Θ∗ pretty well. [sent-266, score-0.589]

90 2 The performance of our divide and conquer algorithm Next, we investigate the time taken by our divide and conquer algorithm on large real and synthetic datasets. [sent-268, score-1.432]

91 Figure 1: The clustering results and the nonzero patterns of inverse covariance matrix Θ∗ on Leukemia dataset. [sent-273, score-0.727]

92 Although our clustering method does not look at Θ∗ , the clustering results match the nonzero pattern in Θ∗ pretty well. [sent-274, score-0.589]

93 • QUIC: The original QUIC, which is a state-of-the-art second order solver for sparse inverse estimation [6]. [sent-278, score-0.3]

94 • QUIC-conn: Using the decomposition method described in [8] and using QUIC to solve each smaller sub-problem. [sent-279, score-0.1]

95 We can see that in the largest synthetic dataset, DCQUIC is more than 10 times faster than QUIC, and thus also faster than Glasso and ALM. [sent-286, score-0.141]

96 For the largest real dataset: Climate with more than 10,000 points, QUIC takes more than 10 hours to get a reasonable solution (relative error=0), while DC-QUIC-3 converges in 1 hour. [sent-287, score-0.123]

97 Moreover, on these 4 datasets QUIC-conn using the decomposition method of [8] provides limited savings, in part because the connected components for the thresholded covariance matrix for each dataset turned out to have a giant component, and multiple smaller components. [sent-288, score-0.499]

98 Acknowledgements We would like to thank Soumyadeep Chatterjee and Puja Das for help with the climate and stock data. [sent-290, score-0.258]

99 An inexact interior point method for l1-reguarlized sparse covariance selection. [sent-363, score-0.244]

100 Exact covariance thresholding into connected components for large-scale graphical lasso. [sent-368, score-0.195]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('conquer', 0.439), ('vc', 0.331), ('det', 0.325), ('divide', 0.237), ('clustering', 0.229), ('quic', 0.217), ('covariance', 0.195), ('sij', 0.178), ('climate', 0.174), ('bd', 0.171), ('leukemia', 0.15), ('inverse', 0.141), ('partition', 0.125), ('gmrf', 0.116), ('thresholded', 0.114), ('nonzero', 0.091), ('cut', 0.087), ('stock', 0.084), ('ij', 0.075), ('spectral', 0.072), ('matrix', 0.071), ('block', 0.066), ('cluster', 0.063), ('solve', 0.058), ('solution', 0.057), ('theorem', 0.056), ('mle', 0.055), ('solver', 0.052), ('daily', 0.051), ('xc', 0.051), ('giant', 0.05), ('sparse', 0.049), ('solving', 0.047), ('faster', 0.047), ('max', 0.047), ('synthetic', 0.047), ('nodes', 0.047), ('feasible', 0.045), ('min', 0.045), ('tens', 0.044), ('partitioning', 0.044), ('inderjit', 0.043), ('multilevel', 0.043), ('skeleton', 0.043), ('banerjee', 0.043), ('decomposition', 0.042), ('dhillon', 0.042), ('partitions', 0.042), ('log', 0.041), ('texas', 0.04), ('pretty', 0.04), ('balanced', 0.039), ('glasso', 0.038), ('eigenvectors', 0.038), ('minimize', 0.036), ('composite', 0.036), ('node', 0.036), ('normalized', 0.036), ('linearization', 0.036), ('savings', 0.036), ('hsieh', 0.035), ('regularization', 0.034), ('real', 0.033), ('get', 0.033), ('recovering', 0.033), ('acknowledges', 0.033), ('conduct', 0.033), ('relaxed', 0.032), ('dual', 0.032), ('austin', 0.031), ('guess', 0.031), ('initial', 0.031), ('ratios', 0.031), ('bound', 0.031), ('temperature', 0.03), ('arg', 0.03), ('hierarchical', 0.03), ('estimator', 0.03), ('cuts', 0.03), ('indicators', 0.03), ('original', 0.029), ('wij', 0.029), ('estimation', 0.029), ('regularized', 0.028), ('whole', 0.028), ('dataset', 0.027), ('grid', 0.026), ('field', 0.026), ('tr', 0.026), ('gene', 0.025), ('outline', 0.025), ('arindam', 0.025), ('blockdiagonal', 0.025), ('chatterjee', 0.025), ('classication', 0.025), ('loh', 0.025), ('lution', 0.025), ('pdimensional', 0.025), ('pradeepr', 0.025), ('timal', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

Author: Cho-jui Hsieh, Arindam Banerjee, Inderjit S. Dhillon, Pradeep K. Ravikumar

Abstract: We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. Recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field, even in very high-dimensional regimes with a limited number of samples. In this paper, we are concerned with the computational cost in solving the above optimization problem. Our proposed algorithm partitions the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. For the conquer step, we use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. 1

2 0.28949249 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki

Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1

3 0.16658595 69 nips-2012-Clustering Sparse Graphs

Author: Yudong Chen, Sujay Sanghavi, Huan Xu

Abstract: We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the in-cluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized differently. We analyze our algorithm’s performance on the natural, classical and widely studied “planted partition” model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well. 1

4 0.15975356 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

Author: Po-ling Loh, Martin J. Wainwright

Abstract: We investigate a curious relationship between the structure of a discrete graphical model and the support of the inverse of a generalized covariance matrix. We show that for certain graph structures, the support of the inverse covariance matrix of indicator variables on the vertices of a graph reflects the conditional independence structure of the graph. Our work extends results that have previously been established only in the context of multivariate Gaussian graphical models, thereby addressing an open question about the significance of the inverse covariance matrix of a non-Gaussian distribution. Based on our population-level results, we show how the graphical Lasso may be used to recover the edge structure of certain classes of discrete graphical models, and present simulations to verify our theoretical results. 1

5 0.1467246 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen

Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1

6 0.13596085 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

7 0.12950288 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

8 0.12768671 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

9 0.10840738 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

10 0.10741543 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

11 0.10732564 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

12 0.10236045 254 nips-2012-On the Sample Complexity of Robust PCA

13 0.10091404 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

14 0.099658892 291 nips-2012-Reducing statistical time-series problems to binary classification

15 0.099089831 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

16 0.098868564 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

17 0.09483508 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

18 0.093838893 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

19 0.083220616 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

20 0.08109495 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.208), (1, 0.083), (2, 0.096), (3, -0.135), (4, -0.017), (5, 0.039), (6, -0.037), (7, -0.122), (8, -0.054), (9, -0.04), (10, 0.025), (11, -0.184), (12, -0.077), (13, 0.037), (14, 0.129), (15, 0.024), (16, -0.071), (17, 0.098), (18, -0.062), (19, -0.091), (20, -0.044), (21, -0.014), (22, 0.023), (23, 0.077), (24, 0.029), (25, -0.123), (26, 0.13), (27, -0.103), (28, -0.044), (29, -0.048), (30, 0.042), (31, 0.058), (32, 0.012), (33, -0.133), (34, 0.063), (35, -0.08), (36, -0.226), (37, 0.07), (38, -0.07), (39, -0.03), (40, 0.087), (41, 0.126), (42, -0.039), (43, -0.077), (44, -0.072), (45, -0.02), (46, 0.01), (47, -0.107), (48, 0.075), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94600356 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

Author: Cho-jui Hsieh, Arindam Banerjee, Inderjit S. Dhillon, Pradeep K. Ravikumar

Abstract: We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. Recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field, even in very high-dimensional regimes with a limited number of samples. In this paper, we are concerned with the computational cost in solving the above optimization problem. Our proposed algorithm partitions the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. For the conquer step, we use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. 1

2 0.69973594 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki

Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1

3 0.66736633 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen

Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1

4 0.60168082 69 nips-2012-Clustering Sparse Graphs

Author: Yudong Chen, Sujay Sanghavi, Huan Xu

Abstract: We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the in-cluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized differently. We analyze our algorithm’s performance on the natural, classical and widely studied “planted partition” model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well. 1

5 0.59600645 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

Author: Piyush Rai, Abhishek Kumar, Hal Daume

Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1

6 0.58657396 254 nips-2012-On the Sample Complexity of Robust PCA

7 0.58568621 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

8 0.58031005 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

9 0.5681749 26 nips-2012-A nonparametric variable clustering model

10 0.55510527 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

11 0.5465703 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

12 0.50711441 291 nips-2012-Reducing statistical time-series problems to binary classification

13 0.49667028 327 nips-2012-Structured Learning of Gaussian Graphical Models

14 0.47759643 155 nips-2012-Human memory search as a random walk in a semantic network

15 0.44866782 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters

16 0.44737333 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

17 0.4399136 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

18 0.43338931 125 nips-2012-Factoring nonnegative matrices with linear programs

19 0.42307213 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

20 0.41932833 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.402), (38, 0.162), (39, 0.014), (54, 0.019), (55, 0.014), (74, 0.027), (76, 0.177), (80, 0.064), (92, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96475255 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

Author: Michael Paul, Mark Dredze

Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1

2 0.92399138 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

Author: Aaron Dennis, Dan Ventura

Abstract: The sum-product network (SPN) is a recently-proposed deep model consisting of a network of sum and product nodes, and has been shown to be competitive with state-of-the-art deep models on certain difficult tasks such as image completion. Designing an SPN network architecture that is suitable for the task at hand is an open question. We propose an algorithm for learning the SPN architecture from data. The idea is to cluster variables (as opposed to data instances) in order to identify variable subsets that strongly interact with one another. Nodes in the SPN network are then allocated towards explaining these interactions. Experimental evidence shows that learning the SPN architecture significantly improves its performance compared to using a previously-proposed static architecture. 1

3 0.9024018 282 nips-2012-Proximal Newton-type methods for convex optimization

Author: Jason Lee, Yuekai Sun, Michael Saunders

Abstract: We seek to solve convex optimization problems in composite form: minimize f (x) := g(x) + h(x), n x∈R where g is convex and continuously differentiable and h : Rn → R is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. We prove such methods are globally convergent and achieve superlinear rates of convergence in the vicinity of an optimal solution. We also demonstrate the performance of these methods using problems of relevance in machine learning and statistics. 1

4 0.89717913 233 nips-2012-Multiresolution Gaussian Processes

Author: David B. Dunson, Emily B. Fox

Abstract: We propose a multiresolution Gaussian process to capture long-range, nonMarkovian dependencies while allowing for abrupt changes and non-stationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each defined over an element of a random nested partition. Long-range dependencies are captured by the top-level GP while the partition points define the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the marginal likelihood of the observations given the partition tree. This property allows for efficient inference of the partition itself, for which we employ graph-theoretic techniques. We apply the multiresolution GP to the analysis of magnetoencephalography (MEG) recordings of brain activity.

5 0.89133656 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

Author: Hyunsin Park, Sungrack Yun, Sanghyuk Park, Jongmin Kim, Chang D. Yoo

Abstract: For phoneme classification, this paper describes an acoustic model based on the variational Gaussian process dynamical system (VGPDS). The nonlinear and nonparametric acoustic model is adopted to overcome the limitations of classical hidden Markov models (HMMs) in modeling speech. The Gaussian process prior on the dynamics and emission functions respectively enable the complex dynamic structure and long-range dependency of speech to be better represented than that by an HMM. In addition, a variance constraint in the VGPDS is introduced to eliminate the sparse approximation error in the kernel matrix. The effectiveness of the proposed model is demonstrated with three experimental results, including parameter estimation and classification performance, on the synthetic and benchmark datasets. 1

6 0.88369024 192 nips-2012-Learning the Dependency Structure of Latent Factors

same-paper 7 0.86862004 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

8 0.79139662 12 nips-2012-A Neural Autoregressive Topic Model

9 0.77804518 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

10 0.75132638 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

11 0.70767963 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

12 0.70172268 78 nips-2012-Compressive Sensing MRI with Wavelet Tree Sparsity

13 0.69975084 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

14 0.69557858 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

15 0.69484293 166 nips-2012-Joint Modeling of a Matrix with Associated Text via Latent Binary Features

16 0.68603033 69 nips-2012-Clustering Sparse Graphs

17 0.67820412 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

18 0.67664838 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

19 0.67573094 150 nips-2012-Hierarchical spike coding of sound

20 0.67230785 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models