nips nips2010 nips2010-87 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rina Foygel, Mathias Drton
Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. [sent-3, score-0.276]
2 For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. [sent-4, score-0.53]
3 In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. [sent-5, score-0.474]
4 1 Introduction This paper is concerned with the problem of model selection (or structure learning) in Gaussian graphical modelling. [sent-8, score-0.191]
5 A Gaussian graphical model for a random vector X = (X1 , . [sent-9, score-0.125]
6 The model comprises all multivariate normal distributions N (µ, Θ−1 ) whose inverse covariance matrix satisfies that Θjk = 0 when {j, k} is not an edge in G. [sent-13, score-0.24]
7 In many applications, in particular in the analysis of gene expression data, inference of the graph G is of significant interest. [sent-15, score-0.099]
8 They provide the objective to be minimized in (heuristic) searches over the space of graphs and are sometimes used to select tuning parameters in other methods such as the graphical lasso of [2]. [sent-17, score-0.297]
9 In this work we study an extended Bayesian information criterion (BIC) for Gaussian graphical models. [sent-18, score-0.297]
10 Given a sample of n independent and identically distributed observations, this criterion takes the form ˆ BICγ (E) = −2ln (Θ(E)) + |E| log n + 4|E|γ log p, (1) ˆ where E is the edge set of a candidate graph and ln (Θ(E)) denotes the maximized log-likelihood function of the associated model. [sent-19, score-0.706]
11 (In this context an edge set comprises unordered pairs {j, k} of distinct elements in {1, . [sent-20, score-0.099]
12 ) The criterion is indexed by a parameter γ ∈ [0, 1]; see the Bayesian interpretation of γ given in [3]. [sent-24, score-0.139]
13 If γ = 0, then the classical BIC of [4] is recovered, which is well known to lead to (asymptotically) consistent model selection in the setting of fixed number of variables p and growing sample size n. [sent-25, score-0.096]
14 Consistency is understood to mean selection of the smallest true graph whose edge set we denote E0 . [sent-26, score-0.324]
15 Positive γ leads to stronger penalization of large graphs and our main result states that the (asymptotic) consistency of an exhaustive search over a restricted 1 model space may then also hold in a scenario where p grows moderately with n (see the Main Theorem in Section 2). [sent-27, score-0.327]
16 Our numerical work demonstrates that positive values of γ indeed lead to improved graph inference when p and n are of comparable size (Section 3). [sent-28, score-0.14]
17 The choice of the criterion in (1) is in analogy to a similar criterion for regression models that was first proposed in [5] and theoretically studied in [3, 6]. [sent-29, score-0.241]
18 Our theoretical study employs ideas from these latter two papers as well as distribution theory available for decomposable graphical models. [sent-30, score-0.317]
19 As mentioned above, we treat an exhaustive search over a restricted model space that contains all decomposable models given by an edge set of cardinality |E| ≤ q. [sent-31, score-0.36]
20 One difference to the regression treatment of [3, 6] is that we do not fix the dimension bound q nor the dimension |E0 | of the smallest true model. [sent-32, score-0.138]
21 Our simulations show that the combination of EBIC and graphical lasso gives good results well beyond the realm of the assumptions made in our theoretical analysis. [sent-36, score-0.316]
22 This combination is consistent in settings where both the lasso and the exhaustive search are consistent but in light of the good theoretical properties of lasso procedures (see [7]), studying this particular combination in itself would be an interesting topic for future work. [sent-37, score-0.261]
23 1 Consistency of the extended BIC for Gaussian graphical models Notation and definitions In the sequel we make no distinction between the edge set E of a graph on p nodes and the associated Gaussian graphical model. [sent-39, score-0.566]
24 We also refer to E as a set of entries in a p × p matrix, meaning the 2|E| entries indexed by (j, k) and (k, j) for each {j, k} ∈ E. [sent-41, score-0.228]
25 We use ∆ to denote the index pairs (j, j) for the diagonal entries of the matrix. [sent-42, score-0.138]
26 In other words, the non-zero entries of Θ0 are precisely the diagonal entries as well as the off-diagonal positions indexed by E0 ; note that a single edge in E0 corresponds to two positions in the matrix due to symmetry. [sent-44, score-0.335]
27 0 1 T Let S = n i Xi Xi be the sample covariance matrix. [sent-49, score-0.098]
28 The Gaussian log-likelihood function simplifies to n (2) ln (Θ) = [log det(Θ) − trace(SΘ)] . [sent-50, score-0.282]
29 Let E be the set of all decomposable models E with |E| ≤ q. [sent-63, score-0.231]
30 Then with probability tending to 1 as n → ∞, E0 = arg min BICγ (E). [sent-64, score-0.094]
31 E∈E That is, the extended BIC with parameter γ selects the smallest true model E0 when applied to any subset of E containing E0 . [sent-65, score-0.166]
32 Firstly, in Chen and Chen’s work on the GLM case [6], the Taylor approximation to the loglikelihood function is used and we will proceed similarly when comparing the smallest true model E0 to models E which do not contain E0 . [sent-67, score-0.134]
33 The technique produces a lower bound on the decrease in likelihood when the true model is replaced by a false model. [sent-68, score-0.191]
34 Then with probability tending to 1 as n → ∞, ˆ ln (Θ0 ) − ln (Θ(E)) > 2q(log p)(1 + γ0 ) ∀ E ∈ E1 . [sent-72, score-0.658]
35 Secondly, Porteous [8] shows that in the case of two nested models which are both decomposable, the likelihood ratio (at the maximum likelihood estimates) follows a distribution that can be expressed exactly as a log product of Beta distributions. [sent-73, score-0.176]
36 We will use this to address the comparison between the model E0 and decomposable models E containing E0 and obtain an upper bound on the improvement in likelihood when the true model is expanded to a larger decomposable model. [sent-74, score-0.544]
37 Let E0 be the set of decomposable models E with E ⊃ E0 and |E| ≤ q. [sent-77, score-0.231]
38 Then with probability tending to 1 as n → ∞, ˆ ˆ ln (Θ(E)) − ln (Θ(E0 )) < 2(1 + γ0 )(|E| − |E0 |) log p ∀E ∈ E0 \{E0 }. [sent-78, score-0.723]
39 If E ∈ E0 , then (by Theorem 2): ˆ ˆ BICγ (E) − BICγ (E0 ) = −2(ln (Θ(E)) − ln (Θ(E0 ))) + 4(1 + γ0 )(|E| − |E0 |) log p > 0. [sent-84, score-0.347]
40 If instead E ∈ E1 , then (by Theorem 1, since |E0 | ≤ q): ˆ ˆ BICγ (E) − BICγ (E0 ) = −2(ln (Θ(E)) − ln (Θ(E0 ))) + 4(1 + γ0 )(|E| − |E0 |) log p > 0. [sent-85, score-0.347]
41 3 Simulations In this section, we demonstrate that the EBIC with positive γ indeed leads to better model selection properties in practically relevant settings. [sent-88, score-0.107]
42 As mentioned in the introduction, we first use the graphical lasso of [2] (as implemented in the ‘glasso’ package for R) to define a small set of models to consider (details given below). [sent-91, score-0.293]
43 For each case, the average positive selection rate (PSR) and false discovery rate (FDR) are computed. [sent-94, score-0.275]
44 We recall that the graphical lasso places an penalty ρ ≥ 0, we obtain the estimate 1 penalty on the inverse covariance matrix. [sent-95, score-0.455]
45 Θ 3 (4) Figure 1: The chain (top) and the ‘double chain’ (bottom) on 6 nodes. [sent-97, score-0.146]
46 The 1 penalty promotes zeros in the estimated inverse covariance ˆ matrix Θρ ; increasing the penalty yields an increase in sparsity. [sent-99, score-0.232]
47 The ‘glasso path’, that is, the set of models recovered over the full range of penalties ρ ∈ [0, ∞), gives a small set of models which, roughly, include the ‘best’ models at various levels of sparsity. [sent-100, score-0.117]
48 We may therefore apply the EBIC to this manageably small set of models (without further restriction to decomposable models). [sent-101, score-0.231]
49 Consistency results on the graphical lasso require the penalty ρ to satisfy bounds that involve measures of regularity in the unknown matrix Θ0 ; see [7]. [sent-102, score-0.283]
50 While cross-validation does not generally have consistency properties for model selection (see [9]), it is nevertheless interesting to compare our method to cross-validation. [sent-104, score-0.142]
51 1 Design In our simulations, we examine the EBIC as applied to the case where the graph is a chain with node j being connected to nodes j −1, j +1, and to the ‘double chain’, where node j is connected to nodes j − 2, j − 1, j + 1, j + 2. [sent-109, score-0.397]
52 For both the chain and the double chain, we investigate four different scaling scenarios, with the exponent κ selected from {0. [sent-111, score-0.338]
53 In the case of a chain, the true inverse covariance matrix Θ0 is tridiagonal with all diagonal entries (Θ0 )j,j set equal to 1, and the entries (Θ0 )j,j+1 = (Θ0 )j+1,j that are next to the main diagonal equal to 0. [sent-116, score-0.465]
54 For the double chain, Θ0 has all diagonal entries equal to 1, the entries next to the main diagonal are (Θ0 )j,j+1 = (Θ0 )j+1,j = 0. [sent-118, score-0.444]
55 2 and the remaining non-zero entries are (Θ0 )j,j+2 = 2 (Θ0 )j+2,j = 0. [sent-119, score-0.095]
56 We choose 100 penalty values ρ which are logarithmically evenly spaced between ρmax (the smallest value which will result in a no-edge model) and ρmax /100. [sent-123, score-0.113]
57 2 Results Chain graph: The results for the chain graph are displayed in Figure 2. [sent-130, score-0.275]
58 The figure shows the positive selection rate (PSR) and false discovery rate (FDR) in the four scaling scenarios. [sent-131, score-0.334]
59 We observe that, for the larger sample sizes, the recovery of the non-zero coefficients is perfect or nearly perfect for all three values of γ; however, the FDR rate is noticeably better for the positive values of γ, especially 4 for higher scaling exponents κ. [sent-132, score-0.225]
60 5 or γ = 1 performs very well, while the ordinary BIC0 produces a non-trivial amount of false positives. [sent-134, score-0.114]
61 Double chain graph: The results for the double chain graph are displayed in Figure 3. [sent-136, score-0.554]
62 In each of the four scaling scenarios for this case, we see a noticeable decline in the PSR as γ increases. [sent-137, score-0.097]
63 Furthermore, the FDR for the ordinary BIC0 is again noticeably higher than for the positive values of γ, and in the scaling scenarios κ ≥ 0. [sent-139, score-0.246]
64 9, the FDR for BIC0 is actually increasing as n and p grow, suggesting that asymptotic consistency may not hold in these cases, as is supported by our theoretical results. [sent-140, score-0.16]
65 For low values of γ, we are more likely to obtain a good (high) positive selection rate. [sent-146, score-0.107]
66 For higher values of γ, we are more likely to obtain a good (low) false discovery rate. [sent-147, score-0.106]
67 Nonetheless, this method offers guaranteed asymptotic consistency for (known) values of γ depending only on n and p. [sent-152, score-0.16]
68 4 Discussion We have proposed the use of an extended Bayesian information criterion for multivariate data generated by sparse graphical models. [sent-153, score-0.297]
69 Our main result gives a specific scaling for the number of variables p, the sample size n, the bound on the number of edges q, and other technical quantities relating to the true model, which will ensure asymptotic consistency. [sent-154, score-0.407]
70 Our simulation study demonstrates the the practical potential of the extended BIC, particularly as a way to tune the graphical lasso. [sent-155, score-0.225]
71 ) and state precise assumptions on those quantities, and then give an explicit lower bound on the probability of the extended BIC recovering the model E0 exactly. [sent-162, score-0.218]
72 We do this to give an intuition for the magnitude of the sample size n necessary for a good chance of exact recovery in a given setting but due to the proof techniques, the resulting implications about sample size are extremely conservative. [sent-163, score-0.128]
73 5 Figure 2: Simulation results when the true graph is a chain. [sent-171, score-0.141]
74 Second, we give a distributional result about the sample correlation when sampling from a bivariate normal distribution. [sent-174, score-0.172]
75 , (Xn , Yn ) are independent draws from a bivariate normal distribution with zero mean, variances equal to one and covariance ρ. [sent-179, score-0.139]
76 n 6 Figure 3: Simulation results when the true graph is a ‘double chain’. [sent-192, score-0.141]
77 2 Non-asymptotic versions of the theorems We assume the following two conditions, where 1 γ0 = γ − (1 − 4κ ): 0, 1 2 > 0, C ≥ σmax λmax , κ = logn p, and (p + 2q) log p λ2 1 × max ≤ 2 1 n θ0 3200 max{1 + γ0 , 1 + 2 C 2 } √ log log p + log(4 1 + γ0 ) + 1 2( 1 + γ0 − 1) − ≥ 0 2 log p Theorem 3. [sent-194, score-0.484]
78 Then with probability at least 1 − E ⊃ E0 with |E| ≤ q, ˆ ln (Θ0 ) − ln (Θ(E)) > 2q(log p)(1 + γ0 ). [sent-196, score-0.564]
79 We sketch a proof along the lines of the proof of Theorem 2 in [6], using Taylor series ˆ centered at the true Θ0 to approximate the likelihood at Θ(E). [sent-198, score-0.154]
80 The score and the negative Hessian of the log-likelihood function in (2) are d n d n ln (Θ) = Θ−1 − S , Hn (Θ) = − sn (Θ) = Θ−1 ⊗ Θ−1 . [sent-199, score-0.351]
81 sn (Θ) = 7 ˆ Next, observe that Θ(E) has support on ∆ ∪ E0 ∪ E, and that by definition of θ0 , we have the lower ˆ bound |Θ(E) − Θ0 |F ≥ θ0 in terms of the Frobenius norm. [sent-203, score-0.112]
82 By Taylor expansion, for some Θ on the path from Θ0 to Θ, we have: 1 ˜ ln (Θ) − ln (Θ0 ) = vec(Θ − Θ0 )T sn (Θ0 ) − vec(Θ − Θ0 )T Hn (Θ)vec(Θ − Θ0 ). [sent-205, score-0.676]
83 2 Next, by (CSB) and Lemma 2, with probability at least 1 − √π 1 p e− 1 log p , the following bound log holds for all edges e in the complete graph (we omit the details): 4 (sn (Θ0 ))2 ≤ 6σmax (2 + e 1 )n log p. [sent-206, score-0.417]
84 Therefore, 2 4 |vec(Θ − Θ0 )T sn (Θ0 )|2 ≤ θ0 (p + 2q) × 6σmax (2 + 1 )n log p. [sent-210, score-0.134]
85 We conclude that 2 1 2 n 2 4 ln (Θ) − ln (Θ0 ) ≤ θ0 (p + 2q) × 6σmax (2 + 1 )n log p − θ0 × (2λmax )−2 . [sent-212, score-0.629]
86 2 2 Combining this bound with our assumptions above, we obtain the desired result. [sent-213, score-0.145]
87 Then with probability at least 1 − 4√π1log p 1−p− 0 , for all decomposable models E such that E E0 and |E| ≤ q, ˆ ˆ ln (Θ(E)) − ln (Θ(E0 )) < 2(1 + γ0 )(|E| − |E0 |) log p. [sent-216, score-0.86]
88 By [8, 11], ln (Θ(E)) − m 1 ˆ ln (Θ(E0 )) is distributed as − n log ( i=1 Bi ), where Bi ∼ Beta( n−ci , 2 ) are independent random 2 2 variables and the constants c1 , . [sent-219, score-0.629]
89 , cm √ bounded by 1 less than the maximal clique size of the are graph given by model E, implying ci ≤ 2q for each i. [sent-222, score-0.132]
90 2 n − 2q − 1 m Finally, combining the assumptions on n, p, q and the (CSB) inequalities, we obtain: m 1 0 ˆ ˆ e− 2 (4(1+ 2 ) log p) . [sent-225, score-0.11]
91 P {ln (Θ(E)) − ln (Θ(E0 )) ≥ 2(1 + γ0 )m log(p)} ≤ √ 4 π log p ˆ ˆ ln (Θ(E)) − ln (Θ(E0 )) ≤ Next, note that the number of models |E| with E ⊃ E0 and |E| − |E0 | = m is bounded by p2m . [sent-226, score-0.95]
92 Taking the union bound over all choices of m and all choices of E with that given m, we obtain that the desired result holds with the desired probability. [sent-227, score-0.194]
93 For its proof apply the union bound to the statements in Theorems 3 and 4, as in the asymptotic proof given in section 2. [sent-229, score-0.203]
94 Let E be the set of subsets E of edges between the p nodes, satisfying |E| ≤ q and representing a decomposable model. [sent-232, score-0.235]
95 Then it holds with p− 0 probability at least 1 − 4√π1log p 1−p− 0 − √π 1 p p− 1 that log E0 = arg min BICγ (E). [sent-233, score-0.102]
96 E∈E That is, the extended BIC with parameter γ selects the smallest true model. [sent-234, score-0.166]
97 Furthermore, although we may not have the exact equality κ = logn p, we will have logn p → κ; this limit will be sufficient for the necessary inequalities to hold for sufficiently large n. [sent-237, score-0.199]
98 Extended Bayesian information criterion for model selection with large model space. [sent-248, score-0.167]
99 Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci. [sent-259, score-0.101]
100 Stochastic inequalities relating a class of log-likelihood ratio statistics to their asymptotic χ2 distribution. [sent-272, score-0.151]
wordName wordTfidf (topN-words)
[('bic', 0.467), ('ebic', 0.404), ('ln', 0.282), ('fdr', 0.2), ('decomposable', 0.192), ('psr', 0.191), ('chain', 0.146), ('glasso', 0.133), ('double', 0.133), ('graphical', 0.125), ('criterion', 0.101), ('graph', 0.099), ('lasso', 0.098), ('entries', 0.095), ('tending', 0.094), ('csb', 0.093), ('vec', 0.085), ('asymptotic', 0.084), ('logn', 0.082), ('max', 0.079), ('consistency', 0.076), ('extended', 0.071), ('false', 0.07), ('sn', 0.069), ('covariance', 0.068), ('selection', 0.066), ('log', 0.065), ('exhaustive', 0.065), ('edge', 0.064), ('noticeably', 0.064), ('theorems', 0.063), ('suppose', 0.062), ('jiahua', 0.062), ('rina', 0.062), ('zehua', 0.062), ('bi', 0.062), ('chen', 0.06), ('penalty', 0.06), ('scaling', 0.059), ('desired', 0.057), ('hn', 0.055), ('drton', 0.054), ('smallest', 0.053), ('simulations', 0.048), ('crossvalidation', 0.047), ('oh', 0.047), ('graphs', 0.045), ('assumptions', 0.045), ('nonetheless', 0.045), ('ordinary', 0.044), ('oxford', 0.044), ('inverse', 0.044), ('diagonal', 0.043), ('path', 0.043), ('bound', 0.043), ('nodes', 0.043), ('edges', 0.043), ('bivariate', 0.042), ('true', 0.042), ('grow', 0.042), ('distributional', 0.041), ('positive', 0.041), ('theorem', 0.04), ('bayesian', 0.039), ('quantities', 0.039), ('models', 0.039), ('indexed', 0.038), ('scenarios', 0.038), ('criteria', 0.038), ('proof', 0.038), ('penalization', 0.037), ('moderately', 0.037), ('holds', 0.037), ('discovery', 0.036), ('likelihood', 0.036), ('inequalities', 0.035), ('main', 0.035), ('comprises', 0.035), ('kronecker', 0.035), ('cai', 0.034), ('connected', 0.033), ('chicago', 0.033), ('beta', 0.033), ('ci', 0.033), ('taylor', 0.033), ('jk', 0.032), ('relating', 0.032), ('appendix', 0.032), ('scenario', 0.032), ('gaussian', 0.032), ('proofs', 0.031), ('package', 0.031), ('rate', 0.031), ('give', 0.03), ('displayed', 0.03), ('sample', 0.03), ('tuning', 0.029), ('normal', 0.029), ('simulation', 0.029), ('recovering', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
Author: Rina Foygel, Mathias Drton
Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1
2 0.24460511 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
Author: Han Liu, Kathryn Roeder, Larry Wasserman
Abstract: A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.
3 0.15281466 217 nips-2010-Probabilistic Multi-Task Feature Selection
Author: Yu Zhang, Dit-Yan Yeung, Qian Xu
Abstract: Recently, some variants of the đ?‘™1 norm, particularly matrix norms such as the đ?‘™1,2 and đ?‘™1,∞ norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. In this paper, we unify the đ?‘™1,2 and đ?‘™1,∞ norms by considering a family of đ?‘™1,đ?‘ž norms for 1 < đ?‘ž ≤ ∞ and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the đ?‘™1,đ?‘ž norm. Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. We also extend the model to learn and exploit more general types of pairwise relationships between tasks. For both versions of the model, we devise expectation-maximization (EM) algorithms to learn all model parameters, including đ?‘ž, automatically. Experiments have been conducted on two cancer classiďŹ cation applications using microarray gene expression data. 1
4 0.11041851 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
5 0.10199089 148 nips-2010-Learning Networks of Stochastic Differential Equations
Author: José Pereira, Morteza Ibrahimi, Andrea Montanari
Abstract: We consider linear models for stochastic dynamics. To any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval T . We analyze the ℓ1 -regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are uniform in the sampling rate as long as this is sufficiently high. This result substantiates the notion of a well defined ‘time complexity’ for the network inference problem. keywords: Gaussian processes, model selection and structure learning, graphical models, sparsity and feature selection. 1 Introduction and main results Let G = (V, E) be a directed graph with weight A0 ∈ R associated to the directed edge (j, i) from ij j ∈ V to i ∈ V . To each node i ∈ V in this network is associated an independent standard Brownian motion bi and a variable xi taking values in R and evolving according to A0 xj (t) dt + dbi (t) , ij dxi (t) = j∈∂+ i where ∂+ i = {j ∈ V : (j, i) ∈ E} is the set of ‘parents’ of i. Without loss of generality we shall take V = [p] ≡ {1, . . . , p}. In words, the rate of change of xi is given by a weighted sum of the current values of its neighbors, corrupted by white noise. In matrix notation, the same system is then represented by dx(t) = A0 x(t) dt + db(t) , p (1) 0 p×p with x(t) ∈ R , b(t) a p-dimensional standard Brownian motion and A ∈ R a matrix with entries {A0 }i,j∈[p] whose sparsity pattern is given by the graph G. We assume that the linear system ij x(t) = A0 x(t) is stable (i.e. that the spectrum of A0 is contained in {z ∈ C : Re(z) < 0}). Further, ˙ we assume that x(t = 0) is in its stationary state. More precisely, x(0) is a Gaussian random variable 1 independent of b(t), distributed according to the invariant measure. Under the stability assumption, this a mild restriction, since the system converges exponentially to stationarity. A portion of time length T of the system trajectory {x(t)}t∈[0,T ] is observed and we ask under which conditions these data are sufficient to reconstruct the graph G (i.e., the sparsity pattern of A0 ). We are particularly interested in computationally efficient procedures, and in characterizing the scaling of the learning time for large networks. Can the network structure be learnt in a time scaling linearly with the number of its degrees of freedom? As an example application, chemical reactions can be conveniently modeled by systems of nonlinear stochastic differential equations, whose variables encode the densities of various chemical species [1, 2]. Complex biological networks might involve hundreds of such species [3], and learning stochastic models from data is an important (and challenging) computational task [4]. Considering one such chemical reaction network in proximity of an equilibrium point, the model (1) can be used to trace fluctuations of the species counts with respect to the equilibrium values. The network G would represent in this case the interactions between different chemical factors. Work in this area focused so-far on low-dimensional networks, i.e. on methods that are guaranteed to be correct for fixed p, as T → ∞, while we will tackle here the regime in which both p and T diverge. Before stating our results, it is useful to stress a few important differences with respect to classical graphical model learning problems: (i) Samples are not independent. This can (and does) increase the sample complexity. (ii) On the other hand, infinitely many samples are given as data (in fact a collection indexed by the continuous parameter t ∈ [0, T ]). Of course one can select a finite subsample, for instance at regularly spaced times {x(i η)}i=0,1,... . This raises the question as to whether the learning performances depend on the choice of the spacing η. (iii) In particular, one expects that choosing η sufficiently large as to make the configurations in the subsample approximately independent can be harmful. Indeed, the matrix A0 contains more information than the stationary distribution of the above process (1), and only the latter can be learned from independent samples. (iv) On the other hand, letting η → 0, one can produce an arbitrarily large number of distinct samples. However, samples become more dependent, and intuitively one expects that there is limited information to be harnessed from a given time interval T . Our results confirm in a detailed and quantitative way these intuitions. 1.1 Results: Regularized least squares Regularized least squares is an efficient and well-studied method for support recovery. We will discuss relations with existing literature in Section 1.3. In the present case, the algorithm reconstructs independently each row of the matrix A0 . The rth row, A0 , is estimated by solving the following convex optimization problem for Ar ∈ Rp r minimize L(Ar ; {x(t)}t∈[0,T ] ) + λ Ar 1 , (2) where the likelihood function L is defined by L(Ar ; {x(t)}t∈[0,T ] ) = 1 2T T 0 (A∗ x(t))2 dt − r 1 T T 0 (A∗ x(t)) dxr (t) . r (3) (Here and below M ∗ denotes the transpose of matrix/vector M .) To see that this likelihood function is indeed related to least squares, one can formally write xr (t) = dxr (t)/dt and complete the square ˙ for the right hand side of Eq. (3), thus getting the integral (A∗ x(t) − xr (t))2 dt − xr (t)2 dt. ˙ ˙ r The first term is a sum of square residuals, and the second is independent of A. Finally the ℓ1 regularization term in Eq. (2) has the role of shrinking to 0 a subset of the entries Aij thus effectively selecting the structure. Let S 0 be the support of row A0 , and assume |S 0 | ≤ k. We will refer to the vector sign(A0 ) as to r r the signed support of A0 (where sign(0) = 0 by convention). Let λmax (M ) and λmin (M ) stand for r 2 the maximum and minimum eigenvalue of a square matrix M respectively. Further, denote by Amin the smallest absolute value among the non-zero entries of row A0 . r When stable, the diffusion process (1) has a unique stationary measure which is Gaussian with covariance Q0 ∈ Rp×p given by the solution of Lyapunov’s equation [5] A0 Q0 + Q0 (A0 )∗ + I = 0. (4) Our guarantee for regularized least squares is stated in terms of two properties of the covariance Q0 and one assumption on ρmin (A0 ) (given a matrix M , we denote by ML,R its submatrix ML,R ≡ (Mij )i∈L,j∈R ): (a) We denote by Cmin ≡ λmin (Q0 0 ,S 0 ) the minimum eigenvalue of the restriction of Q0 to S the support S 0 and assume Cmin > 0. (b) We define the incoherence parameter α by letting |||Q0 (S 0 )C ,S 0 Q0 S 0 ,S 0 and assume α > 0. (Here ||| · |||∞ is the operator sup norm.) −1 |||∞ = 1 − α, ∗ (c) We define ρmin (A0 ) = −λmax ((A0 + A0 )/2) and assume ρmin (A0 ) > 0. Note this is a stronger form of stability assumption. Our main result is to show that there exists a well defined time complexity, i.e. a minimum time interval T such that, observing the system for time T enables us to reconstruct the network with high probability. This result is stated in the following theorem. Theorem 1.1. Consider the problem of learning the support S 0 of row A0 of the matrix A0 from a r sample trajectory {x(t)}t∈[0,T ] distributed according to the model (1). If T > 104 k 2 (k ρmin (A0 )−2 + A−2 ) 4pk min log , 2 α2 ρmin (A0 )Cmin δ (5) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36 log(4p/δ)/(T α2 ρmin (A0 )) . The time complexity is logarithmic in the number of variables and polynomial in the support size. Further, it is roughly inversely proportional to ρmin (A0 ), which is quite satisfying conceptually, since ρmin (A0 )−1 controls the relaxation time of the mixes. 1.2 Overview of other results So far we focused on continuous-time dynamics. While, this is useful in order to obtain elegant statements, much of the paper is in fact devoted to the analysis of the following discrete-time dynamics, with parameter η > 0: x(t) = x(t − 1) + ηA0 x(t − 1) + w(t), t ∈ N0 . (6) Here x(t) ∈ Rp is the vector collecting the dynamical variables, A0 ∈ Rp×p specifies the dynamics as above, and {w(t)}t≥0 is a sequence of i.i.d. normal vectors with covariance η Ip×p (i.e. with independent components of variance η). We assume that consecutive samples {x(t)}0≤t≤n are given and will ask under which conditions regularized least squares reconstructs the support of A0 . The parameter η has the meaning of a time-step size. The continuous-time model (1) is recovered, in a sense made precise below, by letting η → 0. Indeed we will prove reconstruction guarantees that are uniform in this limit as long as the product nη (which corresponds to the time interval T in the previous section) is kept constant. For a formal statement we refer to Theorem 3.1. Theorem 1.1 is indeed proved by carefully controlling this limit. The mathematical challenge in this problem is related to the fundamental fact that the samples {x(t)}0≤t≤n are dependent (and strongly dependent as η → 0). Discrete time models of the form (6) can arise either because the system under study evolves by discrete steps, or because we are subsampling a continuous time system modeled as in Eq. (1). Notice that in the latter case the matrices A0 appearing in Eq. (6) and (1) coincide only to the zeroth order in η. Neglecting this technical complication, the uniformity of our reconstruction guarantees as η → 0 has an appealing interpretation already mentioned above. Whenever the samples spacing is not too large, the time complexity (i.e. the product nη) is roughly independent of the spacing itself. 3 1.3 Related work A substantial amount of work has been devoted to the analysis of ℓ1 regularized least squares, and its variants [6, 7, 8, 9, 10]. The most closely related results are the one concerning high-dimensional consistency for support recovery [11, 12]. Our proof follows indeed the line of work developed in these papers, with two important challenges. First, the design matrix is in our case produced by a stochastic diffusion, and it does not necessarily satisfies the irrepresentability conditions used by these works. Second, the observations are not corrupted by i.i.d. noise (since successive configurations are correlated) and therefore elementary concentration inequalities are not sufficient. Learning sparse graphical models via ℓ1 regularization is also a topic with significant literature. In the Gaussian case, the graphical LASSO was proposed to reconstruct the model from i.i.d. samples [13]. In the context of binary pairwise graphical models, Ref. [11] proves high-dimensional consistency of regularized logistic regression for structural learning, under a suitable irrepresentability conditions on a modified covariance. Also this paper focuses on i.i.d. samples. Most of these proofs builds on the technique of [12]. A naive adaptation to the present case allows to prove some performance guarantee for the discrete-time setting. However the resulting bounds are not uniform as η → 0 for nη = T fixed. In particular, they do not allow to prove an analogous of our continuous time result, Theorem 1.1. A large part of our effort is devoted to producing more accurate probability estimates that capture the correct scaling for small η. Similar issues were explored in the study of stochastic differential equations, whereby one is often interested in tracking some slow degrees of freedom while ‘averaging out’ the fast ones [14]. The relevance of this time-scale separation for learning was addressed in [15]. Let us however emphasize that these works focus once more on system with a fixed (small) number of dimensions p. Finally, the related topic of learning graphical models for autoregressive processes was studied recently in [16, 17]. The convex relaxation proposed in these papers is different from the one developed here. Further, no model selection guarantee was proved in [16, 17]. 2 Illustration of the main results It might be difficult to get a clear intuition of Theorem 1.1, mainly because of conditions (a) and (b), which introduce parameters Cmin and α. The same difficulty arises with analogous results on the high-dimensional consistency of the LASSO [11, 12]. In this section we provide concrete illustration both via numerical simulations, and by checking the condition on specific classes of graphs. 2.1 Learning the laplacian of graphs with bounded degree Given a simple graph G = (V, E) on vertex set V = [p], its laplacian ∆G is the symmetric p × p matrix which is equal to the adjacency matrix of G outside the diagonal, and with entries ∆G = ii −deg(i) on the diagonal [18]. (Here deg(i) denotes the degree of vertex i.) It is well known that ∆G is negative semidefinite, with one eigenvalue equal to 0, whose multiplicity is equal to the number of connected components of G. The matrix A0 = −m I + ∆G fits into the setting of Theorem 1.1 for m > 0. The corresponding model (1.1) describes the over-damped dynamics of a network of masses connected by springs of unit strength, and connected by a spring of strength m to the origin. We obtain the following result. Theorem 2.1. Let G be a simple connected graph of maximum vertex degree k and consider the model (1.1) with A0 = −m I + ∆G where ∆G is the laplacian of G and m > 0. If k+m 5 4pk T ≥ 2 · 105 k 2 , (7) (k + m2 ) log m δ then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36(k + m)2 log(4p/δ)/(T m3 ). In other words, for m bounded away from 0 and ∞, regularized least squares regression correctly reconstructs the graph G from a trajectory of time length which is polynomial in the degree and logarithmic in the system size. Notice that once the graph is known, the laplacian ∆G is uniquely determined. Also, the proof technique used for this example is generalizable to other graphs as well. 4 2800 Min. # of samples for success prob. = 0.9 1 0.9 p = 16 p = 32 0.8 Probability of success p = 64 0.7 p = 128 p = 256 0.6 p = 512 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 T=nη 350 400 2600 2400 2200 2000 1800 1600 1400 1200 1 10 450 2 3 10 10 p Figure 1: (left) Probability of success vs. length of the observation interval nη. (right) Sample complexity for 90% probability of success vs. p. 2.2 Numerical illustrations In this section we present numerical validation of the proposed method on synthetic data. The results confirm our observations in Theorems 1.1 and 3.1, below, namely that the time complexity scales logarithmically with the number of nodes in the network p, given a constant maximum degree. Also, the time complexity is roughly independent of the sampling rate. In Fig. 1 and 2 we consider the discrete-time setting, generating data as follows. We draw A0 as a random sparse matrix in {0, 1}p×p with elements chosen independently at random with P(A0 = 1) = k/p, k = 5. The ij process xn ≡ {x(t)}0≤t≤n is then generated according to Eq. (6). We solve the regularized least 0 square problem (the cost function is given explicitly in Eq. (8) for the discrete-time case) for different values of n, the number of observations, and record if the correct support is recovered for a random row r using the optimum value of the parameter λ. An estimate of the probability of successful recovery is obtained by repeating this experiment. Note that we are estimating here an average probability of success over randomly generated matrices. The left plot in Fig.1 depicts the probability of success vs. nη for η = 0.1 and different values of p. Each curve is obtained using 211 instances, and each instance is generated using a new random matrix A0 . The right plot in Fig.1 is the corresponding curve of the sample complexity vs. p where sample complexity is defined as the minimum value of nη with probability of success of 90%. As predicted by Theorem 2.1 the curve shows the logarithmic scaling of the sample complexity with p. In Fig. 2 we turn to the continuous-time model (1). Trajectories are generated by discretizing this stochastic differential equation with step δ much smaller than the sampling rate η. We draw random matrices A0 as above and plot the probability of success for p = 16, k = 4 and different values of η, as a function of T . We used 211 instances for each curve. As predicted by Theorem 1.1, for a fixed observation interval T , the probability of success converges to some limiting value as η → 0. 3 Discrete-time model: Statement of the results Consider a system evolving in discrete time according to the model (6), and let xn ≡ {x(t)}0≤t≤n 0 be the observed portion of the trajectory. The rth row A0 is estimated by solving the following r convex optimization problem for Ar ∈ Rp minimize L(Ar ; xn ) + λ Ar 0 where L(Ar ; xn ) ≡ 0 1 2η 2 n 1 , (8) n−1 2 t=0 {xr (t + 1) − xr (t) − η A∗ x(t)} . r (9) Apart from an additive constant, the η → 0 limit of this cost function can be shown to coincide with the cost function in the continuous time case, cf. Eq. (3). Indeed the proof of Theorem 1.1 will amount to a more precise version of this statement. Furthermore, L(Ar ; xn ) is easily seen to be the 0 log-likelihood of Ar within model (6). 5 1 1 0.9 0.95 0.9 0.7 Probability of success Probability of success 0.8 η = 0.04 η = 0.06 0.6 η = 0.08 0.5 η = 0.1 0.4 η = 0.14 0.3 η = 0.22 η = 0.18 0.85 0.8 0.75 0.7 0.65 0.2 0.6 0.1 0 50 100 150 T=nη 200 0.55 0.04 250 0.06 0.08 0.1 0.12 η 0.14 0.16 0.18 0.2 0.22 Figure 2: (right)Probability of success vs. length of the observation interval nη for different values of η. (left) Probability of success vs. η for a fixed length of the observation interval, (nη = 150) . The process is generated for a small value of η and sampled at different rates. As before, we let S 0 be the support of row A0 , and assume |S 0 | ≤ k. Under the model (6) x(t) has r a Gaussian stationary state distribution with covariance Q0 determined by the following modified Lyapunov equation A0 Q0 + Q0 (A0 )∗ + ηA0 Q0 (A0 )∗ + I = 0 . (10) It will be clear from the context whether A0 /Q0 refers to the dynamics/stationary matrix from the continuous or discrete time system. We assume conditions (a) and (b) introduced in Section 1.1, and adopt the notations already introduced there. We use as a shorthand notation σmax ≡ σmax (I +η A0 ) where σmax (.) is the maximum singular value. Also define D ≡ 1 − σmax /η . We will assume D > 0. As in the previous section, we assume the model (6) is initiated in the stationary state. Theorem 3.1. Consider the problem of learning the support S 0 of row A0 from the discrete-time r trajectory {x(t)}0≤t≤n . If nη > 4pk 104 k 2 (kD−2 + A−2 ) min log , 2 DC 2 α δ min (11) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = (36 log(4p/δ))/(Dα2 nη). In other words the discrete-time sample complexity, n, is logarithmic in the model dimension, polynomial in the maximum network degree and inversely proportional to the time spacing between samples. The last point is particularly important. It enables us to derive the bound on the continuoustime sample complexity as the limit η → 0 of the discrete-time sample complexity. It also confirms our intuition mentioned in the Introduction: although one can produce an arbitrary large number of samples by sampling the continuous process with finer resolutions, there is limited amount of information that can be harnessed from a given time interval [0, T ]. 4 Proofs In the following we denote by X ∈ Rn×p the matrix whose (t + 1)th column corresponds to the configuration x(t), i.e. X = [x(0), x(1), . . . , x(n − 1)]. Further ∆X ∈ Rn×p is the matrix containing configuration changes, namely ∆X = [x(1) − x(0), . . . , x(n) − x(n − 1)]. Finally we write W = [w(1), . . . , w(n − 1)] for the matrix containing the Gaussian noise realization. Equivalently, The r th row of W is denoted by Wr . W = ∆X − ηA X . In order to lighten the notation, we will omit the reference to xn in the likelihood function (9) and 0 simply write L(Ar ). We define its normalized gradient and Hessian by G = −∇L(A0 ) = r 1 ∗ XWr , nη Q = ∇2 L(A0 ) = r 6 1 XX ∗ . n (12) 4.1 Discrete time In this Section we outline our prove for our main result for discrete-time dynamics, i.e., Theorem 3.1. We start by stating a set of sufficient conditions for regularized least squares to work. Then we present a series of concentration lemmas to be used to prove the validity of these conditions, and finally we sketch the outline of the proof. As mentioned, the proof strategy, and in particular the following proposition which provides a compact set of sufficient conditions for the support to be recovered correctly is analogous to the one in [12]. A proof of this proposition can be found in the supplementary material. Proposition 4.1. Let α, Cmin > 0 be be defined by λmin (Q0 0 ,S 0 ) ≡ Cmin , S |||Q0 0 )C ,S 0 Q0 0 ,S 0 S (S −1 |||∞ ≡ 1 − α . (13) If the following conditions hold then the regularized least square solution (8) correctly recover the signed support sign(A0 ): r λα Amin Cmin G ∞≤ , GS 0 ∞ ≤ − λ, (14) 3 4k α Cmin α Cmin √ , √ . |||QS 0 ,S 0 − Q0 0 ,S 0 |||∞ ≤ (15) |||Q(S 0 )C ,S 0 − Q0 0 )C ,S 0 |||∞ ≤ S (S 12 k 12 k Further the same statement holds for the continuous model 3, provided G and Q are the gradient and the hessian of the likelihood (3). The proof of Theorem 3.1 consists in checking that, under the hypothesis (11) on the number of consecutive configurations, conditions (14) to (15) will hold with high probability. Checking these conditions can be regarded in turn as concentration-of-measure statements. Indeed, if expectation is taken with respect to a stationary trajectory, we have E{G} = 0, E{Q} = Q0 . 4.1.1 Technical lemmas In this section we will state the necessary concentration lemmas for proving Theorem 3.1. These are non-trivial because G, Q are quadratic functions of dependent random variables the samples {x(t)}0≤t≤n . The proofs of Proposition 4.2, of Proposition 4.3, and Corollary 4.4 can be found in the supplementary material provided. Our first Proposition implies concentration of G around 0. Proposition 4.2. Let S ⊆ [p] be any set of vertices and ǫ < 1/2. If σmax ≡ σmax (I + η A0 ) < 1, then 2 P GS ∞ > ǫ ≤ 2|S| e−n(1−σmax ) ǫ /4 . (16) We furthermore need to bound the matrix norms as per (15) in proposition 4.1. First we relate bounds on |||QJS − Q0 JS |||∞ with bounds on |Qij − Q0 |, (i ∈ J, i ∈ S) where J and S are any ij subsets of {1, ..., p}. We have, P(|||QJS − Q0 )|||∞ > ǫ) ≤ |J||S| max P(|Qij − Q0 | > ǫ/|S|). JS ij i,j∈J (17) Then, we bound |Qij − Q0 | using the following proposition ij Proposition 4.3. Let i, j ∈ {1, ..., p}, σmax ≡ σmax (I + ηA0 ) < 1, T = ηn > 3/D and 0 < ǫ < 2/D where D = (1 − σmax )/η then, P(|Qij − Q0 )| > ǫ) ≤ 2e ij n − 32η2 (1−σmax )3 ǫ2 . (18) Finally, the next corollary follows from Proposition 4.3 and Eq. (17). Corollary 4.4. Let J, S (|S| ≤ k) be any two subsets of {1, ..., p} and σmax ≡ σmax (I + ηA0 ) < 1, ǫ < 2k/D and nη > 3/D (where D = (1 − σmax )/η) then, P(|||QJS − Q0 |||∞ > ǫ) ≤ 2|J|ke JS 7 n − 32k2 η2 (1−σmax )3 ǫ2 . (19) 4.1.2 Outline of the proof of Theorem 3.1 With these concentration bounds we can now easily prove Theorem 3.1. All we need to do is to compute the probability that the conditions given by Proposition 4.1 hold. From the statement of the theorem we have that the first two conditions (α, Cmin > 0) of Proposition 4.1 hold. In order to make the first condition on G imply the second condition on G we assume that λα/3 ≤ (Amin Cmin )/(4k) − λ which is guaranteed to hold if λ ≤ Amin Cmin /8k. (20) We also combine the two last conditions on Q, thus obtaining the following |||Q[p],S 0 − Q0 0 |||∞ ≤ [p],S α Cmin √ , 12 k (21) since [p] = S 0 ∪ (S 0 )C . We then impose that both the probability of the condition on Q failing and the probability of the condition on G failing are upper bounded by δ/2 using Proposition 4.2 and Corollary 4.4. It is shown in the supplementary material that this is satisfied if condition (11) holds. 4.2 Outline of the proof of Theorem 1.1 To prove Theorem 1.1 we recall that Proposition 4.1 holds provided the appropriate continuous time expressions are used for G and Q, namely G = −∇L(A0 ) = r 1 T T x(t) dbr (t) , 0 Q = ∇2 L(A0 ) = r 1 T T x(t)x(t)∗ dt . (22) 0 These are of course random variables. In order to distinguish these from the discrete time version, we will adopt the notation Gn , Qn for the latter. We claim that these random variables can be coupled (i.e. defined on the same probability space) in such a way that Gn → G and Qn → Q almost surely as n → ∞ for fixed T . Under assumption (5), it is easy to show that (11) holds for all n > n0 with n0 a sufficiently large constant (for a proof see the provided supplementary material). Therefore, by the proof of Theorem 3.1, the conditions in Proposition 4.1 hold for gradient Gn and hessian Qn for any n ≥ n0 , with probability larger than 1 − δ. But by the claimed convergence Gn → G and Qn → Q, they hold also for G and Q with probability at least 1 − δ which proves the theorem. We are left with the task of showing that the discrete and continuous time processes can be coupled in such a way that Gn → G and Qn → Q. With slight abuse of notation, the state of the discrete time system (6) will be denoted by x(i) where i ∈ N and the state of continuous time system (1) by x(t) where t ∈ R. We denote by Q0 the solution of (4) and by Q0 (η) the solution of (10). It is easy to check that Q0 (η) → Q0 as η → 0 by the uniqueness of stationary state distribution. The initial state of the continuous time system x(t = 0) is a N(0, Q0 ) random variable independent of b(t) and the initial state of the discrete time system is defined to be x(i = 0) = (Q0 (η))1/2 (Q0 )−1/2 x(t = 0). At subsequent times, x(i) and x(t) are assumed are generated by the respective dynamical systems using the same matrix A0 using common randomness provided by the standard Brownian motion {b(t)}0≤t≤T in Rp . In order to couple x(t) and x(i), we construct w(i), the noise driving the discrete time system, by letting w(i) ≡ (b(T i/n) − b(T (i − 1)/n)). The almost sure convergence Gn → G and Qn → Q follows then from standard convergence of random walk to Brownian motion. Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 References [1] D.T. Gillespie. Stochastic simulation of chemical kinetics. Annual Review of Physical Chemistry, 58:35–55, 2007. [2] D. Higham. Modeling and Simulating Chemical Reactions. SIAM Review, 50:347–368, 2008. [3] N.D.Lawrence et al., editor. Learning and Inference in Computational Systems Biology. MIT Press, 2010. [4] T. Toni, D. Welch, N. Strelkova, A. Ipsen, and M.P.H. Stumpf. Modeling and Simulating Chemical Reactions. J. R. Soc. Interface, 6:187–202, 2009. [5] K. Zhou, J.C. Doyle, and K. Glover. Robust and optimal control. Prentice Hall, 1996. [6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. [7] D.L. Donoho. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 59(7):907–934, 2006. [8] D.L. Donoho. For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. [9] T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization. Annals of Statistics, 37:2109–2144, 2009. [10] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183–2202, 2009. [11] M.J. Wainwright, P. Ravikumar, and J.D. Lafferty. High-Dimensional Graphical Model Selection Using l-1-Regularized Logistic Regression. Advances in Neural Information Processing Systems, 19:1465, 2007. [12] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006. [13] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432, 2008. [14] K. Ball, T.G. Kurtz, L. Popovic, and G. Rempala. Modeling and Simulating Chemical Reactions. Ann. Appl. Prob., 16:1925–1961, 2006. [15] G.A. Pavliotis and A.M. Stuart. Parameter estimation for multiscale diffusions. J. Stat. Phys., 127:741–781, 2007. [16] J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical models of autoregressive processes. pages 89–116, 2010. [17] J. Songsiri and L. Vandenberghe. Topology selection in graphical models of autoregressive processes. Journal of Machine Learning Research, 2010. submitted. [18] F.R.K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, 1997. [19] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using l1-regularized logistic regression. Annals of Statistics, 2008. 9
6 0.095943496 108 nips-2010-Graph-Valued Regression
7 0.093313754 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
8 0.090129688 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
9 0.088857204 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
10 0.08658056 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
11 0.085767739 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
12 0.085435927 5 nips-2010-A Dirty Model for Multi-task Learning
13 0.079305306 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
14 0.077869408 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
15 0.076356336 117 nips-2010-Identifying graph-structured activation patterns in networks
16 0.075478539 63 nips-2010-Distributed Dual Averaging In Networks
17 0.072593346 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
18 0.071071811 162 nips-2010-Link Discovery using Graph Feature Tracking
19 0.069538996 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
20 0.069017939 66 nips-2010-Double Q-learning
topicId topicWeight
[(0, 0.213), (1, 0.025), (2, 0.121), (3, 0.131), (4, -0.006), (5, -0.097), (6, -0.021), (7, 0.044), (8, -0.081), (9, 0.022), (10, -0.11), (11, 0.048), (12, -0.11), (13, 0.034), (14, 0.003), (15, -0.042), (16, 0.015), (17, -0.045), (18, 0.036), (19, 0.094), (20, -0.079), (21, 0.001), (22, -0.111), (23, 0.163), (24, 0.041), (25, 0.082), (26, 0.056), (27, 0.025), (28, 0.091), (29, 0.138), (30, 0.035), (31, -0.068), (32, -0.078), (33, -0.024), (34, 0.035), (35, 0.125), (36, 0.049), (37, -0.057), (38, -0.033), (39, -0.029), (40, 0.094), (41, -0.081), (42, 0.077), (43, -0.022), (44, 0.039), (45, -0.072), (46, -0.126), (47, 0.013), (48, 0.022), (49, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.93631166 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
Author: Rina Foygel, Mathias Drton
Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1
2 0.83803743 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
Author: Han Liu, Kathryn Roeder, Larry Wasserman
Abstract: A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.
3 0.64987177 66 nips-2010-Double Q-learning
Author: Hado V. Hasselt
Abstract: In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation. 1
4 0.61889803 108 nips-2010-Graph-Valued Regression
Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty
Abstract: Undirected graphical models encode in a graph G the dependency structure of a random vector Y . In many applications, it is of interest to model Y given another random vector X as input. We refer to the problem of estimating the graph G(x) of Y conditioned on X = x as “graph-valued regression”. In this paper, we propose a semiparametric method for estimating G(x) that builds a tree on the X space just as in CART (classification and regression trees), but at each leaf of the tree estimates a graph. We call the method “Graph-optimized CART”, or GoCART. We study the theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle inequalities on risk minimization and tree partition consistency. We also demonstrate the application of Go-CART to a meteorological dataset, showing how graph-valued regression can provide a useful tool for analyzing complex data. 1
5 0.60159874 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
Author: Yi Sun, Jürgen Schmidhuber, Faustino J. Gomez
Abstract: We present a new way of converting a reversible finite Markov chain into a nonreversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition graph. Our result confirms that non-reversible chains are fundamentally better than reversible ones in terms of asymptotic performance, and suggests interesting directions for further improving MCMC. 1
6 0.58484697 148 nips-2010-Learning Networks of Stochastic Differential Equations
7 0.55253953 217 nips-2010-Probabilistic Multi-Task Feature Selection
8 0.53826803 172 nips-2010-Multi-Stage Dantzig Selector
9 0.52613455 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
10 0.51690501 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
11 0.50935125 117 nips-2010-Identifying graph-structured activation patterns in networks
12 0.50003546 204 nips-2010-Penalized Principal Component Regression on Graphs for Analysis of Subnetworks
13 0.49521482 5 nips-2010-A Dirty Model for Multi-task Learning
14 0.49298635 265 nips-2010-The LASSO risk: asymptotic results and real world examples
15 0.48148084 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
16 0.47649699 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
17 0.46831188 259 nips-2010-Subgraph Detection Using Eigenvector L1 Norms
18 0.46463323 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
19 0.46365768 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
20 0.44320008 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
topicId topicWeight
[(13, 0.042), (17, 0.03), (27, 0.054), (30, 0.074), (35, 0.034), (45, 0.181), (50, 0.091), (52, 0.029), (60, 0.068), (77, 0.058), (78, 0.022), (86, 0.185), (90, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.83126646 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
Author: Rina Foygel, Mathias Drton
Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1
2 0.76909918 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
3 0.76401204 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
4 0.76197785 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
5 0.75987881 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
6 0.75827152 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
7 0.75639564 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
8 0.75620222 63 nips-2010-Distributed Dual Averaging In Networks
9 0.75554657 148 nips-2010-Learning Networks of Stochastic Differential Equations
10 0.75526422 265 nips-2010-The LASSO risk: asymptotic results and real world examples
11 0.75263608 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
12 0.75221467 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
13 0.75092059 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
14 0.75004458 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
15 0.74989367 155 nips-2010-Learning the context of a category
16 0.74909574 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
17 0.74894428 282 nips-2010-Variable margin losses for classifier design
18 0.74834192 243 nips-2010-Smoothness, Low Noise and Fast Rates
19 0.74753988 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
20 0.74602747 280 nips-2010-Unsupervised Kernel Dimension Reduction