nips nips2010 nips2010-46 knowledge-graph by maker-knowledge-mining

46 nips-2010-Causal discovery in multiple models from different experiments

Source: pdf

Author: Tom Claassen, Tom Heskes

Abstract: A long-standing open research problem is how to use information from different experiments, including background knowledge, to infer causal relations. Recent developments have shown ways to use multiple data sets, provided they originate from identical experiments. We present the MCI-algorithm as the ﬁrst method that can infer provably valid causal relations in the large sample limit from different experiments. It is fast, reliable and produces very clear and easily interpretable output. It is based on a result that shows that constraint-based causal discovery is decomposable into a candidate pair identiﬁcation and subsequent elimination step that can be applied separately from different models. We test the algorithm on a variety of synthetic input model sets to assess its behavior and the quality of the output. The method shows promising signs that it can be adapted to suit causal discovery in real-world application areas as well, including large databases. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 nl Abstract A long-standing open research problem is how to use information from different experiments, including background knowledge, to infer causal relations. [sent-5, score-0.818]

2 We present the MCI-algorithm as the ﬁrst method that can infer provably valid causal relations in the large sample limit from different experiments. [sent-7, score-1.012]

3 It is based on a result that shows that constraint-based causal discovery is decomposable into a candidate pair identiﬁcation and subsequent elimination step that can be applied separately from different models. [sent-9, score-0.858]

4 The method shows promising signs that it can be adapted to suit causal discovery in real-world application areas as well, including large databases. [sent-11, score-0.858]

5 1 Introduction Discovering causal relations from observational data is an important, ubiquitous problem in science. [sent-12, score-0.963]

6 In practice, observed dependencies often differ between data sets, precisely because the experimental circumstances were not identical in different experiments, even when the causal system at the heart of it was the same. [sent-24, score-0.915]

7 The method we develop in this article shows how to distinguish between causal dependencies internal to the system under investigation and merely contextual dependencies. [sent-25, score-0.916]

8 [4] recognized the ‘local’ aspect of causal discovery from Y-structures embedded in data: it sufﬁces to establish a certain (in)dependency pattern between variables, without having to uncover the entire graph. [sent-27, score-0.888]

9 As a result the conclusion remains valid, even when, unlike causal inference from Y-structures, the two pieces of information are taken from different models. [sent-30, score-0.818]

10 Section 4 establishes the link between conditional independence and local causal relations, which is used in section 5 to combine multiple models into a single causal graph. [sent-34, score-1.766]

11 A directed graph G is a pair V, E , where V is a set of vertices or nodes and E is a set of edges between pairs of nodes, represented by arrows X → Y . [sent-38, score-0.242]

12 A directed path is a path that is traversed entirely in the direction of the arrows. [sent-43, score-0.286]

13 A directed acyclic graph (DAG) is a directed graph that does not contain a directed path from any node to itself. [sent-44, score-0.437]

14 A vertex X is an ancestor of Y (and Y is a descendant of X) if there is a directed path from X to Y in G or if X = Y . [sent-45, score-0.219]

15 For disjoint (sets of) vertices X, Y and Z in a DAG G, X is d-connected to Y conditional on Z (possibly empty), iff there exists an unblocked path π = X, . [sent-54, score-0.256]

16 Two nodes X and Y are minimally conditionally independent given a set of nodes Z, denoted [X ⊥ Y | Z], iff X is conditionally independent of Y given a minimal set of nodes Z. [sent-62, score-0.372]

17 A causal DAG GC is a graphical model in the form of a DAG where the arrows represent direct causal interactions between variables in a system [6]. [sent-64, score-1.71]

18 There is a causal relation X ⇒ Y , iff there is a directed path from X to Y in GC . [sent-65, score-1.066]

19 We assume that the systems we consider correspond to some underlying causal DAG over a great many observed and unobserved nodes. [sent-68, score-0.862]

20 Different MAGs can represent the same distribution, but only the invariant features, common to all MAGs that can faithfully represent that distribution, carry identiﬁable causal information. [sent-70, score-0.851]

21 Throughout this article we also adopt the causal faithfulness assumption, which implies that all and only the conditional independence relations entailed by the causal Markov condition applied to the true causal DAG will hold in the joint probability distribution over the variables in GC . [sent-74, score-2.786]

22 By modeling this external environment explicitly as a set of unobserved (hypothetical) context nodes that causally affect the system under scrutiny we can account for this effect. [sent-83, score-0.217]

23 The external context GE of a causal DAG GC is a set of independent nodes U in combination with links from every U ∈ U to one or more nodes in GC . [sent-85, score-1.144]

24 The total causal structure of an experiment then becomes GT = {GE + GC }. [sent-86, score-0.853]

25 Figure 1 depicts a causal system in three different experiments (double lined arrows indicate direct causal relations; dashed circles represent unobserved variables). [sent-87, score-1.708]

26 The context only introduces arrows from nodes in GE to GC which can never result in a cycle, therefore the structure of an experiment GT is also a causal DAG. [sent-89, score-0.999]

27 Figure 1: A causal system GC in different experiments In this paradigm different experiments become variations in context of a constant causal system. [sent-91, score-1.686]

28 The goal of causal discovery from multiple models can then be stated as: “Given experiments with unknown total causal structures GT = {GE + GC }, GT = {GE + GC }, etc. [sent-92, score-1.697]

29 , which variables are connected by a directed path in GC ? [sent-94, score-0.226]

30 4 Causal relations in arbitrary context A remarkable result that, to the best of our knowledge, has not been noted before, is that a minimal conditional independence always implies the presence of a causal relation. [sent-98, score-1.138]

31 with X ⊥ Y | Z, implies ⊥ ⊥ that there are no causal links W ⇒ X, W ⇒ Y or W ⇒ Z for any Z ∈ Z in GC , (3) a conditional independence X ⊥ Y | Z implies the absence of (direct) causal paths X ⇒ Y or ⊥ X ⇐ Y in GC between X and Y that are not mediated by nodes in Z. [sent-103, score-2.021]

32 3 The theorem establishes independence patterns that signify (absence of) a causal origin, independent of the (unobserved) external background. [sent-104, score-0.91]

33 Rule (1) identiﬁes a candidate pair of causal relations from a conditional independence. [sent-105, score-1.009]

34 Rule (2) identiﬁes the absence of causal paths from unshielded colliders in G, see also [1]. [sent-106, score-0.949]

35 The ﬁnal step towards causal discovery from multiple models now takes a surprisingly simple form: Lemma 1. [sent-108, score-0.879]

36 Let X, Y and Z ∈ Z be disjoint (sets of) variables in an experiment with causal structure GT = {GE + GC }, then if there exists both: − a minimal conditional independence [X ⊥ Y | Z], ⊥ − established absence of a causal path Z ⇒ X, then there is a causal link (directed path) Z ⇒ Y in GC . [sent-109, score-2.831]

37 The only prerequisite for bringing results from various sources together is that the causal system at the centre is invariant, i. [sent-114, score-0.845]

38 that the causal structure GC remains the same across the different experiments GT , GT etc. [sent-116, score-0.836]

39 We want to use these models to convey as much about the underlying causal structure GC as possible. [sent-119, score-0.857]

40 We choose a causal CPAG as the target output model: similar in form and interpretation to a CPAG, where tails and arrowheads now represent all known (non)causal relations. [sent-120, score-0.91]

41 Ingredients for extracting this information are the rules in theorem 1, in combination with the standard properties of causal relations: acyclic (if X ⇒ Y then Y ⇒ X) and transitivity (if X ⇒ Y and Y ⇒ Z then X ⇒ Z). [sent-122, score-0.907]

42 As the causal system is assumed invariant, the established (absence of) causal relations in one model are valid in all models. [sent-123, score-1.838]

43 as learned by the extended FCI-algorithm [1, 8], from a number of different experiments (i) GT on an invariant causal system GC . [sent-127, score-0.878]

44 The output is the single causal CPAG G over the union of all nodes in the input models Pi . [sent-128, score-0.957]

45 None of these identiﬁes a causal relation, yet despite the different (in)dependence relations, it is easily veriﬁed that the algorithm terminates after two loops with the nearly complete causal CPAG on the r. [sent-133, score-1.636]

46 , Y in a CPAG is called a possibly directed path (or p. [sent-148, score-0.179]

47 path) from X to Y , if it can be converted into a directed path by changing circle marks into appropriate tails and arrowheads [6]. [sent-150, score-0.296]

48 Let X and Y be two variables in an experiment with causal structure GT = {GE + GC }, and let P[G] be the corresponding CPAG over a subset of observed nodes from GC . [sent-152, score-0.998]

49 Then the absence of a causal link X ⇒ Y is detectable from the conditional (in)dependence structure in this experiment iff there exists no p. [sent-153, score-1.046]

50 path from a given CPAG is straightforward: turn the graph into a {0, 1} adjacency matrix by setting all arrowheads to zero and all tails and circle marks to one, and compute the resulting reachability matrix. [sent-160, score-0.271]

51 As this will uncover all detectable ‘non-causal’ relations in a CPAG in one go, it needs to be done only once for each model, and can be aggregated into a matrix MC to make all tests for rule (2) in line 8 superﬂuous. [sent-161, score-0.262]

52 If we also record all other established (non)causal relations in the matrix MC as the algorithm progresses, then indirect causal relations are no longer lost when they cannot be transferred to the output graph G. [sent-162, score-1.224]

53 Let X, Y and Z be disjoint sets of variables in an experiment with causal structure GT = {GE + GC }, then for every [X ⊥ Y | Z]: ⊥ − every (indirect) causal relation X ⇒ Y implies causal links Z ⇒ Y , − every (indirect) absence of causal relation X ⇒ Y implies no causal links X ⇒ Z. [sent-164, score-4.442]

54 The ﬁrst makes it possible to orient indirect causal chains, the second shortens indirect non-causal links. [sent-165, score-0.918]

55 As a ﬁnal improvement it is worth noting that for rules (1), (5) and (6) it is only relevant to know that a node Z occurs in some Z in a minimal conditional independence relation [X ⊥ Y | Z] separating X and Y , but not what the other nodes in ⊥ Z are or in what model(s) it occurred. [sent-167, score-0.286]

56 We can introduce a structure SCI to record all nodes Z that occur in some minimal conditional independency in one of the models Pi for each combination of nodes (X, Y ), before any of the rules (1), (5) or (6) is processed. [sent-168, score-0.384]

57 As a result, in the repeated causal inference loop no conditional independence / m-separation tests need to be performed at all. [sent-169, score-0.909]

58 , Y ∈ Pi Rule (4) MC ← (X ⇒ Y ), if causal path X ⇒ . [sent-175, score-0.925]

59 This can be efﬁciently implemented by noting that nodes connected by an edge will not be separated and that many other combinations will not have to be tested as they contain a subset for which a (minimal) conditional independency has already been established. [sent-182, score-0.203]

60 If a (non)causal relation is found between adjacent variables in G, or one that can be used to infer other intermediate relations (lines 13-14), then it can be marked as ‘processed’ to avoid unnecessary checks. [sent-183, score-0.197]

61 7 Experimental results We tested the MCI-algorithm on a variety of synthetic data sets to verify its validity and assess its behaviour and performance in uncovering causal information from multiple models. [sent-186, score-0.818]

62 For the generation of random causal DAGs we used a variant of [13] to control the distribution of edges over nodes in the network. [sent-187, score-0.917]

63 The random experiments in each run were generated from this causal DAG by including a random context and hidden nodes. [sent-188, score-0.841]

64 The generated output G and MC was veriﬁed against the true causal DAG and expressed as a percentage of the true number of (non-)causal relations. [sent-190, score-0.837]

65 The ﬁrst is a common sense method, indicated as ‘sum-FCI’, that utilizes the transitive closure of all causal relations in the input CPAGs, that could have been identiﬁed by FCI in the large sample limit. [sent-192, score-0.963]

66 -9532 %& Figure 3: Proportion of causal relations discovered by the MCI-algorithm in different settings; (a) causal relations vs. [sent-242, score-1.926]

67 of models; (top) identical observed nodes in input models, (bottom) only partially overlapping observed nodes second benchmark we take all causal information contained in the CPAG over the union of observed variables, independent of the context, hence ‘nc-CPAG’ for ‘no context’. [sent-248, score-1.112]

68 Note that this is not really a method as it uses information directly derived from the true causal graph. [sent-249, score-0.818]

69 Performance is calculated as the proportion of uncovered relations as compared to the actual number of non/causal relations in the true causal graph over the union of observed nodes in each model set. [sent-251, score-1.317]

70 In these runs the underlying causal graphs contained 12 nodes with edge degree ≤ 5. [sent-252, score-0.933]

71 Also non-causal information (c) is much easier to ﬁnd than deﬁnite causal relations (a&b;). [sent-255, score-0.963]

72 3(c,bottom) is only because in going to two models the number of non-causal relations in the union rises more quickly than the number of new relations that is actually found (due to lack of overlap). [sent-258, score-0.311]

73 A striking result that is clearly brought out is that adding random context actually improves detection rate of causal relations. [sent-259, score-0.841]

74 The rationale behind this effect is that externally induced links can introduce conditional dependencies, allowing the deduction of non-causal links that are not otherwise detectable; these in turn may lead to other causal relations that can be inferred, and so on. [sent-260, score-1.142]

75 If the context is expanded further, at some point the detection rate will start to deteriorate as the causal structure will be swamped by the externally induced links (b). [sent-261, score-0.934]

76 We want to stress that for the many tens of thousands of (non)causal relations identiﬁed by the MCI algorithm in all the runs, not a single one was found to be invalid on comparing with the true causal graph. [sent-262, score-0.963]

77 For larger models it can be converted into an anytime algorithm by running over minimal conditional independencies from subsets of increasing size: at each level all uncovered causal information is valid, and, for reasonably sparse models, most will already be found at low levels. [sent-269, score-0.989]

78 For very large models an exciting possibility is to target only speciﬁc causal relations: ﬁnding the right combination of (in)dependencies is sufﬁcient to decide if it is causal, even when there is no hope of deriving a global CPAG model. [sent-270, score-0.839]

79 Integrating our approach with recent developments in causal discovery that are not based on independence constraints [11, 12] can provide for even more detectable causal information. [sent-274, score-1.779]

80 When applied to real data sets the large sample limit no longer applies and inconsistent causal relations may result. [sent-275, score-0.963]

81 Alternatively, output might be generalized to quantities like ‘the probability of a causal relation’ based on the strength of appropriate conditional (in)dependencies in the available data. [sent-277, score-0.883]

82 A node Z ∈ Z that blocks such a trek has a directed path in GC to X and/or Y , but can unblock other paths. [sent-282, score-0.247]

83 These paths contain a trek between Z and X/Y , and must blocked by a node Z ∈ Z\Z , which therefore also has a causal link to X or Y (possibly via Z). [sent-283, score-0.943]

84 Any directed path from W to a Z ∈ Z implies that conditioning on W is not needed when already conditioning on Z. [sent-294, score-0.205]

85 In combination with π1 or π2 , a directed path from W to X or Y in GC would make W a noncollider on an unblocked path between X and Y given Z, contrary to X ⊥ Y | Z. [sent-295, score-0.354]

86 ⊥ (3) A directed path between X and Y that is not blocked by Z would result in X ⊥ Y | Z, see [1]. [sent-296, score-0.206]

87 ⊥ Theorem 2 ‘⇐’ follows from the fact that a directed path π = X, . [sent-297, score-0.179]

88 , Y in the underlying causal DAG GC implies existence of a directed path in the true MAG over the observed nodes and therefore at least the existence of a p. [sent-300, score-1.145]

89 path into a directed path in a MAG that is a member of the equivalence class P[G] . [sent-306, score-0.286]

90 path from X to Y in P[G] implies there is at least some underlying causal DAG in which it is a causal path, and so cannot correspond to a valid, detectable absence of a causal link. [sent-309, score-2.676]

91 Lemma 2 From rule (1) in Theorem 1, [X ⊥ Y | Z] implies causal links Z ⇒ X and/or Z ⇒ Y . [sent-310, score-0.95]

92 Glymour, “Integrating locally learned causal structures with overlapping variables,” in Advances in Neural Information Processing Systems, 21, 2008. [sent-327, score-0.845]

93 Spirtes, “A theoretical study of Y structures for causal discovery,” in Proceedings of the 22nd Conference in Uncertainty in Artiﬁcial Intelligence, pp. [sent-331, score-0.818]

94 Zhang, “On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias,” Artiﬁcial Intelligence, vol. [sent-349, score-0.91]

95 Spirtes, “Detection of unfaithfulness and robust causal inference,” Minds and Machines, vol. [sent-355, score-0.818]

96 Richardson, “An algorithm for causal inference in the presence of latent variables and selection bias,” in Computation, Causation, and Discovery, pp. [sent-362, score-0.841]

97 Kerminen, “A linear non-Gaussian acyclic model a for causal discovery,” Journal of Machine Learning Research, vol. [sent-368, score-0.838]

98 Sch¨ lkopf, “Nonlinear causal discovo ery with additive noise models,” in Advances in Neural Information Processing Systems 21 (NIPS*2008), pp. [sent-376, score-0.818]

99 Heskes, “Learning causal network structure from multiple (in)dependence models,” in Proceedings of the Fifth European Workshop on Probabilistic Graphical Models, 2010. [sent-384, score-0.836]

100 Meek, “Causal inference and causal explanation with background knowledge,” in UAI, pp. [sent-386, score-0.818]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('causal', 0.818), ('gc', 0.26), ('cpag', 0.227), ('relations', 0.145), ('mc', 0.122), ('dag', 0.111), ('path', 0.107), ('nodes', 0.099), ('gt', 0.083), ('ge', 0.081), ('directed', 0.072), ('mci', 0.068), ('pi', 0.063), ('sci', 0.062), ('links', 0.058), ('cpags', 0.057), ('mag', 0.057), ('indirect', 0.05), ('absence', 0.05), ('spirtes', 0.05), ('rule', 0.048), ('external', 0.047), ('graph', 0.047), ('dependencies', 0.047), ('non', 0.046), ('conditional', 0.046), ('abc', 0.045), ('arrowheads', 0.045), ('trek', 0.045), ('unblocked', 0.045), ('independence', 0.045), ('discovery', 0.04), ('iff', 0.04), ('uncovered', 0.04), ('detectable', 0.039), ('ancestral', 0.037), ('transitivity', 0.037), ('minimal', 0.035), ('paths', 0.035), ('arrowhead', 0.034), ('independency', 0.034), ('unused', 0.034), ('invariant', 0.033), ('rules', 0.032), ('collider', 0.03), ('uncover', 0.03), ('valid', 0.03), ('independencies', 0.029), ('relation', 0.029), ('tails', 0.028), ('blocked', 0.027), ('overlapping', 0.027), ('system', 0.027), ('implies', 0.026), ('circle', 0.025), ('identi', 0.025), ('dependency', 0.025), ('connected', 0.024), ('arrows', 0.024), ('article', 0.024), ('observed', 0.023), ('meek', 0.023), ('variables', 0.023), ('context', 0.023), ('claassen', 0.023), ('colliders', 0.023), ('faithfulness', 0.023), ('fci', 0.023), ('mags', 0.023), ('nijmegen', 0.023), ('noncollider', 0.023), ('radboud', 0.023), ('tillman', 0.023), ('unblock', 0.023), ('unshielded', 0.023), ('ancestor', 0.022), ('netherlands', 0.022), ('unobserved', 0.021), ('models', 0.021), ('confounders', 0.02), ('dags', 0.02), ('heskes', 0.02), ('mani', 0.02), ('acyclic', 0.02), ('provably', 0.019), ('developments', 0.019), ('marks', 0.019), ('output', 0.019), ('link', 0.018), ('descendant', 0.018), ('causation', 0.018), ('glymour', 0.018), ('structure', 0.018), ('disjoint', 0.018), ('dependence', 0.017), ('concise', 0.017), ('externally', 0.017), ('hoyer', 0.017), ('experiment', 0.017), ('graphs', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 46 nips-2010-Causal discovery in multiple models from different experiments

Author: Tom Claassen, Tom Heskes

2 0.60713726 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect

Author: Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M. Mooij, Bernhard Schölkopf

Abstract: We propose a novel method for inferring whether X causes Y or vice versa from joint observations of X and Y . The basic idea is to model the observed data using probabilistic latent variable models, which incorporate the effects of unobserved noise. To this end, we consider the hypothetical effect variable to be a function of the hypothetical cause variable and an independent noise term (not necessarily additive). An important novel aspect of our work is that we do not restrict the model class, but instead put general non-parametric priors on this function and on the distribution of the cause. The causal direction can then be inferred by using standard Bayesian model selection. We evaluate our approach on synthetic data and real-world data and report encouraging results. 1

3 0.26116785 247 nips-2010-Sparse Instrumental Variables (SPIV) for Genome-Wide Studies

Author: Paul Mckeigue, Jon Krohn, Amos J. Storkey, Felix V. Agakov

Abstract: This paper describes a probabilistic framework for studying associations between multiple genotypes, biomarkers, and phenotypic traits in the presence of noise and unobserved confounders for large genetic studies. The framework builds on sparse linear methods developed for regression and modiﬁed here for inferring causal structures of richer networks with latent variables. The method is motivated by the use of genotypes as “instruments” to infer causal associations between phenotypic biomarkers and outcomes, without making the common restrictive assumptions of instrumental variable methods. The method may be used for an effective screening of potentially interesting genotype-phenotype and biomarker-phenotype associations in genome-wide studies, which may have important implications for validating biomarkers as possible proxy endpoints for early-stage clinical trials. Where the biomarkers are gene transcripts, the method can be used for ﬁne mapping of quantitative trait loci (QTLs) detected in genetic linkage studies. The method is applied for examining effects of gene transcript levels in the liver on plasma HDL cholesterol levels for a sample of sequenced mice from a heterogeneous stock, with ∼ 105 genetic instruments and ∼ 47 × 103 gene transcripts. 1

4 0.20318606 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference

Author: Vikas Sindhwani, Aurelie C. Lozano

Abstract: We consider multivariate regression problems involving high-dimensional predictor and response spaces. To efﬁciently address such problems, we propose a variable selection method, Multivariate Group Orthogonal Matching Pursuit, which extends the standard Orthogonal Matching Pursuit technique. This extension accounts for arbitrary sparsity patterns induced by domain-speciﬁc groupings over both input and output variables, while also taking advantage of the correlation that may exist between the multiple outputs. Within this framework, we then formulate the problem of inferring causal relationships over a collection of high-dimensional time series variables. When applied to time-evolving social media content, our models yield a new family of causality-based inﬂuence measures that may be seen as an alternative to the classic PageRank algorithm traditionally applied to hyperlink graphs. Theoretical guarantees, extensive simulations and empirical studies conﬁrm the generality and value of our framework.

5 0.055198528 150 nips-2010-Learning concept graphs from text with stick-breaking priors

Author: America Chambers, Padhraic Smyth, Mark Steyvers

Abstract: We present a generative probabilistic model for learning general graph structures, which we term concept graphs, from text. Concept graphs provide a visual summary of the thematic content of a collection of documents—a task that is difﬁcult to accomplish using only keyword search. The proposed model can learn different types of concept graph structures and is capable of utilizing partial prior knowledge about graph structure as well as labeled documents. We describe a generative model that is based on a stick-breaking process for graphs, and a Markov Chain Monte Carlo inference procedure. Experiments on simulated data show that the model can recover known graph structure when learning in both unsupervised and semi-supervised modes. We also show that the proposed model is competitive in terms of empirical log likelihood with existing structure-based topic models (hPAM and hLDA) on real-world text data sets. Finally, we illustrate the application of the model to the problem of updating Wikipedia category graphs. 1

6 0.049902219 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation

7 0.047239576 128 nips-2010-Infinite Relational Modeling of Functional Connectivity in Resting State fMRI

8 0.04159693 162 nips-2010-Link Discovery using Graph Feature Tracking

9 0.039427005 115 nips-2010-Identifying Dendritic Processing

10 0.036850695 39 nips-2010-Bayesian Action-Graph Games

11 0.034323104 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

12 0.033964433 63 nips-2010-Distributed Dual Averaging In Networks

13 0.033949472 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior

14 0.033009373 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

15 0.032379542 290 nips-2010-t-logistic regression

16 0.031961016 117 nips-2010-Identifying graph-structured activation patterns in networks

17 0.031131769 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts

18 0.030338135 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations

19 0.030283062 71 nips-2010-Efficient Relational Learning with Hidden Variable Detection

20 0.02955466 108 nips-2010-Graph-Valued Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.104), (1, 0.033), (2, 0.002), (3, 0.096), (4, -0.085), (5, -0.076), (6, 0.011), (7, 0.041), (8, -0.076), (9, -0.121), (10, -0.303), (11, 0.153), (12, -0.117), (13, -0.282), (14, 0.193), (15, 0.52), (16, -0.06), (17, -0.129), (18, -0.12), (19, -0.11), (20, 0.229), (21, -0.094), (22, -0.02), (23, -0.004), (24, 0.065), (25, 0.022), (26, -0.017), (27, 0.014), (28, -0.023), (29, -0.028), (30, -0.059), (31, 0.036), (32, -0.032), (33, 0.009), (34, -0.037), (35, 0.013), (36, -0.008), (37, 0.031), (38, 0.001), (39, 0.029), (40, 0.03), (41, -0.036), (42, 0.008), (43, 0.009), (44, 0.043), (45, -0.027), (46, -0.01), (47, -0.004), (48, -0.004), (49, 0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98326707 46 nips-2010-Causal discovery in multiple models from different experiments

Author: Tom Claassen, Tom Heskes

2 0.88288325 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect

Author: Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M. Mooij, Bernhard Schölkopf

3 0.78627342 247 nips-2010-Sparse Instrumental Variables (SPIV) for Genome-Wide Studies

Author: Paul Mckeigue, Jon Krohn, Amos J. Storkey, Felix V. Agakov

4 0.60368174 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference

Author: Vikas Sindhwani, Aurelie C. Lozano

5 0.20266114 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models

Author: Danny Bickson, Carlos Guestrin

Abstract: Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L´ vy and Gaussian distrie butions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise. 1

6 0.18579175 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs

7 0.16024867 150 nips-2010-Learning concept graphs from text with stick-breaking priors

8 0.15453938 3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation

9 0.14716856 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

10 0.14424044 280 nips-2010-Unsupervised Kernel Dimension Reduction

11 0.14056168 2 nips-2010-A Bayesian Approach to Concept Drift

12 0.14004777 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks

13 0.13827001 107 nips-2010-Global seismic monitoring as probabilistic inference

14 0.1354069 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs

15 0.13381577 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection

16 0.13368699 125 nips-2010-Inference and communication in the game of Password

17 0.13219053 204 nips-2010-Penalized Principal Component Regression on Graphs for Analysis of Subnetworks

18 0.1314932 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics

19 0.13007922 115 nips-2010-Identifying Dendritic Processing

20 0.1295623 224 nips-2010-Regularized estimation of image statistics by Score Matching

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.035), (17, 0.013), (21, 0.309), (27, 0.066), (30, 0.066), (35, 0.016), (45, 0.163), (50, 0.042), (52, 0.039), (60, 0.046), (77, 0.051), (78, 0.013), (90, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77822357 46 nips-2010-Causal discovery in multiple models from different experiments

Author: Tom Claassen, Tom Heskes

2 0.70165676 166 nips-2010-Minimum Average Cost Clustering

Author: Kiyohito Nagano, Yoshinobu Kawahara, Satoru Iwata

Abstract: A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments. 1

3 0.65576363 148 nips-2010-Learning Networks of Stochastic Differential Equations

Author: José Pereira, Morteza Ibrahimi, Andrea Montanari

Abstract: We consider linear models for stochastic dynamics. To any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval T . We analyze the ℓ1 -regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are uniform in the sampling rate as long as this is sufﬁciently high. This result substantiates the notion of a well deﬁned ‘time complexity’ for the network inference problem. keywords: Gaussian processes, model selection and structure learning, graphical models, sparsity and feature selection. 1 Introduction and main results Let G = (V, E) be a directed graph with weight A0 ∈ R associated to the directed edge (j, i) from ij j ∈ V to i ∈ V . To each node i ∈ V in this network is associated an independent standard Brownian motion bi and a variable xi taking values in R and evolving according to A0 xj (t) dt + dbi (t) , ij dxi (t) = j∈∂+ i where ∂+ i = {j ∈ V : (j, i) ∈ E} is the set of ‘parents’ of i. Without loss of generality we shall take V = [p] ≡ {1, . . . , p}. In words, the rate of change of xi is given by a weighted sum of the current values of its neighbors, corrupted by white noise. In matrix notation, the same system is then represented by dx(t) = A0 x(t) dt + db(t) , p (1) 0 p×p with x(t) ∈ R , b(t) a p-dimensional standard Brownian motion and A ∈ R a matrix with entries {A0 }i,j∈[p] whose sparsity pattern is given by the graph G. We assume that the linear system ij x(t) = A0 x(t) is stable (i.e. that the spectrum of A0 is contained in {z ∈ C : Re(z) < 0}). Further, ˙ we assume that x(t = 0) is in its stationary state. More precisely, x(0) is a Gaussian random variable 1 independent of b(t), distributed according to the invariant measure. Under the stability assumption, this a mild restriction, since the system converges exponentially to stationarity. A portion of time length T of the system trajectory {x(t)}t∈[0,T ] is observed and we ask under which conditions these data are sufﬁcient to reconstruct the graph G (i.e., the sparsity pattern of A0 ). We are particularly interested in computationally efﬁcient procedures, and in characterizing the scaling of the learning time for large networks. Can the network structure be learnt in a time scaling linearly with the number of its degrees of freedom? As an example application, chemical reactions can be conveniently modeled by systems of nonlinear stochastic differential equations, whose variables encode the densities of various chemical species [1, 2]. Complex biological networks might involve hundreds of such species [3], and learning stochastic models from data is an important (and challenging) computational task [4]. Considering one such chemical reaction network in proximity of an equilibrium point, the model (1) can be used to trace ﬂuctuations of the species counts with respect to the equilibrium values. The network G would represent in this case the interactions between different chemical factors. Work in this area focused so-far on low-dimensional networks, i.e. on methods that are guaranteed to be correct for ﬁxed p, as T → ∞, while we will tackle here the regime in which both p and T diverge. Before stating our results, it is useful to stress a few important differences with respect to classical graphical model learning problems: (i) Samples are not independent. This can (and does) increase the sample complexity. (ii) On the other hand, inﬁnitely many samples are given as data (in fact a collection indexed by the continuous parameter t ∈ [0, T ]). Of course one can select a ﬁnite subsample, for instance at regularly spaced times {x(i η)}i=0,1,... . This raises the question as to whether the learning performances depend on the choice of the spacing η. (iii) In particular, one expects that choosing η sufﬁciently large as to make the conﬁgurations in the subsample approximately independent can be harmful. Indeed, the matrix A0 contains more information than the stationary distribution of the above process (1), and only the latter can be learned from independent samples. (iv) On the other hand, letting η → 0, one can produce an arbitrarily large number of distinct samples. However, samples become more dependent, and intuitively one expects that there is limited information to be harnessed from a given time interval T . Our results conﬁrm in a detailed and quantitative way these intuitions. 1.1 Results: Regularized least squares Regularized least squares is an efﬁcient and well-studied method for support recovery. We will discuss relations with existing literature in Section 1.3. In the present case, the algorithm reconstructs independently each row of the matrix A0 . The rth row, A0 , is estimated by solving the following convex optimization problem for Ar ∈ Rp r minimize L(Ar ; {x(t)}t∈[0,T ] ) + λ Ar 1 , (2) where the likelihood function L is deﬁned by L(Ar ; {x(t)}t∈[0,T ] ) = 1 2T T 0 (A∗ x(t))2 dt − r 1 T T 0 (A∗ x(t)) dxr (t) . r (3) (Here and below M ∗ denotes the transpose of matrix/vector M .) To see that this likelihood function is indeed related to least squares, one can formally write xr (t) = dxr (t)/dt and complete the square ˙ for the right hand side of Eq. (3), thus getting the integral (A∗ x(t) − xr (t))2 dt − xr (t)2 dt. ˙ ˙ r The ﬁrst term is a sum of square residuals, and the second is independent of A. Finally the ℓ1 regularization term in Eq. (2) has the role of shrinking to 0 a subset of the entries Aij thus effectively selecting the structure. Let S 0 be the support of row A0 , and assume |S 0 | ≤ k. We will refer to the vector sign(A0 ) as to r r the signed support of A0 (where sign(0) = 0 by convention). Let λmax (M ) and λmin (M ) stand for r 2 the maximum and minimum eigenvalue of a square matrix M respectively. Further, denote by Amin the smallest absolute value among the non-zero entries of row A0 . r When stable, the diffusion process (1) has a unique stationary measure which is Gaussian with covariance Q0 ∈ Rp×p given by the solution of Lyapunov’s equation [5] A0 Q0 + Q0 (A0 )∗ + I = 0. (4) Our guarantee for regularized least squares is stated in terms of two properties of the covariance Q0 and one assumption on ρmin (A0 ) (given a matrix M , we denote by ML,R its submatrix ML,R ≡ (Mij )i∈L,j∈R ): (a) We denote by Cmin ≡ λmin (Q0 0 ,S 0 ) the minimum eigenvalue of the restriction of Q0 to S the support S 0 and assume Cmin > 0. (b) We deﬁne the incoherence parameter α by letting |||Q0 (S 0 )C ,S 0 Q0 S 0 ,S 0 and assume α > 0. (Here ||| · |||∞ is the operator sup norm.) −1 |||∞ = 1 − α, ∗ (c) We deﬁne ρmin (A0 ) = −λmax ((A0 + A0 )/2) and assume ρmin (A0 ) > 0. Note this is a stronger form of stability assumption. Our main result is to show that there exists a well deﬁned time complexity, i.e. a minimum time interval T such that, observing the system for time T enables us to reconstruct the network with high probability. This result is stated in the following theorem. Theorem 1.1. Consider the problem of learning the support S 0 of row A0 of the matrix A0 from a r sample trajectory {x(t)}t∈[0,T ] distributed according to the model (1). If T > 104 k 2 (k ρmin (A0 )−2 + A−2 ) 4pk min log , 2 α2 ρmin (A0 )Cmin δ (5) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36 log(4p/δ)/(T α2 ρmin (A0 )) . The time complexity is logarithmic in the number of variables and polynomial in the support size. Further, it is roughly inversely proportional to ρmin (A0 ), which is quite satisfying conceptually, since ρmin (A0 )−1 controls the relaxation time of the mixes. 1.2 Overview of other results So far we focused on continuous-time dynamics. While, this is useful in order to obtain elegant statements, much of the paper is in fact devoted to the analysis of the following discrete-time dynamics, with parameter η > 0: x(t) = x(t − 1) + ηA0 x(t − 1) + w(t), t ∈ N0 . (6) Here x(t) ∈ Rp is the vector collecting the dynamical variables, A0 ∈ Rp×p speciﬁes the dynamics as above, and {w(t)}t≥0 is a sequence of i.i.d. normal vectors with covariance η Ip×p (i.e. with independent components of variance η). We assume that consecutive samples {x(t)}0≤t≤n are given and will ask under which conditions regularized least squares reconstructs the support of A0 . The parameter η has the meaning of a time-step size. The continuous-time model (1) is recovered, in a sense made precise below, by letting η → 0. Indeed we will prove reconstruction guarantees that are uniform in this limit as long as the product nη (which corresponds to the time interval T in the previous section) is kept constant. For a formal statement we refer to Theorem 3.1. Theorem 1.1 is indeed proved by carefully controlling this limit. The mathematical challenge in this problem is related to the fundamental fact that the samples {x(t)}0≤t≤n are dependent (and strongly dependent as η → 0). Discrete time models of the form (6) can arise either because the system under study evolves by discrete steps, or because we are subsampling a continuous time system modeled as in Eq. (1). Notice that in the latter case the matrices A0 appearing in Eq. (6) and (1) coincide only to the zeroth order in η. Neglecting this technical complication, the uniformity of our reconstruction guarantees as η → 0 has an appealing interpretation already mentioned above. Whenever the samples spacing is not too large, the time complexity (i.e. the product nη) is roughly independent of the spacing itself. 3 1.3 Related work A substantial amount of work has been devoted to the analysis of ℓ1 regularized least squares, and its variants [6, 7, 8, 9, 10]. The most closely related results are the one concerning high-dimensional consistency for support recovery [11, 12]. Our proof follows indeed the line of work developed in these papers, with two important challenges. First, the design matrix is in our case produced by a stochastic diffusion, and it does not necessarily satisﬁes the irrepresentability conditions used by these works. Second, the observations are not corrupted by i.i.d. noise (since successive conﬁgurations are correlated) and therefore elementary concentration inequalities are not sufﬁcient. Learning sparse graphical models via ℓ1 regularization is also a topic with signiﬁcant literature. In the Gaussian case, the graphical LASSO was proposed to reconstruct the model from i.i.d. samples [13]. In the context of binary pairwise graphical models, Ref. [11] proves high-dimensional consistency of regularized logistic regression for structural learning, under a suitable irrepresentability conditions on a modiﬁed covariance. Also this paper focuses on i.i.d. samples. Most of these proofs builds on the technique of [12]. A naive adaptation to the present case allows to prove some performance guarantee for the discrete-time setting. However the resulting bounds are not uniform as η → 0 for nη = T ﬁxed. In particular, they do not allow to prove an analogous of our continuous time result, Theorem 1.1. A large part of our effort is devoted to producing more accurate probability estimates that capture the correct scaling for small η. Similar issues were explored in the study of stochastic differential equations, whereby one is often interested in tracking some slow degrees of freedom while ‘averaging out’ the fast ones [14]. The relevance of this time-scale separation for learning was addressed in [15]. Let us however emphasize that these works focus once more on system with a ﬁxed (small) number of dimensions p. Finally, the related topic of learning graphical models for autoregressive processes was studied recently in [16, 17]. The convex relaxation proposed in these papers is different from the one developed here. Further, no model selection guarantee was proved in [16, 17]. 2 Illustration of the main results It might be difﬁcult to get a clear intuition of Theorem 1.1, mainly because of conditions (a) and (b), which introduce parameters Cmin and α. The same difﬁculty arises with analogous results on the high-dimensional consistency of the LASSO [11, 12]. In this section we provide concrete illustration both via numerical simulations, and by checking the condition on speciﬁc classes of graphs. 2.1 Learning the laplacian of graphs with bounded degree Given a simple graph G = (V, E) on vertex set V = [p], its laplacian ∆G is the symmetric p × p matrix which is equal to the adjacency matrix of G outside the diagonal, and with entries ∆G = ii −deg(i) on the diagonal [18]. (Here deg(i) denotes the degree of vertex i.) It is well known that ∆G is negative semideﬁnite, with one eigenvalue equal to 0, whose multiplicity is equal to the number of connected components of G. The matrix A0 = −m I + ∆G ﬁts into the setting of Theorem 1.1 for m > 0. The corresponding model (1.1) describes the over-damped dynamics of a network of masses connected by springs of unit strength, and connected by a spring of strength m to the origin. We obtain the following result. Theorem 2.1. Let G be a simple connected graph of maximum vertex degree k and consider the model (1.1) with A0 = −m I + ∆G where ∆G is the laplacian of G and m > 0. If k+m 5 4pk T ≥ 2 · 105 k 2 , (7) (k + m2 ) log m δ then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36(k + m)2 log(4p/δ)/(T m3 ). In other words, for m bounded away from 0 and ∞, regularized least squares regression correctly reconstructs the graph G from a trajectory of time length which is polynomial in the degree and logarithmic in the system size. Notice that once the graph is known, the laplacian ∆G is uniquely determined. Also, the proof technique used for this example is generalizable to other graphs as well. 4 2800 Min. # of samples for success prob. = 0.9 1 0.9 p = 16 p = 32 0.8 Probability of success p = 64 0.7 p = 128 p = 256 0.6 p = 512 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 T=nη 350 400 2600 2400 2200 2000 1800 1600 1400 1200 1 10 450 2 3 10 10 p Figure 1: (left) Probability of success vs. length of the observation interval nη. (right) Sample complexity for 90% probability of success vs. p. 2.2 Numerical illustrations In this section we present numerical validation of the proposed method on synthetic data. The results conﬁrm our observations in Theorems 1.1 and 3.1, below, namely that the time complexity scales logarithmically with the number of nodes in the network p, given a constant maximum degree. Also, the time complexity is roughly independent of the sampling rate. In Fig. 1 and 2 we consider the discrete-time setting, generating data as follows. We draw A0 as a random sparse matrix in {0, 1}p×p with elements chosen independently at random with P(A0 = 1) = k/p, k = 5. The ij process xn ≡ {x(t)}0≤t≤n is then generated according to Eq. (6). We solve the regularized least 0 square problem (the cost function is given explicitly in Eq. (8) for the discrete-time case) for different values of n, the number of observations, and record if the correct support is recovered for a random row r using the optimum value of the parameter λ. An estimate of the probability of successful recovery is obtained by repeating this experiment. Note that we are estimating here an average probability of success over randomly generated matrices. The left plot in Fig.1 depicts the probability of success vs. nη for η = 0.1 and different values of p. Each curve is obtained using 211 instances, and each instance is generated using a new random matrix A0 . The right plot in Fig.1 is the corresponding curve of the sample complexity vs. p where sample complexity is deﬁned as the minimum value of nη with probability of success of 90%. As predicted by Theorem 2.1 the curve shows the logarithmic scaling of the sample complexity with p. In Fig. 2 we turn to the continuous-time model (1). Trajectories are generated by discretizing this stochastic differential equation with step δ much smaller than the sampling rate η. We draw random matrices A0 as above and plot the probability of success for p = 16, k = 4 and different values of η, as a function of T . We used 211 instances for each curve. As predicted by Theorem 1.1, for a ﬁxed observation interval T , the probability of success converges to some limiting value as η → 0. 3 Discrete-time model: Statement of the results Consider a system evolving in discrete time according to the model (6), and let xn ≡ {x(t)}0≤t≤n 0 be the observed portion of the trajectory. The rth row A0 is estimated by solving the following r convex optimization problem for Ar ∈ Rp minimize L(Ar ; xn ) + λ Ar 0 where L(Ar ; xn ) ≡ 0 1 2η 2 n 1 , (8) n−1 2 t=0 {xr (t + 1) − xr (t) − η A∗ x(t)} . r (9) Apart from an additive constant, the η → 0 limit of this cost function can be shown to coincide with the cost function in the continuous time case, cf. Eq. (3). Indeed the proof of Theorem 1.1 will amount to a more precise version of this statement. Furthermore, L(Ar ; xn ) is easily seen to be the 0 log-likelihood of Ar within model (6). 5 1 1 0.9 0.95 0.9 0.7 Probability of success Probability of success 0.8 η = 0.04 η = 0.06 0.6 η = 0.08 0.5 η = 0.1 0.4 η = 0.14 0.3 η = 0.22 η = 0.18 0.85 0.8 0.75 0.7 0.65 0.2 0.6 0.1 0 50 100 150 T=nη 200 0.55 0.04 250 0.06 0.08 0.1 0.12 η 0.14 0.16 0.18 0.2 0.22 Figure 2: (right)Probability of success vs. length of the observation interval nη for different values of η. (left) Probability of success vs. η for a ﬁxed length of the observation interval, (nη = 150) . The process is generated for a small value of η and sampled at different rates. As before, we let S 0 be the support of row A0 , and assume |S 0 | ≤ k. Under the model (6) x(t) has r a Gaussian stationary state distribution with covariance Q0 determined by the following modiﬁed Lyapunov equation A0 Q0 + Q0 (A0 )∗ + ηA0 Q0 (A0 )∗ + I = 0 . (10) It will be clear from the context whether A0 /Q0 refers to the dynamics/stationary matrix from the continuous or discrete time system. We assume conditions (a) and (b) introduced in Section 1.1, and adopt the notations already introduced there. We use as a shorthand notation σmax ≡ σmax (I +η A0 ) where σmax (.) is the maximum singular value. Also deﬁne D ≡ 1 − σmax /η . We will assume D > 0. As in the previous section, we assume the model (6) is initiated in the stationary state. Theorem 3.1. Consider the problem of learning the support S 0 of row A0 from the discrete-time r trajectory {x(t)}0≤t≤n . If nη > 4pk 104 k 2 (kD−2 + A−2 ) min log , 2 DC 2 α δ min (11) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = (36 log(4p/δ))/(Dα2 nη). In other words the discrete-time sample complexity, n, is logarithmic in the model dimension, polynomial in the maximum network degree and inversely proportional to the time spacing between samples. The last point is particularly important. It enables us to derive the bound on the continuoustime sample complexity as the limit η → 0 of the discrete-time sample complexity. It also conﬁrms our intuition mentioned in the Introduction: although one can produce an arbitrary large number of samples by sampling the continuous process with ﬁner resolutions, there is limited amount of information that can be harnessed from a given time interval [0, T ]. 4 Proofs In the following we denote by X ∈ Rn×p the matrix whose (t + 1)th column corresponds to the conﬁguration x(t), i.e. X = [x(0), x(1), . . . , x(n − 1)]. Further ∆X ∈ Rn×p is the matrix containing conﬁguration changes, namely ∆X = [x(1) − x(0), . . . , x(n) − x(n − 1)]. Finally we write W = [w(1), . . . , w(n − 1)] for the matrix containing the Gaussian noise realization. Equivalently, The r th row of W is denoted by Wr . W = ∆X − ηA X . In order to lighten the notation, we will omit the reference to xn in the likelihood function (9) and 0 simply write L(Ar ). We deﬁne its normalized gradient and Hessian by G = −∇L(A0 ) = r 1 ∗ XWr , nη Q = ∇2 L(A0 ) = r 6 1 XX ∗ . n (12) 4.1 Discrete time In this Section we outline our prove for our main result for discrete-time dynamics, i.e., Theorem 3.1. We start by stating a set of sufﬁcient conditions for regularized least squares to work. Then we present a series of concentration lemmas to be used to prove the validity of these conditions, and ﬁnally we sketch the outline of the proof. As mentioned, the proof strategy, and in particular the following proposition which provides a compact set of sufﬁcient conditions for the support to be recovered correctly is analogous to the one in [12]. A proof of this proposition can be found in the supplementary material. Proposition 4.1. Let α, Cmin > 0 be be deﬁned by λmin (Q0 0 ,S 0 ) ≡ Cmin , S |||Q0 0 )C ,S 0 Q0 0 ,S 0 S (S −1 |||∞ ≡ 1 − α . (13) If the following conditions hold then the regularized least square solution (8) correctly recover the signed support sign(A0 ): r λα Amin Cmin G ∞≤ , GS 0 ∞ ≤ − λ, (14) 3 4k α Cmin α Cmin √ , √ . |||QS 0 ,S 0 − Q0 0 ,S 0 |||∞ ≤ (15) |||Q(S 0 )C ,S 0 − Q0 0 )C ,S 0 |||∞ ≤ S (S 12 k 12 k Further the same statement holds for the continuous model 3, provided G and Q are the gradient and the hessian of the likelihood (3). The proof of Theorem 3.1 consists in checking that, under the hypothesis (11) on the number of consecutive conﬁgurations, conditions (14) to (15) will hold with high probability. Checking these conditions can be regarded in turn as concentration-of-measure statements. Indeed, if expectation is taken with respect to a stationary trajectory, we have E{G} = 0, E{Q} = Q0 . 4.1.1 Technical lemmas In this section we will state the necessary concentration lemmas for proving Theorem 3.1. These are non-trivial because G, Q are quadratic functions of dependent random variables the samples {x(t)}0≤t≤n . The proofs of Proposition 4.2, of Proposition 4.3, and Corollary 4.4 can be found in the supplementary material provided. Our ﬁrst Proposition implies concentration of G around 0. Proposition 4.2. Let S ⊆ [p] be any set of vertices and ǫ < 1/2. If σmax ≡ σmax (I + η A0 ) < 1, then 2 P GS ∞ > ǫ ≤ 2|S| e−n(1−σmax ) ǫ /4 . (16) We furthermore need to bound the matrix norms as per (15) in proposition 4.1. First we relate bounds on |||QJS − Q0 JS |||∞ with bounds on |Qij − Q0 |, (i ∈ J, i ∈ S) where J and S are any ij subsets of {1, ..., p}. We have, P(|||QJS − Q0 )|||∞ > ǫ) ≤ |J||S| max P(|Qij − Q0 | > ǫ/|S|). JS ij i,j∈J (17) Then, we bound |Qij − Q0 | using the following proposition ij Proposition 4.3. Let i, j ∈ {1, ..., p}, σmax ≡ σmax (I + ηA0 ) < 1, T = ηn > 3/D and 0 < ǫ < 2/D where D = (1 − σmax )/η then, P(|Qij − Q0 )| > ǫ) ≤ 2e ij n − 32η2 (1−σmax )3 ǫ2 . (18) Finally, the next corollary follows from Proposition 4.3 and Eq. (17). Corollary 4.4. Let J, S (|S| ≤ k) be any two subsets of {1, ..., p} and σmax ≡ σmax (I + ηA0 ) < 1, ǫ < 2k/D and nη > 3/D (where D = (1 − σmax )/η) then, P(|||QJS − Q0 |||∞ > ǫ) ≤ 2|J|ke JS 7 n − 32k2 η2 (1−σmax )3 ǫ2 . (19) 4.1.2 Outline of the proof of Theorem 3.1 With these concentration bounds we can now easily prove Theorem 3.1. All we need to do is to compute the probability that the conditions given by Proposition 4.1 hold. From the statement of the theorem we have that the ﬁrst two conditions (α, Cmin > 0) of Proposition 4.1 hold. In order to make the ﬁrst condition on G imply the second condition on G we assume that λα/3 ≤ (Amin Cmin )/(4k) − λ which is guaranteed to hold if λ ≤ Amin Cmin /8k. (20) We also combine the two last conditions on Q, thus obtaining the following |||Q[p],S 0 − Q0 0 |||∞ ≤ [p],S α Cmin √ , 12 k (21) since [p] = S 0 ∪ (S 0 )C . We then impose that both the probability of the condition on Q failing and the probability of the condition on G failing are upper bounded by δ/2 using Proposition 4.2 and Corollary 4.4. It is shown in the supplementary material that this is satisﬁed if condition (11) holds. 4.2 Outline of the proof of Theorem 1.1 To prove Theorem 1.1 we recall that Proposition 4.1 holds provided the appropriate continuous time expressions are used for G and Q, namely G = −∇L(A0 ) = r 1 T T x(t) dbr (t) , 0 Q = ∇2 L(A0 ) = r 1 T T x(t)x(t)∗ dt . (22) 0 These are of course random variables. In order to distinguish these from the discrete time version, we will adopt the notation Gn , Qn for the latter. We claim that these random variables can be coupled (i.e. deﬁned on the same probability space) in such a way that Gn → G and Qn → Q almost surely as n → ∞ for ﬁxed T . Under assumption (5), it is easy to show that (11) holds for all n > n0 with n0 a sufﬁciently large constant (for a proof see the provided supplementary material). Therefore, by the proof of Theorem 3.1, the conditions in Proposition 4.1 hold for gradient Gn and hessian Qn for any n ≥ n0 , with probability larger than 1 − δ. But by the claimed convergence Gn → G and Qn → Q, they hold also for G and Q with probability at least 1 − δ which proves the theorem. We are left with the task of showing that the discrete and continuous time processes can be coupled in such a way that Gn → G and Qn → Q. With slight abuse of notation, the state of the discrete time system (6) will be denoted by x(i) where i ∈ N and the state of continuous time system (1) by x(t) where t ∈ R. We denote by Q0 the solution of (4) and by Q0 (η) the solution of (10). It is easy to check that Q0 (η) → Q0 as η → 0 by the uniqueness of stationary state distribution. The initial state of the continuous time system x(t = 0) is a N(0, Q0 ) random variable independent of b(t) and the initial state of the discrete time system is deﬁned to be x(i = 0) = (Q0 (η))1/2 (Q0 )−1/2 x(t = 0). At subsequent times, x(i) and x(t) are assumed are generated by the respective dynamical systems using the same matrix A0 using common randomness provided by the standard Brownian motion {b(t)}0≤t≤T in Rp . In order to couple x(t) and x(i), we construct w(i), the noise driving the discrete time system, by letting w(i) ≡ (b(T i/n) − b(T (i − 1)/n)). The almost sure convergence Gn → G and Qn → Q follows then from standard convergence of random walk to Brownian motion. Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 References [1] D.T. Gillespie. Stochastic simulation of chemical kinetics. Annual Review of Physical Chemistry, 58:35–55, 2007. [2] D. Higham. Modeling and Simulating Chemical Reactions. SIAM Review, 50:347–368, 2008. [3] N.D.Lawrence et al., editor. Learning and Inference in Computational Systems Biology. MIT Press, 2010. [4] T. Toni, D. Welch, N. Strelkova, A. Ipsen, and M.P.H. Stumpf. Modeling and Simulating Chemical Reactions. J. R. Soc. Interface, 6:187–202, 2009. [5] K. Zhou, J.C. Doyle, and K. Glover. Robust and optimal control. Prentice Hall, 1996. [6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. [7] D.L. Donoho. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 59(7):907–934, 2006. [8] D.L. Donoho. For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. [9] T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization. Annals of Statistics, 37:2109–2144, 2009. [10] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183–2202, 2009. [11] M.J. Wainwright, P. Ravikumar, and J.D. Lafferty. High-Dimensional Graphical Model Selection Using l-1-Regularized Logistic Regression. Advances in Neural Information Processing Systems, 19:1465, 2007. [12] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006. [13] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432, 2008. [14] K. Ball, T.G. Kurtz, L. Popovic, and G. Rempala. Modeling and Simulating Chemical Reactions. Ann. Appl. Prob., 16:1925–1961, 2006. [15] G.A. Pavliotis and A.M. Stuart. Parameter estimation for multiscale diffusions. J. Stat. Phys., 127:741–781, 2007. [16] J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical models of autoregressive processes. pages 89–116, 2010. [17] J. Songsiri and L. Vandenberghe. Topology selection in graphical models of autoregressive processes. Journal of Machine Learning Research, 2010. submitted. [18] F.R.K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, 1997. [19] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using l1-regularized logistic regression. Annals of Statistics, 2008. 9

4 0.554012 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

Author: Surya Ganguli, Haim Sompolinsky

Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1

5 0.55247921 117 nips-2010-Identifying graph-structured activation patterns in networks

Author: James Sharpnack, Aarti Singh

Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1

6 0.55130726 155 nips-2010-Learning the context of a category

7 0.54929209 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

8 0.54842991 63 nips-2010-Distributed Dual Averaging In Networks

9 0.54694176 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

10 0.54590601 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

11 0.54584938 280 nips-2010-Unsupervised Kernel Dimension Reduction

12 0.5453456 265 nips-2010-The LASSO risk: asymptotic results and real world examples

13 0.54507488 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

14 0.54476094 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

15 0.54418063 268 nips-2010-The Neural Costs of Optimal Control

16 0.54323334 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

17 0.54293096 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

18 0.54255974 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices

19 0.54224789 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts

20 0.54223627 7 nips-2010-A Family of Penalty Functions for Structured Sparsity