jmlr jmlr2008 jmlr2008-84 knowledge-graph by maker-knowledge-mining

84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

Source: pdf

Author: Tianjiao Chu, Clark Glymour

Abstract: Pointwise consistent, feasible procedures for estimating contemporaneous linear causal structure from time series data have been developed using multiple conditional independence tests, but no such procedures are available for non-linear systems. We describe a feasible procedure for learning a class of non-linear time series structures, which we call additive non-linear time series. We show that for data generated from stationary models of this type, two classes of conditional independence relations among time series variables and their lags can be tested efﬁciently and consistently using tests based on additive model regression. Combining results of statistical tests for these two classes of conditional independence relations and the temporal structure of time series data, a new consistent model speciﬁcation procedure is able to extract relatively detailed causal information. We investigate the ﬁnite sample behavior of the procedure through simulation, and illustrate the application of this method through analysis of the possible causal connections among four ocean indices. Several variants of the procedure are also discussed. Keywords: conditional independence test, contemporaneous causation, additive model regression, Granger causality, ocean indices

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We describe a feasible procedure for learning a class of non-linear time series structures, which we call additive non-linear time series. [sent-6, score-0.386]

2 We show that for data generated from stationary models of this type, two classes of conditional independence relations among time series variables and their lags can be tested efﬁciently and consistently using tests based on additive model regression. [sent-7, score-0.55]

3 Combining results of statistical tests for these two classes of conditional independence relations and the temporal structure of time series data, a new consistent model speciﬁcation procedure is able to extract relatively detailed causal information. [sent-8, score-0.658]

4 We investigate the ﬁnite sample behavior of the procedure through simulation, and illustrate the application of this method through analysis of the possible causal connections among four ocean indices. [sent-9, score-0.464]

5 Keywords: conditional independence test, contemporaneous causation, additive model regression, Granger causality, ocean indices 1. [sent-11, score-0.828]

6 (2002), Demiralp and Hoover (2003), and Hoover (2005) generalized Swanson and Granger’s procedure to allow speciﬁcation searches for contemporaneous linear systems among all partial orderings of the dependencies among the variables. [sent-15, score-0.397]

7 , 2000), which allows latent variables, or an algorithm due to Richardson and Spirtes (1999) that allows linear feedback relations, though no algorithm is available that is consistent for search for linear causal models when both latent variables and feedback may be present. [sent-21, score-0.544]

8 (Another group of causal inference algorithms that are based on model scores, such as Bayesian posteriors, are unable to handle either latent variables or feedbacks, except under extra constraints (Silva et al. [sent-23, score-0.422]

9 In particular, Xia and An (1999) provides a set of sufﬁcient conditions for the geometric ergodicity of time series generated by projection pursuit models, of which our additive non-linear model is a special case. [sent-39, score-0.358]

10 A Causal Inference Algorithm Consider a time series {X}t = {X1 , · · · , Xt , · · ·} are generated from a lag T additive non-linear model. [sent-41, score-0.793]

11 These pieces of information are not generally sufﬁcient for currently available causal inference algorithms, such as the PC and FCI, to be informative: these procedures require (in the worst case) complete conditional independence information. [sent-44, score-0.484]

12 In summary: Formally a causal graph G is deﬁned as an ordered pair V , E , where V is the set of variables in G, and E the set of edges in G. [sent-50, score-0.34]

13 Proposition 3: Consider a time series {X}t = {X1 , · · · , Xt , · · ·} generated from a lag T additive non-linear model. [sent-59, score-0.793]

14 Proposition 4: Consider a time series {X}t = {X1 , · · · , Xt , · · ·} generated from a lag T additive non-linear model. [sent-68, score-0.793]

15 Because there is no observed variable between W1 and Xt1 on P, they must be adjacent on P, hence there must be a direct causal relation between Xt1 and W1 . [sent-85, score-0.36]

16 Given propositions 2, 3, and 4, we propose a three-step procedure for inference to unit causal graphs from time series data generated by additive non-linear models. [sent-95, score-0.726]

17 The output of this causal inference procedure is a Partial Ancestral Graph (PAG). [sent-96, score-0.377]

18 Roughly speaking, a PAG is a graph consisting of a list of vertices representing observed random variables, and 3 types of end points, −, ◦, and >, which are combined to form the following 4 types of edges representing causal relations between random variables. [sent-97, score-0.394]

19 Below is a constraint based additive non-linear time series causal inference procedure for nonlinear time series with latent common causes. [sent-107, score-0.959]

20 The conditional independence information required by the procedure can be obtained using additive model regression based conditional independence tests mentioned in the previous section. [sent-108, score-0.471]

21 Identify contemporary causal relations (a) For all choices of Xt1 , Xt2 , and Xtc , determine if Xt1 is independent of Xt2 conditional on Xtc and X l . [sent-114, score-0.473]

22 975 C HU AND G LYMOUR (b) Treat the above conditional independence relations as if they were conditional independence relations between Xt1 and Xt2 given Xtc . [sent-115, score-0.328]

23 • Feed these conditional independencies to a causal inference algorithm allowing presence of latent common causes, such as the FCI algorithm. [sent-116, score-0.473]

24 Derive the PAG for the contemporary causal structure among variables in Xt . [sent-117, score-0.368]

25 • For all choices of Xt1 , identify the set of possible contemporaneous direct causes of Xt1 , where Xt2 is a possible contemporaneous direct cause of Xt1 if in πt either Xt1 , or Xt2 → Xt1 , or Xt2 → Xt1 . [sent-119, score-0.896]

26 Denote by PCDC(Xt1 ) the set of possible Xt2 contemporaneous direct causes of Xt1 . [sent-120, score-0.444]

27 Denote by PLDC(Xt1 ) the set of possible lagged direct causes of Xt1 • For all choices of Xt1 , identify the set of permanent lagged predictors of Xt1 , where Xt−i, j is a permanent lagged predictor of Xt1 if for all Xtb ⊆ (Xt \ {Xt1 }), Xt−i, j and Xt1 are dependent given Xtb and X e . [sent-125, score-0.55]

28 Denote by PLP(Xt1 ) the set of permanent lagged predictors of Xt1 (c) Add edges representing the lagged causes of each variable in Xt to π f : i. [sent-126, score-0.391]

29 The complexity of the above procedure is primarily determined by step 1(a), where k2 k−1 additive model regressions are performed to test the conditional independence relations required by the later steps. [sent-137, score-0.387]

30 Simulation Study In this section, we conduct a simple simulation study to evaluate the performance of the additive non-linear causal inference algorithm presented in Section 3. [sent-141, score-0.536]

31 In particular, we would like to see if the additive non-linear algorithm can provide a viable solution to the problem of nonlinear time series causal inference. [sent-142, score-0.679]

32 For comparison, we also apply a causal inference procedure designed for linear time series to the simulated data. [sent-143, score-0.507]

33 2002, Demiralp and Hoover 2003, Moneta 2003 and Hoover 2005 discussed efﬁcient ways of identifying the contemporaneous causal pattern, that is, the Markov equivalence classes (MEC) of the causal graphs for contemporaneous variables assuming causal sufﬁciency. [sent-146, score-1.642]

34 (2004) provides a less efﬁcient algorithm for linear time series that treats a k-dimensional lag p structural vector autoregressive model (SVAR(p)) as a linear causal model with k(p + 1) variables. [sent-149, score-0.934]

35 ) The linear procedure differs from the additive non-linear algorithm only in step 1: unlike the original algorithm which uses additive regression to test conditional independence, the linear procedure uses linear regression instead. [sent-150, score-0.497]

36 The simulated data are generated from the four causal structures shown in Figure 2. [sent-154, score-0.304]

37 The chain-like contemporaneous causal structure is chosen to evaluate the ability of our algorithm to identify the direction of those contemporaneous causal relations that could not be detected using previous algorithms (Bessler et al. [sent-156, score-1.392]

38 • Polynomial lag models: Each contemporaneous variable is a linear combination of other contemporaneous variables and univariate polynomial functions of lagged variables. [sent-161, score-1.341]

39 • Linear lag models: Each contemporaneous variable is a linear combination of other contemporaneous variables and lagged variables. [sent-166, score-1.341]

40 • Trigonometric contemporaneous models: Each contemporaneous variable is a linear combination of univariate trigonometric functions of other contemporaneous variables and lagged variables. [sent-171, score-1.289]

41 978 A DDITIVE N ON -L INEAR T IME S ERIES C AUSAL I NFERENCE Note that these models do not belong to the family of additive non-linear time series models, for they violate the assumption C1. [sent-173, score-0.321]

42 In total we have 16 data generating models, with 12 of them being additive non-linear time series models (including 4 linear time series models). [sent-174, score-0.451]

43 The upper bound Tmax of the true lag number T is set to 3 for all simulations, (T is equal to 2 for 12 of the data generating models based on casual structure (A), (B), and (C) in Figure 2, and 1 for the other 4 models based on casual structure (D)). [sent-177, score-0.534]

44 The edge omission error rate is deﬁned as: Eo = Number of edge omission errors . [sent-191, score-0.474]

45 Number of edges in the true PAG The edge commission error rate is deﬁned as: Ec = Number of edge commission errors . [sent-192, score-0.49]

46 The solid lines in each pane of Figure 3 represent the average omission error rates for different time series lengths; the dotted lines represent the average commission error rates. [sent-194, score-0.523]

47 The pane with label “Trig Contemp” gives the results for data generated from the trigonometric contemporaneous models. [sent-216, score-0.457]

48 We choose these models in the simulation study precisely because they lie outside of the family of additive non-linear time series models, for they violate the functional assumption (C1) in the deﬁnition of additive non-linear time series models. [sent-217, score-0.642]

49 However, as the length of time series increases, the average number of extra edges also increases, apparently because the data generating models are not additive non-linear time series models. [sent-219, score-0.487]

50 980 A DDITIVE N ON -L INEAR T IME S ERIES C AUSAL I NFERENCE The panes labeled with “Trig Lag” and “Poly Lag” show the results for trigonometric lag models and polynomial lag models, both of which are genuine additive non-linear time series models. [sent-221, score-1.32]

51 The additive non-linear algorithm performs very well for the trigonometric lag models, but less than satisfactory for polynomial lag models. [sent-222, score-1.19]

52 Its performance for polynomial lag models, however, does improve as the length of time series increases. [sent-223, score-0.602]

53 The pane with label “Lin Lag” provides the results for linear lag models. [sent-225, score-0.509]

54 Given that a linear lag model is simply a linear time series model, which is a special case of additive non-linear time series model, we expect that both algorithms should perform very well, as they do. [sent-226, score-0.923]

55 This, on the one hand, suggests that the linear procedure is a good choice for linear time series causal inference, on the other hand, implies that the additive non-linear algorithm does not suffer from overﬁtting. [sent-227, score-0.657]

56 We also compare the average error rates for orientation of the edges among contemporaneous variables by the additive non-linear algorithm and the linear procedure. [sent-228, score-0.592]

57 Suppose Xt and Yt are adjacent in both the learned PAG and the true PAG, an arrowhead omission error occurs if the edge is oriented as Xt ∗→ Yt in the true PAG, but as Xt ∗—Yt or Xt ∗ Yt in the learned PAG. [sent-229, score-0.368]

58 Similarly, an arrowhead commission error occurs if the edge is oriented as Xt —∗ Yt or Xt ∗Yt in the true PAG, but as Xt ←∗Yt in the learned PAG. [sent-230, score-0.331]

59 Let E be the set of edges among contemporaneous variables in the true PAG such that the pairs of variables connected by these edges are also adjacent in the learned PAG. [sent-231, score-0.464]

60 The arrowhead omission error rate is deﬁned as: Number of arrowhead omission errors . [sent-232, score-0.574]

61 ∑e∈E Number of arrowheads in e The arrowhead commission error rate is deﬁned as: Ao = Number of arrowhead commission errors . [sent-233, score-0.554]

62 ∑e∈E Number of non-arrowheads in e In Figure 4, the solid lines in each pane represent the average arrowhead omission error rates; the dotted lines represent the average arrowhead commission error rates. [sent-234, score-0.601]

63 (Note that in the top two panels labeled respectively with “Trig Contemp” and “Trig Lag”, the lines representing omission error and commission error for the additive non-linear algorithm overlap. [sent-236, score-0.547]

64 In the bottom two panels labeled respectively with “Poly Lag” and “Lin Lag”, the lines representing commission error for the additive non-linear algorithm and the linear algorithm overlap. [sent-237, score-0.364]

65 ) There are two more scores to measure how close a learned PAG is to the true PAG, that is, the tail omission error rate and the tail commission error rate. [sent-238, score-0.434]

66 Suppose Xt and Yt are adjacent in both the learned PAG and the true PAG, a tail omission error occurs if the edge is oriented as Xt → Yt in the true PAG, but as Xt ∗Yt in the learned PAG. [sent-239, score-0.303]

67 The tail omission error rate is deﬁned as: Ac = To = Number of tail omission errors . [sent-243, score-0.444]

68 For example, for the data sets generated from additive non-linear models, that is, the trigonometric lag models, the polynomial lag models, and linear lag models, the additive non-linear algorithm makes no arrowhead commission errors. [sent-269, score-2.13]

69 The linear procedure performs quite well for polynomial lag models and linear lag models. [sent-270, score-0.976]

70 Our suggestion is that, for longer time series always choose the additive non-linear algorithm. [sent-298, score-0.321]

71 Case Study: Ocean Climate Indices To illustrate the application of the additive non-linear causal inference algorithm for nonlinear time series, we use it to study the causal relations among some ocean climate indices. [sent-301, score-1.209]

72 ADF tests for all 4 time series reject the null hypothesis that the tested series has a unit root against the alternative that the series is stationary, with p values of the tests smaller than 0. [sent-311, score-0.38]

73 For all 4 time series, KPSS tests with lag truncation parameter set to 12 fail to reject the null hypothesis that the tested series is (trend) stationary against the unit root alternative, with p values of the tests higher than 0. [sent-314, score-0.695]

74 The idea is that, if a time series satisﬁes the strong mixing condition, its autocorrelation should decrease rapidly as the lag increases. [sent-317, score-0.602]

75 From the plot, the auto correlations of SOI do not decrease as quickly as for other indices, but they become insigniﬁcant when the lag is above 12 months. [sent-318, score-0.472]

76 We assume that the 4 indices are generated from a lag 12 additive non-linear model. [sent-319, score-0.697]

77 We ﬁrst remove any linear trend from the data, then, following the causal inference procedure presented in Section 3, derive a causal structure represented by a PAG for the 4 ocean climate indices. [sent-324, score-0.909]

78 Because of the relative shorter length of the ocean indices data (10 observations per variable for a lag 12 model), it is worth conducting another inference on the 4 ocean indices using the linear procedure. [sent-326, score-0.837]

79 The resulting causal structure is given in Figure 8. [sent-327, score-0.304]

80 This is not surprising given the complexity of the processes represented by the ocean climate indices, and illustrates the need of causal inference procedures that can be applied to data generated from nonlinear models. [sent-364, score-0.656]

81 Discussion Methods of causal inference, ﬁrst developed in the machine learning literature, have been successfully applied to many diverse ﬁelds, including biology, medicine, and sociology (Pearl, 2000; Spirtes et al. [sent-366, score-0.304]

82 985 C HU AND G LYMOUR Figure 7: Causal connections among 4 ocean climate indices, using the additive non-linear algorithm Figure 8: Causal connections among 4 ocean climate indices, using the linear procedure This study extends the application of causal inference to nonlinear time series data. [sent-369, score-1.208]

83 We present a new procedure that combines semi-automated model search for causal structure with additive 986 0. [sent-370, score-0.527]

84 The particular example is to ocean climate indices, but the component procedures have been individually applied to econometric data with some success, suggesting that the criteria for successful application of the joint procedures are statistical and causal rather than domain speciﬁc. [sent-380, score-0.618]

85 Our approach is modular, and its two main components, that is, conditional independence testing and causal model search, could be replaced by other comparable methods. [sent-381, score-0.414]

86 1 Nonstationary Nonlinear Time Series In most of this paper we assume that the nonlinear time series are stationary, only because it has been shown that for stationary nonlinear time series data satisfying certain conditions, nonparametric regression is asymptotically consistent. [sent-384, score-0.466]

87 However, to apply our algorithm to nonstationary nonlinear time series data, we must ﬁnd an efﬁcient regression method to estimate the conditional expectations and conduct conditional independence tests. [sent-386, score-0.377]

88 2 Feedback Models The original deﬁnition of additive non-linear models in Section 2 does not allow any feedback among contemporaneous variables (see condition C4). [sent-392, score-0.599]

89 The resulting deﬁnition deﬁnes a additive non-linear feedback model, which, compared to the additive non-linear model, allows feedback, but not latent common causes. [sent-395, score-0.502]

90 Assuming there is no latent common cause, Xt−i, j and Xt1 are dependent conditional on Xtb and X e if and only if either Xt−i, j is either a direct cause of Xt1 , or a direct cause of a contemporaneous cause of Xt1 . [sent-397, score-0.725]

91 3 Score Based Search Procedure The causal inference procedure presented in Section 3 is constraint based. [sent-400, score-0.377]

92 As we mentioned in Section 3, the main advantage of this procedure and its modiﬁed version is that they can handle the presence of latent common causes or feedbacks in the contemporaneous causal structure. [sent-402, score-0.865]

93 2006 provides a maximum likelihood estimation algorithm that allows the computation of BIC scores for certain types of linear models with correlated error terms, though not for the contemporaneous causal structure of a additive non-linear model. [sent-404, score-0.86]

94 ) If we are willing to exclude feedbacks and latent common causes, a simple two-step score based procedure can be used to infer causal information from data generated by additive non-linear models. [sent-405, score-0.641]

95 The signiﬁcant predictors of Xti in that best submodel are direct causes of Xti in causal model M. [sent-409, score-0.383]

96 The causal model with the best (lowest) BIC score then is returned as the result of the score based casual inference algorithm. [sent-411, score-0.376]

97 Searching for the causal structure of a vector autoregression. [sent-482, score-0.304]

98 Automatic inference of the contemporaneous causal order of a system of equations. [sent-521, score-0.71]

99 Estimation of linear, non-Gaussian causal models in the presence of confounding latent variables. [sent-529, score-0.381]

100 Impulse response functions based on a causal approach to residual orthogonalization in vector autoregressions. [sent-624, score-0.304]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lag', 0.472), ('contemporaneous', 0.365), ('causal', 0.304), ('xt', 0.288), ('additive', 0.191), ('omission', 0.183), ('commission', 0.173), ('pag', 0.17), ('lagged', 0.139), ('ocean', 0.128), ('xtd', 0.128), ('arrowhead', 0.104), ('climate', 0.1), ('series', 0.097), ('dditive', 0.091), ('lymour', 0.091), ('trig', 0.091), ('yt', 0.087), ('soi', 0.082), ('tmax', 0.082), ('nference', 0.077), ('eries', 0.077), ('latent', 0.077), ('nao', 0.073), ('wp', 0.073), ('xtb', 0.073), ('contemporary', 0.064), ('ime', 0.064), ('granger', 0.062), ('nonparametric', 0.061), ('ausal', 0.059), ('hu', 0.059), ('independence', 0.059), ('cause', 0.058), ('ao', 0.057), ('spirtes', 0.057), ('trigonometric', 0.055), ('collider', 0.055), ('aot', 0.055), ('hoover', 0.055), ('soit', 0.055), ('edge', 0.054), ('relations', 0.054), ('pags', 0.054), ('nonlinear', 0.054), ('inear', 0.052), ('conditional', 0.051), ('causes', 0.05), ('bic', 0.047), ('contemp', 0.046), ('slp', 0.046), ('xtc', 0.046), ('feedback', 0.043), ('wi', 0.043), ('fci', 0.042), ('inference', 0.041), ('zt', 0.04), ('tail', 0.039), ('poly', 0.039), ('stationary', 0.037), ('acf', 0.037), ('bessler', 0.037), ('cointegration', 0.037), ('demiralp', 0.037), ('ergodicity', 0.037), ('feedbacks', 0.037), ('karlsen', 0.037), ('moneta', 0.037), ('pane', 0.037), ('edges', 0.036), ('indices', 0.034), ('time', 0.033), ('procedure', 0.032), ('sp', 0.032), ('nonstationary', 0.032), ('tj', 0.032), ('casual', 0.031), ('oscillation', 0.031), ('xtm', 0.031), ('pc', 0.03), ('path', 0.03), ('direct', 0.029), ('procedures', 0.029), ('directed', 0.029), ('causality', 0.028), ('econometric', 0.028), ('tests', 0.028), ('autoregressive', 0.028), ('propositions', 0.028), ('gam', 0.027), ('hellinger', 0.027), ('monthly', 0.027), ('pcdc', 0.027), ('permanent', 0.027), ('pldc', 0.027), ('plp', 0.027), ('sea', 0.027), ('stheim', 0.027), ('swanson', 0.027), ('adjacent', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

Author: Tianjiao Chu, Clark Glymour

2 0.21207589 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)

Author: Jiji Zhang

Abstract: Causal reasoning is primarily concerned with what would happen to a system under external interventions. In particular, we are often interested in predicting the probability distribution of some random variables that would result if some other variables were forced to take certain values. One prominent approach to tackling this problem is based on causal Bayesian networks, using directed acyclic graphs as causal diagrams to relate post-intervention probabilities to pre-intervention probabilities that are estimable from observational data. However, such causal diagrams are seldom fully testable given observational data. In consequence, many causal discovery algorithms based on data-mining can only output an equivalence class of causal diagrams (rather than a single one). This paper is concerned with causal reasoning given an equivalence class of causal diagrams, represented by a (partial) ancestral graph. We present two main results. The ﬁrst result extends Pearl (1995)’s celebrated do-calculus to the context of ancestral graphs. In the second result, we focus on a key component of Pearl’s calculus—the property of invariance under interventions, and give stronger graphical conditions for this property than those implied by the ﬁrst result. The second result also improves the earlier, similar results due to Spirtes et al. (1993). Keywords: ancestral graphs, causal Bayesian network, do-calculus, intervention

3 0.1373857 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)

Author: Ilya Shpitser, Judea Pearl

Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple “parallel worlds” and resulting from simultaneous, possibly conﬂicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Speciﬁcally, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy. Keywords: causality, graphical causal models, identiﬁcation

4 0.12979254 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)

Author: Jean-Philippe Pellet, André Elisseeff

Abstract: We show how a generic feature-selection algorithm returning strongly relevant variables can be turned into a causal structure-learning algorithm. We prove this under the Faithfulness assumption for the data distribution. In a causal graph, the strongly relevant variables for a node X are its parents, children, and children’s parents (or spouses), also known as the Markov blanket of X. Identifying the spouses leads to the detection of the V-structure patterns and thus to causal orientations. Repeating the task for all variables yields a valid partially oriented causal graph. We ﬁrst show an efﬁcient way to identify the spouse links. We then perform several experiments in the continuous domain using the Recursive Feature Elimination feature-selection algorithm with Support Vector Regression and empirically verify the intuition of this direct (but computationally expensive) approach. Within the same framework, we then devise a fast and consistent algorithm, Total Conditioning (TC), and a variant, TCbw , with an explicit backward feature-selection heuristics, for Gaussian data. After running a series of comparative experiments on ﬁve artiﬁcial networks, we argue that Markov blanket algorithms such as TC/TCbw or Grow-Shrink scale better than the reference PC algorithm and provides higher structural accuracy. Keywords: causal structure learning, feature selection, Markov blanket, partial correlation, statistical test of conditional independence

5 0.097123563 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)

Author: Yang-Bo He, Zhi Geng

Abstract: The causal discovery from data is important for various scientiﬁc investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we ﬁrst ﬁnd a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph. Keywords: active learning, causal networks, directed acyclic graphs, intervention, Markov equivalence class, optimal design, structural learning

6 0.079722166 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs

7 0.064845115 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

8 0.063749455 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations

9 0.06362658 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

10 0.054632463 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

11 0.054628808 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

12 0.050300039 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

13 0.049985617 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models

14 0.049675893 60 jmlr-2008-Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis

15 0.049456842 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

16 0.046207808 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure

17 0.037380598 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

18 0.035979375 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

19 0.034004852 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

20 0.029850235 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.189), (1, 0.262), (2, 0.006), (3, 0.032), (4, -0.003), (5, -0.057), (6, -0.036), (7, 0.033), (8, -0.165), (9, -0.149), (10, -0.051), (11, -0.021), (12, 0.064), (13, -0.473), (14, 0.128), (15, 0.005), (16, -0.037), (17, -0.013), (18, -0.053), (19, 0.023), (20, -0.079), (21, -0.019), (22, -0.003), (23, -0.007), (24, 0.056), (25, 0.011), (26, 0.1), (27, 0.042), (28, -0.058), (29, 0.066), (30, 0.124), (31, -0.072), (32, 0.018), (33, 0.023), (34, 0.017), (35, -0.052), (36, 0.06), (37, -0.005), (38, 0.075), (39, -0.014), (40, -0.037), (41, -0.096), (42, -0.026), (43, -0.068), (44, 0.04), (45, 0.099), (46, -0.017), (47, 0.003), (48, -0.079), (49, 0.135)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96623427 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

Author: Tianjiao Chu, Clark Glymour

2 0.67729467 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)

Author: Jiji Zhang

3 0.56691074 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)

Author: Ilya Shpitser, Judea Pearl

4 0.49432376 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)

Author: Jean-Philippe Pellet, André Elisseeff

5 0.37068731 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

Author: Vojtěch Franc, Bogdan Savchynskyy

Abstract: The max-sum classiﬁer predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions deﬁned over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classiﬁer. Learning parameters of the max-sum classiﬁer is a challenging problem because even computing the response of such classiﬁer is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classiﬁers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classiﬁers, have been extended for learning some subclasses of the max-sum classiﬁer. Besides the max-sum classiﬁers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulﬁll some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classiﬁers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classiﬁers whose response can be computed in polynomial time: (i) the max-sum classiﬁers with supermodular quality functions and (ii) the max-sum classiﬁers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classiﬁer. Keywords: max-xum classiﬁer, hidden Markov networks, support vector machines

6 0.36554632 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations

7 0.28429234 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)

8 0.26110578 60 jmlr-2008-Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis

9 0.25502375 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

10 0.2208053 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

11 0.20333889 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models

12 0.19615337 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

13 0.18865506 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

14 0.15554804 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

15 0.15475713 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs

16 0.15163384 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

17 0.13737594 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

18 0.13659427 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure

19 0.12165803 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

20 0.11912626 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (5, 0.018), (29, 0.455), (30, 0.019), (31, 0.021), (40, 0.018), (54, 0.034), (58, 0.048), (61, 0.024), (66, 0.038), (72, 0.013), (76, 0.038), (87, 0.013), (88, 0.065), (92, 0.027), (94, 0.039), (99, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73214716 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

Author: Tianjiao Chu, Clark Glymour

2 0.23245001 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett

Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efﬁcient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments conﬁrm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efﬁciently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random ﬁelds ∗. These authors contributed equally. c 2008 Michael Col

3 0.2317929 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefﬁcients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identiﬁcation from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in signiﬁcant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identiﬁcation application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

4 0.22943816 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

Author: Gal Elidan, Stephen Gould

Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufﬁciently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overﬁtting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modiﬁcations and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth

5 0.22861259 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

Author: Zongming Ma, Xianchao Xie, Zhi Geng

Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning

6 0.2264498 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)

7 0.22580013 83 jmlr-2008-Robust Submodular Observation Selection

8 0.22196379 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs

9 0.22142714 26 jmlr-2008-Consistency of Trace Norm Minimization

10 0.22091845 57 jmlr-2008-Manifold Learning: The Price of Normalization

11 0.2208889 56 jmlr-2008-Magic Moments for Structured Output Prediction

12 0.22046505 9 jmlr-2008-Active Learning by Spherical Subdivision

13 0.22030157 86 jmlr-2008-SimpleMKL

14 0.22005863 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

15 0.21871534 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

16 0.21867454 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

17 0.21672745 58 jmlr-2008-Max-margin Classification of Data with Absent Features

18 0.21614634 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)

19 0.21608564 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

20 0.21604241 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections