nips nips2013 nips2013-134 knowledge-graph by maker-knowledge-mining

134 nips-2013-Graphical Models for Inference with Missing Data

Source: pdf

Author: Karthika Mohan, Judea Pearl, Jin Tian

Abstract: We address the problem of recoverability i.e. deciding whether there exists a consistent estimator of a given relation Q, when data are missing not at random. We employ a formal representation called ‘Missingness Graphs’ to explicitly portray the causal mechanisms responsible for missingness and to encode dependencies between these mechanisms and the variables being measured. Using this representation, we derive conditions that the graph should satisfy to ensure recoverability and devise algorithms to detect the presence of these conditions in the graph. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We address the problem of recoverability i. [sent-11, score-0.268]

2 deciding whether there exists a consistent estimator of a given relation Q, when data are missing not at random. [sent-13, score-0.288]

3 We employ a formal representation called ‘Missingness Graphs’ to explicitly portray the causal mechanisms responsible for missingness and to encode dependencies between these mechanisms and the variables being measured. [sent-14, score-0.479]

4 Using this representation, we derive conditions that the graph should satisfy to ensure recoverability and devise algorithms to detect the presence of these conditions in the graph. [sent-15, score-0.33]

5 1 Introduction The “missing data” problem arises when values for one or more variables are missing from recorded observations. [sent-16, score-0.265]

6 The extent of the problem is evidenced from the vast literature on missing data in such diverse ﬁelds as social science, epidemiology, statistics, biology and computer science. [sent-17, score-0.265]

7 Other areas such as data mining [7], knowledge discovery [18] and network tomography [2] are also plagued by missing data problems. [sent-22, score-0.285]

8 Secondly, performing the analysis using only complete cases and ignoring the cases with missing values can reduce the sample size thereby substantially reducing estimation efﬁciency. [sent-27, score-0.23]

9 1 Existing Methods for Handling Missing Data There are several methods for handling missing data, described in a rich literature of books, articles and software packages, which are brieﬂy summarized here1 . [sent-31, score-0.23]

10 Of these, listwise deletion and pairwise deletion are used in approximately 96% of studies in the social and behavioral sciences [24]. [sent-32, score-0.344]

11 Listwise deletion refers to a simple method in which cases with missing values are deleted [3]. [sent-33, score-0.391]

12 Unless data are missing completely at random, listwise deletion can bias the outcome [31]. [sent-34, score-0.493]

13 1 deletion (or “available case”) is a deletion method used for estimating pairwise relations among variables. [sent-36, score-0.265]

14 For example, to compute the covariance of variables X and Y , all those cases or observations in which both X and Y are observed are used, regardless of whether other variables in the dataset have missing values. [sent-37, score-0.319]

15 It has been proven that likelihood-based inference while ignoring the missing data mechanism, leads to unbiased estimates under the assumption of missing at random (MAR) [13]. [sent-39, score-0.479]

16 Exceptions are recent works on collaborative ﬁltering and recommender systems which develop probabilistic models that explicitly incorporate missing data mechanism [16, 14, 15]. [sent-41, score-0.314]

17 ML is often used in conjunction with imputation methods, which in layman terms, substitutes a reasonable guess for each missing value [1]. [sent-42, score-0.292]

18 A simple example is Mean Substitution, in which all missing observations of variable X are substituted with the mean of all observed values of X. [sent-43, score-0.249]

19 Hot-deck imputation, cold-deck imputation [17] and Multiple Imputation [26, 27] are examples of popular imputation procedures. [sent-44, score-0.124]

20 Missing data discussed so far is a special case of coarse data, namely data that contains observations made in the power set rather than the sample space of variables of interest [12]. [sent-46, score-0.073]

21 The notion of coarsening at random (CAR) was introduced in [12] and identiﬁes the condition under which coarsening mechanism can be ignored while drawing inferences on the distribution of variables of interest [10]. [sent-47, score-0.206]

22 For a gentle introduction to the missing data problem and the issue of testability refer to [22, 19]. [sent-51, score-0.306]

23 This paper aims to illuminate missing data problems using causal graphs [See Appendix 5. [sent-52, score-0.337]

24 The questions we pose are: Given a target relation Q to be estimated and a set of assumptions about the missingness process encoded in a graphical model, under what conditions does a consistent estimate exist and how can we elicit it from the data available? [sent-54, score-0.417]

25 Furthermore, we review the traditional taxonomy of missing data problems and cast it in graphical terms. [sent-56, score-0.312]

26 In Section 3 we deﬁne the notion of recoverability - the existence of a consistent estimate - and present graphical conditions for detecting recoverability of a given probabilistic query Q. [sent-57, score-0.603]

27 1 Missingness Graphs X Y Ry X Rx Ry Y* (a) X Y Y Ry Y* Rx Y* X* (b) X (c) Y Ry Y* X* (d) Figure 1: m-graphs for data that are: (a) MCAR, (b) MAR, (c) & (d) MNAR; Hollow and solid circles denote partially and fully observed variables respectively. [sent-60, score-0.11]

28 Graphical models such as DAGs (Directed Acyclic Graphs) can be used for encoding as well as portraying conditional independencies and causal relations, and the graphical criterion called dseparation (refer Appendix-5. [sent-61, score-0.152]

29 Graphical Models have been used to analyze missing information in the form of missing cases (due to sample selection bias)[4]. [sent-63, score-0.46]

30 Using causal graphs, [8]- analyzes missingness due to attrition (partially 2 observed outcome) and [29]- cautions against the indiscriminate use of auxiliary variables. [sent-64, score-0.403]

31 In both papers missing values are associated with one variable and interactions among several missingness mechanisms remain unexplored. [sent-65, score-0.552]

32 The need exists for a general approach capable of modeling an arbitrary data-generating process and deciding whether (and how) missingness can be outmaneuvered in every dataset generated by that process. [sent-66, score-0.292]

33 Such a general approach should allow each variable to be governed by its own missingness mechanism, and each mechanism to be triggered by other (potentially) partially observed variables in the model. [sent-67, score-0.423]

34 Let G(V, E) be the causal DAG where V = V ∪ U ∪ V ∗ ∪ R. [sent-69, score-0.068]

35 Nodes in the graph correspond to variables in the data set. [sent-71, score-0.078]

36 V is partitioned into Vo and Vm such that Vo ⊆ V is the set of variables that are observed in all records in the population and Vm ⊆ V is the set of variables that are missing in at least one record. [sent-75, score-0.319]

37 Variable X is termed as fully observed if X ∈ Vo and partially observed if X ∈ Vm . [sent-76, score-0.101]

38 V ∗ is a set of all proxy variables and R is the set of all causal mechanisms that are responsible for missingness. [sent-78, score-0.157]

39 R variables may not be parents of variables in V ∪ U . [sent-79, score-0.104]

40 This graphical representation succinctly depicts both the causal relationships among variables in V and the process that accounts for missingness in some of the variables. [sent-80, score-0.428]

41 Missing At Random (MAR) : Data are MAR if for all data cases Y , P (R|Yobs , Ymis ) = P (R|Yobs ) where Yobs denotes the observed component of Y and Ymis , the missing component. [sent-85, score-0.268]

42 , Yobs and Ymis ), many authors ﬁnd Rubin’s deﬁnition difﬁcult to apply in practice and prefer to work with deﬁnitions expressed in terms of independencies among variables (see [28, 11, 6, 17]). [sent-92, score-0.086]

43 Finally if neither of these conditions ⊥V hold, data are termed MNAR, as depicted in Figure 1(c) and (d). [sent-99, score-0.093]

44 Additionally, the conditional independencies that deﬁne each class are represented explicitly as separation conditions 3 in the corresponding m-graphs. [sent-101, score-0.07]

45 Given a m-graph G, and a target relation Q deﬁned on the variables in V , Q is said to be recoverable in G if there exists an algorithm that produces a consistent estimate of Q for every dataset D such that P (D) is (1) compatible with G and (2) strictly positive over complete cases i. [sent-106, score-0.278]

46 2 Here we assume that the observed distribution over complete cases P (Vo , Vm , R = 0) is strictly positive, thereby rendering recoverability a property that can be ascertained exclusively from the m-graph. [sent-109, score-0.287]

47 A relation Q is recoverable in G if and only if Q can be expressed in terms of the probability P (O) where O = {R, V ∗ , Vo } is the set of observable variables in G. [sent-111, score-0.295]

48 Practically, what recoverability means is that if the data D are generated by any process compatible ˆ with G, a procedure exists that computes an estimator Q(D) such that, in the limit of large samples, ˆ Q(D) converges to Q. [sent-114, score-0.287]

49 ” Thus, recoverability is the sole property of G and Q, not of the data available, or of any routine chosen to analyze or process the data. [sent-116, score-0.287]

50 Let X be the treatment and Y be the outcome as depicted in the m-graph in Fig. [sent-121, score-0.092]

51 Let it be the case that we accidentally deleted the values of Y for a handful of samples, hence Y ∈ Vm . [sent-123, score-0.082]

52 Let X be the treatment and Y be the outcome as depicted in the m-graph in Fig. [sent-137, score-0.092]

53 Let it be the case that some patients who underwent treatment are not likely to report the outcome, hence the arrow X → Ry . [sent-139, score-0.067]

54 (2) permits P (X, Y ) to be recovered by listwise deletion, while eq. [sent-153, score-0.086]

55 In this paper we focus on recoverability under large sample assumption and will not be dealing with the shrinking sample size issue. [sent-155, score-0.268]

56 1 (d) depicts a study where (i) some units who underwent treatment (X = 1) did not report the outcome (Y ) and (ii) we accidentally deleted the values of treatment for a handful of cases. [sent-160, score-0.17]

57 Thus we have missing values for both X and Y which renders the dataset MNAR. [sent-161, score-0.245]

58 In the ﬁrst step, we delete all cases in which X is missing and create a new data set D from which we estimate P (X). [sent-170, score-0.266]

59 We will call this sequence of deletions as deletion order. [sent-174, score-0.121]

60 Several features are worth noting regarding this graph-based taxonomy of missingness mechanisms. [sent-175, score-0.322]

61 Second, the assumption of MCAR allows an estimation procedure that amounts (asymptotically) to listwise deletion, while MAR dictates a procedure that amounts to listwise deletion in every stratum of Vo . [sent-177, score-0.335]

62 Applying MAR procedure to MCAR problem is safe, because all conditional independencies required for recoverability under the MAR assumption also hold in an MCAR problem, i. [sent-178, score-0.319]

63 Applying listwise deletion is likely to result in bias, because the necessary condition R⊥ o , Vm ) is violated in the graph. [sent-183, score-0.237]

64 An interesting property which evolves ⊥(V from this discussion is that recoverability of certain relations does not require RVi ⊥ i |Vo ; a subset ⊥V of Vo would sufﬁce as shown below. [sent-184, score-0.291]

65 P (Vi ) is recoverable if ∃W ⊆ Vo such that RVi ⊥ |W . [sent-186, score-0.204]

66 w P (Vi∗ |Rvi = 0, W )P (W ) since Vi ⊥ Vi |W ⊥R It is important to note that the recoverability of P (X, Y ) in Fig. [sent-189, score-0.268]

67 1(d) was feasible despite the fact that the missingness model would not be considered Rubin’s MAR (as deﬁned in [25]). [sent-190, score-0.292]

68 A query Q deﬁned over variables in Vo ∪ Vm is recoverable if it is decomposable into terms of the form Qj = P (Sj |Tj ) such that Tj contains the missingness mechanism Rv = 0 of every partially observed variable V that appears in Qj . [sent-198, score-0.627]

69 5 Proof: If such a decomposition exists, every Qj is estimable from the data, hence the entire expression for Q is recoverable. [sent-199, score-0.069]

70 Given a m-graph G with no edges between the R variables and no latent variables as parents of R variables, a necessary and sufﬁcient condition for recovering the joint distribution P (V ) is that no variable X be a parent of its missingness mechanism RX . [sent-214, score-0.525]

71 Moreover, when recoverable, P (V ) is given by P (R = 0, v) , (6) P (v) = o m m i P (Ri = 0|pari , pari , RP ar = 0) i where P aoi ⊆ Vo and P am ⊆ Vm are the parents of Ri . [sent-215, score-0.079]

72 Vi ∈ P am ), then we have ri Ri ⊥ P am |(P aoi ∪ P am ). [sent-220, score-0.086]

73 Therefore, ⊥R r r ri i P (Ri = 0|paoi , pam ) = P (Ri = 0|paoi , pam , RP am = 0). [sent-221, score-0.193]

74 r ri r ri r i (8) Given strictly positive P (R = 0, Vm , Vo ), we have that all probabilities P (Ri = 0|paoi , pam , RP am = 0) are strictly positive. [sent-222, score-0.194]

75 Using Equations (7) and (8) , we conclude that r ri ri P (V ) is recoverable as given by Eq. [sent-223, score-0.334]

76 (necessity) If X is a parent of its missingness mechanism RX , then P (X) is not recoverable based on Lemmas 3 and 4 in Appendix 5. [sent-225, score-0.56]

77 The following theorem gives a sufﬁcient condition for recovering the joint distribution in a Markovian model. [sent-228, score-0.087]

78 , Markovian) the joint distribution P (V ) is recoverable if no missingness mechanism RX is a descendant of its corresponding variable X. [sent-232, score-0.551]

79 Moreover, if recoverable, then P (V ) is given by P (vi |pao , pam , RP am = 0) i i i P (v) = i,Vi ∈Vo P (vj |pao , pam , RVj = 0, RP am = 0), j j j j,Vj ∈Vm where P ao ⊆ Vo and P am ⊆ Vm are the parents of Vi . [sent-233, score-0.162]

80 An ordered factorization over a set O of ordered V variables Y1 < Y2 < . [sent-236, score-0.166]

81 A sufﬁcient condition for recoverability of a relation Q is that Q be decomposable into an ordered factorization, or a sum of such factorizations, such that every factor Qi = P (Yi |Xi ) satisﬁes Yi ⊥ yi , Rxi )|Xi . [sent-247, score-0.408]

82 follows from Theorem-1 noting that ordered factorization is one speciﬁc form of decomposition. [sent-251, score-0.083]

83 Theorem 4 will allow us to conﬁrm recoverability of certain queries Q in models such as those in Fig. [sent-252, score-0.268]

84 Firstly, the decomposition is limited to ordered factorizations i. [sent-256, score-0.099]

85 Speciﬁcally, (c) can be written as x1 P (X4 |X1 , RX4 = 0)P (X1 ) and can be estimated by the deletion schedule (X1 , X4 ), i. [sent-264, score-0.121]

86 An ordered set O will not yield an admissible decomposition if there exists a partially observed variable Vi in the order O which is not marginally independent of RVi such that all minimal separators (refer Appendix-5. [sent-273, score-0.203]

87 7 7 RC B D C E X Y RY A RX F RD RA RE RB (b) (a) Figure 3: demonstrates (a) pruning in Example-7 (b) P (X, Y ) is recoverable in Example-5 Applying lemma-1 requires a solution to a set of disjunctive constraints which can be represented by directed constraint graphs [5]. [sent-276, score-0.224]

88 The independencies implied by minimal separators (as required by Lemma-1) are: A⊥ A |B, ⊥R B⊥ B |φ, C⊥ C |{D, E}, ( D⊥ D |A or D⊥ D |C or D⊥ D |B ) and (E⊥ E |{B, F } or ⊥R ⊥R ⊥R ⊥R ⊥R ⊥R E⊥ E |{B, D} or E⊥ E |C). [sent-283, score-0.071]

89 The following lemma presents a simple test to determine non-admissibility by specifying the condition under which a given order can be summarily removed from the set of candidate orders that are likely to yield admissible factorizations. [sent-287, score-0.093]

90 An ordered set O will not yield an admissible decomposition if it contains a partially observed variable Vi for which there exists no set S ⊆ V that d-separates Vi from RVi . [sent-289, score-0.183]

91 An interesting consequence of Lemma 2 is the following corollary that gives a sufﬁcient condition under which no ordered factorization can be labeled admissible. [sent-294, score-0.113]

92 For any disjoint sets X and Y , there exists no admissible factorization for recovering the relation P (Y |X) by Theorem 4 if Y contains a partially observed variable Vi for which there exists no set S ⊆ V that d-separates Vi from RVi . [sent-296, score-0.213]

93 We formalized the notion of recoverability and showed that relations are always recoverable when data are missing at random (MCAR or MAR) and, more importantly, that in many commonly occurring problems, recoverability can be achieved even when data are missing not at random (MNAR). [sent-298, score-1.276]

94 We further presented a sufﬁcient condition to ensure recoverability of a given relation Q (Theorem 1) and operationalized Theorem 1 using graphical criteria (Theorems 2, 3 and 4). [sent-299, score-0.37]

95 In summary, we demonstrated some of the insights and capabilities that can be gained by exploiting causal knowledge in missing data problems. [sent-300, score-0.317]

96 Out of sight, not out of mind: strategies for handling missing data. [sent-320, score-0.23]

97 Using causal diagrams to guide analysis in missing data problems. [sent-331, score-0.317]

98 Recoverability and testability of missing data: Introduction and summary of results. [sent-453, score-0.266]

99 Advances in missing data methods and implications for educational research. [sent-469, score-0.277]

100 Selection of auxiliary variables in missing data problems: Not all auxiliary variables are created equal. [sent-503, score-0.319]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ry', 0.474), ('rx', 0.414), ('missingness', 0.292), ('vo', 0.287), ('recoverability', 0.268), ('missing', 0.23), ('recoverable', 0.204), ('vm', 0.187), ('mcar', 0.182), ('rvi', 0.158), ('mar', 0.137), ('vi', 0.121), ('deletion', 0.121), ('mnar', 0.097), ('listwise', 0.086), ('causal', 0.068), ('ri', 0.065), ('pam', 0.064), ('admissible', 0.063), ('imputation', 0.062), ('independencies', 0.051), ('paoi', 0.049), ('ordered', 0.048), ('rz', 0.043), ('coarsening', 0.043), ('yobs', 0.043), ('deleted', 0.04), ('mechanism', 0.04), ('relation', 0.039), ('partially', 0.037), ('outcome', 0.037), ('testability', 0.036), ('ymis', 0.036), ('factorization', 0.035), ('factorizations', 0.035), ('variables', 0.035), ('parents', 0.034), ('graphical', 0.033), ('rubin', 0.033), ('estimable', 0.032), ('respondents', 0.032), ('taxonomy', 0.03), ('condition', 0.03), ('mechanisms', 0.03), ('depicted', 0.029), ('angeles', 0.028), ('educational', 0.028), ('los', 0.027), ('termed', 0.026), ('treatment', 0.026), ('marlin', 0.025), ('recommender', 0.025), ('graph', 0.024), ('attrition', 0.024), ('collider', 0.024), ('guilford', 0.024), ('karthika', 0.024), ('laan', 0.024), ('mgraph', 0.024), ('pao', 0.024), ('pari', 0.024), ('shopper', 0.024), ('stratum', 0.024), ('parent', 0.024), ('responsible', 0.024), ('relations', 0.023), ('yi', 0.023), ('theorem', 0.022), ('aoi', 0.021), ('gentle', 0.021), ('accidentally', 0.021), ('estimand', 0.021), ('gill', 0.021), ('mohan', 0.021), ('qj', 0.021), ('rp', 0.021), ('hence', 0.021), ('graphs', 0.02), ('recovering', 0.02), ('underwent', 0.02), ('judea', 0.02), ('separators', 0.02), ('conditions', 0.019), ('observed', 0.019), ('data', 0.019), ('pearl', 0.018), ('dictates', 0.018), ('ml', 0.017), ('tomography', 0.017), ('mcknight', 0.017), ('delete', 0.017), ('longitudinal', 0.017), ('observable', 0.017), ('decomposition', 0.016), ('social', 0.016), ('renders', 0.015), ('seattle', 0.015), ('questions', 0.015), ('joint', 0.015), ('notion', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 134 nips-2013-Graphical Models for Inference with Missing Data

Author: Karthika Mohan, Judea Pearl, Jin Tian

2 0.065255605 337 nips-2013-Transportability from Multiple Environments with Limited Experiments

Author: Elias Bareinboim, Sanghack Lee, Vasant Honavar, Judea Pearl

Abstract: This paper considers the problem of transferring experimental ﬁndings learned from multiple heterogeneous domains to a target domain, in which only limited experiments can be performed. We reduce questions of transportability from multiple domains and with limited scope to symbolic derivations in the causal calculus, thus extending the original setting of transportability introduced in [1], which assumes only one domain with full experimental information available. We further provide different graphical and algorithmic conditions for computing the transport formula in this setting, that is, a way of fusing the observational and experimental information scattered throughout different domains to synthesize a consistent estimate of the desired effects in the target domain. We also consider the issue of minimizing the variance of the produced estimand in order to increase power. 1 Motivation Transporting and synthesizing experimental knowledge from heterogeneous settings are central to scientiﬁc discovery. Conclusions that are obtained in a laboratory setting are transported and applied elsewhere in an environment that differs in many aspects from that of the laboratory. In data-driven sciences, experiments are conducted on disparate domains, but the intention is almost invariably to fuse the acquired knowledge, and translate it into some meaningful claim about a target domain, which is usually different than any of the individual study domains. However, the conditions under which this extrapolation can be legitimized have not been formally articulated until very recently. Although the problem has been discussed in many areas of statistics, economics, and the health sciences, under rubrics such as “external validity” [2, 3], “meta-analysis” [4], “quasi-experiments” [5], “heterogeneity” [6], these discussions are limited to verbal narratives in the form of heuristic guidelines for experimental researchers – no formal treatment of the problem has been attempted to answer the practical challenge of generalizing causal knowledge across multiple heterogeneous domains with disparate experimental data posed in this paper. The ﬁelds of artiﬁcial intelligence and statistics provide the theoretical underpinnings necessary for tackling transportability. First, the distinction between statistical and causal knowledge has received syntactic representation through causal diagrams [7, 8, 9], which became a popular tool for causal inference in data-driven ﬁelds. Second, the inferential machinery provided by the causal calculus (do-calculus) [7, 9, 10] is particularly suitable for handling knowledge transfer across domains. Armed with these techniques, [1] introduced a formal language for encoding differences and commonalities between domains accompanied with necessary or sufﬁcient conditions under which transportability of empirical ﬁndings is feasible between two domains, a source and a target; then, these conditions were extended for a complete characterization for transportability in one domain with unrestricted experimental data [11]. Subsequently, these results were generalized for the settings when ∗ These authors contributed equally to this paper. The authors’ addresses are respectively eb@cs.ucla.edu, sxl439@ist.psu.edu, vhonavar@ist.psu.edu, judea@cs.ucla.edu. 1 only limited experiments are available in the source domain [12, 13], and further for when multiple source domains with unrestricted experimental information are available [14, 15]. This paper broadens these discussions introducing a more general setting in which multiple heterogeneous sources with limited and distinct experiments are available, a task that we call here “mz-transportability”.1 More formally, the mz-transportability problem concerns with the transfer of causal knowledge from a heterogeneous collection of source domains Π = {π1 , ..., πn } to a target domain π ∗ . In each domain πi ∈ Π, experiments over a set of variables Zi can be performed, and causal knowledge gathered. In π ∗ , potentially different from πi , only passive observations can be collected (this constraint is weakened later on). The problem is to infer a causal relationship R in π ∗ using knowledge obtained in Π. Clearly, if nothing is known about the relationship between Π and π ∗ , the problem is trivial; no transfer can be justiﬁed. Yet the fact that all scientiﬁc experiments are conducted with the intent of being used elsewhere (e.g., outside the lab) implies that scientiﬁc progress relies on the assumption that certain domains share common characteristics and that, owed to these commonalities, causal claims would be valid in new settings even where experiments cannot be conducted. The problem stated in this paper generalizes the one-dimensional version of transportability with limited scope and the multiple dimensional with unlimited scope. Remarkably, while the effects of interest might not be individually transportable to the target domain from the experiments in any of the available sources, combining different pieces from the various sources may enable the estimation of the desired effects (to be shown later on). The goal of this paper is to formally understand under which conditions the target quantity is (non-parametrically) estimable from the available data. 2 Previous work and our contributions Consider Fig. 1(a) in which the node S represents factors that produce differences between source and target populations. Assume that we conduct a randomized trial in Los Angeles (LA) and estimate the causal effect of treatment X on outcome Y for every age group Z = z, denoted by P (y|do(x), z). We now wish to generalize the results to the population of the United States (U.S.), but we ﬁnd the distribution P (x, y, z) in LA to be different from the one in the U.S. (call the latter P ∗ (x, y, z)). In particular, the average age in the U.S. is signiﬁcantly higher than that in LA. How are we to estimate the causal effect of X on Y in U.S., denoted R = P ∗ (y|do(x))? 2 3 The selection diagram for this example (Fig. 1(a)) conveys the assumption that the only difference between the two populations are factors determining age distributions, shown as S → Z, while agespeciﬁc effects P ∗ (y|do(x), Z = z) are invariant across populations. Difference-generating factors are represented by a special set of variables called selection variables S (or simply S-variables), which are graphically depicted as square nodes ( ). From this assumption, the overall causal effect in the U.S. can be derived as follows: R P ∗ (y|do(x), z)P ∗ (z) = z P (y|do(x), z)P ∗ (z) = (1) z The last line is the transport formula for R. It combines experimental results obtained in LA, P (y|do(x), z), with observational aspects of the U.S. population, P ∗ (z), to obtain an experimental claim P ∗ (y|do(x)) about the U.S.. In this trivial example, the transport formula amounts to a simple re-calibration (or re-weighting) of the age-speciﬁc effects to account for the new age distribution. In general, however, a more involved mixture of experimental and observational ﬁndings would be necessary to obtain a bias-free estimate of the target relation R. Fig. 1(b) depicts the smallest example in which transportability is not feasible even when experiments over X in π are available. In real world applications, it may happen that certain controlled experiments cannot be conducted in the source environment (for ﬁnancial, ethical, or technical reasons), so only a limited amount 1 The machine learning literature has been concerned about discrepancies among domains in the context, almost exclusively, on predictive or classiﬁcation tasks as opposed to learning causal or counterfactual measures [16, 17]. Interestingly enough, recent work on anticausal learning moves towards more general modalities of learning and also leverages knowledge about the underlying data-generating structure [18, 19]. 2 We will use Px (y | z) interchangeably with P (y | do(x), z). 3 We use the structural interpretation of causal diagrams as described in [9, pp. 205]. 2 S S Z Z1 S S X Y (a) X Y (b) Z2 X Y (c) Figure 1: The selection variables S are depicted as square nodes ( ). (a) Selection diagram illustrating when transportability between two domains is trivially solved through simple recalibration. (b) The smallest possible selection diagram in which a causal relation is not transportable. (c) Selection diagram illustrating transportability when only experiments over {Z1 } are available in the source. of experimental information can be gathered. A natural question arises whether the investigator in possession of a limited set of experiments would still be able to estimate the desired effects at the target domain. For instance, we assume in Fig. 1(c) that experiments over Z1 are available and the target quantity is R = P ∗ (y|do(x)), which can be shown to be equivalent to P (y|x, do(Z1 )), the conditional distribution of Y given X in the experimental study when Z1 is randomized. 4 One might surmise that multiple pairwise z-transportability would be sufﬁcient to solve the mztransportability problem, but this is not the case. To witness, consider Fig. 2(a,b) which concerns the transport of experimental results from two sources ({πa , πb }) to infer the effect of X on Y in π ∗ , R = P ∗ (y|do(x)). In these diagrams, X may represent the treatment (e.g., cholesterol level), Z1 represents a pre-treatment variable (e.g., diet), Z2 represents an intermediate variable (e.g., biomarker), and Y represents the outcome (e.g., heart failure). We assume that experimental studies randomizing {Z1 , Z2 } can be conducted in both domains. A simple analysis based on [12] can show that R cannot be z-transported from either source alone, but it turns out that combining in a special way experiments from both sources allows one to determine the effect in the target. More interestingly, we consider the more stringent scenario where only certain experiments can be performed in each of the domains. For instance, assume that it is only possible to conduct experiments over {Z2 } in πa and over {Z1 } in πb . Obviously, R will not be z-transported individually from these domains, but it turns out that taking both sets of experiments into account, R = z2 P (a) (y|do(z2 ))P (b) (z2 |x, do(Z1 )), which fully uses all pieces of experimental data available. In other words, we were able to decompose R into subrelations such that each one is separately z-transportable from the source domains, and so is the desired target quantity. Interestingly, it is the case in this example that if the domains in which experiments were conducted were reversed (i.e., {Z1 } randomized in πa , {Z2 } in πb ), it will not be possible to transport R by any method – the target relation is simply not computable from the available data (formally shown later on). This illustrates some of the subtle issues mz-transportability entails, which cannot be immediately cast in terms of previous instances of the transportability class. In the sequel, we try to better understand some of these issues, and we develop sufﬁcient or (speciﬁc) necessary conditions for deciding special transportability for arbitrary collection of selection diagrams and set of experiments. We further construct an algorithm for deciding mz-transportability of joint causal effects and returning the correct transport formula whenever this is possible. We also consider issues relative to the variance of the estimand aiming for improving sample efﬁciency and increasing statistical power. 3 Graphical conditions for mz-transportability The basic semantical framework in our analysis rests on structural causal models as deﬁned in [9, pp. 205], also called data-generating models. In the structural causal framework [9, Ch. 7], actions are modiﬁcations of functional relationships, and each action do(x) on a causal model M produces 4 A typical example is whether we can estimate the effect of cholesterol (X) on heart failure (Y ) by experiments on diet (Z1 ) given that cholesterol levels cannot be randomized [20]. 3 Z2 Z1 Z1 X Z1 X X Z2 W Z2 Y (a) Z2 Z1 Y X Z3 W U U Y Y (b) (c) Z3 (d) Figure 2: Selection diagrams illustrating impossibility of estimating R = P ∗ (y|do(x)) through individual transportability from πa and πb even when Z = {Z1 , Z2 } (for (a, b), (c, d))). If we assume, more stringently, availability of experiments Za = {Z2 }, Zb = {Z1 }, Z∗ = {}, a more elaborated analysis can show that R can be estimated combining different pieces from both domains. a new model Mx = U, V, Fx , P (U) , where Fx is obtained after replacing fX ∈ F for every X ∈ X with a new function that outputs a constant value x given by do(x). 5 We follow the conventions given in [9]. We denote variables by capital letters and their realized values by small letters. Similarly, sets of variables will be denoted by bold capital letters, sets of realized values by bold small letters. We use the typical graph-theoretic terminology with the corresponding abbreviations P a(Y)G and An(Y)G , which will denote respectively the set of observable parents and ancestors of the node set Y in G. A graph GY will denote the induced subgraph G containing nodes in Y and all arrows between such nodes. Finally, GXZ stands for the edge subgraph of G where all incoming arrows into X and all outgoing arrows from Z are removed. Key to the analysis of transportability is the notion of “identiﬁability,” deﬁned below, which expresses the requirement that causal effects are computable from a combination of data P and assumptions embodied in a causal graph G. Deﬁnition 1 (Causal Effects Identiﬁability (Pearl, 2000, pp. 77)). The causal effect of an action do(x) on a set of variables Y such that Y ∩ X = ∅ is said to be identiﬁable from P in G if Px (y) is uniquely computable from P (V) in any model that induces G. Causal models and their induced graphs are usually associated with one particular domain (also called setting, study, population, or environment). In ordinary transportability, this representation was extended to capture properties of two domains simultaneously. This is possible if we assume that the structural equations share the same set of arguments, though the functional forms of the equations may vary arbitrarily [11]. 6 Deﬁnition 2 (Selection Diagram). Let M, M ∗ be a pair of structural causal models [9, pp. 205] relative to domains π, π ∗ , sharing a causal diagram G. M, M ∗ is said to induce a selection diagram D if D is constructed as follows: 1. Every edge in G is also an edge in D; 2. D contains an extra edge Si → Vi whenever there might exist a discrepancy fi = fi∗ or P (Ui ) = P ∗ (Ui ) between M and M ∗ . In words, the S-variables locate the mechanisms where structural discrepancies between the two domains are suspected to take place.7 Alternatively, the absence of a selection node pointing to a variable represents the assumption that the mechanism responsible for assigning value to that variable is identical in both domains. 5 The results presented here are also valid in other formalisms for causality based on potential outcomes. As discussed in the reference, the assumption of no structural changes between domains can be relaxed, but some structural assumptions regarding the discrepancies between domains must still hold. 7 Transportability assumes that enough structural knowledge about both domains is known in order to substantiate the production of their respective causal diagrams. In the absence of such knowledge, causal discovery algorithms might be used to infer the diagrams from data [8, 9]. 6 4 Armed with the concept of identiﬁability and selection diagrams, mz-transportability of causal effects can be deﬁned as follows: Deﬁnition 3 (mz-Transportability). Let D = {D(1) , ..., D(n) } be a collection of selection diagrams relative to source domains Π = {π1 , ..., πn }, and target domain π ∗ , respectively, and Zi (and Z∗ ) i be the variables in which experiments can be conducted in domain πi (and π ∗ ). Let P i , Iz be i i the pair of observational and interventional distributions of πi , where Iz = Z ⊆Zi P (v|do(z )), ∗ and in an analogous manner, P ∗ , Iz be the observational and interventional distributions of π ∗ . ∗ ∗ The causal effect R = Px (y|w) is said to be mz-transportable from Π to π ∗ in D if Px (y|w) is i ∗ uniquely computable from i=1,...,n P i , Iz ∪ P ∗ , Iz in any model that induces D. ∗ i The requirement that R is uniquely computable from P ∗ , Iz and P i , Iz from all sources has a syntactic image in the causal calculus, which is captured by the following sufﬁcient condition. Theorem 1. Let D = {D(1) , ..., D(n) } be a collection of selection diagrams relative to source domains Π = {π1 , ..., πn }, and target domain π ∗ , respectively, and Si represents the collection of i ∗ S-variables in the selection diagram D(i) . Let { P i , Iz } and P ∗ , Iz be respectively the pairs of observational and interventional distributions in the sources Π and target π ∗ . The relation R = P ∗ (y|do(x), w) is mz-transportable from Π to π ∗ in D if the expression P (y|do(x), w, S1 , ..., Sn ) is reducible, using the rules of the causal calculus, to an expression in which (1) do-operators that ∗ i apply to subsets of Iz have no Si -variables or (2) do-operators apply only to subsets of Iz . This result provides a powerful way to syntactically establish mz-transportability, but it is not immediately obvious whether a sequence of applications of the rules of causal calculus that achieves the reduction required by the theorem exists, and even if such sequence exists, it is not obvious how to obtain it. For concreteness, we illustrate this result using the selection diagrams in Fig. 2(a,b). Corollary 1. P ∗ (y|do(x)) is mz-transportable in Fig. 2(a,b) with Za = {Z2 } and Zb = {Z1 }. Proof. The goal is to show that R = P ∗ (y|do(x)) is mz-transportable from {πa , πb } to π ∗ using experiments conducted over {Z2 } in πa and {Z1 } in πb . Note that naively trying to transport R from each of the domains individually is not possible, but R can be decomposed as follows: P ∗ (y|do(x)) = P ∗ (y|do(x), do(Z1 )) (2) P ∗ (y|do(x), do(Z1 ), z2 )P ∗ (z2 |do(x), do(Z1 )) (3) P ∗ (y|do(x), do(Z1 ), do(z2 ))P ∗ (z2 |do(x), do(Z1 )), = (4) z2 = z2 where Eq. (2) follows by rule 3 of the causal calculus since (Z1 ⊥ Y |X)DX,Z holds, we con⊥ 1 dition on Z2 in Eq. (3), and Eq. (4) follows by rule 2 of the causal calculus since (Z2 ⊥ ⊥ Y |X, Z1 )DX,Z ,Z , where D is the diagram in π ∗ (despite the location of the S-nodes). 1 2 Now we can rewrite the ﬁrst term of Eq. (4) as indicated by the Theorem (and suggested by Def. 2): P ∗ (y|do(x), do(Z1 ), do(z2 )) = P (y|do(x), do(Z1 ), do(z2 ), Sa , Sb ) (5) = P (y|do(x), do(Z1 ), do(z2 ), Sb ) (6) = P (y|do(z2 ), Sb ) (7) = P (a) (y|do(z2 )), (8) where Eq. (5) follows from the theorem (and the deﬁnition of selection diagram), Eq. (6) follows , Eq. (7) follows from rule from rule 1 of the causal calculus since (Sa ⊥ Y |Z1 , Z2 , X)D(a) ⊥ Z1 ,Z2 ,X 3 of the causal calculus since (Z1 , X ⊥ Y |Z2 )D(a) ⊥ . Note that this equation matches with the Z1 ,Z2 ,X a syntactic goal of Theorem 1 since we have precisely do(z2 ) separated from Sa (and Z2 ∈ Iz ); so, we can rewrite the expression which results in Eq. (8) by the deﬁnition of selection diagram. Finally, we can rewrite the second term of Eq. (4) as follows: P ∗ (z2 |do(x), do(Z1 )) = P (z2 |do(x), do(Z1 ), Sa , Sb ) = P (z2 |do(x), do(Z1 ), Sa ) = P (z2 |x, do(Z1 ), Sa ) = P (b) (z2 |x, do(Z1 )), 5 (9) (10) (11) (12) where Eq. (9) follows from the theorem (and the deﬁnition of selection diagram), Eq. (10) follows from rule 1 of the causal calculus since (Sb ⊥ Z2 |Z1 , X)D(b) , Eq. (11) follows from rule 2 of ⊥ Z1 ,X the causal calculus since (X ⊥ Z2 |Z1 )D(b) . Note that this equation matches the condition of the ⊥ Z1 X theorem, separate do(Z1 ) from Sb (i.e., experiments over Z1 can be used since they are available in πb ), so we can rewrite Eq. (12) using the deﬁnition of selection diagrams, the corollary follows. The next condition for mz-transportability is more visible than Theorem 1 (albeit weaker), which also demonstrates the challenge of relating mz-transportability to other types of transportability. Corollary 2. R = P ∗ (y|do(x)) is mz-transportable in D if there exists Zi ⊆ Zi such that all paths from Zi to Y are blocked by X, (Si ⊥ Y|X, Zi )D(i) , and R is computable from do(Zi ). ⊥ X,Z i Remarkably, randomizing Z2 when applying Corollary 1 was instrumental to yield transportability in the previous example, despite the fact that the directed paths from Z2 to Y were not blocked by X, which suggests how different this transportability is from z-identiﬁability. So, it is not immediately obvious how to combine the topological relations of Zi ’s with X and Y in order to create a general condition for mz-transportability, the relationship between the distributions in the different domains can get relatively intricate, but we defer this discussion for now and consider a simpler case. It is not usually trivial to pursue a derivation of mz-transportability in causal calculus, and next we show an example in which such derivation does not even exist. Consider again the diagrams in Fig. 2(a,b), and assume that randomized experiments are available over {Z1 } in πa and {Z2 } in πb . Theorem 2. P ∗ (y|do(x)) is not mz-transportable in Fig. 2(a,b) with Za = {Z1 } and Zb = {Z2 }. Proof. Formally, we need to display two models M1 , M2 such that the following relations hold (as implied by Def. 3):  (a) (a)  PM1 (Z1 , X, Z2 , Y ) = PM2 (Z1 , X, Z2 , Y ),   (b)   P (Z1 , X, Z2 , Y ) = P (b) (Z1 , X, Z2 , Y ),  M1 M2 (a) (a) (13) PM1 (X, Z2 , Y |do(Z1 )) = PM2 (X, Z2 , Y |do(Z1 )),  (b)   P (Z , X, Y |do(Z )) = P (b) (Z , X, Y |do(Z )),  M 2 1 2  M2  ∗1 1 ∗ PM1 (Z1 , X, Z2 , Y ) = PM2 (Z1 , X, Z2 , Y ), for all values of Z1 , X, Z2 , and Y , and also, ∗ ∗ PM1 (Y |do(X)) = PM2 (Y |do(X)), (14) for some value of X and Y . Let V be the set of observable variables and U be the set of unobservable variables in D. Let us assume that all variables in U ∪ V are binary. Let U1 , U2 ∈ U be the common causes of Z1 and X and Z2 , respectively; let U3 , U4 , U5 ∈ U be a random disturbance exclusive to Z1 , Z2 , and Y , respectively, and U6 ∈ U be an extra random disturbance exclusive to Z2 , and U7 , U8 ∈ U to Y . Let Sa and Sb index the model in the following way: the tuples Sa = 1, Sb = 0 , Sa = 0, Sb = 1 , Sa = 0, Sb = 0 represent domains πa , πb , and π ∗ , respectively. Deﬁne the two models as follows:    Z1 = U1 ⊕ U2 ⊕ U3 ∧ Sa  Z1 = U1 ⊕ U2 ⊕ U3 ∧ Sa     X=U X = Z1 ⊕ U1 1 M1 = M2 =  Z2 = (X ⊕ U2 ⊕ (U4 ∧ Sa )) ∨ U6  Z2 = U4 ∧ Sa ∧ U6 ⊕ U6     Y = (Z2 ∧ U5 ) ⊕ (U5 ∧ U7 ) ⊕ (Sb ∧ U8 ) Y = (Z2 ∧ U5 ) ⊕ (U5 ∧ U7 ) ⊕ (Sb ∧ U8 ) where ⊕ represents the exclusive or function. Both models agree in respect to P (U), which is deﬁned as P (Ui ) = 1/2, i = 1, ..., 8. It is not difﬁcult to evaluate these models and note that the constraints given in Eqs. (13) and (14) are satisﬁed (including positivity), the theorem follows. 4 Algorithm for computing mz-transportability In this section, we build on previous analyses of identiﬁability [7, 21, 22, 23] in order to obtain a mechanic procedure in which a collection of selection diagrams and experimental data is inputted, and the procedure returns a transport formula whenever it is able to produce one. More speciﬁcally, 6 PROCEDURE TRmz (y, x, P, I, S, W, D) INPUT: x, y: value assignments; P: local distribution relative to domain S (S = 0 indexes π ∗ ) and active experiments I; W: weighting scheme; D: backbone of selection diagram; Si : selection nodes in πi (S0 = ∅ (i) relative to π ∗ ); [The following set and distributions are globally deﬁned: Zi , P ∗ , PZi .] (i) ∗ ∗ OUTPUT: Px (y) in terms of P ∗ , PZ , PZi or F AIL(D, C0 ). P 1 if x = ∅, return V\Y P. P 2 if V \ An(Y)D = ∅, return TRmz (y, x ∩ An(Y)D , V\An(Y)D P, I, S, W, DAn(Y) ). 3 set W = (V \ X) \ An(Y)DX . if W = ∅, return TRmz (y, x ∪ w, P, I, S, W, D). Q P 4 if C(D \ X) = {C0 , C1 , ..., Ck }, return V\{Y,X} i TRmz (ci , v \ ci , P, I, S, W, D). 5 if C(D \ X) = {C0 }, 6 if C(D) = {D}, Q P P 7 if C0 ∈ C(D), return i|Vi ∈C0 V\V (i) P/ V\V (i−1) P. D 8 9 10 11 12 D if (∃C )C0 ⊂ C ∈ C(D), (i−1) for {i|Vi ∈ C }, set κi = κi ∪ vD \C . Q (i−1) mz return TR (y, x ∩ C , i|Vi ∈C P(Vi |VD ∩ C , κi ), I, S, W, C ). else, if I`= ∅, for i = 0, ..., |D|, ´ if (Si ⊥ Y | X)D(i) ∧ (Zi ∩ X = ∅) , Ei = TRmz (y, x \ zi , P, Zi ∩ X, i, W, D \ {Zi ∩ X}). ⊥ PX (j) if |E| > 0, return |E| wi Ei . i=1 else, FAIL(D, C0 ). Figure 3: Modiﬁed version of identiﬁcation algorithm capable of recognizing mz-transportability. our algorithm is called TRmz (see Fig. 3), and is based on the C-component decomposition for identiﬁcation of causal effects [22, 23] (and a version of the identiﬁcation algorithm called ID). The rationale behind TRmz is to apply Tian’s factorization and decompose the target relation into smaller, more manageable sub-expressions, and then try to evaluate whether each sub-expression can be computed in the target domain. Whenever this evaluation fails, TRmz tries to use the experiments available from the target and, if possible, from the sources; this essentially implements the declarative condition delineated in Theorem 1. Next, we consider the soundness of the algorithm. ∗ Theorem 3 (soundness). Whenever TRmz returns an expression for Px (y), it is correct. In the sequel, we demonstrate how the algorithm works through the mz-transportability of Q = P ∗ (y|do(x)) in Fig. 2(c,d) with Z∗ = {Z1 }, Za = {Z2 }, and Zb = {Z1 }. Since (V \ X) \ An(Y)DX = {Z2 }, TRmz invokes line 3 with {Z2 } ∪ {X} as interventional set. The new call triggers line 4 and C(D \ {X, Z2 }) = {C0 , C1 , C2 , C3 }, where C0 = DZ1 , C1 = DZ3 , C2 = DU , and C3 = DW,Y , we invoke line 4 and try to mz-transport individ∗ ∗ ∗ ually Q0 = Px,z2 ,z3 ,u,w,y (z1 ), Q1 = Px,z1 ,z2 ,u,w,y (z3 ), Q2 = Px,z1 ,z2 ,z3 ,w,y (u), and Q3 = ∗ Px,z1 ,z2 ,z3 ,u (w, y). Thus the original problem reduces to try to evaluate the equivalent expression ∗ ∗ ∗ ∗ z1 ,z3 ,u,w Px,z2 ,z3 ,u,w,y (z1 )Px,z1 ,z2 ,u,w,y (z3 ) Px,z1 ,z2 ,z3 ,w,y (u)Px,z1 ,z2 ,z3 ,u (w, y). First, TRmz evaluates the expression Q0 and triggers line 2, noting that all nodes can be ignored ∗ since they are not ancestors of {Z1 }, which implies after line 1 that Px,z2 ,z3 ,u,w,y (z1 ) = P ∗ (z1 ). Second, TRmz evaluates the expression Q1 triggering line 2, which implies that ∗ ∗ Px,z1 ,z2 ,u,w,y (z3 ) = Px,z1 ,z2 (z3 ) with induced subgraph D1 = DX,Z1 ,Z2 ,Z3 . TRmz goes to line 5, in which in the local call C(D \ {X, Z1 , Z2 }) = {DZ3 }. Thus it proceeds to line 6 testing whether C(D \ {X, Z1 , Z2 }) is different from D1 , which is false. In this call, ordinary identiﬁability would fail, but TRmz proceeds to line 9. The goal of this line is to test whether some experiment can help for computing Q1 . In this case, πa fails immediately the test in line 10, but πb and π ∗ succeed, (i) which means experiments in these domains may eventually help; the new call is Px,z2 (z3 )D\Z1 , for mz i = {b, ∗} with induced graph D1 = DX,Z2 ,Z3 . Finally, TR triggers line 8 since X is not part of Z3 ’s components in D1 (or, Z3 ∈ C = {Z2 Z3 }), so line 2 is triggered since Z2 is no longer an ancestor of Z3 in D1 , and then line 1 is triggered since the interventional set is empty in (i) (i) ∗ this local call, so Px,z1 ,z2 (z3 ) = Z Pz1 (z3 |x, Z2 )Pz1 (Z2 ), for i = {b, ∗}. 2 7 ∗ Third, evaluating the expression Q2 , TRmz goes to line 2, which implies that Px,z1 ,z2 ,z3 ,w,y (u) = ∗ mz Px,z1 ,z2 ,z3 ,w (u) with induced subgraph D2 = DX,Z1 ,Z2 ,Z3 ,W,U . TR goes to line 5, and in this local call C(D \ {X, Z1 , Z2 , Z3 , W }) = {DU }, and the test in 6 succeed, since there are more components in D. So, it triggers line 8 since W is not part of U ’s component ∗ ∗ in D2 . The algorithm makes Px,z1 ,z2 ,z3 ,w (u) = Px,z1 ,z2 ,z3 (u)D2 |W (and update the working distribution); note that in this call, ordinary identiﬁability would fail since the nodes are in the same C-component and the test in line 6 fails. But TRmz proceeds to line 9 trying to ﬁnd experiments that can help in Q2 ’s computation. In this case, πb cannot help but πa (a) and π ∗ perhaps can, noting that new calls are launched for computing Px,z1 ,z3 (u)D2 \Z2 |W rel∗ ative to πa , and Px,z2 ,z3 (u)D2 \Z1 |W relative to π ∗ with the corresponding data structures set. (a) (a) In πa , the algorithm triggers line 7, which yields Px,z1 ,z3 (u)D2 \Z2 |W = Pz2 (u|w, z3 , x, z1 ), ∗ and a bit more involved analysis for πb yields (after simpliﬁcation) Px,z2 ,z3 (u)D2 \Z1 |W = ∗ ∗ ∗ ∗ ∗ Z Pz1 (z3 |x, Z2 )Pz1 (Z2 ) . Z Pz1 (u|w, z3 , x, Z2 )Pz1 (z3 |x, Z2 )Pz1 (Z2 ) / 2 2 mz Fourth, TR evaluates the expression Q3 and triggers line 5, C(D\{X, Z1 , Z2 , Z3 , U }) = DW,Y . ∗ In turn, both tests at lines 6 and 7 succeed, which makes the procedure to return Px,z1 ,z2 ,z3 ,u (w, y) = ∗ ∗ P (w|z3 , x, z1 , z2 )P (y|w, x, z1 , z2 , z3 , u). The composition of the return of these calls generates the following expression: ! ∗ Px (y) = X ∗ P (z1 ) z1 ,z3 ,w,u (2) w1 (1) w1 X ∗ ∗ Pz1 (z3 |x, Z2 )Pz1 (Z2 ) X (b) (b) Pz1 (z3 |x, Z2 )Pz1 (Z2 ) Z2 Z2 „X + (1) w2 « « „X ∗ ∗ ∗ ∗ ∗ Pz1 (z3 |x, Z2 )Pz1 (Z2 ) Pz1 (u|w, z3 , x, Z2 )Pz1 (z3 |x, Z2 )Pz1 (Z2 ) / Z2 Z2 ! (2) (a) + w2 Pz2 (u|w, z3 , x, z1 ) P ∗ (w|x, z1 , z2 , z3 ) P ∗ (y|x, z1 , z2 , z3 , w, u) (15) (k) where wi represents the weight for each factor in estimand k (i = 1, ..., nk ), and nk is the number of feasible estimands of k. Eq. (15) depicts a powerful way to estimate P ∗ (y|do(x)) in the target domain, and depending on weighting choice a different estimand will be entailed. For instance, one might use an analogous to inverse-variance weighting, which sets the weights for the normalized (k) nk −2 −2 2 inverse of their variances (i.e., wi = σi / j=1 σj , where σj is the variance of the jth component of estimand k). Our strategy resembles the approach taken in meta-analysis [4], albeit the latter usually disregards the intricacies of the relationships between variables, so producing a statistically less powerful estimand. Our method leverages this non-trivial and highly structured relationships, as exempliﬁed in Eq. (15), which yields an estimand with less variance and statistically more powerful. 5 Conclusions In this paper, we treat a special type of transportability in which experiments can be conducted only over limited sets of variables in the sources and target domains, and the goal is to infer whether a certain effect can be estimated in the target using the information scattered throughout the domains. We provide a general sufﬁcient graphical conditions for transportability based on the causal calculus along with a necessary condition for a speciﬁc scenario, which should be generalized for arbitrary structures. We further provide a procedure for computing transportability, that is, generate a formula for fusing the available observational and experimental data to synthesize an estimate of the desired causal effects. Our algorithm also allows for generic weighting schemes, which generalizes standard statistical procedures and leads to the construction of statistically more powerful estimands. Acknowledgment The work of Judea Pearl and Elias Bareinboim was supported in part by grants from NSF (IIS1249822, IIS-1302448), and ONR (N00014-13-1-0153, N00014-10-1-0933). The work of Sanghack Lee and Vasant Honavar was partially completed while they were with the Department of Computer Science at Iowa State University. The work of Vasant Honavar while working at the National Science Foundation (NSF) was supported by the NSF. The work of Sanghack Lee was supported in part by the grant from NSF (IIS-0711356). Any opinions, ﬁndings, and conclusions contained in this article are those of the authors and do not necessarily reﬂect the views of the sponsors. 8 References [1] J. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A formal approach. In W. Burgard and D. Roth, editors, Proceedings of the Twenty-Fifth National Conference on Artiﬁcial Intelligence, pages 247–254. AAAI Press, Menlo Park, CA, 2011. [2] D. Campbell and J. Stanley. Experimental and Quasi-Experimental Designs for Research. Wadsworth Publishing, Chicago, 1963. [3] C. Manski. Identiﬁcation for Prediction and Decision. Harvard University Press, Cambridge, Massachusetts, 2007. [4] L. V. Hedges and I. Olkin. Statistical Methods for Meta-Analysis. Academic Press, January 1985. [5] W.R. Shadish, T.D. Cook, and D.T. Campbell. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifﬂin, Boston, second edition, 2002. [6] S. Morgan and C. Winship. Counterfactuals and Causal Inference: Methods and Principles for Social Research (Analytical Methods for Social Research). Cambridge University Press, New York, NY, 2007. [7] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–710, 1995. [8] P. Spirtes, C.N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2000. [9] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009. [10] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. [11] E. Bareinboim and J. Pearl. Transportability of causal effects: Completeness results. In J. Hoffmann and B. Selman, editors, Proceedings of the Twenty-Sixth National Conference on Artiﬁcial Intelligence, pages 698–704. AAAI Press, Menlo Park, CA, 2012. [12] E. Bareinboim and J. Pearl. Causal transportability with limited experiments. In M. desJardins and M. Littman, editors, Proceedings of the Twenty-Seventh National Conference on Artiﬁcial Intelligence, pages 95–101, Menlo Park, CA, 2013. AAAI Press. [13] S. Lee and V. Honavar. Causal transportability of experiments on controllable subsets of variables: ztransportability. In A. Nicholson and P. Smyth, editors, Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 361–370. AUAI Press, 2013. [14] E. Bareinboim and J. Pearl. Meta-transportability of causal effects: A formal approach. In C. Carvalho and P. Ravikumar, editors, Proceedings of the Sixteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 135–143. JMLR W&CP; 31, 2013. [15] S. Lee and V. Honavar. m-transportability: Transportability of a causal effect from multiple environments. In M. desJardins and M. Littman, editors, Proceedings of the Twenty-Seventh National Conference on Artiﬁcial Intelligence, pages 583–590, Menlo Park, CA, 2013. AAAI Press. [16] H. Daume III and D. Marcu. Domain adaptation for statistical classiﬁers. Journal of Artiﬁcial Intelligence Research, 26:101–126, 2006. [17] A.J. Storkey. When training and test sets are different: characterising learning transfer. In J. Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence, editors, Dataset Shift in Machine Learning, pages 3–28. MIT Press, Cambridge, MA, 2009. [18] B. Sch¨ lkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal o learning. In J Langford and J Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1255–1262, New York, NY, USA, 2012. Omnipress. [19] K. Zhang, B. Sch¨ lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional o shift. In Proceedings of the 30th International Conference on Machine Learning (ICML). JMLR: W&CP; volume 28, 2013. [20] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identiﬁability. In N. Freitas and K. Murphy, editors, Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 113–120. AUAI Press, 2012. [21] M. Kuroki and M. Miyakawa. Identiﬁability criteria for causal effects of joint interventions. Journal of the Royal Statistical Society, 29:105–117, 1999. [22] J. Tian and J. Pearl. A general identiﬁcation condition for causal effects. In Proceedings of the Eighteenth National Conference on Artiﬁcial Intelligence, pages 567–573. AAAI Press/The MIT Press, Menlo Park, CA, 2002. [23] I. Shpitser and J. Pearl. Identiﬁcation of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the Twenty-First National Conference on Artiﬁcial Intelligence, pages 1219–1226. AAAI Press, Menlo Park, CA, 2006. 9

3 0.059891891 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

Author: Anima Anandkumar, Daniel Hsu, Majid Janzamin, Sham M. Kakade

Abstract: Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identiﬁed given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identiﬁable, we establish generic identiﬁability under a constraint, referred to as topic persistence. Our sufﬁcient conditions for identiﬁability involve a novel set of “higher order” expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identiﬁable w.h.p. in the overcomplete regime. Our identiﬁability results allow for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identiﬁability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition. Keywords: Overcomplete representation, admixture models, generic identiﬁability, tensor decomposition.

4 0.050476771 245 nips-2013-Pass-efficient unsupervised feature selection

Author: Crystal Maung, Haim Schweitzer

Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modiﬁcation of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the ﬁnal selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efﬁciency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1

5 0.037777252 252 nips-2013-Predictive PAC Learning and Process Decompositions

Author: Cosma Shalizi, Aryeh Kontorovitch

Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with ﬁnite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This deﬁnition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1

6 0.036572561 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

7 0.03599875 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

8 0.035469763 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

9 0.034547776 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

10 0.034244541 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

11 0.032324713 91 nips-2013-Dirty Statistical Models

12 0.03153016 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

13 0.030472638 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

14 0.028984098 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

15 0.027693901 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

16 0.027187753 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning

17 0.026132343 149 nips-2013-Latent Structured Active Learning

18 0.024999354 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks

19 0.02492502 226 nips-2013-One-shot learning by inverting a compositional causal process

20 0.024021929 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.078), (1, 0.028), (2, 0.012), (3, 0.018), (4, 0.015), (5, 0.006), (6, 0.013), (7, -0.014), (8, 0.004), (9, 0.009), (10, 0.016), (11, -0.04), (12, 0.035), (13, 0.003), (14, -0.014), (15, -0.009), (16, -0.006), (17, 0.009), (18, -0.023), (19, 0.011), (20, -0.012), (21, -0.002), (22, 0.019), (23, -0.009), (24, 0.011), (25, -0.021), (26, 0.01), (27, -0.018), (28, -0.073), (29, -0.026), (30, -0.018), (31, 0.001), (32, -0.005), (33, -0.014), (34, 0.029), (35, -0.003), (36, -0.017), (37, -0.017), (38, -0.02), (39, 0.048), (40, -0.125), (41, 0.029), (42, -0.034), (43, 0.009), (44, 0.088), (45, 0.011), (46, -0.007), (47, -0.008), (48, -0.056), (49, -0.0)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87340248 134 nips-2013-Graphical Models for Inference with Missing Data

Author: Karthika Mohan, Judea Pearl, Jin Tian

2 0.77358919 337 nips-2013-Transportability from Multiple Environments with Limited Experiments

Author: Elias Bareinboim, Sanghack Lee, Vasant Honavar, Judea Pearl

3 0.49866974 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between ofﬂine and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classiﬁer response space. We propose an efﬁcient algorithmic implementation which makes it tractable in practice, and demonstrate its efﬁciency experimentally on several compute-vision data-sets, on which it outperforms both online and ofﬂine methods in a memory constrained setting. 1

4 0.4958156 184 nips-2013-Marginals-to-Models Reducibility

Author: Tim Roughgarden, Michael Kearns

Abstract: We consider a number of classical and new computational problems regarding marginal distributions, and inference in models specifying a full joint distribution. We prove general and efﬁcient reductions between a number of these problems, which demonstrate that algorithmic progress in inference automatically yields progress for “pure data” problems. Our main technique involves formulating the problems as linear programs, and proving that the dual separation oracle required by the ellipsoid method is provided by the target problem. This technique may be of independent interest in probabilistic inference. 1

5 0.49205914 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

Author: Boqing Gong, Kristen Grauman, Fei Sha

Abstract: In visual recognition problems, the common data distribution mismatches between training and testing make domain adaptation essential. However, image data is difﬁcult to manually divide into the discrete domains required by adaptation algorithms, and the standard practice of equating datasets with domains is a weak proxy for all the real conditions that alter the statistics in complex ways (lighting, pose, background, resolution, etc.) We propose an approach to automatically discover latent domains in image or video datasets. Our formulation imposes two key properties on domains: maximum distinctiveness and maximum learnability. By maximum distinctiveness, we require the underlying distributions of the identiﬁed domains to be different from each other to the maximum extent; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. We devise a nonparametric formulation and efﬁcient optimization procedure that can successfully discover domains among both training and test data. We extensively evaluate our approach on object recognition and human activity recognition tasks. 1

6 0.48665112 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics

7 0.48265913 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors

8 0.4732697 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family

9 0.47088516 25 nips-2013-Adaptive Anonymity via $b$-Matching

10 0.46958289 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?

11 0.46953061 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

12 0.45124272 332 nips-2013-Tracking Time-varying Graphical Structure

13 0.44200981 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

14 0.43978003 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

15 0.43521386 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests

16 0.43479115 183 nips-2013-Mapping paradigm ontologies to and from the brain

17 0.43386412 352 nips-2013-What do row and column marginals reveal about your dataset?

18 0.42736191 306 nips-2013-Speeding up Permutation Testing in Neuroimaging

19 0.42687985 211 nips-2013-Non-Linear Domain Adaptation with Boosting

20 0.41656059 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.015), (14, 0.021), (16, 0.056), (33, 0.081), (34, 0.08), (41, 0.024), (49, 0.036), (56, 0.11), (59, 0.301), (70, 0.034), (85, 0.049), (89, 0.028), (93, 0.022), (95, 0.019), (97, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73900497 134 nips-2013-Graphical Models for Inference with Missing Data

Author: Karthika Mohan, Judea Pearl, Jin Tian

2 0.67080259 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks

Author: Myunghwan Kim, Jure Leskovec

Abstract: unkown-abstract

3 0.6124565 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

Author: Raja Hafiz Affandi, Emily Fox, Ben Taskar

Abstract: Determinantal point processes (DPPs) are random point processes well-suited for modeling repulsion. In machine learning, the focus of DPP-based models has been on diverse subset selection from a discrete and ﬁnite base set. This discrete setting admits an efﬁcient sampling algorithm based on the eigendecomposition of the deﬁning kernel matrix. Recently, there has been growing interest in using DPPs deﬁned on continuous spaces. While the discrete-DPP sampler extends formally to the continuous case, computationally, the steps required are not tractable in general. In this paper, we present two efﬁcient DPP sampling schemes that apply to a wide range of kernel functions: one based on low rank approximations via Nystr¨ m o and random Fourier feature techniques and another based on Gibbs sampling. We demonstrate the utility of continuous DPPs in repulsive mixture modeling and synthesizing human poses spanning activity spaces. 1

4 0.53694814 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits

Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos

Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a ﬁnite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1

5 0.5061754 252 nips-2013-Predictive PAC Learning and Process Decompositions

Author: Cosma Shalizi, Aryeh Kontorovitch

6 0.50340658 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

7 0.49995393 249 nips-2013-Polar Operators for Structured Sparse Estimation

8 0.49939358 355 nips-2013-Which Space Partitioning Tree to Use for Search?

9 0.49938199 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

10 0.49919319 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

11 0.49580213 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

12 0.49436128 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

13 0.49407142 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

14 0.49273261 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

15 0.49241179 186 nips-2013-Matrix factorization with binary components

16 0.49207082 232 nips-2013-Online PCA for Contaminated Data

17 0.4917123 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

18 0.4905211 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

19 0.49027592 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

20 0.48951682 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence