nips nips2013 nips2013-276 knowledge-graph by maker-knowledge-mining

276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

Source: pdf

Author: Boqing Gong, Kristen Grauman, Fei Sha

Abstract: In visual recognition problems, the common data distribution mismatches between training and testing make domain adaptation essential. However, image data is difﬁcult to manually divide into the discrete domains required by adaptation algorithms, and the standard practice of equating datasets with domains is a weak proxy for all the real conditions that alter the statistics in complex ways (lighting, pose, background, resolution, etc.) We propose an approach to automatically discover latent domains in image or video datasets. Our formulation imposes two key properties on domains: maximum distinctiveness and maximum learnability. By maximum distinctiveness, we require the underlying distributions of the identiﬁed domains to be different from each other to the maximum extent; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. We devise a nonparametric formulation and efﬁcient optimization procedure that can successfully discover domains among both training and test data. We extensively evaluate our approach on object recognition and human activity recognition tasks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In visual recognition problems, the common data distribution mismatches between training and testing make domain adaptation essential. [sent-8, score-0.785]

2 ) We propose an approach to automatically discover latent domains in image or video datasets. [sent-10, score-0.727]

3 By maximum distinctiveness, we require the underlying distributions of the identiﬁed domains to be different from each other to the maximum extent; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. [sent-12, score-0.672]

4 We devise a nonparametric formulation and efﬁcient optimization procedure that can successfully discover domains among both training and test data. [sent-13, score-0.691]

5 We extensively evaluate our approach on object recognition and human activity recognition tasks. [sent-14, score-0.351]

6 While domain adaptation research largely focuses on how adaptation should proceed, there are also vital questions concerning the domains themselves: what exactly is a domain composed of? [sent-20, score-1.639]

7 For example, in speech recognition, we can organize data into speaker-speciﬁc domains where each domain contains a single speaker’s utterances. [sent-23, score-0.921]

8 For those types of data, we can neatly categorize each instance with a discrete set of semantically meaningful properties; a domain is thus naturally composed of instances of the same (subset of) properties. [sent-25, score-0.365]

9 1 Partially due to these conceptual and practical constraints, datasets for visual recognition are not deliberately collected with clearly identiﬁable domains [11, 12, 13, 14, 15]. [sent-32, score-0.844]

10 As a result, a troubling practice in visual domain adaptation research is to equate datasets with domains and study the problem of cross-dataset generalization or correcting dataset bias [16, 17, 18, 19]. [sent-34, score-1.321]

11 Thus, modeling the dataset as a single domain would necessarily blend the distinctions, potentially damaging visual discrimination. [sent-36, score-0.431]

12 Unaware of the distinction of these two views of videos, a model for the training set as a single training domain needs to account for both inter-subject and inter-view variations. [sent-39, score-0.485]

13 At the surface, we are not given any information about the domains that the datasets contain, such as the statistical properties of the domains, or even the number of domains. [sent-49, score-0.672]

14 Furthermore, the challenge cannot be construed as a traditional clustering problem; simply clustering images by their appearance is prone to reshaping datasets into per-category domains, as observed in [20] and our own empirical studies. [sent-50, score-0.619]

15 Moreover, there may be many complex factors behind the domains, making it difﬁcult to model the domains with parametric mixture models on which traditional clustering algorithms (e. [sent-51, score-0.632]

16 Our key insights are two axiomatic properties that latent domains should possess: maximum distinctiveness and maximum learnability. [sent-54, score-0.827]

17 By maximum distinctiveness, we identify domains that are maximally different in distribution from each other. [sent-55, score-0.661]

18 This ensures domains are characteristic in terms of their large inter-domain variations. [sent-56, score-0.591]

19 By maximum learnability, we identify domains from which we can derive strong discriminative models to apply to new testing data. [sent-57, score-0.674]

20 In section 2, we describe our learning methods for extracting domains with these desirable properties. [sent-58, score-0.561]

21 To our best knowledge, [20] is the ﬁrst and only work addressing latent domain discovery. [sent-61, score-0.393]

22 In section 4, we demonstrate the effectiveness of our approach on several domain adaptation tasks for object recognition and human activity recognition. [sent-63, score-0.763]

23 Moreover, we assume that each data instance comes from a latent domain zm ∈ [K] where K is the number of domains. [sent-68, score-0.433]

24 1 Maximally distinctive domains Given K, we denote the distributions of unknown domains Dk by Pk (x, y) for k ∈ [K]. [sent-72, score-1.171]

25 Instead, the marginal distribution Pk (x) is approximated 2 ˆ by the empirical distribution Pk (x) 1 ˆ Pk (x) = Mk δxm zmk , m where Mk is the number of data instances to be assigned to the domain k and δxm is an atom at xm . [sent-74, score-0.605]

26 Intuitively, we would like any two different ˆ ˆ domains Pk (x) and Pk (x) to be as distinctive as possible. [sent-78, score-0.61]

27 In the context of modeling visual data, this implies that intra-class variations between domains are often far more pronounced than interclass variations within the same domain. [sent-79, score-0.696]

28 Clearly, deﬁning domains by simply clustering the images by appearance is insufﬁcient; the inter-category and inter-pose variations will both contribute to the clustering procedure and may lead to unreasonable clusters. [sent-83, score-0.804]

29 The measure approaches zero as the number of samples tends to inﬁnity, if and only if the two domains are the same, Pk = Pk . [sent-89, score-0.561]

30 We deﬁne the total domain distinctiveness (TDD) as the sum of this quantity over all possible pairs of domains: TDD(K) = d(k, k ), (2) k=k and choose domain assignments for zm such that TDD is maximized. [sent-90, score-0.824]

31 Optimization In addition to the binary constraints on zmk , we also enforce K zmk = 1, k=1 ∀ m ∈ [M], and 1 Mk M M zmk ymc = m=1 1 ymc , M m=1 ∀ c ∈ [C], k ∈ [K] (3) where ymc is a binary indicator variable, taking the value of 1 if ym = c. [sent-92, score-0.827]

32 The ﬁrst constraint stipulates that every instance will be assigned to one domain and one domain only. [sent-93, score-0.642]

33 For example, in action recognition, if the “walking” category occurs relatively frequently in a domain corresponding to brightly lit video, we also expect it to be frequent in the darker videos. [sent-97, score-0.429]

34 1/M ≤ βmk ≤ 1/C, m = 1, 2, · · · , M, (5) k (1 − δ)/M ymc ≤ m βmk ymc ≤ (1 + δ)/M m ymc , c = 1, · · · , C, k = 1, · · · , K, m where K is the M × M kernel matrix. [sent-104, score-0.356]

35 The ﬁrst constraint stems from the (default) requirement that every domain should have at least one instance per category, namely, Mk ≥ C and every domain should at most have M instances (Mk ≤ M). [sent-105, score-0.686]

36 We assign xm to the domain k for which βmk is the maximum of βm1 , · · · , βmK . [sent-107, score-0.427]

37 2 Maximally learnable domains: determining the number of domains Given M instances, how many domains hide inside? [sent-111, score-1.122]

38 Note that the total domain distinctiveness TDD(K) increases as K increases — presumably, in the extreme case, each domain has only a few instances and their distributions would be maximally different from each other. [sent-112, score-0.874]

39 However, such tiny domains would offer insufﬁcient data to separate the categories of interest reliably. [sent-113, score-0.587]

40 Denote the cross-validation accuracy for the k-th domain by Ak . [sent-120, score-0.353]

41 k=1 For very large K such that each domain contains only a few examples, A(K) approaches the classiﬁcation accuracy using the class prior probability to classify. [sent-122, score-0.353]

42 4 In our previous work, we proposed to identify some landmark data points in the source domain which are distributed similarly to the target domain [26]. [sent-128, score-0.815]

43 We discover the underlying domains of the training datasets, each of which will be adaptable, whereas the landmarks in [26] are intentionally biased towards the single given target domain. [sent-130, score-0.763]

44 They also aim at discovering the latent domains from datasets, by modeling the data with a hierarchical distribution consisting of Gaussian mixtures. [sent-133, score-0.633]

45 However, their focus is the setting of unsupervised clustering where ours is domain discovery. [sent-139, score-0.384]

46 Multi-domain adaptation methods suppose that multiple source domains are given as input, and the learner must adapt from (some of) them to do well in testing on a novel target domain [28, 29, 10]. [sent-141, score-1.271]

47 In contrast, in the problem we tackle, the division of data into domains is not given—our algorithm must discover the latent domains. [sent-142, score-0.675]

48 4 Experimental Results We validate our approach on visual object recognition and human activity recognition tasks. [sent-145, score-0.396]

49 We ﬁrst describe our experimental settings, and then report the results of identifying latent domains and using the identiﬁed domains for adapting classiﬁers to a new mono-domain test set. [sent-146, score-1.306]

50 After that, we present and report experimental results of reshaping heterogeneous test datasets into domains matching to the identiﬁed training domains. [sent-147, score-1.091]

51 We represent images with bag-of-visual-words descriptors following previous work on domain adaptation [2, 4]. [sent-155, score-0.62]

52 Evaluation strategy The four image datasets are commonly used as distinctive domains in research in visual domain adaptation [2, 3, 4, 32]. [sent-163, score-1.357]

53 Likewise, each view in the IXMAS dataset is often taken as a domain in action recognition [33, 34, 35, 24]. [sent-164, score-0.56]

54 Similarly, in our experiments, we use a subset of these datasets (views) as source domains for training classiﬁers and the rest of the datasets (views) as target domains for testing. [sent-165, score-1.537]

55 However, the key difference is that we do not compare performance of different adaptation algorithms which assume domains are already given. [sent-166, score-0.779]

56 Instead, we evaluate the effectiveness of our approach by investigating whether its automatically identiﬁed domains improve adaptation, that is, whether recognition accuracy on the target domains can be improved by reshaping the datasets into their latent source domains. [sent-167, score-1.906]

57 5 Table 1: Oracle recognition accuracy on target domains by adapting original or identiﬁed domains S A, C D, W C, D, W Cam 0, 1 Cam 2, 3, 4 T D, W A, C A Cam 2, 3, 4 Cam 0, 1 GORIG 41. [sent-168, score-1.413]

58 3 Table 2: Adaptation recognition accuracies, using original and identiﬁed domains with different multi-source adaptation methods Latent Multi-DA A, C D, W C, D, W Cam 0, 1 Cam 2, 3, 4 Domains method D, W A, C A Cam 2, 3, 4 Cam 0, 1 ORIGINAL UNION 41. [sent-183, score-0.88]

59 The number of latent domains K is determined by the DWCV procedure (cf. [sent-212, score-0.633]

60 2 Identifying latent domains from training datasets Notation Let S = {S1 , S2 , . [sent-216, score-0.792]

61 Furthermore, let K denote the number of optimal domains discovered by our DWCV procedure and U = {U1 , U2 , . [sent-223, score-0.595]

62 , UK } the K hidden domains identiﬁed by our approach. [sent-226, score-0.561]

63 Let r(A → B) denote the recognition accuracy on the target domain B with A as the source domain. [sent-227, score-0.599]

64 Likewise, we can deﬁne GORIG where we compute the best possible accuracy for the original domains {Sj }, and GOTHER where we compute the same quantity for a competing method for identifying latent domains, proposed in [20]. [sent-230, score-0.665]

65 Note that the max operation requires that the target domains be annotated; thus the accuracies are the most optimistic estimate for all methods, and upper bounds of practical algorithms. [sent-231, score-0.712]

66 Practical utility of identiﬁed domains In practical applications of domain adaptation algorithms, however, the target domains are not annotated. [sent-236, score-1.747]

67 In the following, we examine how closely the performance of the identiﬁed domains can approximate the oracle if we employ multi-source adaptation. [sent-238, score-0.59]

68 To this end, we consider several choices of multiple-source domain adaptation methods: • UNION The most naive way is to combine all the source domains into a single dataset and adapt from this “mega” domain to the target domains. [sent-239, score-1.605]

69 • ENSEMBLE A more sophisticated strategy is to adapt each source domain to the target domain and combine the adaptation results in the form of combining multiple classiﬁers [20]. [sent-241, score-1.005]

70 9 No reshaping A B C→F Conditional reshaping X → FX , ∀X ∈ {A , B , C } 37. [sent-258, score-0.594]

71 8 • MATCHING This strategy compares the empirical (marginal) distribution of the source domains and the target domains and selects the single source domain that has the smallest difference to the target domain to adapt. [sent-268, score-2.054]

72 Note that since we compare only the marginal distributions, we do not require the target domains to be annotated. [sent-270, score-0.647]

73 Table 2 reports the averaged recognition accuracies on the target domains, using either the original datasets/domains or the identiﬁed domains as the source domains. [sent-271, score-0.872]

74 The latent domains identiﬁed by our method generally perform well, especially using MATCHING to select the single best source domain to match the target domain for adaptation. [sent-272, score-1.42]

75 3 Reshaping the test datasets So far we have been concentrating on reshaping multiple annotated datasets (for training classiﬁers) into domains for adapting to test datasets. [sent-275, score-1.309]

76 Hence, it is also instrumental to investigate whether we can reshape the test datasets into multiple domains to achieve better adaptation results. [sent-277, score-0.993]

77 However, the reshaping process for test datasets has a critical difference from reshaping training datasets. [sent-278, score-0.793]

78 Speciﬁcally, we should reshape test datasets, conditioning on the identiﬁed domains from the training datasets — the goal is to discover latent domains in the test datasets that match the domains in the training datasets as much as possible. [sent-279, score-2.369]

79 Computationally, conditional reshaping is more tractable than identifying latent domains from the training datasets. [sent-281, score-0.978]

80 Concretely, we minimize the distribution differences between the latent domains in the test datasets and the domains in the training datasets, using the kernel-based measure explained in section 2. [sent-282, score-1.393]

81 Table 3 demonstrates the beneﬁt of conditionally reshaping the test datasets, on cross-view action recognition. [sent-286, score-0.4]

82 Two baselines are included: (1) adapting from the identiﬁed domains A , B and C to the merged dataset F ; (2) adapting from the merged dataset A B C to F . [sent-291, score-0.845]

83 These are contrasted to adapting from the identiﬁed domains in the training datasets to the matched domains in F . [sent-292, score-1.353]

84 In most groups, there is a signiﬁcant improvement in recognition accuracies by conditional reshaping over no reshaping on either training or testing, and reshaping on training only. [sent-293, score-1.153]

85 4 Analysis of identiﬁed domains and the optimal number of domains It is also interesting to see which factors are dominant in the identiﬁed domains. [sent-295, score-1.157]

86 Some exemplar images are shown in Figure 1, where each row corresponds to an original dataset, and each column is an identiﬁed domain across two datasets. [sent-298, score-0.402]

87 In Domain II all the “laptop” images 1) are taken from 7 Identiﬁed Domain I Identiﬁed Domain II Identiﬁed Domain II Webcam DSLR Caltech Amazon Identiﬁed Domain I Figure 1: Exemplar images from the original and identiﬁed domains after reshaping. [sent-300, score-0.723]

88 Note that identiﬁed domains contain images from both datasets. [sent-301, score-0.642]

89 It looks like the domains in Amazon and Caltech-256 are mainly determined by the factors of object pose and appearance (color). [sent-304, score-0.752]

90 The ﬁgures on the right are from reshaping DSLR and Webcam, of which the “keyboard” images are taken in an ofﬁce environment with various lighting, object poses, and background controlled by the dataset creators [2]. [sent-305, score-0.508]

91 Figure 2 plots some intermediate results of the domain-wise cross-validation (DWCV) for determining the number of domains K to identify from the multiple training datasets. [sent-309, score-0.637]

92 Interestingly, the number favored by DWCV coincides with the number of datasets we mix, even though, as our experiments above show, the ideal domain boundaries do not coincide with the dataset boundaries. [sent-313, score-0.471]

93 5 Conclusion We introduced two domain properties, maximum distinctiveness and maximum learnability, to discover latent domains from datasets. [sent-314, score-1.19]

94 Accordingly, we proposed nonparametric approaches encouraging the extracted domains to satisfy these properties. [sent-315, score-0.561]

95 Since in each domain visual discrimination is more consistent than that in the heterogeneous datasets, better prediction performance can be achieved on the target domain. [sent-316, score-0.478]

96 The proposed approach is extensively evaluated on visual object recognition and human activity recognition tasks. [sent-317, score-0.422]

97 Our identiﬁed domains outperform not only the original datasets but also the domains discovered by [20], validating the effectiveness of our approach. [sent-318, score-1.267]

98 It may also shed light on dataset construction in the future by examining the main factors of the domains discovered from the existing datasets. [sent-319, score-0.669]

99 Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. [sent-517, score-0.348]

100 Exploiting weakly-labeled web images to improve object classiﬁcation: a domain adaptation approach. [sent-556, score-0.684]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('domains', 0.561), ('domain', 0.321), ('reshaping', 0.297), ('adaptation', 0.218), ('dwcv', 0.213), ('cam', 0.193), ('zmk', 0.16), ('mk', 0.156), ('distinctiveness', 0.142), ('datasets', 0.111), ('ymc', 0.106), ('identi', 0.103), ('recognition', 0.101), ('tdd', 0.089), ('target', 0.086), ('images', 0.081), ('xm', 0.08), ('pk', 0.078), ('adapting', 0.072), ('latent', 0.072), ('dslr', 0.071), ('webcam', 0.071), ('visual', 0.071), ('views', 0.068), ('accuracies', 0.065), ('object', 0.064), ('action', 0.063), ('reshape', 0.063), ('source', 0.059), ('appearance', 0.058), ('ers', 0.057), ('learnability', 0.056), ('lpc', 0.053), ('gong', 0.049), ('distinctive', 0.049), ('training', 0.048), ('tsang', 0.047), ('maximally', 0.046), ('amazon', 0.045), ('category', 0.045), ('instances', 0.044), ('discover', 0.042), ('classi', 0.041), ('sha', 0.041), ('grauman', 0.041), ('test', 0.04), ('zm', 0.04), ('dataset', 0.039), ('videos', 0.039), ('organize', 0.039), ('kernel', 0.038), ('tl', 0.037), ('clustering', 0.036), ('cvpr', 0.036), ('view', 0.036), ('gorig', 0.035), ('gother', 0.035), ('gours', 0.035), ('ixmas', 0.035), ('factors', 0.035), ('pose', 0.034), ('discovered', 0.034), ('matching', 0.034), ('discriminative', 0.033), ('accuracy', 0.032), ('eccv', 0.032), ('activity', 0.032), ('variations', 0.032), ('mismatched', 0.031), ('surf', 0.031), ('gretton', 0.031), ('merged', 0.031), ('characteristic', 0.03), ('actions', 0.03), ('examine', 0.029), ('ym', 0.029), ('annotated', 0.029), ('duan', 0.029), ('colorful', 0.029), ('overcoming', 0.029), ('saenko', 0.029), ('southern', 0.029), ('identify', 0.028), ('human', 0.027), ('unsupervised', 0.027), ('blitzer', 0.027), ('jegelka', 0.027), ('percentages', 0.027), ('background', 0.027), ('maximum', 0.026), ('uk', 0.026), ('categories', 0.026), ('extensively', 0.026), ('landmarks', 0.026), ('geodesic', 0.026), ('iarpa', 0.026), ('automatically', 0.026), ('ensemble', 0.026), ('testing', 0.026), ('image', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

Author: Boqing Gong, Kristen Grauman, Fei Sha

2 0.19600844 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-speciﬁc decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for speciﬁc a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a signiﬁcant improvement over the state of the art. 1

3 0.14118236 337 nips-2013-Transportability from Multiple Environments with Limited Experiments

Author: Elias Bareinboim, Sanghack Lee, Vasant Honavar, Judea Pearl

Abstract: This paper considers the problem of transferring experimental ﬁndings learned from multiple heterogeneous domains to a target domain, in which only limited experiments can be performed. We reduce questions of transportability from multiple domains and with limited scope to symbolic derivations in the causal calculus, thus extending the original setting of transportability introduced in [1], which assumes only one domain with full experimental information available. We further provide different graphical and algorithmic conditions for computing the transport formula in this setting, that is, a way of fusing the observational and experimental information scattered throughout different domains to synthesize a consistent estimate of the desired effects in the target domain. We also consider the issue of minimizing the variance of the produced estimand in order to increase power. 1 Motivation Transporting and synthesizing experimental knowledge from heterogeneous settings are central to scientiﬁc discovery. Conclusions that are obtained in a laboratory setting are transported and applied elsewhere in an environment that differs in many aspects from that of the laboratory. In data-driven sciences, experiments are conducted on disparate domains, but the intention is almost invariably to fuse the acquired knowledge, and translate it into some meaningful claim about a target domain, which is usually different than any of the individual study domains. However, the conditions under which this extrapolation can be legitimized have not been formally articulated until very recently. Although the problem has been discussed in many areas of statistics, economics, and the health sciences, under rubrics such as “external validity” [2, 3], “meta-analysis” [4], “quasi-experiments” [5], “heterogeneity” [6], these discussions are limited to verbal narratives in the form of heuristic guidelines for experimental researchers – no formal treatment of the problem has been attempted to answer the practical challenge of generalizing causal knowledge across multiple heterogeneous domains with disparate experimental data posed in this paper. The ﬁelds of artiﬁcial intelligence and statistics provide the theoretical underpinnings necessary for tackling transportability. First, the distinction between statistical and causal knowledge has received syntactic representation through causal diagrams [7, 8, 9], which became a popular tool for causal inference in data-driven ﬁelds. Second, the inferential machinery provided by the causal calculus (do-calculus) [7, 9, 10] is particularly suitable for handling knowledge transfer across domains. Armed with these techniques, [1] introduced a formal language for encoding differences and commonalities between domains accompanied with necessary or sufﬁcient conditions under which transportability of empirical ﬁndings is feasible between two domains, a source and a target; then, these conditions were extended for a complete characterization for transportability in one domain with unrestricted experimental data [11]. Subsequently, these results were generalized for the settings when ∗ These authors contributed equally to this paper. The authors’ addresses are respectively eb@cs.ucla.edu, sxl439@ist.psu.edu, vhonavar@ist.psu.edu, judea@cs.ucla.edu. 1 only limited experiments are available in the source domain [12, 13], and further for when multiple source domains with unrestricted experimental information are available [14, 15]. This paper broadens these discussions introducing a more general setting in which multiple heterogeneous sources with limited and distinct experiments are available, a task that we call here “mz-transportability”.1 More formally, the mz-transportability problem concerns with the transfer of causal knowledge from a heterogeneous collection of source domains Π = {π1 , ..., πn } to a target domain π ∗ . In each domain πi ∈ Π, experiments over a set of variables Zi can be performed, and causal knowledge gathered. In π ∗ , potentially different from πi , only passive observations can be collected (this constraint is weakened later on). The problem is to infer a causal relationship R in π ∗ using knowledge obtained in Π. Clearly, if nothing is known about the relationship between Π and π ∗ , the problem is trivial; no transfer can be justiﬁed. Yet the fact that all scientiﬁc experiments are conducted with the intent of being used elsewhere (e.g., outside the lab) implies that scientiﬁc progress relies on the assumption that certain domains share common characteristics and that, owed to these commonalities, causal claims would be valid in new settings even where experiments cannot be conducted. The problem stated in this paper generalizes the one-dimensional version of transportability with limited scope and the multiple dimensional with unlimited scope. Remarkably, while the effects of interest might not be individually transportable to the target domain from the experiments in any of the available sources, combining different pieces from the various sources may enable the estimation of the desired effects (to be shown later on). The goal of this paper is to formally understand under which conditions the target quantity is (non-parametrically) estimable from the available data. 2 Previous work and our contributions Consider Fig. 1(a) in which the node S represents factors that produce differences between source and target populations. Assume that we conduct a randomized trial in Los Angeles (LA) and estimate the causal effect of treatment X on outcome Y for every age group Z = z, denoted by P (y|do(x), z). We now wish to generalize the results to the population of the United States (U.S.), but we ﬁnd the distribution P (x, y, z) in LA to be different from the one in the U.S. (call the latter P ∗ (x, y, z)). In particular, the average age in the U.S. is signiﬁcantly higher than that in LA. How are we to estimate the causal effect of X on Y in U.S., denoted R = P ∗ (y|do(x))? 2 3 The selection diagram for this example (Fig. 1(a)) conveys the assumption that the only difference between the two populations are factors determining age distributions, shown as S → Z, while agespeciﬁc effects P ∗ (y|do(x), Z = z) are invariant across populations. Difference-generating factors are represented by a special set of variables called selection variables S (or simply S-variables), which are graphically depicted as square nodes ( ). From this assumption, the overall causal effect in the U.S. can be derived as follows: R P ∗ (y|do(x), z)P ∗ (z) = z P (y|do(x), z)P ∗ (z) = (1) z The last line is the transport formula for R. It combines experimental results obtained in LA, P (y|do(x), z), with observational aspects of the U.S. population, P ∗ (z), to obtain an experimental claim P ∗ (y|do(x)) about the U.S.. In this trivial example, the transport formula amounts to a simple re-calibration (or re-weighting) of the age-speciﬁc effects to account for the new age distribution. In general, however, a more involved mixture of experimental and observational ﬁndings would be necessary to obtain a bias-free estimate of the target relation R. Fig. 1(b) depicts the smallest example in which transportability is not feasible even when experiments over X in π are available. In real world applications, it may happen that certain controlled experiments cannot be conducted in the source environment (for ﬁnancial, ethical, or technical reasons), so only a limited amount 1 The machine learning literature has been concerned about discrepancies among domains in the context, almost exclusively, on predictive or classiﬁcation tasks as opposed to learning causal or counterfactual measures [16, 17]. Interestingly enough, recent work on anticausal learning moves towards more general modalities of learning and also leverages knowledge about the underlying data-generating structure [18, 19]. 2 We will use Px (y | z) interchangeably with P (y | do(x), z). 3 We use the structural interpretation of causal diagrams as described in [9, pp. 205]. 2 S S Z Z1 S S X Y (a) X Y (b) Z2 X Y (c) Figure 1: The selection variables S are depicted as square nodes ( ). (a) Selection diagram illustrating when transportability between two domains is trivially solved through simple recalibration. (b) The smallest possible selection diagram in which a causal relation is not transportable. (c) Selection diagram illustrating transportability when only experiments over {Z1 } are available in the source. of experimental information can be gathered. A natural question arises whether the investigator in possession of a limited set of experiments would still be able to estimate the desired effects at the target domain. For instance, we assume in Fig. 1(c) that experiments over Z1 are available and the target quantity is R = P ∗ (y|do(x)), which can be shown to be equivalent to P (y|x, do(Z1 )), the conditional distribution of Y given X in the experimental study when Z1 is randomized. 4 One might surmise that multiple pairwise z-transportability would be sufﬁcient to solve the mztransportability problem, but this is not the case. To witness, consider Fig. 2(a,b) which concerns the transport of experimental results from two sources ({πa , πb }) to infer the effect of X on Y in π ∗ , R = P ∗ (y|do(x)). In these diagrams, X may represent the treatment (e.g., cholesterol level), Z1 represents a pre-treatment variable (e.g., diet), Z2 represents an intermediate variable (e.g., biomarker), and Y represents the outcome (e.g., heart failure). We assume that experimental studies randomizing {Z1 , Z2 } can be conducted in both domains. A simple analysis based on [12] can show that R cannot be z-transported from either source alone, but it turns out that combining in a special way experiments from both sources allows one to determine the effect in the target. More interestingly, we consider the more stringent scenario where only certain experiments can be performed in each of the domains. For instance, assume that it is only possible to conduct experiments over {Z2 } in πa and over {Z1 } in πb . Obviously, R will not be z-transported individually from these domains, but it turns out that taking both sets of experiments into account, R = z2 P (a) (y|do(z2 ))P (b) (z2 |x, do(Z1 )), which fully uses all pieces of experimental data available. In other words, we were able to decompose R into subrelations such that each one is separately z-transportable from the source domains, and so is the desired target quantity. Interestingly, it is the case in this example that if the domains in which experiments were conducted were reversed (i.e., {Z1 } randomized in πa , {Z2 } in πb ), it will not be possible to transport R by any method – the target relation is simply not computable from the available data (formally shown later on). This illustrates some of the subtle issues mz-transportability entails, which cannot be immediately cast in terms of previous instances of the transportability class. In the sequel, we try to better understand some of these issues, and we develop sufﬁcient or (speciﬁc) necessary conditions for deciding special transportability for arbitrary collection of selection diagrams and set of experiments. We further construct an algorithm for deciding mz-transportability of joint causal effects and returning the correct transport formula whenever this is possible. We also consider issues relative to the variance of the estimand aiming for improving sample efﬁciency and increasing statistical power. 3 Graphical conditions for mz-transportability The basic semantical framework in our analysis rests on structural causal models as deﬁned in [9, pp. 205], also called data-generating models. In the structural causal framework [9, Ch. 7], actions are modiﬁcations of functional relationships, and each action do(x) on a causal model M produces 4 A typical example is whether we can estimate the effect of cholesterol (X) on heart failure (Y ) by experiments on diet (Z1 ) given that cholesterol levels cannot be randomized [20]. 3 Z2 Z1 Z1 X Z1 X X Z2 W Z2 Y (a) Z2 Z1 Y X Z3 W U U Y Y (b) (c) Z3 (d) Figure 2: Selection diagrams illustrating impossibility of estimating R = P ∗ (y|do(x)) through individual transportability from πa and πb even when Z = {Z1 , Z2 } (for (a, b), (c, d))). If we assume, more stringently, availability of experiments Za = {Z2 }, Zb = {Z1 }, Z∗ = {}, a more elaborated analysis can show that R can be estimated combining different pieces from both domains. a new model Mx = U, V, Fx , P (U) , where Fx is obtained after replacing fX ∈ F for every X ∈ X with a new function that outputs a constant value x given by do(x). 5 We follow the conventions given in [9]. We denote variables by capital letters and their realized values by small letters. Similarly, sets of variables will be denoted by bold capital letters, sets of realized values by bold small letters. We use the typical graph-theoretic terminology with the corresponding abbreviations P a(Y)G and An(Y)G , which will denote respectively the set of observable parents and ancestors of the node set Y in G. A graph GY will denote the induced subgraph G containing nodes in Y and all arrows between such nodes. Finally, GXZ stands for the edge subgraph of G where all incoming arrows into X and all outgoing arrows from Z are removed. Key to the analysis of transportability is the notion of “identiﬁability,” deﬁned below, which expresses the requirement that causal effects are computable from a combination of data P and assumptions embodied in a causal graph G. Deﬁnition 1 (Causal Effects Identiﬁability (Pearl, 2000, pp. 77)). The causal effect of an action do(x) on a set of variables Y such that Y ∩ X = ∅ is said to be identiﬁable from P in G if Px (y) is uniquely computable from P (V) in any model that induces G. Causal models and their induced graphs are usually associated with one particular domain (also called setting, study, population, or environment). In ordinary transportability, this representation was extended to capture properties of two domains simultaneously. This is possible if we assume that the structural equations share the same set of arguments, though the functional forms of the equations may vary arbitrarily [11]. 6 Deﬁnition 2 (Selection Diagram). Let M, M ∗ be a pair of structural causal models [9, pp. 205] relative to domains π, π ∗ , sharing a causal diagram G. M, M ∗ is said to induce a selection diagram D if D is constructed as follows: 1. Every edge in G is also an edge in D; 2. D contains an extra edge Si → Vi whenever there might exist a discrepancy fi = fi∗ or P (Ui ) = P ∗ (Ui ) between M and M ∗ . In words, the S-variables locate the mechanisms where structural discrepancies between the two domains are suspected to take place.7 Alternatively, the absence of a selection node pointing to a variable represents the assumption that the mechanism responsible for assigning value to that variable is identical in both domains. 5 The results presented here are also valid in other formalisms for causality based on potential outcomes. As discussed in the reference, the assumption of no structural changes between domains can be relaxed, but some structural assumptions regarding the discrepancies between domains must still hold. 7 Transportability assumes that enough structural knowledge about both domains is known in order to substantiate the production of their respective causal diagrams. In the absence of such knowledge, causal discovery algorithms might be used to infer the diagrams from data [8, 9]. 6 4 Armed with the concept of identiﬁability and selection diagrams, mz-transportability of causal effects can be deﬁned as follows: Deﬁnition 3 (mz-Transportability). Let D = {D(1) , ..., D(n) } be a collection of selection diagrams relative to source domains Π = {π1 , ..., πn }, and target domain π ∗ , respectively, and Zi (and Z∗ ) i be the variables in which experiments can be conducted in domain πi (and π ∗ ). Let P i , Iz be i i the pair of observational and interventional distributions of πi , where Iz = Z ⊆Zi P (v|do(z )), ∗ and in an analogous manner, P ∗ , Iz be the observational and interventional distributions of π ∗ . ∗ ∗ The causal effect R = Px (y|w) is said to be mz-transportable from Π to π ∗ in D if Px (y|w) is i ∗ uniquely computable from i=1,...,n P i , Iz ∪ P ∗ , Iz in any model that induces D. ∗ i The requirement that R is uniquely computable from P ∗ , Iz and P i , Iz from all sources has a syntactic image in the causal calculus, which is captured by the following sufﬁcient condition. Theorem 1. Let D = {D(1) , ..., D(n) } be a collection of selection diagrams relative to source domains Π = {π1 , ..., πn }, and target domain π ∗ , respectively, and Si represents the collection of i ∗ S-variables in the selection diagram D(i) . Let { P i , Iz } and P ∗ , Iz be respectively the pairs of observational and interventional distributions in the sources Π and target π ∗ . The relation R = P ∗ (y|do(x), w) is mz-transportable from Π to π ∗ in D if the expression P (y|do(x), w, S1 , ..., Sn ) is reducible, using the rules of the causal calculus, to an expression in which (1) do-operators that ∗ i apply to subsets of Iz have no Si -variables or (2) do-operators apply only to subsets of Iz . This result provides a powerful way to syntactically establish mz-transportability, but it is not immediately obvious whether a sequence of applications of the rules of causal calculus that achieves the reduction required by the theorem exists, and even if such sequence exists, it is not obvious how to obtain it. For concreteness, we illustrate this result using the selection diagrams in Fig. 2(a,b). Corollary 1. P ∗ (y|do(x)) is mz-transportable in Fig. 2(a,b) with Za = {Z2 } and Zb = {Z1 }. Proof. The goal is to show that R = P ∗ (y|do(x)) is mz-transportable from {πa , πb } to π ∗ using experiments conducted over {Z2 } in πa and {Z1 } in πb . Note that naively trying to transport R from each of the domains individually is not possible, but R can be decomposed as follows: P ∗ (y|do(x)) = P ∗ (y|do(x), do(Z1 )) (2) P ∗ (y|do(x), do(Z1 ), z2 )P ∗ (z2 |do(x), do(Z1 )) (3) P ∗ (y|do(x), do(Z1 ), do(z2 ))P ∗ (z2 |do(x), do(Z1 )), = (4) z2 = z2 where Eq. (2) follows by rule 3 of the causal calculus since (Z1 ⊥ Y |X)DX,Z holds, we con⊥ 1 dition on Z2 in Eq. (3), and Eq. (4) follows by rule 2 of the causal calculus since (Z2 ⊥ ⊥ Y |X, Z1 )DX,Z ,Z , where D is the diagram in π ∗ (despite the location of the S-nodes). 1 2 Now we can rewrite the ﬁrst term of Eq. (4) as indicated by the Theorem (and suggested by Def. 2): P ∗ (y|do(x), do(Z1 ), do(z2 )) = P (y|do(x), do(Z1 ), do(z2 ), Sa , Sb ) (5) = P (y|do(x), do(Z1 ), do(z2 ), Sb ) (6) = P (y|do(z2 ), Sb ) (7) = P (a) (y|do(z2 )), (8) where Eq. (5) follows from the theorem (and the deﬁnition of selection diagram), Eq. (6) follows , Eq. (7) follows from rule from rule 1 of the causal calculus since (Sa ⊥ Y |Z1 , Z2 , X)D(a) ⊥ Z1 ,Z2 ,X 3 of the causal calculus since (Z1 , X ⊥ Y |Z2 )D(a) ⊥ . Note that this equation matches with the Z1 ,Z2 ,X a syntactic goal of Theorem 1 since we have precisely do(z2 ) separated from Sa (and Z2 ∈ Iz ); so, we can rewrite the expression which results in Eq. (8) by the deﬁnition of selection diagram. Finally, we can rewrite the second term of Eq. (4) as follows: P ∗ (z2 |do(x), do(Z1 )) = P (z2 |do(x), do(Z1 ), Sa , Sb ) = P (z2 |do(x), do(Z1 ), Sa ) = P (z2 |x, do(Z1 ), Sa ) = P (b) (z2 |x, do(Z1 )), 5 (9) (10) (11) (12) where Eq. (9) follows from the theorem (and the deﬁnition of selection diagram), Eq. (10) follows from rule 1 of the causal calculus since (Sb ⊥ Z2 |Z1 , X)D(b) , Eq. (11) follows from rule 2 of ⊥ Z1 ,X the causal calculus since (X ⊥ Z2 |Z1 )D(b) . Note that this equation matches the condition of the ⊥ Z1 X theorem, separate do(Z1 ) from Sb (i.e., experiments over Z1 can be used since they are available in πb ), so we can rewrite Eq. (12) using the deﬁnition of selection diagrams, the corollary follows. The next condition for mz-transportability is more visible than Theorem 1 (albeit weaker), which also demonstrates the challenge of relating mz-transportability to other types of transportability. Corollary 2. R = P ∗ (y|do(x)) is mz-transportable in D if there exists Zi ⊆ Zi such that all paths from Zi to Y are blocked by X, (Si ⊥ Y|X, Zi )D(i) , and R is computable from do(Zi ). ⊥ X,Z i Remarkably, randomizing Z2 when applying Corollary 1 was instrumental to yield transportability in the previous example, despite the fact that the directed paths from Z2 to Y were not blocked by X, which suggests how different this transportability is from z-identiﬁability. So, it is not immediately obvious how to combine the topological relations of Zi ’s with X and Y in order to create a general condition for mz-transportability, the relationship between the distributions in the different domains can get relatively intricate, but we defer this discussion for now and consider a simpler case. It is not usually trivial to pursue a derivation of mz-transportability in causal calculus, and next we show an example in which such derivation does not even exist. Consider again the diagrams in Fig. 2(a,b), and assume that randomized experiments are available over {Z1 } in πa and {Z2 } in πb . Theorem 2. P ∗ (y|do(x)) is not mz-transportable in Fig. 2(a,b) with Za = {Z1 } and Zb = {Z2 }. Proof. Formally, we need to display two models M1 , M2 such that the following relations hold (as implied by Def. 3):  (a) (a)  PM1 (Z1 , X, Z2 , Y ) = PM2 (Z1 , X, Z2 , Y ),   (b)   P (Z1 , X, Z2 , Y ) = P (b) (Z1 , X, Z2 , Y ),  M1 M2 (a) (a) (13) PM1 (X, Z2 , Y |do(Z1 )) = PM2 (X, Z2 , Y |do(Z1 )),  (b)   P (Z , X, Y |do(Z )) = P (b) (Z , X, Y |do(Z )),  M 2 1 2  M2  ∗1 1 ∗ PM1 (Z1 , X, Z2 , Y ) = PM2 (Z1 , X, Z2 , Y ), for all values of Z1 , X, Z2 , and Y , and also, ∗ ∗ PM1 (Y |do(X)) = PM2 (Y |do(X)), (14) for some value of X and Y . Let V be the set of observable variables and U be the set of unobservable variables in D. Let us assume that all variables in U ∪ V are binary. Let U1 , U2 ∈ U be the common causes of Z1 and X and Z2 , respectively; let U3 , U4 , U5 ∈ U be a random disturbance exclusive to Z1 , Z2 , and Y , respectively, and U6 ∈ U be an extra random disturbance exclusive to Z2 , and U7 , U8 ∈ U to Y . Let Sa and Sb index the model in the following way: the tuples Sa = 1, Sb = 0 , Sa = 0, Sb = 1 , Sa = 0, Sb = 0 represent domains πa , πb , and π ∗ , respectively. Deﬁne the two models as follows:    Z1 = U1 ⊕ U2 ⊕ U3 ∧ Sa  Z1 = U1 ⊕ U2 ⊕ U3 ∧ Sa     X=U X = Z1 ⊕ U1 1 M1 = M2 =  Z2 = (X ⊕ U2 ⊕ (U4 ∧ Sa )) ∨ U6  Z2 = U4 ∧ Sa ∧ U6 ⊕ U6     Y = (Z2 ∧ U5 ) ⊕ (U5 ∧ U7 ) ⊕ (Sb ∧ U8 ) Y = (Z2 ∧ U5 ) ⊕ (U5 ∧ U7 ) ⊕ (Sb ∧ U8 ) where ⊕ represents the exclusive or function. Both models agree in respect to P (U), which is deﬁned as P (Ui ) = 1/2, i = 1, ..., 8. It is not difﬁcult to evaluate these models and note that the constraints given in Eqs. (13) and (14) are satisﬁed (including positivity), the theorem follows. 4 Algorithm for computing mz-transportability In this section, we build on previous analyses of identiﬁability [7, 21, 22, 23] in order to obtain a mechanic procedure in which a collection of selection diagrams and experimental data is inputted, and the procedure returns a transport formula whenever it is able to produce one. More speciﬁcally, 6 PROCEDURE TRmz (y, x, P, I, S, W, D) INPUT: x, y: value assignments; P: local distribution relative to domain S (S = 0 indexes π ∗ ) and active experiments I; W: weighting scheme; D: backbone of selection diagram; Si : selection nodes in πi (S0 = ∅ (i) relative to π ∗ ); [The following set and distributions are globally deﬁned: Zi , P ∗ , PZi .] (i) ∗ ∗ OUTPUT: Px (y) in terms of P ∗ , PZ , PZi or F AIL(D, C0 ). P 1 if x = ∅, return V\Y P. P 2 if V \ An(Y)D = ∅, return TRmz (y, x ∩ An(Y)D , V\An(Y)D P, I, S, W, DAn(Y) ). 3 set W = (V \ X) \ An(Y)DX . if W = ∅, return TRmz (y, x ∪ w, P, I, S, W, D). Q P 4 if C(D \ X) = {C0 , C1 , ..., Ck }, return V\{Y,X} i TRmz (ci , v \ ci , P, I, S, W, D). 5 if C(D \ X) = {C0 }, 6 if C(D) = {D}, Q P P 7 if C0 ∈ C(D), return i|Vi ∈C0 V\V (i) P/ V\V (i−1) P. D 8 9 10 11 12 D if (∃C )C0 ⊂ C ∈ C(D), (i−1) for {i|Vi ∈ C }, set κi = κi ∪ vD \C . Q (i−1) mz return TR (y, x ∩ C , i|Vi ∈C P(Vi |VD ∩ C , κi ), I, S, W, C ). else, if I`= ∅, for i = 0, ..., |D|, ´ if (Si ⊥ Y | X)D(i) ∧ (Zi ∩ X = ∅) , Ei = TRmz (y, x \ zi , P, Zi ∩ X, i, W, D \ {Zi ∩ X}). ⊥ PX (j) if |E| > 0, return |E| wi Ei . i=1 else, FAIL(D, C0 ). Figure 3: Modiﬁed version of identiﬁcation algorithm capable of recognizing mz-transportability. our algorithm is called TRmz (see Fig. 3), and is based on the C-component decomposition for identiﬁcation of causal effects [22, 23] (and a version of the identiﬁcation algorithm called ID). The rationale behind TRmz is to apply Tian’s factorization and decompose the target relation into smaller, more manageable sub-expressions, and then try to evaluate whether each sub-expression can be computed in the target domain. Whenever this evaluation fails, TRmz tries to use the experiments available from the target and, if possible, from the sources; this essentially implements the declarative condition delineated in Theorem 1. Next, we consider the soundness of the algorithm. ∗ Theorem 3 (soundness). Whenever TRmz returns an expression for Px (y), it is correct. In the sequel, we demonstrate how the algorithm works through the mz-transportability of Q = P ∗ (y|do(x)) in Fig. 2(c,d) with Z∗ = {Z1 }, Za = {Z2 }, and Zb = {Z1 }. Since (V \ X) \ An(Y)DX = {Z2 }, TRmz invokes line 3 with {Z2 } ∪ {X} as interventional set. The new call triggers line 4 and C(D \ {X, Z2 }) = {C0 , C1 , C2 , C3 }, where C0 = DZ1 , C1 = DZ3 , C2 = DU , and C3 = DW,Y , we invoke line 4 and try to mz-transport individ∗ ∗ ∗ ually Q0 = Px,z2 ,z3 ,u,w,y (z1 ), Q1 = Px,z1 ,z2 ,u,w,y (z3 ), Q2 = Px,z1 ,z2 ,z3 ,w,y (u), and Q3 = ∗ Px,z1 ,z2 ,z3 ,u (w, y). Thus the original problem reduces to try to evaluate the equivalent expression ∗ ∗ ∗ ∗ z1 ,z3 ,u,w Px,z2 ,z3 ,u,w,y (z1 )Px,z1 ,z2 ,u,w,y (z3 ) Px,z1 ,z2 ,z3 ,w,y (u)Px,z1 ,z2 ,z3 ,u (w, y). First, TRmz evaluates the expression Q0 and triggers line 2, noting that all nodes can be ignored ∗ since they are not ancestors of {Z1 }, which implies after line 1 that Px,z2 ,z3 ,u,w,y (z1 ) = P ∗ (z1 ). Second, TRmz evaluates the expression Q1 triggering line 2, which implies that ∗ ∗ Px,z1 ,z2 ,u,w,y (z3 ) = Px,z1 ,z2 (z3 ) with induced subgraph D1 = DX,Z1 ,Z2 ,Z3 . TRmz goes to line 5, in which in the local call C(D \ {X, Z1 , Z2 }) = {DZ3 }. Thus it proceeds to line 6 testing whether C(D \ {X, Z1 , Z2 }) is different from D1 , which is false. In this call, ordinary identiﬁability would fail, but TRmz proceeds to line 9. The goal of this line is to test whether some experiment can help for computing Q1 . In this case, πa fails immediately the test in line 10, but πb and π ∗ succeed, (i) which means experiments in these domains may eventually help; the new call is Px,z2 (z3 )D\Z1 , for mz i = {b, ∗} with induced graph D1 = DX,Z2 ,Z3 . Finally, TR triggers line 8 since X is not part of Z3 ’s components in D1 (or, Z3 ∈ C = {Z2 Z3 }), so line 2 is triggered since Z2 is no longer an ancestor of Z3 in D1 , and then line 1 is triggered since the interventional set is empty in (i) (i) ∗ this local call, so Px,z1 ,z2 (z3 ) = Z Pz1 (z3 |x, Z2 )Pz1 (Z2 ), for i = {b, ∗}. 2 7 ∗ Third, evaluating the expression Q2 , TRmz goes to line 2, which implies that Px,z1 ,z2 ,z3 ,w,y (u) = ∗ mz Px,z1 ,z2 ,z3 ,w (u) with induced subgraph D2 = DX,Z1 ,Z2 ,Z3 ,W,U . TR goes to line 5, and in this local call C(D \ {X, Z1 , Z2 , Z3 , W }) = {DU }, and the test in 6 succeed, since there are more components in D. So, it triggers line 8 since W is not part of U ’s component ∗ ∗ in D2 . The algorithm makes Px,z1 ,z2 ,z3 ,w (u) = Px,z1 ,z2 ,z3 (u)D2 |W (and update the working distribution); note that in this call, ordinary identiﬁability would fail since the nodes are in the same C-component and the test in line 6 fails. But TRmz proceeds to line 9 trying to ﬁnd experiments that can help in Q2 ’s computation. In this case, πb cannot help but πa (a) and π ∗ perhaps can, noting that new calls are launched for computing Px,z1 ,z3 (u)D2 \Z2 |W rel∗ ative to πa , and Px,z2 ,z3 (u)D2 \Z1 |W relative to π ∗ with the corresponding data structures set. (a) (a) In πa , the algorithm triggers line 7, which yields Px,z1 ,z3 (u)D2 \Z2 |W = Pz2 (u|w, z3 , x, z1 ), ∗ and a bit more involved analysis for πb yields (after simpliﬁcation) Px,z2 ,z3 (u)D2 \Z1 |W = ∗ ∗ ∗ ∗ ∗ Z Pz1 (z3 |x, Z2 )Pz1 (Z2 ) . Z Pz1 (u|w, z3 , x, Z2 )Pz1 (z3 |x, Z2 )Pz1 (Z2 ) / 2 2 mz Fourth, TR evaluates the expression Q3 and triggers line 5, C(D\{X, Z1 , Z2 , Z3 , U }) = DW,Y . ∗ In turn, both tests at lines 6 and 7 succeed, which makes the procedure to return Px,z1 ,z2 ,z3 ,u (w, y) = ∗ ∗ P (w|z3 , x, z1 , z2 )P (y|w, x, z1 , z2 , z3 , u). The composition of the return of these calls generates the following expression: ! ∗ Px (y) = X ∗ P (z1 ) z1 ,z3 ,w,u (2) w1 (1) w1 X ∗ ∗ Pz1 (z3 |x, Z2 )Pz1 (Z2 ) X (b) (b) Pz1 (z3 |x, Z2 )Pz1 (Z2 ) Z2 Z2 „X + (1) w2 « « „X ∗ ∗ ∗ ∗ ∗ Pz1 (z3 |x, Z2 )Pz1 (Z2 ) Pz1 (u|w, z3 , x, Z2 )Pz1 (z3 |x, Z2 )Pz1 (Z2 ) / Z2 Z2 ! (2) (a) + w2 Pz2 (u|w, z3 , x, z1 ) P ∗ (w|x, z1 , z2 , z3 ) P ∗ (y|x, z1 , z2 , z3 , w, u) (15) (k) where wi represents the weight for each factor in estimand k (i = 1, ..., nk ), and nk is the number of feasible estimands of k. Eq. (15) depicts a powerful way to estimate P ∗ (y|do(x)) in the target domain, and depending on weighting choice a different estimand will be entailed. For instance, one might use an analogous to inverse-variance weighting, which sets the weights for the normalized (k) nk −2 −2 2 inverse of their variances (i.e., wi = σi / j=1 σj , where σj is the variance of the jth component of estimand k). Our strategy resembles the approach taken in meta-analysis [4], albeit the latter usually disregards the intricacies of the relationships between variables, so producing a statistically less powerful estimand. Our method leverages this non-trivial and highly structured relationships, as exempliﬁed in Eq. (15), which yields an estimand with less variance and statistically more powerful. 5 Conclusions In this paper, we treat a special type of transportability in which experiments can be conducted only over limited sets of variables in the sources and target domains, and the goal is to infer whether a certain effect can be estimated in the target using the information scattered throughout the domains. We provide a general sufﬁcient graphical conditions for transportability based on the causal calculus along with a necessary condition for a speciﬁc scenario, which should be generalized for arbitrary structures. We further provide a procedure for computing transportability, that is, generate a formula for fusing the available observational and experimental data to synthesize an estimate of the desired causal effects. Our algorithm also allows for generic weighting schemes, which generalizes standard statistical procedures and leads to the construction of statistically more powerful estimands. Acknowledgment The work of Judea Pearl and Elias Bareinboim was supported in part by grants from NSF (IIS1249822, IIS-1302448), and ONR (N00014-13-1-0153, N00014-10-1-0933). The work of Sanghack Lee and Vasant Honavar was partially completed while they were with the Department of Computer Science at Iowa State University. The work of Vasant Honavar while working at the National Science Foundation (NSF) was supported by the NSF. The work of Sanghack Lee was supported in part by the grant from NSF (IIS-0711356). Any opinions, ﬁndings, and conclusions contained in this article are those of the authors and do not necessarily reﬂect the views of the sponsors. 8 References [1] J. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A formal approach. In W. Burgard and D. Roth, editors, Proceedings of the Twenty-Fifth National Conference on Artiﬁcial Intelligence, pages 247–254. AAAI Press, Menlo Park, CA, 2011. [2] D. Campbell and J. Stanley. Experimental and Quasi-Experimental Designs for Research. Wadsworth Publishing, Chicago, 1963. [3] C. Manski. Identiﬁcation for Prediction and Decision. Harvard University Press, Cambridge, Massachusetts, 2007. [4] L. V. Hedges and I. Olkin. Statistical Methods for Meta-Analysis. Academic Press, January 1985. [5] W.R. Shadish, T.D. Cook, and D.T. Campbell. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifﬂin, Boston, second edition, 2002. [6] S. Morgan and C. Winship. Counterfactuals and Causal Inference: Methods and Principles for Social Research (Analytical Methods for Social Research). Cambridge University Press, New York, NY, 2007. [7] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–710, 1995. [8] P. Spirtes, C.N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2000. [9] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009. [10] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. [11] E. Bareinboim and J. Pearl. Transportability of causal effects: Completeness results. In J. Hoffmann and B. Selman, editors, Proceedings of the Twenty-Sixth National Conference on Artiﬁcial Intelligence, pages 698–704. AAAI Press, Menlo Park, CA, 2012. [12] E. Bareinboim and J. Pearl. Causal transportability with limited experiments. In M. desJardins and M. Littman, editors, Proceedings of the Twenty-Seventh National Conference on Artiﬁcial Intelligence, pages 95–101, Menlo Park, CA, 2013. AAAI Press. [13] S. Lee and V. Honavar. Causal transportability of experiments on controllable subsets of variables: ztransportability. In A. Nicholson and P. Smyth, editors, Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 361–370. AUAI Press, 2013. [14] E. Bareinboim and J. Pearl. Meta-transportability of causal effects: A formal approach. In C. Carvalho and P. Ravikumar, editors, Proceedings of the Sixteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 135–143. JMLR W&CP; 31, 2013. [15] S. Lee and V. Honavar. m-transportability: Transportability of a causal effect from multiple environments. In M. desJardins and M. Littman, editors, Proceedings of the Twenty-Seventh National Conference on Artiﬁcial Intelligence, pages 583–590, Menlo Park, CA, 2013. AAAI Press. [16] H. Daume III and D. Marcu. Domain adaptation for statistical classiﬁers. Journal of Artiﬁcial Intelligence Research, 26:101–126, 2006. [17] A.J. Storkey. When training and test sets are different: characterising learning transfer. In J. Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence, editors, Dataset Shift in Machine Learning, pages 3–28. MIT Press, Cambridge, MA, 2009. [18] B. Sch¨ lkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal o learning. In J Langford and J Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1255–1262, New York, NY, USA, 2012. Omnipress. [19] K. Zhang, B. Sch¨ lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional o shift. In Proceedings of the 30th International Conference on Machine Learning (ICML). JMLR: W&CP; volume 28, 2013. [20] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identiﬁability. In N. Freitas and K. Murphy, editors, Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 113–120. AUAI Press, 2012. [21] M. Kuroki and M. Miyakawa. Identiﬁability criteria for causal effects of joint interventions. Journal of the Royal Statistical Society, 29:105–117, 1999. [22] J. Tian and J. Pearl. A general identiﬁcation condition for causal effects. In Proceedings of the Eighteenth National Conference on Artiﬁcial Intelligence, pages 567–573. AAAI Press/The MIT Press, Menlo Park, CA, 2002. [23] I. Shpitser and J. Pearl. Identiﬁcation of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the Twenty-First National Conference on Artiﬁcial Intelligence, pages 1219–1226. AAAI Press, Menlo Park, CA, 2006. 9

4 0.10256067 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models

Author: Tuan A. Nguyen, Subbarao Kambhampati, Minh Do

Abstract: Most current planners assume complete domain models and focus on generating correct plans. Unfortunately, domain modeling is a laborious and error-prone task, thus real world agents have to plan with incomplete domain models. While domain experts cannot guarantee completeness, often they are able to circumscribe the incompleteness of the model by providing annotations as to which parts of the domain model may be incomplete. In such cases, the goal should be to synthesize plans that are robust with respect to any known incompleteness of the domain. In this paper, we ﬁrst introduce annotations expressing the knowledge of the domain incompleteness and formalize the notion of plan robustness with respect to an incomplete domain model. We then show an approach to compiling the problem of ﬁnding robust plans to the conformant probabilistic planning problem, and present experimental results with Probabilistic-FF planner. 1

5 0.091469109 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell

Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to ﬁnd the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classiﬁers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a signiﬁcant advantage results from combining visual classiﬁers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1

6 0.088709764 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

7 0.086266577 335 nips-2013-Transfer Learning in a Transductive Setting

8 0.082108483 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

9 0.07947515 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

10 0.078822054 5 nips-2013-A Deep Architecture for Matching Short Texts

11 0.07867559 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

12 0.075710252 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

13 0.075189158 148 nips-2013-Latent Maximum Margin Clustering

14 0.070509806 75 nips-2013-Convex Two-Layer Modeling

15 0.066575505 166 nips-2013-Learning invariant representations and applications to face verification

16 0.06522423 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

17 0.063781582 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

18 0.062315315 294 nips-2013-Similarity Component Analysis

19 0.061989993 21 nips-2013-Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths

20 0.061669149 136 nips-2013-Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.163), (1, 0.041), (2, -0.096), (3, -0.043), (4, 0.107), (5, -0.072), (6, 0.008), (7, 0.011), (8, -0.053), (9, 0.044), (10, -0.131), (11, -0.006), (12, -0.004), (13, -0.058), (14, 0.008), (15, 0.051), (16, 0.028), (17, -0.015), (18, 0.012), (19, -0.082), (20, -0.052), (21, 0.088), (22, 0.062), (23, -0.001), (24, 0.004), (25, -0.015), (26, 0.051), (27, -0.092), (28, -0.05), (29, -0.034), (30, 0.033), (31, -0.024), (32, 0.022), (33, 0.038), (34, 0.036), (35, 0.003), (36, -0.047), (37, 0.025), (38, 0.1), (39, 0.146), (40, -0.155), (41, -0.02), (42, 0.011), (43, 0.052), (44, 0.059), (45, -0.096), (46, 0.054), (47, 0.028), (48, -0.058), (49, 0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97955179 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

Author: Boqing Gong, Kristen Grauman, Fei Sha

2 0.76383001 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

3 0.75574201 337 nips-2013-Transportability from Multiple Environments with Limited Experiments

Author: Elias Bareinboim, Sanghack Lee, Vasant Honavar, Judea Pearl

4 0.646873 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models

Author: Tuan A. Nguyen, Subbarao Kambhampati, Minh Do

5 0.56703764 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori

Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efﬁciently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classiﬁcation, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classiﬁcation, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classiﬁcation label and localized latent paths. 1

6 0.56042546 294 nips-2013-Similarity Component Analysis

7 0.53923261 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

8 0.53357297 183 nips-2013-Mapping paradigm ontologies to and from the brain

9 0.53108859 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

10 0.53035951 166 nips-2013-Learning invariant representations and applications to face verification

11 0.52637404 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

12 0.52356172 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

13 0.52135593 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

14 0.50702733 148 nips-2013-Latent Maximum Margin Clustering

15 0.49606019 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

16 0.49209684 335 nips-2013-Transfer Learning in a Transductive Setting

17 0.49111187 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking

18 0.48967391 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics

19 0.48952901 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation

20 0.48859784 226 nips-2013-One-shot learning by inverting a compositional causal process

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.015), (10, 0.244), (16, 0.023), (33, 0.16), (34, 0.095), (41, 0.027), (49, 0.043), (56, 0.074), (70, 0.038), (85, 0.039), (89, 0.028), (93, 0.114), (95, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82488573 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation

Author: Boqing Gong, Kristen Grauman, Fei Sha

2 0.74702513 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers

Author: Raphael Bailly, Xavier Carreras, Ariadna Quattoni

Abstract: Finite-State Transducers (FST) are a standard tool for modeling paired inputoutput sequences and are used in numerous applications, ranging from computational biology to natural language processing. Recently Balle et al. [4] presented a spectral algorithm for learning FST from samples of aligned input-output sequences. In this paper we address the more realistic, yet challenging setting where the alignments are unknown to the learning algorithm. We frame FST learning as ﬁnding a low rank Hankel matrix satisfying constraints derived from observable statistics. Under this formulation, we provide identiﬁability results for FST distributions. Then, following previous work on rank minimization, we propose a regularized convex relaxation of this objective which is based on minimizing a nuclear norm penalty subject to linear constraints and can be solved efﬁciently. 1

3 0.72415638 158 nips-2013-Learning Multiple Models via Regularized Weighting

Author: Daniel Vainsencher, Shie Mannor, Huan Xu

Abstract: We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd’s algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufﬁciently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efﬁciently. We demonstrate the robustness beneﬁts of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a ﬁxed percentage of adversarial unbounded outliers. 1

4 0.68993527 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh

Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1

5 0.68138802 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overﬁtting by artiﬁcially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is ﬁrst-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and ﬁnd that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classiﬁcation tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

6 0.67782682 30 nips-2013-Adaptive dropout for training deep neural networks

7 0.67731619 251 nips-2013-Predicting Parameters in Deep Learning

8 0.67434543 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

9 0.67416018 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

10 0.6733585 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation

11 0.67314827 64 nips-2013-Compete to Compute

12 0.67232823 215 nips-2013-On Decomposing the Proximal Map

13 0.66967189 183 nips-2013-Mapping paradigm ontologies to and from the brain

14 0.6695447 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

15 0.66911048 5 nips-2013-A Deep Architecture for Matching Short Texts

16 0.66826868 331 nips-2013-Top-Down Regularization of Deep Belief Networks

17 0.66557777 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

18 0.66548461 201 nips-2013-Multi-Task Bayesian Optimization

19 0.66224903 301 nips-2013-Sparse Additive Text Models with Low Rank Background

20 0.66122699 65 nips-2013-Compressive Feature Learning