nips nips2013 nips2013-358 knowledge-graph by maker-knowledge-mining

358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

Source: pdf

Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch

Abstract: In this paper we introduce a novel method that can efﬁciently estimate a family of hierarchical dense sets in high-dimensional distributions. Our method can be regarded as a natural extension of the one-class SVM (OCSVM) algorithm that ﬁnds multiple parallel separating hyperplanes in a reproducing kernel Hilbert space. We call our method q-OCSVM, as it can be used to estimate q quantiles of a highdimensional distribution. For this purpose, we introduce a new global convex optimization program that ﬁnds all estimated sets at once and show that it can be solved efﬁciently. We prove the correctness of our method and present empirical results that demonstrate its superiority over existing methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract In this paper we introduce a novel method that can efﬁciently estimate a family of hierarchical dense sets in high-dimensional distributions. [sent-4, score-0.163]

2 Our method can be regarded as a natural extension of the one-class SVM (OCSVM) algorithm that ﬁnds multiple parallel separating hyperplanes in a reproducing kernel Hilbert space. [sent-5, score-0.288]

3 We call our method q-OCSVM, as it can be used to estimate q quantiles of a highdimensional distribution. [sent-6, score-0.159]

4 For this purpose, we introduce a new global convex optimization program that ﬁnds all estimated sets at once and show that it can be solved efﬁciently. [sent-7, score-0.158]

5 Furthermore, when a Gaussian kernel with a zero tending bandwidth is used, the solution also converges to the minimum-volume set (MV-set) at level α [19], which is a subset of the input space with the smallest volume and probability mass of at least α. [sent-16, score-0.146]

6 It appears, however, that in some applications we are not actually interested in estimating a single MV-set but in estimating multiple hierarchical MV-sets, which reveal more information about the distribution. [sent-18, score-0.105]

7 For instance, in cluster analysis [5], we are interested in learning hierarchical MV-sets to construct a cluster tree of the distribution. [sent-19, score-0.105]

8 In outlier detection [6], hierarchical MV-sets can be used to classify examples as outliers at different levels of signiﬁcance. [sent-20, score-0.258]

9 In statistical tests, hierarchical MV-sets are used for generalizing univariate tests to high-dimensional data [12, 4]. [sent-21, score-0.105]

10 We are thus interested in a method that generalizes the OCSVM algorithm for approximating hierarchical MV-sets. [sent-22, score-0.139]

11 Let q be the number of hierarchical MV-sets we would like to approximate. [sent-25, score-0.105]

12 In this paper we introduce a generalized version of the OCSVM algorithm for approximating hierarchical MV-sets in high-dimensional distributions. [sent-29, score-0.143]

13 As in the naive approach, approximated MV-sets in our method are represented as dense sets captured by separating hyperplanes in a reproducing kernel Hilbert space. [sent-30, score-0.358]

14 To preserve the ν-property of the solution while fulﬁlling the hierarchy constraint, we require the resulting hyperplanes to be parallel to one another. [sent-32, score-0.21]

15 To provide a generalized global solution, we introduce a new convex optimization program that ﬁnds all approximated MV-sets at once. [sent-33, score-0.255]

16 Furthermore, we expect our method to have better generalization ability due to the parallelism constraint Q=4,N=100 imposed on the hyperplanes, which also acts as a regularization term on the solution. [sent-34, score-0.159]

17 8 1 Figure 1: An approximation of 4 hierarchical MV-sets We call our method q-OCSVM, as it can be used by statisticians to generalize q-quantiles to highdimensional distributions. [sent-43, score-0.169]

18 We show that our method can be solved efﬁciently, and provide theoretical results showing that it preserves both the density assumption for each approximated set in the same sense suggested by [14]. [sent-45, score-0.191]

19 These points are well deﬁned as the inverse of the CDF, that is, the quantile function. [sent-48, score-0.242]

20 However, it appears that generalizing quantile functions beyond one dimension is hard since the number of ways to deﬁne them grows exponentially with the dimensions [3]. [sent-50, score-0.242]

21 Furthermore, while various quantile regression methods [7, 16, 9] can be to used to estimate a single quantile of a high-dimensional distribution, extensions of those to estimate q-quantiles is mostly non-trivial. [sent-51, score-0.484]

22 Let us ﬁrst understand the exponential complexity involved in estimating a generalized quantile function in high-dimensional data. [sent-52, score-0.28]

23 When d = 1, the quantile transforms F −1 (αj ) are uniquely deﬁned as the points xj ∈ R satisfying F (X ≤ xj ) ≤ αj , where X is a random variable drawn from F . [sent-57, score-0.242]

24 Equivalently, F −1 (αj ) can be identiﬁed with the unique hierarchical intervals [−∞, xj ]. [sent-58, score-0.105]

25 Hence, we are left with 2d−1 possible ways of deﬁning a generalized quantile function for the data. [sent-64, score-0.28]

26 Hypothetically, any arbitrary hierarchical sets satisfying F (Cj ) = αj can be used to deﬁne a valid generalized quantile function. [sent-65, score-0.409]

27 Motivated in this direction, Polonik [12] suggested using hierarchical MV-sets to generalize quantile functions. [sent-67, score-0.385]

28 He thus suggested that level sets can be used as approximations of the MV-sets of a distribution. [sent-71, score-0.117]

29 Since level sets are hierarchical by nature, a density estimator over X would be sufﬁcient to construct a generalized quantile function. [sent-72, score-0.532]

30 In high-dimensional data, not only is the density estimation hard, but extracting level sets of the estimated density is also not always feasible. [sent-74, score-0.128]

31 Furthermore, in high-dimensional settings, even attempting to estimate q hierarchical MV-sets of a distribution might be too optimistic an objective due to the exponential growth in the search space, which may lead to overﬁtted estimates, especially when the sample is relatively small. [sent-75, score-0.105]

32 For a detailed discussion about generalized quantile functions, see Serﬂing [15]. [sent-78, score-0.28]

33 One prominent method that uses a variant of the OCSVM algorithm for approximating level sets of a distribution was proposed by Lee and Scott [8]. [sent-79, score-0.09]

34 Their method is called nested OCSVM (NOC-SVM) and it is based on a new quadratic program that simultaneously ﬁnds a global solution of multiple nested half-space decision functions. [sent-80, score-0.328]

35 An efﬁcient decomposition method is introduced to solve this program for large-scale problems. [sent-81, score-0.132]

36 This program uses the C-SVM formulation of the OCSVM algorithm [18], where ν is replaced by a different parameter, C ≥ 0, and incorporates nesting constraints into the dual quadratic program of each approximated function. [sent-82, score-0.315]

37 However, due to these difference formulations, our method converges to predeﬁned q-quantiles of a distribution while theirs converges to approximated sets with unpredicted probability masses. [sent-83, score-0.213]

38 The probability masses in their solution are even less trackable because the constraints imposed by the NOC-SVM program on the dual variables changes the geometric interpretation of the primal variables in a non-intuitive way. [sent-84, score-0.246]

39 An improved quantile regression variant of the OCSVM algorithm that also uses “non-crossing” constraints to estimate “non-crossing” quantiles of a distribution was proposed by Takeuchi et al. [sent-85, score-0.337]

40 Recently, a greedy hierarchical MV-set estimator (HMVE) that uses OCSVMs as a basic component was introduced by Glazer et al. [sent-88, score-0.16]

41 The superiority of HMVE was shown over a density-based estimation method and over a different hierarchical MV-set estimator that was also introduced in that paper and is based on the one-class neighbor machine (OCNM) algorithm [11]. [sent-91, score-0.231]

42 However, as we shall see in experiments, it appears that approximations in this greedy approach tend to become less accurate as the required number of MV-sets increases, especially for approximated MV-sets with small α in the last iterations. [sent-92, score-0.106]

43 In contrast to the naive approach of training q OCSVMs independently 1 , our q-OCSVM estimator preserves the ν-property of the solution and converges to a generalized global solution. [sent-93, score-0.2]

44 The OCSVM algorithm returns a function fC ∈ H that maximizes the margin between the half-space decision boundary and the origin in F, while bounding a portion of examples in X satisfying fC (x) = −1. [sent-108, score-0.116]

45 By solving the program for ν = 1 − α, we can use the OCSVM to approximate C(α). [sent-116, score-0.098]

46 This program is convex, and thus a global minimum can be found in polynomial time. [sent-130, score-0.134]

47 It is important to note that even with an ideal, unbounded number of examples, this program does not necessarily converge to the exact MV-sets but to approximated MV-sets of a distribution. [sent-131, score-0.181]

48 As we shall see in Section 4, all decision functions returned by this program preserve the ν-property. [sent-132, score-0.178]

49 We argue that the stability of these approximated MV-sets beneﬁts from the parallelism constraint imposed on the half-spaces in H, which acts as a regularizer. [sent-133, score-0.208]

50 In the following we show that our program can be solved efﬁciently in its dual form. [sent-134, score-0.134]

51 Using multipliers ηj,i ≥ 0, βj,i ≥ 0, the Lagrangian of this program is q L (w, ξq , ρ1 , . [sent-135, score-0.098]

52 nνj (4) Substituting Equation (4) into Equation (3), and replacing the dot product (Φ(xi ) · Φ(xs ))F with a kernel function k (xi , xs ) 2 , we obtain the dual program min η 1 2q ηj,i = 1, 0 ≤ ηji ≤ ηj,i ηp,s k (xi , xs ), s. [sent-140, score-0.285]

53 (5) nνj Similar to the formulation of the dual objective function in the original OCSVM algorithm, our dual program depends only on the η multipliers, and hence can be solved more efﬁciently than the primal one. [sent-143, score-0.206]

54 The resulting decision function for j’th estimate is fCj (x) = sgn 2 1 q ∗ ηi k (xi , x) − ρj , i 2 A Gaussian kernel function k(xi , xs ) = e−γ||xi −xs || was used in the following. [sent-144, score-0.177]

55 This efﬁcient formulation of the decision function, which derives from the fact that parallel half-spaces share the same w, allows us to compute the outputs of all the q decision functions simultaneously. [sent-146, score-0.174]

56 As in the OCSVM algorithm, ρj are recovered by identifying points Φ (xj,i ) lying strictly on the 1 j’th decision boundary. [sent-147, score-0.111]

57 (7) i Figure 1 shows the resulting estimates of our q-OCSVM method for 4 hierarchical MV-sets with α = 0. [sent-150, score-0.139]

58 The program we solve is different from the one in Equation (1). [sent-161, score-0.098]

59 Note that if a Gaussian kernel is used (implies k(xi , xs ) > 0), as in our case, then X is separable. [sent-174, score-0.097]

60 , fCq are (a) approximations for the MV-sets in the same sense suggested by Sch¨ lkopf et al. [sent-189, score-0.106]

61 , fCq be the decision functions returned by the q-OCSVM estimator with parameters {α1 , . [sent-195, score-0.135]

62 Let SVoj be the set of SVs lying strictly outside Cj , and SVbj be the set of SVs lying exactly on the boundary of Cj . [sent-200, score-0.159]

63 5 |SVbj |+|SVoj | |X |νj < 1, a slight increase in ρj will result in a decrease in the objective function, which contradicts the optimality of the hyperplane. [sent-215, score-0.084]

64 Then, a slight decrease in ρj will result in a decrease in the objective function, which contradicts the optimality of the hyperplane. [sent-217, score-0.108]

65 Hence, asymptotically, the probability of points lying exactly on the hyperplanes converges to zero (cf. [sent-219, score-0.195]

66 5 Empirical Results We extensively evaluated the effectiveness of our q-OCSVM method on a variety of real highdimensional data from the UCI repository and the 20-Newsgroup document corpus, and compared its performance to competing methods. [sent-221, score-0.184]

67 1 Experiments on the UCI Repository We ﬁrst evaluated our method on datasets taken from the UCI repository 4 . [sent-223, score-0.129]

68 For each data set, we trained the reference methods to approximate hierarchical MV-sets at levels α1 = 0. [sent-239, score-0.14]

69 Since the correct MV-sets are not known for the data, the quality of the approximated MV-sets was evaluated by the coverage ratio (CR): Let α be the empirical proportion of the approximated MV-sets that was measured with the test data. [sent-247, score-0.475]

70 A perfect MV-set approxα imation method would yield a coverage ratio of 1. [sent-250, score-0.244]

71 An advantage of choosing this measure for evaluation is that it gives more weight for differences between α and α in small quantiles associated with regions of high probability mass. [sent-252, score-0.118]

72 Results on test data for each approximated MV-set are shown in Figure 2. [sent-253, score-0.123]

73 The left graph displays in bars the empirical proportion of test examples in the approximated MV-sets (α ) as a function of the expected proportion (α) averaged over all 61 data sets. [sent-254, score-0.251]

74 The right graph displays the coverage ratio of test examples as a function of α averaged over all 61 data sets. [sent-255, score-0.316]

75 It can be seen that our q-OCSVM method dominates the others with the best average α and average coverage ratio behaviors. [sent-256, score-0.244]

76 In each quantile separately, we tested the signiﬁcance of the advantage of q-OCSVM over the competitors using the Wilcoxon statistical test over the absolute difference between the expected and empirical coverage ratios (|1. [sent-257, score-0.478]

77 The superiority of our method against the three competitors was found signiﬁcant, with P < 0. [sent-259, score-0.112]

78 8 In outlier detection, this measure reﬂects the ratio between expected and empirical false alarm rates. [sent-272, score-0.112]

79 5 6 believe that by ignoring the fundamental hierarchical structure of MV-sets, the I-OCSVM method is more likely than ours to reach an overﬁtted solution. [sent-273, score-0.139]

80 The HMVE method shows a decrease in performance from the largest to the smallest α. [sent-274, score-0.082]

81 This is in contrast to q-OCSVM, which converges to a global minimum, and hence is more scalable than HMVE with respect to the number of approximated MV-sets (q). [sent-278, score-0.155]

82 Right: The coverage ratio as a function of α averaged over all datasets. [sent-320, score-0.239]

83 Interestingly, the solutions produced by the HMVE and I-OCSVM methods for the largest approximated MV-set (associated with α19 = 0. [sent-321, score-0.107]

84 Therefore, in this setup, we claim that q-OCSVM also outperforms the OCSVM algorithm in the approximation of a single MV-set, and it does so with an average coverage ratio of 0. [sent-325, score-0.21]

85 We believe this improved performance is due to the parallelism constraint imposed by the q-OCSVM method on the hyperplanes, which acts as a regularization term on the solution. [sent-328, score-0.159]

86 For instance, using the average coverage ratio over all quantiles as an optimality criterion. [sent-336, score-0.338]

87 9 1 Figure 3: The q-OCSVM, HMVE, and I-OCSVM methods were trained to estimate 19 quantiles for the distribution of the 20 categories in the 20-Newsgroup document corpus. [sent-381, score-0.155]

88 Right: The coverage ratio as a function of α averaged over all 20 categories. [sent-383, score-0.239]

89 We trained the reference methods with X to estimate 19-quantiles of a distribution, and evaluated the estimated q-quantiles with the test set. [sent-388, score-0.103]

90 Results on test data for each approximated MV-set are shown in Figure 3 in the same manner as in Figure 2 10 . [sent-389, score-0.123]

91 Again, our qOCSVM method dominates the others with the best average α and average coverage ratio behaviors. [sent-391, score-0.244]

92 01, our method performs signiﬁcantly better than the other competitors for each of the 19 quantiles separately. [sent-393, score-0.17]

93 It can be seen that the differences in coverage ratios between q-OCSVM and I-OCSVM in the largest quantile (associated with α19 = 0. [sent-394, score-0.421]

94 95) are relatively high, where the average coverage ratio for q-OCSVM is 0. [sent-395, score-0.21]

95 Recall that the solution of I-OCSVM in the largest quantile is equal to the solution of a single OCSVM algorithm trained with ν = 0. [sent-398, score-0.371]

96 These results are aligned with our conclusions from the UCI repository experiments, that the parallelism constraint, which acts as a regularizer, may lead to improved performance even for the approximation of a single MV-set. [sent-400, score-0.151]

97 6 Summary The q-OCSVM method introduced in this paper can be regarded as a generalized OCSVM, as it ﬁnds multiple parallel separating hyperplanes in a reproducing kernel Hilbert space. [sent-401, score-0.326]

98 Theoretical properties of our methods are analyzed, showing that it can be used to approximate a family of hierarchical MV-sets while preserving the guaranteed separation properties (ν-property), in the same sense suggested by Sch¨ lkopf et al. [sent-402, score-0.188]

99 o Our q-OCSVM method is empirically evaluated on a variety of high-dimensional data from the UCI repository and the 20-Newsgroup document corpus, and its advantage is veriﬁed in this setup. [sent-404, score-0.154]

100 Learning high-density regions for a generalized Kolmogorov-Smirnov test in high-dimensional data. [sent-424, score-0.101]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ocsvm', 0.69), ('hmve', 0.365), ('quantile', 0.242), ('ocsvms', 0.203), ('coverage', 0.155), ('hierarchical', 0.105), ('hyperplanes', 0.103), ('program', 0.098), ('quantiles', 0.095), ('uci', 0.084), ('approximated', 0.083), ('glazer', 0.081), ('svbj', 0.081), ('svo', 0.081), ('svoj', 0.081), ('svs', 0.081), ('cj', 0.074), ('repository', 0.067), ('outlier', 0.057), ('lying', 0.056), ('estimator', 0.055), ('ratio', 0.055), ('decision', 0.055), ('xs', 0.054), ('fc', 0.054), ('polonik', 0.054), ('cr', 0.053), ('parallelism', 0.049), ('prede', 0.045), ('lkopf', 0.045), ('kernel', 0.043), ('sv', 0.042), ('corpus', 0.042), ('hilbert', 0.042), ('imposed', 0.041), ('competitors', 0.041), ('ser', 0.041), ('lf', 0.041), ('fcj', 0.041), ('fcq', 0.041), ('lindenbaum', 0.041), ('noc', 0.041), ('test', 0.04), ('suggested', 0.038), ('generalized', 0.038), ('reproducing', 0.037), ('superiority', 0.037), ('examples', 0.037), ('parallel', 0.037), ('sch', 0.037), ('xi', 0.037), ('equation', 0.037), ('converges', 0.036), ('cq', 0.036), ('primal', 0.036), ('dual', 0.036), ('density', 0.036), ('global', 0.036), ('hierarchy', 0.035), ('acts', 0.035), ('nested', 0.035), ('solution', 0.035), ('trained', 0.035), ('svm', 0.035), ('method', 0.034), ('separating', 0.034), ('hyperplane', 0.033), ('optimality', 0.033), ('level', 0.032), ('documents', 0.032), ('nds', 0.031), ('takeuchi', 0.031), ('proportion', 0.031), ('highdimensional', 0.03), ('outliers', 0.03), ('negation', 0.029), ('wilcoxon', 0.029), ('detection', 0.029), ('svms', 0.029), ('averaged', 0.029), ('evaluated', 0.028), ('contradicts', 0.027), ('derives', 0.027), ('asymptotically', 0.027), ('ji', 0.026), ('advantages', 0.026), ('separable', 0.026), ('sgn', 0.025), ('returned', 0.025), ('document', 0.025), ('lebesgue', 0.025), ('sets', 0.024), ('largest', 0.024), ('decrease', 0.024), ('boundary', 0.024), ('libsvm', 0.024), ('approximations', 0.023), ('outside', 0.023), ('regions', 0.023), ('drawbacks', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch

2 0.13558498 244 nips-2013-Parametric Task Learning

Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima

Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle inﬁnitely many tasks parameterized by a continuous parameter. Our key ﬁnding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.

3 0.073823735 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros

Abstract: Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difﬁcult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. We also evaluate our method on the task of scene classiﬁcation, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. 1

4 0.069613121 216 nips-2013-On Flat versus Hierarchical Classification in Large-Scale Taxonomies

Author: Rohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini

Abstract: We study in this paper ﬂat and hierarchical classiﬁcation strategies in the context of large-scale taxonomies. To this end, we ﬁrst propose a multiclass, hierarchical data dependent bound on the generalization error of classiﬁers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of ﬂat and hierarchical classiﬁers. We then introduce another type of bound targeting the approximation error of a family of classiﬁers, and derive from it features used in a meta-classiﬁer to decide which nodes to prune (or ﬂatten) in a large-scale taxonomy. We ﬁnally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies. 1

5 0.068144754 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

Author: Adel Javanmard, Andrea Montanari

Abstract: Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical signiﬁcance as conﬁdence intervals or p-values. We consider here a broad class of regression problems, and propose an efﬁcient algorithm for constructing conﬁdence intervals and p-values. The resulting conﬁdence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a ‘de-biased’ version of regularized Mestimators. The new construction improves over recent work in the ﬁeld in that it does not assume a special structure on the design matrix. Furthermore, proofs are remarkably simple. We test our method on a diabetes prediction problem. 1

6 0.061561402 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

7 0.057926815 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

8 0.057743043 158 nips-2013-Learning Multiple Models via Regularized Weighting

9 0.053634193 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

10 0.049482502 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

11 0.046911649 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

12 0.045792326 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

13 0.0425089 269 nips-2013-Regression-tree Tuning in a Streaming Setting

14 0.041399028 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap

15 0.040204134 171 nips-2013-Learning with Noisy Labels

16 0.039055046 173 nips-2013-Least Informative Dimensions

17 0.038760915 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

18 0.038234763 63 nips-2013-Cluster Trees on Manifolds

19 0.038230166 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

20 0.038156744 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.13), (1, 0.046), (2, -0.005), (3, 0.006), (4, 0.056), (5, 0.041), (6, -0.004), (7, 0.012), (8, -0.049), (9, 0.023), (10, -0.008), (11, -0.0), (12, 0.002), (13, -0.044), (14, 0.052), (15, 0.044), (16, 0.057), (17, -0.001), (18, -0.009), (19, -0.017), (20, -0.006), (21, 0.033), (22, 0.026), (23, 0.042), (24, -0.062), (25, 0.016), (26, 0.018), (27, -0.033), (28, -0.013), (29, -0.006), (30, 0.003), (31, 0.023), (32, 0.013), (33, -0.033), (34, 0.005), (35, -0.032), (36, 0.033), (37, -0.049), (38, -0.117), (39, 0.036), (40, -0.015), (41, -0.013), (42, -0.078), (43, 0.082), (44, -0.046), (45, -0.077), (46, 0.042), (47, -0.117), (48, -0.061), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88103229 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch

2 0.62505728 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

Author: Xinhua Zhang, Wee Sun Lee, Yee Whye Teh

Abstract: Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-deﬁnite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are compactly encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efﬁciently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art. 1

3 0.61524439 244 nips-2013-Parametric Task Learning

Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima

4 0.60740519 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Author: Qichao Que, Mikhail Belkin

Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the ﬁrst kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is ﬂexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities deﬁned on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difﬁcult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classiﬁcation within the covariate shift framework. 1

5 0.57390165 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

Author: Yu Zhang

Abstract: All the existing multi-task local learning methods are deﬁned on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classiﬁcation and regression problems based on heterogeneous neighborhood which is deﬁned on data points from all tasks. Speciﬁcally, we extend the knearest-neighbor classiﬁer by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-speciﬁc. By deﬁning a regularizer to enforce the task-speciﬁc weight matrix to approach a symmetric one, a regularized objective function is proposed and an efﬁcient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classiﬁcation case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1

6 0.55513984 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling

7 0.51631534 158 nips-2013-Learning Multiple Models via Regularized Weighting

8 0.51221281 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

9 0.51118261 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap

10 0.51111609 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

11 0.50413507 202 nips-2013-Multiclass Total Variation Clustering

12 0.50223297 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

13 0.50108629 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

14 0.50080943 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization

15 0.49813429 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

16 0.48545644 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

17 0.48463523 63 nips-2013-Cluster Trees on Manifolds

18 0.4766764 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

19 0.47243014 76 nips-2013-Correlated random features for fast semi-supervised learning

20 0.46478233 9 nips-2013-A Kernel Test for Three-Variable Interactions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.02), (16, 0.048), (19, 0.02), (33, 0.163), (34, 0.115), (41, 0.024), (49, 0.02), (52, 0.213), (56, 0.09), (70, 0.047), (85, 0.042), (89, 0.043), (93, 0.043), (95, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86979961 337 nips-2013-Transportability from Multiple Environments with Limited Experiments

Author: Elias Bareinboim, Sanghack Lee, Vasant Honavar, Judea Pearl

Abstract: This paper considers the problem of transferring experimental ﬁndings learned from multiple heterogeneous domains to a target domain, in which only limited experiments can be performed. We reduce questions of transportability from multiple domains and with limited scope to symbolic derivations in the causal calculus, thus extending the original setting of transportability introduced in [1], which assumes only one domain with full experimental information available. We further provide different graphical and algorithmic conditions for computing the transport formula in this setting, that is, a way of fusing the observational and experimental information scattered throughout different domains to synthesize a consistent estimate of the desired effects in the target domain. We also consider the issue of minimizing the variance of the produced estimand in order to increase power. 1 Motivation Transporting and synthesizing experimental knowledge from heterogeneous settings are central to scientiﬁc discovery. Conclusions that are obtained in a laboratory setting are transported and applied elsewhere in an environment that differs in many aspects from that of the laboratory. In data-driven sciences, experiments are conducted on disparate domains, but the intention is almost invariably to fuse the acquired knowledge, and translate it into some meaningful claim about a target domain, which is usually different than any of the individual study domains. However, the conditions under which this extrapolation can be legitimized have not been formally articulated until very recently. Although the problem has been discussed in many areas of statistics, economics, and the health sciences, under rubrics such as “external validity” [2, 3], “meta-analysis” [4], “quasi-experiments” [5], “heterogeneity” [6], these discussions are limited to verbal narratives in the form of heuristic guidelines for experimental researchers – no formal treatment of the problem has been attempted to answer the practical challenge of generalizing causal knowledge across multiple heterogeneous domains with disparate experimental data posed in this paper. The ﬁelds of artiﬁcial intelligence and statistics provide the theoretical underpinnings necessary for tackling transportability. First, the distinction between statistical and causal knowledge has received syntactic representation through causal diagrams [7, 8, 9], which became a popular tool for causal inference in data-driven ﬁelds. Second, the inferential machinery provided by the causal calculus (do-calculus) [7, 9, 10] is particularly suitable for handling knowledge transfer across domains. Armed with these techniques, [1] introduced a formal language for encoding differences and commonalities between domains accompanied with necessary or sufﬁcient conditions under which transportability of empirical ﬁndings is feasible between two domains, a source and a target; then, these conditions were extended for a complete characterization for transportability in one domain with unrestricted experimental data [11]. Subsequently, these results were generalized for the settings when ∗ These authors contributed equally to this paper. The authors’ addresses are respectively eb@cs.ucla.edu, sxl439@ist.psu.edu, vhonavar@ist.psu.edu, judea@cs.ucla.edu. 1 only limited experiments are available in the source domain [12, 13], and further for when multiple source domains with unrestricted experimental information are available [14, 15]. This paper broadens these discussions introducing a more general setting in which multiple heterogeneous sources with limited and distinct experiments are available, a task that we call here “mz-transportability”.1 More formally, the mz-transportability problem concerns with the transfer of causal knowledge from a heterogeneous collection of source domains Π = {π1 , ..., πn } to a target domain π ∗ . In each domain πi ∈ Π, experiments over a set of variables Zi can be performed, and causal knowledge gathered. In π ∗ , potentially different from πi , only passive observations can be collected (this constraint is weakened later on). The problem is to infer a causal relationship R in π ∗ using knowledge obtained in Π. Clearly, if nothing is known about the relationship between Π and π ∗ , the problem is trivial; no transfer can be justiﬁed. Yet the fact that all scientiﬁc experiments are conducted with the intent of being used elsewhere (e.g., outside the lab) implies that scientiﬁc progress relies on the assumption that certain domains share common characteristics and that, owed to these commonalities, causal claims would be valid in new settings even where experiments cannot be conducted. The problem stated in this paper generalizes the one-dimensional version of transportability with limited scope and the multiple dimensional with unlimited scope. Remarkably, while the effects of interest might not be individually transportable to the target domain from the experiments in any of the available sources, combining different pieces from the various sources may enable the estimation of the desired effects (to be shown later on). The goal of this paper is to formally understand under which conditions the target quantity is (non-parametrically) estimable from the available data. 2 Previous work and our contributions Consider Fig. 1(a) in which the node S represents factors that produce differences between source and target populations. Assume that we conduct a randomized trial in Los Angeles (LA) and estimate the causal effect of treatment X on outcome Y for every age group Z = z, denoted by P (y|do(x), z). We now wish to generalize the results to the population of the United States (U.S.), but we ﬁnd the distribution P (x, y, z) in LA to be different from the one in the U.S. (call the latter P ∗ (x, y, z)). In particular, the average age in the U.S. is signiﬁcantly higher than that in LA. How are we to estimate the causal effect of X on Y in U.S., denoted R = P ∗ (y|do(x))? 2 3 The selection diagram for this example (Fig. 1(a)) conveys the assumption that the only difference between the two populations are factors determining age distributions, shown as S → Z, while agespeciﬁc effects P ∗ (y|do(x), Z = z) are invariant across populations. Difference-generating factors are represented by a special set of variables called selection variables S (or simply S-variables), which are graphically depicted as square nodes ( ). From this assumption, the overall causal effect in the U.S. can be derived as follows: R P ∗ (y|do(x), z)P ∗ (z) = z P (y|do(x), z)P ∗ (z) = (1) z The last line is the transport formula for R. It combines experimental results obtained in LA, P (y|do(x), z), with observational aspects of the U.S. population, P ∗ (z), to obtain an experimental claim P ∗ (y|do(x)) about the U.S.. In this trivial example, the transport formula amounts to a simple re-calibration (or re-weighting) of the age-speciﬁc effects to account for the new age distribution. In general, however, a more involved mixture of experimental and observational ﬁndings would be necessary to obtain a bias-free estimate of the target relation R. Fig. 1(b) depicts the smallest example in which transportability is not feasible even when experiments over X in π are available. In real world applications, it may happen that certain controlled experiments cannot be conducted in the source environment (for ﬁnancial, ethical, or technical reasons), so only a limited amount 1 The machine learning literature has been concerned about discrepancies among domains in the context, almost exclusively, on predictive or classiﬁcation tasks as opposed to learning causal or counterfactual measures [16, 17]. Interestingly enough, recent work on anticausal learning moves towards more general modalities of learning and also leverages knowledge about the underlying data-generating structure [18, 19]. 2 We will use Px (y | z) interchangeably with P (y | do(x), z). 3 We use the structural interpretation of causal diagrams as described in [9, pp. 205]. 2 S S Z Z1 S S X Y (a) X Y (b) Z2 X Y (c) Figure 1: The selection variables S are depicted as square nodes ( ). (a) Selection diagram illustrating when transportability between two domains is trivially solved through simple recalibration. (b) The smallest possible selection diagram in which a causal relation is not transportable. (c) Selection diagram illustrating transportability when only experiments over {Z1 } are available in the source. of experimental information can be gathered. A natural question arises whether the investigator in possession of a limited set of experiments would still be able to estimate the desired effects at the target domain. For instance, we assume in Fig. 1(c) that experiments over Z1 are available and the target quantity is R = P ∗ (y|do(x)), which can be shown to be equivalent to P (y|x, do(Z1 )), the conditional distribution of Y given X in the experimental study when Z1 is randomized. 4 One might surmise that multiple pairwise z-transportability would be sufﬁcient to solve the mztransportability problem, but this is not the case. To witness, consider Fig. 2(a,b) which concerns the transport of experimental results from two sources ({πa , πb }) to infer the effect of X on Y in π ∗ , R = P ∗ (y|do(x)). In these diagrams, X may represent the treatment (e.g., cholesterol level), Z1 represents a pre-treatment variable (e.g., diet), Z2 represents an intermediate variable (e.g., biomarker), and Y represents the outcome (e.g., heart failure). We assume that experimental studies randomizing {Z1 , Z2 } can be conducted in both domains. A simple analysis based on [12] can show that R cannot be z-transported from either source alone, but it turns out that combining in a special way experiments from both sources allows one to determine the effect in the target. More interestingly, we consider the more stringent scenario where only certain experiments can be performed in each of the domains. For instance, assume that it is only possible to conduct experiments over {Z2 } in πa and over {Z1 } in πb . Obviously, R will not be z-transported individually from these domains, but it turns out that taking both sets of experiments into account, R = z2 P (a) (y|do(z2 ))P (b) (z2 |x, do(Z1 )), which fully uses all pieces of experimental data available. In other words, we were able to decompose R into subrelations such that each one is separately z-transportable from the source domains, and so is the desired target quantity. Interestingly, it is the case in this example that if the domains in which experiments were conducted were reversed (i.e., {Z1 } randomized in πa , {Z2 } in πb ), it will not be possible to transport R by any method – the target relation is simply not computable from the available data (formally shown later on). This illustrates some of the subtle issues mz-transportability entails, which cannot be immediately cast in terms of previous instances of the transportability class. In the sequel, we try to better understand some of these issues, and we develop sufﬁcient or (speciﬁc) necessary conditions for deciding special transportability for arbitrary collection of selection diagrams and set of experiments. We further construct an algorithm for deciding mz-transportability of joint causal effects and returning the correct transport formula whenever this is possible. We also consider issues relative to the variance of the estimand aiming for improving sample efﬁciency and increasing statistical power. 3 Graphical conditions for mz-transportability The basic semantical framework in our analysis rests on structural causal models as deﬁned in [9, pp. 205], also called data-generating models. In the structural causal framework [9, Ch. 7], actions are modiﬁcations of functional relationships, and each action do(x) on a causal model M produces 4 A typical example is whether we can estimate the effect of cholesterol (X) on heart failure (Y ) by experiments on diet (Z1 ) given that cholesterol levels cannot be randomized [20]. 3 Z2 Z1 Z1 X Z1 X X Z2 W Z2 Y (a) Z2 Z1 Y X Z3 W U U Y Y (b) (c) Z3 (d) Figure 2: Selection diagrams illustrating impossibility of estimating R = P ∗ (y|do(x)) through individual transportability from πa and πb even when Z = {Z1 , Z2 } (for (a, b), (c, d))). If we assume, more stringently, availability of experiments Za = {Z2 }, Zb = {Z1 }, Z∗ = {}, a more elaborated analysis can show that R can be estimated combining different pieces from both domains. a new model Mx = U, V, Fx , P (U) , where Fx is obtained after replacing fX ∈ F for every X ∈ X with a new function that outputs a constant value x given by do(x). 5 We follow the conventions given in [9]. We denote variables by capital letters and their realized values by small letters. Similarly, sets of variables will be denoted by bold capital letters, sets of realized values by bold small letters. We use the typical graph-theoretic terminology with the corresponding abbreviations P a(Y)G and An(Y)G , which will denote respectively the set of observable parents and ancestors of the node set Y in G. A graph GY will denote the induced subgraph G containing nodes in Y and all arrows between such nodes. Finally, GXZ stands for the edge subgraph of G where all incoming arrows into X and all outgoing arrows from Z are removed. Key to the analysis of transportability is the notion of “identiﬁability,” deﬁned below, which expresses the requirement that causal effects are computable from a combination of data P and assumptions embodied in a causal graph G. Deﬁnition 1 (Causal Effects Identiﬁability (Pearl, 2000, pp. 77)). The causal effect of an action do(x) on a set of variables Y such that Y ∩ X = ∅ is said to be identiﬁable from P in G if Px (y) is uniquely computable from P (V) in any model that induces G. Causal models and their induced graphs are usually associated with one particular domain (also called setting, study, population, or environment). In ordinary transportability, this representation was extended to capture properties of two domains simultaneously. This is possible if we assume that the structural equations share the same set of arguments, though the functional forms of the equations may vary arbitrarily [11]. 6 Deﬁnition 2 (Selection Diagram). Let M, M ∗ be a pair of structural causal models [9, pp. 205] relative to domains π, π ∗ , sharing a causal diagram G. M, M ∗ is said to induce a selection diagram D if D is constructed as follows: 1. Every edge in G is also an edge in D; 2. D contains an extra edge Si → Vi whenever there might exist a discrepancy fi = fi∗ or P (Ui ) = P ∗ (Ui ) between M and M ∗ . In words, the S-variables locate the mechanisms where structural discrepancies between the two domains are suspected to take place.7 Alternatively, the absence of a selection node pointing to a variable represents the assumption that the mechanism responsible for assigning value to that variable is identical in both domains. 5 The results presented here are also valid in other formalisms for causality based on potential outcomes. As discussed in the reference, the assumption of no structural changes between domains can be relaxed, but some structural assumptions regarding the discrepancies between domains must still hold. 7 Transportability assumes that enough structural knowledge about both domains is known in order to substantiate the production of their respective causal diagrams. In the absence of such knowledge, causal discovery algorithms might be used to infer the diagrams from data [8, 9]. 6 4 Armed with the concept of identiﬁability and selection diagrams, mz-transportability of causal effects can be deﬁned as follows: Deﬁnition 3 (mz-Transportability). Let D = {D(1) , ..., D(n) } be a collection of selection diagrams relative to source domains Π = {π1 , ..., πn }, and target domain π ∗ , respectively, and Zi (and Z∗ ) i be the variables in which experiments can be conducted in domain πi (and π ∗ ). Let P i , Iz be i i the pair of observational and interventional distributions of πi , where Iz = Z ⊆Zi P (v|do(z )), ∗ and in an analogous manner, P ∗ , Iz be the observational and interventional distributions of π ∗ . ∗ ∗ The causal effect R = Px (y|w) is said to be mz-transportable from Π to π ∗ in D if Px (y|w) is i ∗ uniquely computable from i=1,...,n P i , Iz ∪ P ∗ , Iz in any model that induces D. ∗ i The requirement that R is uniquely computable from P ∗ , Iz and P i , Iz from all sources has a syntactic image in the causal calculus, which is captured by the following sufﬁcient condition. Theorem 1. Let D = {D(1) , ..., D(n) } be a collection of selection diagrams relative to source domains Π = {π1 , ..., πn }, and target domain π ∗ , respectively, and Si represents the collection of i ∗ S-variables in the selection diagram D(i) . Let { P i , Iz } and P ∗ , Iz be respectively the pairs of observational and interventional distributions in the sources Π and target π ∗ . The relation R = P ∗ (y|do(x), w) is mz-transportable from Π to π ∗ in D if the expression P (y|do(x), w, S1 , ..., Sn ) is reducible, using the rules of the causal calculus, to an expression in which (1) do-operators that ∗ i apply to subsets of Iz have no Si -variables or (2) do-operators apply only to subsets of Iz . This result provides a powerful way to syntactically establish mz-transportability, but it is not immediately obvious whether a sequence of applications of the rules of causal calculus that achieves the reduction required by the theorem exists, and even if such sequence exists, it is not obvious how to obtain it. For concreteness, we illustrate this result using the selection diagrams in Fig. 2(a,b). Corollary 1. P ∗ (y|do(x)) is mz-transportable in Fig. 2(a,b) with Za = {Z2 } and Zb = {Z1 }. Proof. The goal is to show that R = P ∗ (y|do(x)) is mz-transportable from {πa , πb } to π ∗ using experiments conducted over {Z2 } in πa and {Z1 } in πb . Note that naively trying to transport R from each of the domains individually is not possible, but R can be decomposed as follows: P ∗ (y|do(x)) = P ∗ (y|do(x), do(Z1 )) (2) P ∗ (y|do(x), do(Z1 ), z2 )P ∗ (z2 |do(x), do(Z1 )) (3) P ∗ (y|do(x), do(Z1 ), do(z2 ))P ∗ (z2 |do(x), do(Z1 )), = (4) z2 = z2 where Eq. (2) follows by rule 3 of the causal calculus since (Z1 ⊥ Y |X)DX,Z holds, we con⊥ 1 dition on Z2 in Eq. (3), and Eq. (4) follows by rule 2 of the causal calculus since (Z2 ⊥ ⊥ Y |X, Z1 )DX,Z ,Z , where D is the diagram in π ∗ (despite the location of the S-nodes). 1 2 Now we can rewrite the ﬁrst term of Eq. (4) as indicated by the Theorem (and suggested by Def. 2): P ∗ (y|do(x), do(Z1 ), do(z2 )) = P (y|do(x), do(Z1 ), do(z2 ), Sa , Sb ) (5) = P (y|do(x), do(Z1 ), do(z2 ), Sb ) (6) = P (y|do(z2 ), Sb ) (7) = P (a) (y|do(z2 )), (8) where Eq. (5) follows from the theorem (and the deﬁnition of selection diagram), Eq. (6) follows , Eq. (7) follows from rule from rule 1 of the causal calculus since (Sa ⊥ Y |Z1 , Z2 , X)D(a) ⊥ Z1 ,Z2 ,X 3 of the causal calculus since (Z1 , X ⊥ Y |Z2 )D(a) ⊥ . Note that this equation matches with the Z1 ,Z2 ,X a syntactic goal of Theorem 1 since we have precisely do(z2 ) separated from Sa (and Z2 ∈ Iz ); so, we can rewrite the expression which results in Eq. (8) by the deﬁnition of selection diagram. Finally, we can rewrite the second term of Eq. (4) as follows: P ∗ (z2 |do(x), do(Z1 )) = P (z2 |do(x), do(Z1 ), Sa , Sb ) = P (z2 |do(x), do(Z1 ), Sa ) = P (z2 |x, do(Z1 ), Sa ) = P (b) (z2 |x, do(Z1 )), 5 (9) (10) (11) (12) where Eq. (9) follows from the theorem (and the deﬁnition of selection diagram), Eq. (10) follows from rule 1 of the causal calculus since (Sb ⊥ Z2 |Z1 , X)D(b) , Eq. (11) follows from rule 2 of ⊥ Z1 ,X the causal calculus since (X ⊥ Z2 |Z1 )D(b) . Note that this equation matches the condition of the ⊥ Z1 X theorem, separate do(Z1 ) from Sb (i.e., experiments over Z1 can be used since they are available in πb ), so we can rewrite Eq. (12) using the deﬁnition of selection diagrams, the corollary follows. The next condition for mz-transportability is more visible than Theorem 1 (albeit weaker), which also demonstrates the challenge of relating mz-transportability to other types of transportability. Corollary 2. R = P ∗ (y|do(x)) is mz-transportable in D if there exists Zi ⊆ Zi such that all paths from Zi to Y are blocked by X, (Si ⊥ Y|X, Zi )D(i) , and R is computable from do(Zi ). ⊥ X,Z i Remarkably, randomizing Z2 when applying Corollary 1 was instrumental to yield transportability in the previous example, despite the fact that the directed paths from Z2 to Y were not blocked by X, which suggests how different this transportability is from z-identiﬁability. So, it is not immediately obvious how to combine the topological relations of Zi ’s with X and Y in order to create a general condition for mz-transportability, the relationship between the distributions in the different domains can get relatively intricate, but we defer this discussion for now and consider a simpler case. It is not usually trivial to pursue a derivation of mz-transportability in causal calculus, and next we show an example in which such derivation does not even exist. Consider again the diagrams in Fig. 2(a,b), and assume that randomized experiments are available over {Z1 } in πa and {Z2 } in πb . Theorem 2. P ∗ (y|do(x)) is not mz-transportable in Fig. 2(a,b) with Za = {Z1 } and Zb = {Z2 }. Proof. Formally, we need to display two models M1 , M2 such that the following relations hold (as implied by Def. 3):  (a) (a)  PM1 (Z1 , X, Z2 , Y ) = PM2 (Z1 , X, Z2 , Y ),   (b)   P (Z1 , X, Z2 , Y ) = P (b) (Z1 , X, Z2 , Y ),  M1 M2 (a) (a) (13) PM1 (X, Z2 , Y |do(Z1 )) = PM2 (X, Z2 , Y |do(Z1 )),  (b)   P (Z , X, Y |do(Z )) = P (b) (Z , X, Y |do(Z )),  M 2 1 2  M2  ∗1 1 ∗ PM1 (Z1 , X, Z2 , Y ) = PM2 (Z1 , X, Z2 , Y ), for all values of Z1 , X, Z2 , and Y , and also, ∗ ∗ PM1 (Y |do(X)) = PM2 (Y |do(X)), (14) for some value of X and Y . Let V be the set of observable variables and U be the set of unobservable variables in D. Let us assume that all variables in U ∪ V are binary. Let U1 , U2 ∈ U be the common causes of Z1 and X and Z2 , respectively; let U3 , U4 , U5 ∈ U be a random disturbance exclusive to Z1 , Z2 , and Y , respectively, and U6 ∈ U be an extra random disturbance exclusive to Z2 , and U7 , U8 ∈ U to Y . Let Sa and Sb index the model in the following way: the tuples Sa = 1, Sb = 0 , Sa = 0, Sb = 1 , Sa = 0, Sb = 0 represent domains πa , πb , and π ∗ , respectively. Deﬁne the two models as follows:    Z1 = U1 ⊕ U2 ⊕ U3 ∧ Sa  Z1 = U1 ⊕ U2 ⊕ U3 ∧ Sa     X=U X = Z1 ⊕ U1 1 M1 = M2 =  Z2 = (X ⊕ U2 ⊕ (U4 ∧ Sa )) ∨ U6  Z2 = U4 ∧ Sa ∧ U6 ⊕ U6     Y = (Z2 ∧ U5 ) ⊕ (U5 ∧ U7 ) ⊕ (Sb ∧ U8 ) Y = (Z2 ∧ U5 ) ⊕ (U5 ∧ U7 ) ⊕ (Sb ∧ U8 ) where ⊕ represents the exclusive or function. Both models agree in respect to P (U), which is deﬁned as P (Ui ) = 1/2, i = 1, ..., 8. It is not difﬁcult to evaluate these models and note that the constraints given in Eqs. (13) and (14) are satisﬁed (including positivity), the theorem follows. 4 Algorithm for computing mz-transportability In this section, we build on previous analyses of identiﬁability [7, 21, 22, 23] in order to obtain a mechanic procedure in which a collection of selection diagrams and experimental data is inputted, and the procedure returns a transport formula whenever it is able to produce one. More speciﬁcally, 6 PROCEDURE TRmz (y, x, P, I, S, W, D) INPUT: x, y: value assignments; P: local distribution relative to domain S (S = 0 indexes π ∗ ) and active experiments I; W: weighting scheme; D: backbone of selection diagram; Si : selection nodes in πi (S0 = ∅ (i) relative to π ∗ ); [The following set and distributions are globally deﬁned: Zi , P ∗ , PZi .] (i) ∗ ∗ OUTPUT: Px (y) in terms of P ∗ , PZ , PZi or F AIL(D, C0 ). P 1 if x = ∅, return V\Y P. P 2 if V \ An(Y)D = ∅, return TRmz (y, x ∩ An(Y)D , V\An(Y)D P, I, S, W, DAn(Y) ). 3 set W = (V \ X) \ An(Y)DX . if W = ∅, return TRmz (y, x ∪ w, P, I, S, W, D). Q P 4 if C(D \ X) = {C0 , C1 , ..., Ck }, return V\{Y,X} i TRmz (ci , v \ ci , P, I, S, W, D). 5 if C(D \ X) = {C0 }, 6 if C(D) = {D}, Q P P 7 if C0 ∈ C(D), return i|Vi ∈C0 V\V (i) P/ V\V (i−1) P. D 8 9 10 11 12 D if (∃C )C0 ⊂ C ∈ C(D), (i−1) for {i|Vi ∈ C }, set κi = κi ∪ vD \C . Q (i−1) mz return TR (y, x ∩ C , i|Vi ∈C P(Vi |VD ∩ C , κi ), I, S, W, C ). else, if I`= ∅, for i = 0, ..., |D|, ´ if (Si ⊥ Y | X)D(i) ∧ (Zi ∩ X = ∅) , Ei = TRmz (y, x \ zi , P, Zi ∩ X, i, W, D \ {Zi ∩ X}). ⊥ PX (j) if |E| > 0, return |E| wi Ei . i=1 else, FAIL(D, C0 ). Figure 3: Modiﬁed version of identiﬁcation algorithm capable of recognizing mz-transportability. our algorithm is called TRmz (see Fig. 3), and is based on the C-component decomposition for identiﬁcation of causal effects [22, 23] (and a version of the identiﬁcation algorithm called ID). The rationale behind TRmz is to apply Tian’s factorization and decompose the target relation into smaller, more manageable sub-expressions, and then try to evaluate whether each sub-expression can be computed in the target domain. Whenever this evaluation fails, TRmz tries to use the experiments available from the target and, if possible, from the sources; this essentially implements the declarative condition delineated in Theorem 1. Next, we consider the soundness of the algorithm. ∗ Theorem 3 (soundness). Whenever TRmz returns an expression for Px (y), it is correct. In the sequel, we demonstrate how the algorithm works through the mz-transportability of Q = P ∗ (y|do(x)) in Fig. 2(c,d) with Z∗ = {Z1 }, Za = {Z2 }, and Zb = {Z1 }. Since (V \ X) \ An(Y)DX = {Z2 }, TRmz invokes line 3 with {Z2 } ∪ {X} as interventional set. The new call triggers line 4 and C(D \ {X, Z2 }) = {C0 , C1 , C2 , C3 }, where C0 = DZ1 , C1 = DZ3 , C2 = DU , and C3 = DW,Y , we invoke line 4 and try to mz-transport individ∗ ∗ ∗ ually Q0 = Px,z2 ,z3 ,u,w,y (z1 ), Q1 = Px,z1 ,z2 ,u,w,y (z3 ), Q2 = Px,z1 ,z2 ,z3 ,w,y (u), and Q3 = ∗ Px,z1 ,z2 ,z3 ,u (w, y). Thus the original problem reduces to try to evaluate the equivalent expression ∗ ∗ ∗ ∗ z1 ,z3 ,u,w Px,z2 ,z3 ,u,w,y (z1 )Px,z1 ,z2 ,u,w,y (z3 ) Px,z1 ,z2 ,z3 ,w,y (u)Px,z1 ,z2 ,z3 ,u (w, y). First, TRmz evaluates the expression Q0 and triggers line 2, noting that all nodes can be ignored ∗ since they are not ancestors of {Z1 }, which implies after line 1 that Px,z2 ,z3 ,u,w,y (z1 ) = P ∗ (z1 ). Second, TRmz evaluates the expression Q1 triggering line 2, which implies that ∗ ∗ Px,z1 ,z2 ,u,w,y (z3 ) = Px,z1 ,z2 (z3 ) with induced subgraph D1 = DX,Z1 ,Z2 ,Z3 . TRmz goes to line 5, in which in the local call C(D \ {X, Z1 , Z2 }) = {DZ3 }. Thus it proceeds to line 6 testing whether C(D \ {X, Z1 , Z2 }) is different from D1 , which is false. In this call, ordinary identiﬁability would fail, but TRmz proceeds to line 9. The goal of this line is to test whether some experiment can help for computing Q1 . In this case, πa fails immediately the test in line 10, but πb and π ∗ succeed, (i) which means experiments in these domains may eventually help; the new call is Px,z2 (z3 )D\Z1 , for mz i = {b, ∗} with induced graph D1 = DX,Z2 ,Z3 . Finally, TR triggers line 8 since X is not part of Z3 ’s components in D1 (or, Z3 ∈ C = {Z2 Z3 }), so line 2 is triggered since Z2 is no longer an ancestor of Z3 in D1 , and then line 1 is triggered since the interventional set is empty in (i) (i) ∗ this local call, so Px,z1 ,z2 (z3 ) = Z Pz1 (z3 |x, Z2 )Pz1 (Z2 ), for i = {b, ∗}. 2 7 ∗ Third, evaluating the expression Q2 , TRmz goes to line 2, which implies that Px,z1 ,z2 ,z3 ,w,y (u) = ∗ mz Px,z1 ,z2 ,z3 ,w (u) with induced subgraph D2 = DX,Z1 ,Z2 ,Z3 ,W,U . TR goes to line 5, and in this local call C(D \ {X, Z1 , Z2 , Z3 , W }) = {DU }, and the test in 6 succeed, since there are more components in D. So, it triggers line 8 since W is not part of U ’s component ∗ ∗ in D2 . The algorithm makes Px,z1 ,z2 ,z3 ,w (u) = Px,z1 ,z2 ,z3 (u)D2 |W (and update the working distribution); note that in this call, ordinary identiﬁability would fail since the nodes are in the same C-component and the test in line 6 fails. But TRmz proceeds to line 9 trying to ﬁnd experiments that can help in Q2 ’s computation. In this case, πb cannot help but πa (a) and π ∗ perhaps can, noting that new calls are launched for computing Px,z1 ,z3 (u)D2 \Z2 |W rel∗ ative to πa , and Px,z2 ,z3 (u)D2 \Z1 |W relative to π ∗ with the corresponding data structures set. (a) (a) In πa , the algorithm triggers line 7, which yields Px,z1 ,z3 (u)D2 \Z2 |W = Pz2 (u|w, z3 , x, z1 ), ∗ and a bit more involved analysis for πb yields (after simpliﬁcation) Px,z2 ,z3 (u)D2 \Z1 |W = ∗ ∗ ∗ ∗ ∗ Z Pz1 (z3 |x, Z2 )Pz1 (Z2 ) . Z Pz1 (u|w, z3 , x, Z2 )Pz1 (z3 |x, Z2 )Pz1 (Z2 ) / 2 2 mz Fourth, TR evaluates the expression Q3 and triggers line 5, C(D\{X, Z1 , Z2 , Z3 , U }) = DW,Y . ∗ In turn, both tests at lines 6 and 7 succeed, which makes the procedure to return Px,z1 ,z2 ,z3 ,u (w, y) = ∗ ∗ P (w|z3 , x, z1 , z2 )P (y|w, x, z1 , z2 , z3 , u). The composition of the return of these calls generates the following expression: ! ∗ Px (y) = X ∗ P (z1 ) z1 ,z3 ,w,u (2) w1 (1) w1 X ∗ ∗ Pz1 (z3 |x, Z2 )Pz1 (Z2 ) X (b) (b) Pz1 (z3 |x, Z2 )Pz1 (Z2 ) Z2 Z2 „X + (1) w2 « « „X ∗ ∗ ∗ ∗ ∗ Pz1 (z3 |x, Z2 )Pz1 (Z2 ) Pz1 (u|w, z3 , x, Z2 )Pz1 (z3 |x, Z2 )Pz1 (Z2 ) / Z2 Z2 ! (2) (a) + w2 Pz2 (u|w, z3 , x, z1 ) P ∗ (w|x, z1 , z2 , z3 ) P ∗ (y|x, z1 , z2 , z3 , w, u) (15) (k) where wi represents the weight for each factor in estimand k (i = 1, ..., nk ), and nk is the number of feasible estimands of k. Eq. (15) depicts a powerful way to estimate P ∗ (y|do(x)) in the target domain, and depending on weighting choice a different estimand will be entailed. For instance, one might use an analogous to inverse-variance weighting, which sets the weights for the normalized (k) nk −2 −2 2 inverse of their variances (i.e., wi = σi / j=1 σj , where σj is the variance of the jth component of estimand k). Our strategy resembles the approach taken in meta-analysis [4], albeit the latter usually disregards the intricacies of the relationships between variables, so producing a statistically less powerful estimand. Our method leverages this non-trivial and highly structured relationships, as exempliﬁed in Eq. (15), which yields an estimand with less variance and statistically more powerful. 5 Conclusions In this paper, we treat a special type of transportability in which experiments can be conducted only over limited sets of variables in the sources and target domains, and the goal is to infer whether a certain effect can be estimated in the target using the information scattered throughout the domains. We provide a general sufﬁcient graphical conditions for transportability based on the causal calculus along with a necessary condition for a speciﬁc scenario, which should be generalized for arbitrary structures. We further provide a procedure for computing transportability, that is, generate a formula for fusing the available observational and experimental data to synthesize an estimate of the desired causal effects. Our algorithm also allows for generic weighting schemes, which generalizes standard statistical procedures and leads to the construction of statistically more powerful estimands. Acknowledgment The work of Judea Pearl and Elias Bareinboim was supported in part by grants from NSF (IIS1249822, IIS-1302448), and ONR (N00014-13-1-0153, N00014-10-1-0933). The work of Sanghack Lee and Vasant Honavar was partially completed while they were with the Department of Computer Science at Iowa State University. The work of Vasant Honavar while working at the National Science Foundation (NSF) was supported by the NSF. The work of Sanghack Lee was supported in part by the grant from NSF (IIS-0711356). Any opinions, ﬁndings, and conclusions contained in this article are those of the authors and do not necessarily reﬂect the views of the sponsors. 8 References [1] J. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A formal approach. In W. Burgard and D. Roth, editors, Proceedings of the Twenty-Fifth National Conference on Artiﬁcial Intelligence, pages 247–254. AAAI Press, Menlo Park, CA, 2011. [2] D. Campbell and J. Stanley. Experimental and Quasi-Experimental Designs for Research. Wadsworth Publishing, Chicago, 1963. [3] C. Manski. Identiﬁcation for Prediction and Decision. Harvard University Press, Cambridge, Massachusetts, 2007. [4] L. V. Hedges and I. Olkin. Statistical Methods for Meta-Analysis. Academic Press, January 1985. [5] W.R. Shadish, T.D. Cook, and D.T. Campbell. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifﬂin, Boston, second edition, 2002. [6] S. Morgan and C. Winship. Counterfactuals and Causal Inference: Methods and Principles for Social Research (Analytical Methods for Social Research). Cambridge University Press, New York, NY, 2007. [7] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–710, 1995. [8] P. Spirtes, C.N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2000. [9] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009. [10] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. [11] E. Bareinboim and J. Pearl. Transportability of causal effects: Completeness results. In J. Hoffmann and B. Selman, editors, Proceedings of the Twenty-Sixth National Conference on Artiﬁcial Intelligence, pages 698–704. AAAI Press, Menlo Park, CA, 2012. [12] E. Bareinboim and J. Pearl. Causal transportability with limited experiments. In M. desJardins and M. Littman, editors, Proceedings of the Twenty-Seventh National Conference on Artiﬁcial Intelligence, pages 95–101, Menlo Park, CA, 2013. AAAI Press. [13] S. Lee and V. Honavar. Causal transportability of experiments on controllable subsets of variables: ztransportability. In A. Nicholson and P. Smyth, editors, Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 361–370. AUAI Press, 2013. [14] E. Bareinboim and J. Pearl. Meta-transportability of causal effects: A formal approach. In C. Carvalho and P. Ravikumar, editors, Proceedings of the Sixteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 135–143. JMLR W&CP; 31, 2013. [15] S. Lee and V. Honavar. m-transportability: Transportability of a causal effect from multiple environments. In M. desJardins and M. Littman, editors, Proceedings of the Twenty-Seventh National Conference on Artiﬁcial Intelligence, pages 583–590, Menlo Park, CA, 2013. AAAI Press. [16] H. Daume III and D. Marcu. Domain adaptation for statistical classiﬁers. Journal of Artiﬁcial Intelligence Research, 26:101–126, 2006. [17] A.J. Storkey. When training and test sets are different: characterising learning transfer. In J. Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence, editors, Dataset Shift in Machine Learning, pages 3–28. MIT Press, Cambridge, MA, 2009. [18] B. Sch¨ lkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal o learning. In J Langford and J Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1255–1262, New York, NY, USA, 2012. Omnipress. [19] K. Zhang, B. Sch¨ lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional o shift. In Proceedings of the 30th International Conference on Machine Learning (ICML). JMLR: W&CP; volume 28, 2013. [20] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identiﬁability. In N. Freitas and K. Murphy, editors, Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 113–120. AUAI Press, 2012. [21] M. Kuroki and M. Miyakawa. Identiﬁability criteria for causal effects of joint interventions. Journal of the Royal Statistical Society, 29:105–117, 1999. [22] J. Tian and J. Pearl. A general identiﬁcation condition for causal effects. In Proceedings of the Eighteenth National Conference on Artiﬁcial Intelligence, pages 567–573. AAAI Press/The MIT Press, Menlo Park, CA, 2002. [23] I. Shpitser and J. Pearl. Identiﬁcation of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the Twenty-First National Conference on Artiﬁcial Intelligence, pages 1219–1226. AAAI Press, Menlo Park, CA, 2006. 9

same-paper 2 0.81969178 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch

3 0.7411561 201 nips-2013-Multi-Task Bayesian Optimization

Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams

Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efﬁciency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to ﬁnd optimal hyperparameter settings more efﬁciently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method signiﬁcantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1

4 0.73761022 173 nips-2013-Least Informative Dimensions

Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda

Abstract: We present a novel non-parametric method for ﬁnding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1

5 0.73530006 350 nips-2013-Wavelets on Graphs via Deep Learning

Author: Raif Rustamov, Leonidas Guibas

Abstract: An increasing number of applications require processing of signals deﬁned on weighted graphs. While wavelets provide a ﬂexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less ﬂexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is conﬁrmed via experiments both on synthetic and real data. 1

6 0.73152012 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

7 0.7305392 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

8 0.73037487 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

9 0.72852379 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

10 0.72846162 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

11 0.72755265 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

12 0.72690904 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

13 0.72670668 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

14 0.72637743 161 nips-2013-Learning Stochastic Inverses

15 0.72560942 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

16 0.72449577 251 nips-2013-Predicting Parameters in Deep Learning

17 0.72425413 294 nips-2013-Similarity Component Analysis

18 0.72406667 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

19 0.72402811 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

20 0.72368413 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking