nips nips2009 nips2009-170 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arthur Gretton, Peter Spirtes, Robert E. Tillman
Abstract: The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. However, for certain distributions, e.g. linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Nonlinear directed acyclic structure learning with weakly additive noise models Peter Spirtes Arthur Gretton Robert E. [sent-1, score-0.921]
2 com Abstract The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. [sent-7, score-0.776]
3 linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. [sent-10, score-0.644]
4 We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. [sent-11, score-1.895]
5 We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. [sent-12, score-0.852]
6 This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. [sent-13, score-0.589]
7 1 Introduction Learning probabilistic graphical models from data serves two primary purposes: (i) finding compact representations of probability distributions to make inference efficient and (ii) modeling unknown data generating mechanisms and predicting causal relationships. [sent-14, score-0.159]
8 Until recently, most constraint-based and score-based algorithms for learning directed graphical models from continuous data required assuming relationships between variables are linear with Gaussian noise. [sent-15, score-0.153]
9 While this assumption may be appropriate in many contexts, there are well known contexts, such as fMRI images, where variables have nonlinear dependencies and data do not tend towards Gaussianity. [sent-16, score-0.082]
10 A second major limitation of the traditional algorithms is they cannot identify a unique structure; they reduce the set of possible structures to an equivalence class which entail the same Markov properties. [sent-17, score-0.084]
11 The recently proposed additive noise model [1] for structure learning addresses both limitations; by taking advantage of observed nonlinearity and non-Gaussianity, a unique directed acyclic structure can be identified in many contexts. [sent-18, score-0.801]
12 linear Gaussians, the model is invertible and not useful for structure learning; (ii) it was originally proposed for two variables with a multivariate extension that requires enumerating all possible DAGs, which is super-exponential in the number of variables. [sent-21, score-0.08]
13 In this paper, we address the limitations of the additive noise model. [sent-22, score-0.564]
14 We introduce weakly additive noise models, which have the advantages of additive noise models, but are still useful when the additive noise model is invertible and in most cases when additive noise is not present. [sent-23, score-2.459]
15 Weakly additive noise models allow us to express greater uncertainty about the 1 data generating mechanism, but can still identify a unique structure or a smaller equivalence class in most cases. [sent-24, score-0.706]
16 We also provide an algorithm for learning an equivalence class for such models from data that is more computationally efficient in the more than two variables case. [sent-25, score-0.109]
17 Section 2 reviews the appropriate background; section 3 introduces weakly additive noise models; section 4 describes our learning algorithm; section 5 discusses some related research; section 6 presents some experimental results; finally, section 7 offers conclusions. [sent-26, score-0.687]
18 2 Background Let G = V, E be a directed acyclic graph (DAG), where V denotes the set of vertices and Eij ∈ E denotes a directed edge Vi → Vj . [sent-28, score-0.425]
19 The degree of Vi G G is the number of edges with an endpoint at Vi . [sent-31, score-0.057]
20 P is faithful to G if every conditional independence true G Vi ∈V in P is entailed by the above factorization. [sent-35, score-0.096]
21 A partially directed acyclic graph (PDAG) H for G is a mixed graph, i. [sent-36, score-0.266]
22 consisting of directed and undirected edges, representing all DAGs Markov equivalent to G, i. [sent-38, score-0.175]
23 If Vi → Vj is a directed edge in H, then all DAGs Markov equivalent to G have this directed edge; if Vi − Vj is an undirected edge in H, then some DAGs that are Markov equivalent to G have the directed edge Vi → Vj while others have the directed edge Vi ← Vj . [sent-41, score-0.779]
24 The PC algorithm is a well known constraint-based, or conditional independence based, structure learning algorithm. [sent-42, score-0.096]
25 PC learns the correct PDAG in the large sample limit when the Markov, faithfulness, and causal sufficiency (that there are no unmeasured common causes of two or more measured variables) assumptions hold [2]. [sent-46, score-0.101]
26 The partial correlation based Fisher Z-transformation test, which assumes linear Gaussian distributions, is used for conditional independence testing with continuous variables. [sent-47, score-0.096]
27 The recently proposed additive noise model approach to structure learning [1] assumes only that each variable can be represented as a (possibly nonlinear) function f of its parents plus additive noise ǫ with some arbitrary distribution, and that the noise components are n P(ǫi ). [sent-50, score-1.34]
28 , ǫn ) = i=1 X → Y is the true DAG, X = ǫX , Y = sin(πX) + ǫY , ǫX ∼ U nif (−1, 1), and ǫY ∼ U nif (−1, 1). [sent-56, score-0.138]
29 If we regress Y on X (nonparametrically), the forward model, figure 1a, and 2 0. [sent-57, score-0.087]
30 5 1 −1 −8 −6 −4 −2 0 2 4 6 8 10 Z (c) (d) Figure 1: Nonparametric regressions with data overlayed for (a) Y regressed on X, (b) X regressed on Y , (c) Z regressed on X, and (d) X regressed on Z regress X on Y , the backward model, figure 1b, we observe the residuals ǫY ⊥ X and ˆ ⊥ ǫX ⊥ Y . [sent-73, score-0.275]
31 This provides a criterion for distinguishing X → Y from X ← Y in many cases, ˆ ⊥ / but there are counterexamples such as the linear Gaussian case, where the forward model is invertible so we find ǫY ⊥ X and ǫX ⊥ Y . [sent-74, score-0.112]
32 [1, 5] show, however, that whenever f is ˆ ⊥ ˆ ⊥ nonlinear, the forward model is noninvertible, and when f is linear, the forward model is only invertible when ǫ is Gaussian and a few other special cases. [sent-75, score-0.144]
33 for X → Y → Z with X = ǫX , Y = X 3 + ǫY , Z = Y 3 + ǫZ , ǫX ∼ U nif (−1, 1), ǫY ∼ U nif (−1, 1), and ǫZ ∼ U nif (0, 1), observing only X and Z, figures 1c and 1d, causes us to reject both the forward and backward models. [sent-78, score-0.239]
34 To test whether a DAG is compatible with the data, we regress each variable on its parents and test whether the resulting residuals are mutually independent. [sent-80, score-0.087]
35 Since we do not assume linearity or Gaussianity in this framework, a sufficiently powerful nonparametric independence test must be used. [sent-85, score-0.062]
36 A Hilbert space HX of functions from X to R is a reproducing kernel Hilbert space (RKHS) if for some kernel k(·, ·) (the reproducing kernel for HX ), for every f (·) ∈ HX and x ∈ X , the inner product f (·), k(x, ·) HX = f (x). [sent-88, score-0.192]
37 The Moore-Aronszajn theorem shows that all symmetric positive definite kernels (most popular kernels) are reproducing kernels that uniquely define corresponding RKHSs [7]. [sent-91, score-0.108]
38 Let Y be a random variable with domain Y and l(·, ·) the reproducing kernel for HY . [sent-92, score-0.078]
39 µX = EX [k(x, ·)] CXY = ([k(x, ·) − µX ] ⊗ [l(y, ·) − µY ]) If the kernels are characteristic, e. [sent-94, score-0.033]
40 3 Weakly additive noise models We now extend the additive noise model framework to account for cases where additive noise models are invertible and cases where additive noise may not be present. [sent-108, score-2.386]
41 ψ = Vi , PaVi G is a local additive noise model for a distribution P over V that is Markov to a DAG G = V, E if Vi = f PaVi + ǫ is an additive noise model. [sent-111, score-1.128]
42 When we assume a data generating process has a weakly additive noise model representation, we assume only that there are no cases where X → Y can be written X = f (Y ) + ǫX , but not Y = f (X) + ǫY . [sent-115, score-0.72]
43 In other words, the data cannot appear as though it admits an additive noise model representation, but only in the incorrect direction. [sent-116, score-0.564]
44 This representation is still appropriate when additive noise models are invertible, and when additive noise is not present: such cases only lead to weakly additive noise models which express greater underdetermination of the true data generating process. [sent-117, score-1.898]
45 We now define the notion of distribution-equivalence for weakly additive noise models. [sent-118, score-0.687]
46 A weakly additive noise model M = G, Ψ is distribution-equivalent to N = G ′ , Ψ′ if and only if G and G ′ are Markov equivalent and ψ ∈ Ψ if and only if ψ ∈ Ψ′ . [sent-121, score-0.687]
47 Distribution-equivalence defines what can be discovered about the true data generating mechanism using observational data. [sent-122, score-0.033]
48 We now define a new structure to partition data generating processes which instantiate distribution-equivalent weakly additive noise models. [sent-123, score-0.72]
49 A weakly additive noise partially directed acyclic graph (WAN-PDAG) for M = G, Ψ is a mixed graph H = V, E such that for {Vi , Vj } ⊆ V, 1. [sent-126, score-0.986]
50 Vi → Vj is a directed edge in H if and only if Vi → Vj is a directed edge in G and in all G ′ such that N = G ′ , Ψ′ is distribution-equivalent to M 2. [sent-127, score-0.366]
51 Vi − Vj is an undirected edge in H if and only if Vi → Vj is a directed edge in G and there exists a G ′ and N = G ′ , Ψ′ distribution-equivalent to M such that Vi ← Vj is a directed edge in G ′ We now get the following results. [sent-128, score-0.468]
52 Let M = G, Ψ be a weakly additive noise model, ′ ′ N = G ,Ψ be distribution equivalent to M. [sent-131, score-0.687]
53 (i) This is correct because of Markov equivalence [2]. [sent-139, score-0.084]
54 In the first stage, we use a kernel-based conditional dependence measure similar to HSIC [9] (see also [11, Section 2. [sent-146, score-0.034]
55 For ˜ a conditioning variable Z with centered Gram matrix M for a reproducing kernel m(·, ·), −1 ¨ we define the conditional cross covariance CXY |Z = CXZ CZZ CZ Y , where X = (X, Z) and ¨ ¨ 2 ¨ Y = (Y, Z). [sent-148, score-0.112]
56 It follows from [9, Theorem 3] that HXY |Z = 0 if and only if X ⊥ Y |Z when kernels are characteristic. [sent-150, score-0.033]
57 [9] provides the empirical estimator: ⊥ 1 ˆ ˜˜ ˜ ˜ ˜ ˜˜ ˜ ˜ ˜ ˜˜ ˜ ˜ ˜ HXY |Z = 2 tr(K L − 2K M (M + ǫIN )−2 M L + K M (M + ǫIN )−2 M LM (M + ǫIN )−2 M ) m ˆ The null distribution of HXY |Z is unknown and difficult to derive so we must use the permutation approach described in section 2. [sent-151, score-0.045]
58 We thus (making analogy to the discrete case) must cluster Z and then permute elements only within clusters for the permutation test, as in [12]. [sent-153, score-0.045]
59 ˆ This first stage is not computational efficient, however, since each evaluation of HXY |Z is ˆ naively O N 3 and we need to evaluate HXY |Z approximately 1000 times for each permutation test. [sent-154, score-0.083]
60 We implemented the incomplete Cholesky factorization [14], which can be used to obtain an m × p matrix G, where p ≪ m, and an m × m permutation matrix P such that K ≈ P GG⊤ P ⊤ , where K is ˆ an m × m Gram matrix. [sent-156, score-0.078]
61 A clever implementation after replacing Gram matrices in HXY |Z with their incomplete Cholesky factorizations and using an appropriate equivalence to invert ˜ G⊤ G + ǫIp for M instead of GG⊤ + ǫIm results in a straightforward O mp3 operation. [sent-157, score-0.117]
62 A more stable (and faster) approach is to obtain incomplete Cholesky factorizations GX , GY , and GZ with permutation matrices PX , PY , and PZ , and then obtain the thin SVDs for HPX GX , HPY GY , and HPZ GZ , e. [sent-159, score-0.078]
63 Figure 2 shows that this method leads to a significant increase in speed when used with a permutation test for conditional independence without significantly affecting the empirically observed type I error rate for a level-. [sent-162, score-0.141]
64 ⊥ may find more orientations by considering submodels, e. [sent-173, score-0.064]
65 if all relations are linear and only one variable has a non-Gaussian noise term. [sent-175, score-0.18]
66 The basic strategy used is a“PC-style” greedy search where we look for undirected edges in the current mixed graph (starting with the PDAG resulting from the first stage) adjacent to the fewest other undirected edges. [sent-176, score-0.208]
67 If these edges can be oriented using additive noise models, we make the implied orientations, apply the extended Meek rules, and then iterate until no more edges can be oriented. [sent-177, score-0.724]
68 Let G = V, E be the resulting PDAG and ∀Vi ∈ V, let UVi denote G the nodes connected to Vi in G by an undirected edge. [sent-179, score-0.047]
69 If an edge is oriented in the second stage of kPC, it is implied by a noninvertible local additive noise model. [sent-184, score-0.797]
70 If the condition at line 8 is true then Vi , PaVi ∪ S is a noninvertible local additive G noise model. [sent-186, score-0.658]
71 Suppose ψ = Vi , W is a noninvertible local additive noise model. [sent-215, score-0.658]
72 Then kPC will make all orientations implied by ψ. [sent-216, score-0.11]
73 Since Vi , Pa ∪ S is a noninvertible local G G additive noise model, line 8 is satisfied so all edges connected to Vi are oriented. [sent-220, score-0.715]
74 Assume data is generated according to some weakly additive noise model M = G, Ψ . [sent-223, score-0.687]
75 Then kPC will return the WAN-PDAG instantiated by M assuming perfect conditional independence information, Markov, faithfulness, and causal sufficiency. [sent-224, score-0.197]
76 The PC algorithm is correct and complete with respect to conditional independence [2]. [sent-226, score-0.096]
77 Orientations made with respect to additive noise models are correct by lemma 4. [sent-227, score-0.614]
78 1 and all such orientations that can be made are made by lemma 4. [sent-228, score-0.089]
79 The Meek rules, which are correct and complete [4], are invoked after each orientation made with respect to additive noise models so they are invoked after all such orientations are made. [sent-230, score-0.653]
80 5 Related research kPC is similar in spirit to the PC-LiNGAM structure learning algorithm [15], which assumes dependencies are linear with either Gaussian or non-Gaussian noise. [sent-231, score-0.036]
81 KCL [11] is a heuristic search for a mixed graph that uses the same kernel-based dependence measures as kPC (while not determining significance threhsholds via a hypothesis test), but does not take advantage of additive noise models. [sent-233, score-0.621]
82 [16] provides a more efficient algorithm for learning additive noise models, by first finding a causal ordering after doing a series of high dimensional regressions and HSIC independence tests and then pruning the resulting DAG implied by this ordering. [sent-234, score-0.798]
83 Finally, [17] proposes a two-stage procedure for learning additive noise models from data that is similar to kPC, but requires the additive noise model assumptions in the first stage where the Markov equivalence class is identified. [sent-235, score-1.275]
84 We generated non-Gaussian noise using the same procedure as [19] and used polynomial and trigonometric functions for nonlinear dependencies. [sent-237, score-0.226]
85 We compared kPC to PC, the score-based GES with the BIC-score [20], and the ICA-based LiNGAM [19], which assumes linear dependencies and non-Gaussian noise. [sent-238, score-0.036]
86 proportion of directed edges in the resulting graph that are in the true DAG, and recall, i. [sent-241, score-0.218]
87 proportion of directed edges in the true DAG that are in the resulting graph. [sent-243, score-0.185]
88 kPC performs about the same as PC in precision and recall, which again is unsurprising since previous simulation results have shown that nonlinearity, but not non-Gaussianity can significantly affect the performance of PC. [sent-248, score-0.056]
89 In the nonlinear non-Gaussian case, kPC performs slightly better than PC in precision. [sent-249, score-0.046]
90 2 We also ran kPC on data from an fMRI experiment that is analyzed in [21] where nonlinear dependencies can be observed. [sent-251, score-0.082]
91 kPC successfully found the same directed edges without using any background knowledge. [sent-255, score-0.212]
92 7 Conclusion We introduced weakly additive noise models, which extend the additive noise model framework to cases such as the linear Gaussian, where the additive noise model is invertible and thus unidentifiable, as well as cases where additive noise is not present. [sent-257, score-2.459]
93 The weakly additive noise framework allows us to identify a unique DAG when the additive noise model assumptions hold, and a structure that is at least as specific as a PDAG (possibly still a unique DAG) when some additive noise assumptions fail. [sent-258, score-1.815]
94 We defined equivalence classes for such models and introduced the kPC algorithm for learning these equivalence classes from data. [sent-259, score-0.193]
95 2 When simulating nonlinear data, we must be careful to ensure that variances do not blow up and result in data for which no finite sample method can show adequate performance. [sent-265, score-0.046]
96 This has the unfortunate side effect that the nonlinear data generated may be well approximated using linear methods. [sent-266, score-0.046]
97 Causal discovery of linear acyclic models with arbitrary distributions. [sent-366, score-0.106]
98 Regression by dependence minio mization and its application to causal inference in additive noise models. [sent-374, score-0.665]
99 Acyclic causality discovery with additive noise: An a information-theoretical perspective. [sent-379, score-0.384]
100 A linear non-gaussian acyclic a model for causal discovery. [sent-392, score-0.182]
wordName wordTfidf (topN-words)
[('kpc', 0.397), ('additive', 0.384), ('vi', 0.307), ('pavi', 0.235), ('vj', 0.224), ('hxy', 0.219), ('noise', 0.18), ('pc', 0.173), ('dag', 0.151), ('lingam', 0.137), ('directed', 0.128), ('dags', 0.126), ('pdag', 0.125), ('uvi', 0.125), ('weakly', 0.123), ('ges', 0.11), ('gz', 0.11), ('causal', 0.101), ('vk', 0.095), ('chvi', 0.094), ('noninvertible', 0.094), ('equivalence', 0.084), ('acyclic', 0.081), ('invertible', 0.08), ('gy', 0.078), ('meek', 0.069), ('cxy', 0.069), ('nif', 0.069), ('hx', 0.067), ('gx', 0.067), ('orientations', 0.064), ('janzing', 0.063), ('orient', 0.063), ('pagj', 0.063), ('independence', 0.062), ('gram', 0.059), ('edges', 0.057), ('edge', 0.055), ('regress', 0.055), ('regressed', 0.055), ('cholesky', 0.053), ('ii', 0.049), ('hsic', 0.047), ('spirtes', 0.047), ('submodels', 0.047), ('markov', 0.047), ('undirected', 0.047), ('implied', 0.046), ('nonlinear', 0.046), ('permutation', 0.045), ('gretton', 0.045), ('hilbert', 0.043), ('reproducing', 0.042), ('hyv', 0.041), ('stage', 0.038), ('dependencies', 0.036), ('kernel', 0.036), ('su', 0.036), ('sz', 0.035), ('conditional', 0.034), ('generating', 0.033), ('incomplete', 0.033), ('hs', 0.033), ('hoyer', 0.033), ('graph', 0.033), ('kernels', 0.033), ('forward', 0.032), ('rules', 0.032), ('fukumizu', 0.032), ('parents', 0.032), ('ul', 0.032), ('faithfulness', 0.031), ('gaussianity', 0.031), ('immorality', 0.031), ('lacc', 0.031), ('lifg', 0.031), ('lipl', 0.031), ('lmtg', 0.031), ('locc', 0.031), ('pagi', 0.031), ('ramsey', 0.031), ('sk', 0.031), ('precision', 0.029), ('nonlinearity', 0.028), ('scholk', 0.027), ('hanson', 0.027), ('mooij', 0.027), ('nonparametrically', 0.027), ('unsurprising', 0.027), ('background', 0.027), ('sch', 0.027), ('tests', 0.025), ('lemma', 0.025), ('pittsburgh', 0.025), ('schmidt', 0.025), ('unidenti', 0.025), ('gg', 0.025), ('pa', 0.025), ('models', 0.025), ('mixed', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models
Author: Arthur Gretton, Peter Spirtes, Robert E. Tillman
Abstract: The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. However, for certain distributions, e.g. linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. 1
2 0.20885067 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1
3 0.14643021 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
Author: Marcel V. Gerven, Botond Cseke, Robert Oostenveld, Tom Heskes
Abstract: We introduce a novel multivariate Laplace (MVL) distribution as a sparsity promoting prior for Bayesian source localization that allows the specification of constraints between and within sources. We represent the MVL distribution as a scale mixture that induces a coupling between source variances instead of their means. Approximation of the posterior marginals using expectation propagation is shown to be very efficient due to properties of the scale mixture representation. The computational bottleneck amounts to computing the diagonal elements of a sparse matrix inverse. Our approach is illustrated using a mismatch negativity paradigm for which MEG data and a structural MRI have been acquired. We show that spatial coupling leads to sources which are active over larger cortical areas as compared with an uncoupled prior. 1
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright
Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.
5 0.089960158 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
Author: Han Liu, Xi Chen
Abstract: This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate models, we propose an algorithm called generalized forward regression. Both algorithms simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-ofthe-art competitors, including LASSO, a nonparametric version of LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called Foba. We also provide some theoretical justifications of specific versions of the additive forward regression. 1
6 0.077811129 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
7 0.074169427 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording
8 0.073688678 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition
9 0.071510755 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem
10 0.067732356 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
11 0.066412047 43 nips-2009-Bayesian estimation of orientation preference maps
12 0.05540045 152 nips-2009-Measuring model complexity with the prior predictive
13 0.050939761 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares
14 0.050768949 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
15 0.048830193 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
16 0.048611175 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
17 0.046164647 246 nips-2009-Time-Varying Dynamic Bayesian Networks
18 0.043179642 78 nips-2009-Efficient Moments-based Permutation Tests
19 0.041993219 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
20 0.040802088 147 nips-2009-Matrix Completion from Noisy Entries
topicId topicWeight
[(0, -0.147), (1, -0.008), (2, 0.008), (3, 0.032), (4, -0.017), (5, -0.072), (6, 0.067), (7, -0.039), (8, -0.036), (9, -0.095), (10, -0.039), (11, -0.068), (12, 0.027), (13, 0.029), (14, 0.023), (15, 0.025), (16, -0.02), (17, 0.033), (18, -0.052), (19, -0.117), (20, -0.053), (21, 0.12), (22, -0.222), (23, 0.005), (24, -0.189), (25, 0.04), (26, -0.057), (27, -0.003), (28, -0.015), (29, -0.109), (30, 0.008), (31, -0.038), (32, 0.119), (33, 0.017), (34, 0.034), (35, 0.017), (36, -0.018), (37, -0.056), (38, -0.049), (39, 0.036), (40, 0.111), (41, 0.226), (42, -0.032), (43, -0.059), (44, -0.198), (45, 0.035), (46, 0.095), (47, 0.127), (48, 0.223), (49, -0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.96212798 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models
Author: Arthur Gretton, Peter Spirtes, Robert E. Tillman
Abstract: The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. However, for certain distributions, e.g. linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. 1
2 0.63793898 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1
3 0.4853451 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
Author: Marcel V. Gerven, Botond Cseke, Robert Oostenveld, Tom Heskes
Abstract: We introduce a novel multivariate Laplace (MVL) distribution as a sparsity promoting prior for Bayesian source localization that allows the specification of constraints between and within sources. We represent the MVL distribution as a scale mixture that induces a coupling between source variances instead of their means. Approximation of the posterior marginals using expectation propagation is shown to be very efficient due to properties of the scale mixture representation. The computational bottleneck amounts to computing the diagonal elements of a sparse matrix inverse. Our approach is illustrated using a mismatch negativity paradigm for which MEG data and a structural MRI have been acquired. We show that spatial coupling leads to sources which are active over larger cortical areas as compared with an uncoupled prior. 1
4 0.47752622 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition
Author: Tom Ouyang, Randall Davis
Abstract: We propose a new sketch recognition framework that combines a rich representation of low level visual appearance with a graphical model for capturing high level relationships between symbols. This joint model of appearance and context allows our framework to be less sensitive to noise and drawing variations, improving accuracy and robustness. The result is a recognizer that is better able to handle the wide range of drawing styles found in messy freehand sketches. We evaluate our work on two real-world domains, molecular diagrams and electrical circuit diagrams, and show that our combined approach significantly improves recognition performance. 1
5 0.43594006 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
Author: Han Liu, Xi Chen
Abstract: This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate models, we propose an algorithm called generalized forward regression. Both algorithms simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-ofthe-art competitors, including LASSO, a nonparametric version of LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called Foba. We also provide some theoretical justifications of specific versions of the additive forward regression. 1
6 0.42498863 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording
7 0.41902304 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem
8 0.36481002 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
9 0.36077511 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
10 0.34390366 78 nips-2009-Efficient Moments-based Permutation Tests
11 0.33991846 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares
12 0.33729252 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation
13 0.33251271 206 nips-2009-Riffled Independence for Ranked Data
14 0.32087103 43 nips-2009-Bayesian estimation of orientation preference maps
15 0.3134391 100 nips-2009-Gaussian process regression with Student-t likelihood
16 0.30835277 152 nips-2009-Measuring model complexity with the prior predictive
17 0.30259758 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
18 0.28112385 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
19 0.27880529 3 nips-2009-AUC optimization and the two-sample problem
20 0.26920485 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information
topicId topicWeight
[(7, 0.364), (21, 0.01), (24, 0.063), (25, 0.048), (35, 0.091), (36, 0.064), (39, 0.063), (49, 0.019), (58, 0.083), (71, 0.045), (86, 0.054)]
simIndex simValue paperId paperTitle
1 0.89658505 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording
Author: Zhi Yang, Qi Zhao, Edward Keefer, Wentai Liu
Abstract: Studying signal and noise properties of recorded neural data is critical in developing more efficient algorithms to recover the encoded information. Important issues exist in this research including the variant spectrum spans of neural spikes that make it difficult to choose a globally optimal bandpass filter. Also, multiple sources produce aggregated noise that deviates from the conventional white Gaussian noise. In this work, the spectrum variability of spikes is addressed, based on which the concept of adaptive bandpass filter that fits the spectrum of individual spikes is proposed. Multiple noise sources have been studied through analytical models as well as empirical measurements. The dominant noise source is identified as neuron noise followed by interface noise of the electrode. This suggests that major efforts to reduce noise from electronics are not well spent. The measured noise from in vivo experiments shows a family of 1/f x spectrum that can be reduced using noise shaping techniques. In summary, the methods of adaptive bandpass filtering and noise shaping together result in several dB signal-to-noise ratio (SNR) enhancement.
same-paper 2 0.78787935 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models
Author: Arthur Gretton, Peter Spirtes, Robert E. Tillman
Abstract: The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. However, for certain distributions, e.g. linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. 1
3 0.7470355 69 nips-2009-Discrete MDL Predicts in Total Variation
Author: Marcus Hutter
Abstract: The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. The result is completely general. No independence, ergodicity, stationarity, identifiability, or other assumption on the model class need to be made. More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. Implications for non-i.i.d. domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed. 1
4 0.74283445 95 nips-2009-Fast subtree kernels on graphs
Author: Nino Shervashidze, Karsten M. Borgwardt
Abstract: In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & G¨ rtner scales as O(n2 4d h). Key to this efficiency is the observation that the a Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime. 1
5 0.65899748 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Ulf Brefeld, Pavel Laskov, Klaus-Robert Müller, Alexander Zien, Sören Sonnenburg
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability. Unfortunately, 1 -norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary p -norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply p -norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art. 1
6 0.56392425 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds
7 0.45877337 226 nips-2009-Spatial Normalized Gamma Processes
8 0.45572805 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
9 0.45024413 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies
10 0.44505951 3 nips-2009-AUC optimization and the two-sample problem
11 0.44329244 113 nips-2009-Improving Existing Fault Recovery Policies
12 0.43909264 168 nips-2009-Non-stationary continuous dynamic Bayesian networks
13 0.43446225 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
14 0.43428147 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
15 0.43143779 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning
16 0.431427 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding
17 0.43040687 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions
18 0.4289341 237 nips-2009-Subject independent EEG-based BCI decoding
19 0.42742753 247 nips-2009-Time-rescaling methods for the estimation and assessment of non-Poisson neural encoding models
20 0.42648882 175 nips-2009-Occlusive Components Analysis