jmlr jmlr2011 jmlr2011-86 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks
Reference: text
sentIndex sentText sentNum sentScore
1 It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. [sent-7, score-0.579]
2 Linear latent variable models (or factor analysis) and linear directed acyclic graphs (DAGs) are prominent examples of models for continuous multivariate data. [sent-23, score-0.187]
3 Sparsity arises for example in gene regulation because the latent factors represent driving signals for gene regulatory sub-networks and/or transcription factors, each of which only includes/affects a limited number of genes. [sent-33, score-0.329]
4 Furthermore, linear relationships might be unrealistic for example in gene regulation, where it is generally accepted that one cannot replace the driving signal (related to concentration of a transcription factor protein in the cell nucleus) with the measured concentration of corresponding mRNA. [sent-35, score-0.22]
5 (2006) provided the important insight that every DAG has a factor model representation, that is, the connectivity matrix of a DAG gives rise to a triangular mixing matrix in the factor model. [sent-42, score-0.35]
6 (2008) extend it to allow for latent variables, for which they use a probabilistic version of ICA to obtain the variable ordering, pruning to make the model sparse and bootstrapping for model selection. [sent-48, score-0.24]
7 In fully Bayesian sparse factor modeling, two approaches have been proposed: parametric models with bimodal sparsity promoting priors (West, 2003; Lucas et al. [sent-62, score-0.248]
8 The model proposed by West (2003) is, as far as the authors know, the first attempt to encode sparsity in a factor model explicitly in the form of a prior. [sent-66, score-0.252]
9 Starting from a training-test set partition of data {X, X⋆ }, our framework produces factor models C and DAG candidates B with and without latent variables Z that can be compared in terms of how well they fit the data using test likelihoods L . [sent-124, score-0.355]
10 The variable ordering P needed by the DAG is obtained as a byproduct of a factor model inference. [sent-125, score-0.21]
11 However, since we are interested in the factor model by itself, we will not constrain the factor loading matrix to have triangular form, but allow for sparse solutions so pruning is not needed. [sent-128, score-0.286]
12 The slab and spike prior biases towards sparsity so it makes sense to search for a permutation in parallel with factor model inference. [sent-130, score-0.543]
13 Given a set of possible variable orderings inferred by this method, we can then learn DAGs using slab and spike priors for their connectivity matrices. [sent-133, score-0.428]
14 The so-called slab and spike prior is a two-component mixture of a continuous distribution and degenerate δ-function point mass at zero. [sent-134, score-0.209]
15 Section 3 provides all prior specification including sparsity, latent variables and driving signals, order search and extensions for correlated data (CSLIM) and non-linearities (SNIM). [sent-140, score-0.241]
16 The model is not a completely general linear Bayesian network because connections to latent variables are absent (see for example Silva, 2010). [sent-179, score-0.197]
17 1, 1973) states that when at least m − 1 latent variables are non-Gaussian, CL is identifiable up to scale and permutation of its columns, that is, we can identify CL = CL Sf Pf , where Sf and Pf are arbitrary scaling and permutation matrices, respectively. [sent-195, score-0.28]
18 What we will show is that even if D is still identifiable, we can no longer obtain B and C uniquely unless we “tag” the model by requiring the distributions of driving signals zD and latent signals zL to differ. [sent-217, score-0.313]
19 The remaining components of the model are described as it follows in five parts named sparsity, latent variables and driving signals, order search, allowing for correlated data and allowing for nonlinearities. [sent-308, score-0.243]
20 The same effect can be expected from any continuous distribution used for sparsity like Student’s t, α-stable and bimodal priors (continuous slab and spike priors, Ishwaran and Rao, 2005). [sent-318, score-0.346]
21 This has motivated the introduction of many variants over the years of so-called slab and spike priors consisting of two component mixtures of a continuous part and a δ-function at zero (Lempers, 1971; Mitchell and Beauchamp, 1988; George and McCulloch, 1993; Geweke, 1996; West, 2003). [sent-320, score-0.261]
22 Note also that qi j |ν j is mainly used for clarity to make the binary indicators explicit, nevertheless in practice we can work directly with ci j |ν j , · ∼ (1 − ν j )δ(ci j ) + ν j Cont(ci j |·) because qi j can be marginalized out. [sent-325, score-0.182]
23 (2008) the model with a shared beta-distributed sparsity level per factor introduces the undesirable side-effect that there is strong co-variation between the elements in each column of the masking matrix. [sent-328, score-0.206]
24 (3) The distribution over ηi j expresses that we expect parsimony: either ηi j is zero exactly (implying that qi j and ci j are zero) or non-zero drawn from a beta distribution favoring high values, that is, qi j and ci j are non-zero with high probability. [sent-335, score-0.265]
25 The slab and spike for B (DAG) is obtained from Equations (2), (3) and (4) by simply replacing ci j with bi j and qi j with ri j . [sent-347, score-0.322]
26 The hyperparameters for the variance of the non-zero elements of B and C are set to get a diffuse prior distribution bounded away from zero (ts = 2 and tr = 1), to allow for a better separation between slab and spike components. [sent-378, score-0.209]
27 This means that we can perform inference for the factor model to obtain D while searching in parallel for a set of permutations P that are in good agreement (in a probabilistic sense) with the triangular requirement of D. [sent-415, score-0.218]
28 Such a set of orderings is found during the inference procedure of the factor model. [sent-416, score-0.235]
29 To set up the stochastic search, we need to modify the factor model slightly by introducing separate data (row) and factor (column) permutations, P and Pf to obtain x = P⊤ DPf z + ε . [sent-417, score-0.196]
30 The reason for using two different permutation matrices, rather than only one like in the definition of the DAG model, is that we need to account for the permutation freedom of the factor model (see Section 2. [sent-418, score-0.289]
31 The procedure can be seen as a simple approach for generating hypotheses about good orderings, producing close to triangular versions of D, in a model where the slab and spike prior provide the required bias towards sparsity. [sent-429, score-0.309]
32 We can assume then independent Gaussian process (GP) priors for each latent variable instead of scale mixtures of Gaussians, to obtain what we have called correlated sparse linear identifiable modeling (CSLIM). [sent-438, score-0.2]
33 The main limitation of the model in Equation (10) is that if the true ordering of the variables is unknown, the exhaustive enumeration of P is needed. [sent-472, score-0.176]
34 In principle, an ordering search procedure for the non-linear model only requires the latent variables z to have Gaussian process priors as well. [sent-474, score-0.343]
35 Formally the Bayesian model selection yardstick, the marginal likelihood for model M p(X|M ) = Θ Θ Θ p(X|Θ , Z)p(Θ |M )p(Z|M )dΘ dZ , can be obtained by marginalizing the joint over the parameters Θ and latent variables Z. [sent-479, score-0.258]
36 For the full DAG model in Equation (1), we will not average over permutations P but rather calculate the test likelihood for a number of candidates P(1) , . [sent-488, score-0.191]
37 The average over the extensive variables associated with the test points p(Z⋆ |·) is slightly more complicated because naively drawing samples from p(Z⋆ |·) results in an estimator with high variance—for ψi ≪ υ jn . [sent-497, score-0.209]
38 , υ(d+m)n ) with elements υ jn , Θ is sampled directly from the prior, accordingly. [sent-505, score-0.209]
39 , yiN |kυi ,x (·)) , (SNIM) z jn υ jn j = 1 : d +m ψi xin yin i=1:d υi where we have omitted P and the hyperparameters in the graphical model. [sent-528, score-0.418]
40 The latent variables/driving signals z jn and the mixing/connectivity matrices with elements ci j or bi j are modeled independently. [sent-530, score-0.4]
41 Variables υ jn are variances if z jn use a scale mixture of Gaussian’s representation, or length scales in the GP prior case. [sent-533, score-0.418]
42 1 Proposed Workflow We propose the workflow shown in Figure 1 to integrate all elements of SLIM, namely factor model and DAG inference, stochastic order search and model selection using predictive densities. [sent-538, score-0.211]
43 Select the top mtop candidates according to their frequency during inference in step 2. [sent-551, score-0.203]
44 2 Computational Cost The cost of running the linear DAG with latent variables or the factor model is roughly the same, that is, O (Ns d 2 N) where Ns is the total number of samples including the burn-in period. [sent-577, score-0.233]
45 For the reanalysis of flow cytometry data using our models, quantitative model comparison favors the DAG with latent variables rather than the standard factor model or the pure DAG which was the paradigm used in the structure learning approach of Sachs et al. [sent-600, score-0.279]
46 As a summary statistic we use median values everywhere, and Laplace distributions for the latent factors if not stated otherwise. [sent-609, score-0.185]
47 9 Correct ordering rate Correct ordering rate 1 0. [sent-656, score-0.178]
48 (a,c) Total correct ordering rates where DENSE is our factor model without sparsity prior and DS corresponds to DENSE but using the deterministic ordering search used in LiNGAM. [sent-665, score-0.428]
49 The crosses and horizontal lines correspond to LiNGAM while the triangles are accumulated correct orderings across candidates used by SLIM. [sent-668, score-0.208]
50 1 O RDER S EARCH With this experiment we want to quantify the impact of using sparsity, stochastic ordering search and more than one ordering candidate, that is, mtop = 10 in total. [sent-671, score-0.291]
51 We have the following abbreviations for this experiment, DENSE is our factor model without sparsity prior, that is, assuming that p(ri j = 1) = 1 a priori. [sent-673, score-0.206]
52 (iii) Using more than one ordering candidate considerably improves the total correct ordering rate, for example, by almost 30% for d = 5, N = 200 and 35% for d = 10, N = 500. [sent-679, score-0.223]
53 (iv) The number of accumulated correct orderings found saturates as the number of candidates used increases, suggesting that further increasing mtop will not considerably change the overall results. [sent-680, score-0.277]
54 (v) The number of correct orderings tends to accumulate on the first candidate when N increases since the uncertainty of the estimation of the parameters in the factor model decreases accordingly. [sent-681, score-0.283]
55 even when using the same single candidate ordering search proposed by Shimizu et al. [sent-769, score-0.178]
56 (ix) In some cases the difference between SLIM and LiNGAM is very large, for example, for d = 10 using two candidates and N = 1000 is enough to obtain as many correct orderings as LiNGAM with N = 5000. [sent-771, score-0.208]
57 (c) First 50 (out of 92) ordering candidates produced by our method during inference and their frequency, the first mtop candidates were used for to learn DAGs. [sent-844, score-0.383]
58 Figures 9(a), 9(b) and 9(c) show averaged performance measures respectively as ROC curves and the proportion of links reversed in the estimated model due to ordering errors. [sent-903, score-0.245]
59 We kept 20% of the data to compute the predictive densities to then select between all estimated DAG candidates and the factor model. [sent-928, score-0.207]
60 The true DAG rates tend to be larger than for factor models because it is more likely that the latter is confused with a DAG due to estimation errors or closeness to a DAG representation, than a DAG being confused with a factor model which is naturally more general. [sent-940, score-0.196]
61 Our ordering search procedure was able to find the right ordering 78 out of 100 times. [sent-1008, score-0.222]
62 (b,c,d) Median error, likelihood and test likelihood for all possible orderings and 10 independent repetitions. [sent-1030, score-0.225]
63 Note that when the error is zero in (b) the likelihoods are larger with respect to the remaining orderings in (c) and (d). [sent-1032, score-0.194]
64 Since we do not yet have an ordering search procedure for non-linear DAGs, we perform DAG inference for all possible orderings and data sets. [sent-1038, score-0.293]
65 Second, we need to validate that the likelihood is able to select the model with less error and correct ordering among all possible candidates so we can use it in practice. [sent-1040, score-0.28]
66 These three non-linear DAG search algorithms have the great advantage of not requiring exhaustive enumeration of the orderings as our method and others available in the literature. [sent-1068, score-0.202]
67 Our model was able to find the right ordering, that is, duration → interval in all cases when the test likelihood was used but only 7 times with the training likelihood due to the proximity of the densities, see Figure 13(c). [sent-1076, score-0.223]
68 Focused only on the observational data, we want to test all the possibilities of our model in this data set, namely, standard factor models, pure DAGs, DAGs with latent variables, non-linear DAGs and quantitative model comparison using test likelihoods. [sent-1139, score-0.279]
69 SLIM in Figure 15(c) finds a network almost equal to the one in Figure 15(b) apart from one reversed link, plc → pip3. [sent-1145, score-0.196]
70 In addition, there is just one false positive, the pair {jnk, p38}, even with a dedicated latent variable in the factor model mixing matrix shown in Figure 16(b), thus we cannot attribute such a false positive to estimation errors. [sent-1148, score-0.367]
71 A total of 211 ordering candidates were produced during the inference out of approximately 107 possible and only mtop = 10 of them were used in the structure search step. [sent-1149, score-0.336]
72 We also examined the estimated factor model in Figure 16(b) and we found that several factors could correspond respectively to three unmeasured proteins, namely pi3k in factors 9 and 11, m3 (mapkkk, mek4/7) and m4 (mapkkk, mek3/6) in factor 7, ras in factors 4 and 6. [sent-1154, score-0.349]
73 However the number 892 S PARSE L INEAR I DENTIFIABLE M ULTIVARIATE M ODELING ras pik3 pip2 pip3 pip2 m3 jnk pip3 pkc m4 akt plc pka p38 pip2 plc pip3 pkc jnk raf p38 erk akt pka erk l1 l2 akt plc mek (c) log LDAG = −4. [sent-1157, score-1.083]
74 10e3 l2 pip3 pkc jnk pip2 p38 pka akt raf erk l1 pip3 pip2 p38 mek (b) log LDAG = −4. [sent-1158, score-0.434]
75 30e3 (a) pkc jnk raf pka mek plc plc pkc jnk raf p38 pka mek erk akt (d) log LDAG = −3. [sent-1159, score-0.924]
76 Estimated structure using SLIM: (b) using the true ordering, (c) obtaining the ordering from the stochastic search, (d) top DAG with 2 latent variables and (e) the runner-up (in test likelihood). [sent-1164, score-0.201]
77 The poor performance of LiNGAM is difficult to explain but the large amount of reversed links may be due to the FastICA based deterministic ordering search procedure. [sent-1170, score-0.243]
78 The results obtained by the DAG with 2 a priori assumed latent variables are shown in Figures 15(d) and 15(e), corresponding to the first and second DAG candidates in terms of test likelihoods. [sent-1172, score-0.203]
79 −3 7 mek pip2 akt pip3 erk erk −4050 13. [sent-1174, score-0.234]
80 5 −4060 14 plc pka 13 pip3 pka plc 4 3 2 p38 1 −4070 raf 15 −2 −4090 1 −3 jnk 0 10 2 10 Magnitude (a) 4 10 0 −1 −4080 pkc mek 2 5 akt pip2 6 Errors y pkc −4040 FM DAG Test log-likelihood p38 x 10 raf Density jnk 1 2 3 4 5 6 7 8 9 10 11 log LFM = −3. [sent-1175, score-0.8]
81 (c) Test likelihoods for the best DAG (dashed) and the factor model (solid). [sent-1179, score-0.198]
82 (d) Test likelihoods (squares) and structure errors (circles) included reversed links for all candidates. [sent-1180, score-0.187]
83 It is very interesting to see how, due to the link between pik3 and ras that is not possible to model with our model, the second inferred latent variable is detecting signals pointing towards pip2 and plc. [sent-1184, score-0.235]
84 Comparing Figures 15(c) and 15(e) reveals two differences in the observed part, a false negative pip3 → plc and a new true (reversed) positive mek → pka. [sent-1186, score-0.194]
85 This candidate is particularly interesting because the first latent variable captures the connectivity of pik3 while connecting itself to plc due to the lack of connectivity between pip3 and plc. [sent-1187, score-0.34]
86 Moreover, the second latent variable resembles ras and the link between pik3 and ras as a link from itself to pip3 . [sent-1188, score-0.196]
87 In terms of median test likelihoods, the model in Figure 15(d) is only marginally better than the factor model in Figure 16(b) and in turn marginally worse than the DAG in Figure 15(e). [sent-1190, score-0.203]
88 In particular there are only two differences, plc → pip2 and jnk → p38 are missing, meaning that at least in this case there are no false positives in the non-linear DAG. [sent-1192, score-0.194]
89 The mixing matrix obtained by our unrestricted model in Figure 17(c) is clearly denser than the other two, suggesting that there are different ways of connecting genes and transcription factors and still reconstruct the transcription factor activities given the observed gene expression data. [sent-1225, score-0.328]
90 The key ingredients for our Bayesian models are slab and spike priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities (test likelihoods) to perform the comparison. [sent-1287, score-0.354]
91 A set of candidate orderings is produced by stochastic search during the factor model inference. [sent-1288, score-0.327]
92 We also show that the DAG with latent variables can be fully identifiable and that SLIM can be extended to the non-linear case (SNIM - Sparse Non-linear Identifiable Multivariate modeling), if the ordering of the variables is provided or can be tested by exhaustive enumeration. [sent-1292, score-0.242]
93 We believe that the priors that give raise to sparse models in the fully Bayesian inference setting, like the two-level slab (continuous) and spike (point-mass in zero) priors used are very powerful tools for simultaneous model and parameter inference. [sent-1300, score-0.438]
94 Although the posterior distributions for slab and spike priors will be non-convex, it is our experience that inference with blocked Gibbs sampling actually has very good convergence properties. [sent-1302, score-0.304]
95 In the two-level approach, one uses a hierarchy of two slab and spike priors. [sent-1303, score-0.209]
96 Instead of letting this parameter be controlled by a single Beta-distribution (one level approach) we have a slab and spike distribution on it with a Beta-distributed slab component biased towards one. [sent-1307, score-0.331]
97 Directly specifying the distributions of the latent variables in order to obtain identifiability in the general DAG with latent variables requires having different distributions for the driving signals of the observed variables and latent variables. [sent-1317, score-0.456]
98 The main problem is that the non-linear DAG has no equivalent factor model representation so we cannot directly exploit the permutation candidates we find in SLIM. [sent-1326, score-0.296]
99 However, as long as the non-linearities are weak, one might in principle use the permutation candidates found in a factor model, that is, the linear effects will determine the correct ordering of the variables. [sent-1327, score-0.339]
100 Sampling from the conditional distributions for τi j can be done using di2j 1 τ−1 |d jn ,ts ,tr ∼ Gamma τ−1 ts + ,tr + ij ij 2 2ψi . [sent-1382, score-0.209]
wordName wordTfidf (topN-words)
[('dag', 0.388), ('slim', 0.374), ('lingam', 0.259), ('jn', 0.209), ('dags', 0.151), ('dentifiable', 0.145), ('enao', 0.145), ('inther', 0.145), ('odeling', 0.145), ('ultivariate', 0.145), ('snim', 0.125), ('slab', 0.122), ('orderings', 0.117), ('latent', 0.112), ('cslim', 0.111), ('wald', 0.095), ('inear', 0.095), ('candidates', 0.091), ('sachs', 0.09), ('ordering', 0.089), ('spike', 0.087), ('sparsity', 0.085), ('driving', 0.085), ('parse', 0.085), ('permutation', 0.084), ('plc', 0.083), ('likelihoods', 0.077), ('factor', 0.075), ('reversed', 0.074), ('jnk', 0.069), ('mek', 0.069), ('mtop', 0.069), ('nca', 0.069), ('pkc', 0.069), ('qi', 0.069), ('duration', 0.069), ('cl', 0.064), ('hoyer', 0.063), ('raf', 0.062), ('transcription', 0.06), ('ql', 0.058), ('age', 0.056), ('akt', 0.055), ('erk', 0.055), ('henao', 0.055), ('pka', 0.055), ('regulondb', 0.055), ('identi', 0.054), ('laplace', 0.054), ('triangular', 0.054), ('likelihood', 0.054), ('zl', 0.053), ('shimizu', 0.053), ('ik', 0.053), ('priors', 0.052), ('connectivity', 0.05), ('mixing', 0.05), ('ground', 0.049), ('kagan', 0.048), ('kao', 0.048), ('ldag', 0.048), ('pnl', 0.048), ('pf', 0.048), ('gamma', 0.048), ('hyv', 0.047), ('model', 0.046), ('dense', 0.045), ('sr', 0.045), ('candidate', 0.045), ('bayesian', 0.045), ('search', 0.044), ('ci', 0.044), ('inference', 0.043), ('carvalho', 0.042), ('causal', 0.042), ('false', 0.042), ('interventional', 0.042), ('mmiss', 0.042), ('nan', 0.042), ('ras', 0.042), ('permuted', 0.041), ('exhaustive', 0.041), ('figures', 0.041), ('densities', 0.041), ('ss', 0.04), ('gp', 0.04), ('ability', 0.04), ('network', 0.039), ('beta', 0.039), ('cd', 0.039), ('rinen', 0.038), ('parsimonious', 0.037), ('factors', 0.037), ('median', 0.036), ('zd', 0.036), ('links', 0.036), ('sparse', 0.036), ('hik', 0.035), ('skeleton', 0.035), ('signals', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks
2 0.11924003 23 jmlr-2011-DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model
Author: Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvärinen, Yoshinobu Kawahara, Takashi Washio, Patrik O. Hoyer, Kenneth Bollen
Abstract: Structural equation models and Bayesian networks have been widely used to analyze causal relations between continuous variables. In such frameworks, linear acyclic models are typically used to model the data-generating process of variables. Recently, it was shown that use of non-Gaussianity identifies the full structure of a linear acyclic model, that is, a causal ordering of variables and their connection strengths, without using any prior knowledge on the network structure, which is not the case with conventional methods. However, existing estimation methods are based on iterative search algorithms and may not converge to a correct solution in a finite number of steps. In this paper, we propose a new direct method to estimate a causal ordering and connection strengths based on non-Gaussianity. In contrast to the previous methods, our algorithm requires no algorithmic parameters and is guaranteed to converge to the right solution within a small fixed number of steps if the data strictly follows the model, that is, if all the model assumptions are met and the sample size is infinite. Keywords: structural equation models, Bayesian networks, independent component analysis, non-Gaussianity, causal discovery c 2011 Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv¨ rinen, Yoshinobu Kawahara, Takashi Washio, Patrik O. Hoyer a and Kenneth Bollen ¨ S HIMIZU , I NAZUMI , S OGAWA , H YV ARINEN , K AWAHARA , WASHIO , H OYER AND B OLLEN
3 0.1176575 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
Author: Vianney Perchet
Abstract: We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law. They are based on a generalization of calibration, no longer defined in terms of a Vorono¨ ı diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework, the expected average internal, as well as the usual external, regret at stage n by O(n−1/3 ), which is known to be optimal. Keywords: repeated games, on-line learning, regret, partial monitoring, calibration, Vorono¨ and ı Laguerre diagrams
4 0.087876283 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
Author: Cassio P. de Campos, Qiang Ji
Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique
5 0.082256161 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
6 0.065389842 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
7 0.065002955 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
8 0.060250901 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
9 0.059502531 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
10 0.057887964 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
11 0.05448696 54 jmlr-2011-Learning Latent Tree Graphical Models
12 0.051435776 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
13 0.049859911 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
14 0.045479376 59 jmlr-2011-Learning with Structured Sparsity
15 0.044859927 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
16 0.042874172 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
17 0.04127159 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
18 0.040746972 12 jmlr-2011-Bayesian Co-Training
19 0.040289246 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
20 0.038278703 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
topicId topicWeight
[(0, 0.227), (1, -0.101), (2, -0.073), (3, 0.043), (4, -0.028), (5, 0.007), (6, 0.053), (7, 0.082), (8, 0.045), (9, 0.101), (10, 0.105), (11, -0.054), (12, 0.127), (13, 0.151), (14, 0.021), (15, -0.151), (16, -0.048), (17, -0.036), (18, -0.064), (19, 0.223), (20, 0.116), (21, 0.468), (22, -0.101), (23, -0.049), (24, 0.047), (25, -0.074), (26, 0.005), (27, 0.0), (28, 0.112), (29, -0.117), (30, -0.042), (31, -0.048), (32, -0.129), (33, 0.044), (34, 0.064), (35, 0.047), (36, 0.031), (37, -0.017), (38, 0.014), (39, 0.037), (40, -0.015), (41, 0.009), (42, 0.048), (43, -0.028), (44, 0.024), (45, 0.01), (46, 0.028), (47, -0.01), (48, 0.025), (49, -0.081)]
simIndex simValue paperId paperTitle
same-paper 1 0.92539436 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks
2 0.79929012 23 jmlr-2011-DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model
Author: Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvärinen, Yoshinobu Kawahara, Takashi Washio, Patrik O. Hoyer, Kenneth Bollen
Abstract: Structural equation models and Bayesian networks have been widely used to analyze causal relations between continuous variables. In such frameworks, linear acyclic models are typically used to model the data-generating process of variables. Recently, it was shown that use of non-Gaussianity identifies the full structure of a linear acyclic model, that is, a causal ordering of variables and their connection strengths, without using any prior knowledge on the network structure, which is not the case with conventional methods. However, existing estimation methods are based on iterative search algorithms and may not converge to a correct solution in a finite number of steps. In this paper, we propose a new direct method to estimate a causal ordering and connection strengths based on non-Gaussianity. In contrast to the previous methods, our algorithm requires no algorithmic parameters and is guaranteed to converge to the right solution within a small fixed number of steps if the data strictly follows the model, that is, if all the model assumptions are met and the sample size is infinite. Keywords: structural equation models, Bayesian networks, independent component analysis, non-Gaussianity, causal discovery c 2011 Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv¨ rinen, Yoshinobu Kawahara, Takashi Washio, Patrik O. Hoyer a and Kenneth Bollen ¨ S HIMIZU , I NAZUMI , S OGAWA , H YV ARINEN , K AWAHARA , WASHIO , H OYER AND B OLLEN
3 0.40920749 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
Author: Vianney Perchet
Abstract: We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law. They are based on a generalization of calibration, no longer defined in terms of a Vorono¨ ı diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework, the expected average internal, as well as the usual external, regret at stage n by O(n−1/3 ), which is known to be optimal. Keywords: repeated games, on-line learning, regret, partial monitoring, calibration, Vorono¨ and ı Laguerre diagrams
4 0.33612904 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
5 0.31676614 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
Author: Daniil Ryabko
Abstract: A sequence x1 , . . . , xn , . . . of discrete-valued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, one is required to give conditional probabilities of the next observation. The realizable case is when the measure µ belongs to an arbitrary but known class C of process measures. The non-realizable case is when µ is completely arbitrary, but the prediction performance is measured with respect to a given set C of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and finding these solutions. We show that if the quality of prediction is measured using the total variation distance, then these problems coincide, while if it is measured using the expected average KL divergence, then they are different. For some of the formalizations we also show that when a solution exists it can be obtained as a Bayes mixture over a countable subset of C . We also obtain several characterization of those sets C for which solutions to the considered problems exist. As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all finite-memory processes, but does not exist for the set of all stationary processes. It should be emphasized that the framework is completely general: the processes measures considered are not required to be i.i.d., mixing, stationary, or to belong to any parametric family. Keywords: sequence prediction, time series, online prediction, realizable sequence prediction, non-realizable sequence prediction
6 0.31439266 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
7 0.28954512 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
8 0.2656053 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
9 0.26514542 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
10 0.24804257 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
11 0.24242866 12 jmlr-2011-Bayesian Co-Training
12 0.23207656 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
13 0.22062558 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
14 0.21872739 54 jmlr-2011-Learning Latent Tree Graphical Models
15 0.21420731 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
16 0.20951551 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
17 0.20596044 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
18 0.20560445 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
19 0.19193421 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
20 0.18893671 61 jmlr-2011-Logistic Stick-Breaking Process
topicId topicWeight
[(4, 0.043), (6, 0.014), (9, 0.035), (10, 0.045), (24, 0.059), (31, 0.08), (32, 0.024), (41, 0.068), (49, 0.349), (60, 0.018), (65, 0.01), (70, 0.018), (71, 0.017), (73, 0.043), (78, 0.055), (90, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.70658779 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks
2 0.37589756 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel
Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c
3 0.36094162 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
4 0.35993114 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
5 0.3579343 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.
6 0.35716549 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
7 0.35645196 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
8 0.3561123 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
9 0.35460868 91 jmlr-2011-The Sample Complexity of Dictionary Learning
10 0.3545495 16 jmlr-2011-Clustering Algorithms for Chains
11 0.35265663 12 jmlr-2011-Bayesian Co-Training
12 0.35115951 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
13 0.35081857 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
14 0.34899303 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
16 0.34674415 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
17 0.3453021 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
18 0.34503686 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
19 0.34462276 66 jmlr-2011-Multiple Kernel Learning Algorithms
20 0.34419549 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models