jmlr jmlr2013 jmlr2013-15 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
Reference: text
sentIndex sentText sentNum sentScore
1 We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. [sent-10, score-0.142]
2 Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation 1. [sent-13, score-0.175]
3 Introduction Canonical correlation analysis (CCA), originally introduced by Hotelling (1936), extracts linear components that capture correlations between two multivariate random variables or data sets. [sent-14, score-0.165]
4 While the analysis part of Browne (1979) is limited to the special case of CCA, the generic IBFA model describes not only the correlations between the data sets but provides also components explaining the linear structure within each of the data sets. [sent-27, score-0.147]
5 If the analysis focuses on only the correlating components, or equivalently the latent variables shared by both data sets, the solution becomes equivalent to CCA. [sent-29, score-0.259]
6 Using the term CCA emphasizes finding of the correlations and shared components, whereas IBFA emphasizes the decomposition into shared and data source-specific components. [sent-35, score-0.203]
7 , 2011), and in particular provide two efficient inference algorithms, a variational approximation and a Gibbs sampler, that automatically learn the structure of the model that is, in the general case, unidentifiable. [sent-37, score-0.183]
8 The model is solved as a generic factor analysis (FA) model with a specific group-wise sparsity prior for the factor loadings or projections, and an additional constraint tying the residual variances within each group to be the same. [sent-38, score-0.155]
9 At the core of the generative process is an unobserved latent variable z ∈ RK×1 , which is transformed via linear mappings to the observation spaces to represent the two multivariate random variables x(1) ∈ RD1 ×1 and x(2) ∈ RD2 ×1 . [sent-51, score-0.145]
10 1 Inter-battery Factor Analysis In the latent variable model studied in this work, z ∼ N(0, I), z(m) ∼ N(0, I), (2) x(m) ∼ N(A(m) z + B(m) z(m) , Σ(m) ), following the probabilistic interpretation of inter-battery factor analysis by Browne (1979). [sent-61, score-0.16]
11 The shared latent variables z capture the variation common to both data sets, and they are transformed to the observation 1. [sent-67, score-0.233]
12 The shaded nodes x(m) denote the two observed random variables, and the latent variables z capture the correlations between them. [sent-70, score-0.152]
13 The variation specific to each view is modeled with view-specific latent variables z(m) . [sent-71, score-0.15]
14 The remaining variation is modeled by the latent variables z(m) ∈ RKm ×1 specific to each data set, transformed to the observation space by another linear mapping B(m) z(m) , where B(m) ∈ RDm ×Km . [sent-74, score-0.15]
15 The process starts by integrating out the view-specific latent variables z(m) , to reach a model that has explicit components only for the shared variation similarly to how CCA only explains the correlations. [sent-83, score-0.343]
16 The latent representation of this model is simpler, only containing the z instead of three separate sets T of latent variables, but the diagonal covariance of the IBFA model is replaced with B(m) B(m) +Σ(m) . [sent-85, score-0.328]
17 The model comes with three separate sets of latent variables with component numbers K, K1 and K2 . [sent-105, score-0.198]
18 However, individual components can be moved between these sets without influencing the likelihood of the observed data; removal of a shared component can always be compensated by introducing two view-specific components, one for each data set, that have the same latent variables. [sent-106, score-0.273]
19 This is done either by explicitly parameterizing the noise through a covariance matrix Ψ(m) as in (3) or by the separate view-specific components B(m) z(m) as in (2). [sent-113, score-0.148]
20 It may be possible to identify these components as view-specific in a post-processing step to reach interpretation similar to CCA, but directly modeling the view-specific variation as separate components has obvious advantages. [sent-117, score-0.213]
21 In particular, such models are likely to misinterpret strong view-specific variation as a shared effect, since they have no means of explaining it otherwise. [sent-124, score-0.139]
22 Inference For learning the IBFA model we need to infer both the latent signals z and z(m) as well as the linear projections A(m) and B(m) from data. [sent-127, score-0.165]
23 3 The posterior of the model then becomes (m) one where the number of components is automatically selected by pushing αk of unnecessary (m) components towards infinity. [sent-141, score-0.208]
24 971 K LAMI , V IRTANEN AND K ASKI to the ARD prior updates being more efficient in the variational framework, whereas the latter is easier to extend, as demonstrated by Klami and Kaski (2007) by using the Bayesian CCA as part of a non-parametric hierarchical model. [sent-156, score-0.187]
25 Hence the direct Bayesian treatment m of CCA needs to resort to either using very strong priors (for example, favoring diagonal covariance matrices and hence regularizing the model towards Bayesian PCA), or it will end up doing inference over a very wide posterior. [sent-162, score-0.153]
26 3, is that it requires learning three separate sets of components and the solution is unidentifiable with respect to allocating components to the three groups. [sent-172, score-0.18]
27 We solve the BIBFA model by doing inference directly for (5) and learn the structure of W by imposing group-wise sparsity for the components (columns of W), which results in the model automatically converging to a solution that matches (4) (up to an arbitrary re-ordering of the columns). [sent-187, score-0.25]
28 (2010) introduced a similar sparsity constraint for learning factorized latent spaces; our approach can be seen as a Bayesian realization of the same idea, applied to canonical correlation analysis. [sent-192, score-0.207]
29 Similarly to how ARD has earlier been used to choose the number of (m) components, the group-wise ARD makes unnecessary components wk inactive for each of the (m) views separately. [sent-196, score-0.153]
30 The components needed for modeling the shared response will have small αk (m) (that is, large variance) for both views, whereas the view-specific components will have small αk for the active view and a large one for the inactive one. [sent-197, score-0.27]
31 Finally, the model still selects automatically the total number of components by making both views inactive for unnecessary components. [sent-198, score-0.193]
32 Similarly, the explorative data analysis experiment illustrated in Figures 8 and 9 is invariant to the actual components and only relies on the total amount of contribution each feature has on the shared variation. [sent-208, score-0.153]
33 The ARD prior efficiently pushes the variance of inactive components towards zero, and hence selecting the threshold is often easy in practice. [sent-211, score-0.145]
34 3, the model is unidentifiable with respect to linear transformations of y, and only the prior p(y) is influenced by the allocation of the components into shared and view-specific ones. [sent-220, score-0.24]
35 The prior, in turn, assumes independent latent variables, implying that the optimal solution will result in an R that makes the latent variables of the posterior approximation also maximally independent. [sent-226, score-0.26]
36 We prefer to use the phrase independence as it better fits the notion of assuming independent latent variables and may become more precise in extensions; for other priors and inference algorithms independence need not equal orthogonality. [sent-229, score-0.191]
37 the classical CCA solution is; the latent variables are assumed orthogonal, instead of assuming orthogonal projections (like in PCA). [sent-238, score-0.192]
38 That property also allows deriving a more efficient algorithm for optimizing the variational approximation, following the idea of parameter-expanded variational Bayes (Qi and Jaakkola, 2007; Luttinen and Ilin, 2010). [sent-239, score-0.184]
39 In summary, the above model formulation with the associated variational approximation provides a fully Bayesian treatment for the IBFA model. [sent-243, score-0.151]
40 Here we briefly discuss possible alternatives, and derive one practical implementation that uses sampling-based inference instead of the variational approximation described above. [sent-253, score-0.143]
41 As general properties, such priors allow continuous inference procedures that are often efficient, but it is not always trivial to separate low-activity components from inactive ones for interpretative purposes. [sent-261, score-0.193]
42 Efficient inference is done in the factor analysis model for x = [x(1) ; x(2) ] with group-wise sparsity prior for the projection matrix: y ∼ N(0, I), x ∼ N(Wy, Σ), (7) W(m) ∼ ARD(α0 , β0 ). [sent-284, score-0.166]
43 An alternative inference scheme replaces the above ARD prior with the group-wise spike-andslab prior of (6) and draws samples from the posterior using Gibbs sampling. [sent-288, score-0.173]
44 The full model is specified as z ∼ N(0, I), (m) x ∼ N(A(m) z, Ψ(m) ), (8) A(m) ∼ ARD(α0 , β0 ), Ψ(m) ∼ IW(S0 , ν0 ), and inference follows the variational updates provided by Wang (2007). [sent-292, score-0.209]
45 We initialize the model by sampling the latent variables from the prior, and recommend running the algorithm multiple times and choosing the solution with the best variational lower bound. [sent-295, score-0.268]
46 However, since the ARD prior (or the spike-and-slab prior for the Gibbs sampler variant) automatically shuts down components that are not needed, the parameter can safely be set large enough; the only drawback of using too large K is in increased computation time. [sent-298, score-0.208]
47 Illustration In this section we demonstrate the BIBFA model on artificial data, in order to illustrate the factorization into shared and data set-specific components, as well as to show that the inference proceduree converge to the correct solution. [sent-302, score-0.174]
48 The results are illustrated primarily from the point-of-view of the variational inference solution; the variational approximation is easier to visualize and compare with alternative methods. [sent-304, score-0.235]
49 1 Artificial Example First, we validate the model on artificial data drawn from a model from the same model family, with parameters set up so that it contains all types of components (view-specific and shared components). [sent-307, score-0.273]
50 The latent signals y were manually constructed to produce components that can be visually matched with the true ones for intuitive assessment of the results. [sent-308, score-0.166]
51 Also the α(m) parameters, controlling the activity of each latent component in both views, were manually specified. [sent-309, score-0.139]
52 The left column of Figure 3 illustrates the data generation, showing the four latent components, two of which are shared between the two views. [sent-311, score-0.179]
53 We generated N = 100 samples with D1 = 50 and D2 = 40 dimensions, and applied the BIBFA model with K = 6 components to show that it learns the correct latent components and automatically discards the excess ones. [sent-312, score-0.276]
54 The results of the variational inference are shown in the middle column of Figure 3; the Gibbs sampler produces virtually indistinguishable results. [sent-313, score-0.187]
55 The learned matrix of α-values (and the corresponding elements in W) reveals that the model extracted exactly four components, correctly identifying two of them as shared components and two as view-specific ones (one for each data set). [sent-314, score-0.193]
56 The actual latent components also correspond to the ones used for generating the data. [sent-315, score-0.166]
57 We also see how the model is invariant to the sign of y, but that it gives the actual components instead of a linear combination of those, demonstrating that the variational approximation indeed solves the rotational disambiguity that would remain for instance in the maximum likelihood solution. [sent-317, score-0.202]
58 We see that the true parameter values fall nicely within the posterior and both the variational approximation and the Gibbs sampler provide almost the same posterior. [sent-320, score-0.164]
59 We generated data with 4 true correlating components drawn from the prior, drawing N independent samples of D1 = D2 dimensions, and measure the performance by comparing the average of the four largest correlations ρk , normalized by the ground truth correlations. [sent-332, score-0.147]
60 We compare the two variants of the Bayesian CCA, denoting by BCCA a model parameterized with full covariance matrices (8) and by BIBFA the fully factorized model (7), with both classical CCA and a regularized CCA (RCCA). [sent-336, score-0.166]
61 The left column shows four components in the generated data, the first two being shared between the two views and the last two being specific to just one view. [sent-346, score-0.208]
62 Components 5 and 6 are shared, revealed by non-zero variance for both views, components 2 and 4 are the two view-specific components, and the unnecessary components 1 and 3 have been suppressed to the prior in the sense that their mean and variance match those of the prior. [sent-349, score-0.187]
63 The small lines depict one standard deviation, revealing that the model is more confident on its predictions for the shared components, due to more data (D1 + D2 features compared to just D1 or D2 ) available for inferring them. [sent-350, score-0.146]
64 To further illustrate the behavior of BIBFA, we plot in Figure 6 the estimated number of shared and view-specific components for both the variational and Gibbs sampling variants. [sent-367, score-0.245]
65 We see that both inference algorithms are conservative in the sense that for very small sample sizes they miss some of the components, using the residual noise term to model the variation that cannot be reliably explained with so limited data. [sent-369, score-0.148]
66 For reasonable number of samples both the variational approximation (VB) and the spike-and-slab sampler (Gibbs) learn the correct number of both shared components (red lines) and total components (black lines) for all three dimensionalities (subplots). [sent-409, score-0.381]
67 The solid lines correspond to the results of the variational approximation, averaged over 10 random initializations, whereas the dashed line shows the mean of the posterior samples for the Gibbs sampler and the shaded region covers the values between the 5% and 95% quantiles. [sent-411, score-0.164]
68 1 Modifying the Generative Model Since the latent variable model is described through a generative process, it is straightforward to change the distributional assumptions in the model to arrive at alternatives designed for specific purposes. [sent-420, score-0.206]
69 That is, the model automatically learns topics that are shared between the views as well as topics specific to each view, using a hierarchical Dirichlet process (HDP; Teh et al. [sent-440, score-0.2]
70 They introduced sparsity priors and associated variational approximations for Bayesian PCA and the full IBFA model, but did not provide empirical experiments with the latter. [sent-444, score-0.145]
71 (2009), using an element-wise ARD prior to obtain sparsity, though the method is actually not a proper CCA model since it does not model view-specific variation at all. [sent-446, score-0.162]
72 (2008) extended the probabilistic formulation to create Gaussian process latent variable models (GP-LVM) for modeling dependencies between two data sets. [sent-452, score-0.201]
73 (2012) extended the approach to a Bayesian multi-view model that uses group-wise sparsity to identify shared and view-specific latent manifolds for a GP-LVM model, using an ARD prior very similar to the one used by Virtanen et al. [sent-457, score-0.294]
74 (2010) presented clustering models that capture the dependencies between two views with the cluster structure while modeling view-specific variation with another set of clusters. [sent-462, score-0.152]
75 However, the exact definition of the noise additive to the factorization is crucial; BIBFA includes explicit components for modeling view-specific variation (or they are modeled with full covariance matrices as in the earlier Bayesian CCA solutions). [sent-473, score-0.183]
76 (2011) extended CMFs to localized factor model (LMF) that allows separate latent variables z(m) for the views and models them as a linear combination of global latent profiles u. [sent-477, score-0.346]
77 By iteratively treating the view-specific and shared components as the explanatory factors they can learn the maximum likelihood solution of IBFA (and hence CCA) through eigen-decompositions, but their general formulation also applies to other data analysis scenarios. [sent-481, score-0.193]
78 Already Archambeau and Bach (2009) mention that the generative model directly generalizes to more than two views, but they do not show that their inference solution would provide meaningful results for multiple views. [sent-483, score-0.142]
79 (2011), in turn, created a hierarchical topic trajectory model (HTTM) by using CCA as the observation model in a hidden Markov model (HMM). [sent-504, score-0.142]
80 Direct inspection of the weights in the shared components then reveals the cancer-associated genes; a high weight implies an association between the copy number and gene expression, relating the gene to the cancer under study. [sent-551, score-0.238]
81 We repeated their experimental setup to obtain results directly comparable with their study, and measured the performance by the same measure, the area under curve (AUC) for retrieving known cancer genes (37 out of 4247 genes in Pollack, and 47 out of 7363 genes in Hyman). [sent-554, score-0.187]
82 We ran the BIBFA model for Kc between 5 and 60 components and chose the model with the best variational lower bound, resulting in Kc = 15 for Hyman and Kc = 40 for Pollack. [sent-555, score-0.242]
83 The BIBFA model ranks the genes based on the weight of that gene in both W(1) and W(2) , and finds the cancer genes with better accuracy than any of the methods studied in the recent comparison by Lahti et al. [sent-576, score-0.203]
84 In a sense, the model can be seen as a combination of purely unsupervised and supervised learning; both of the views can be considered as supervising the other view, yet the model is (here) defined as a generative description of the whole data collection. [sent-590, score-0.165]
85 In other words, the model squeezes the prediction through the shared latent variables z, the lower-dimensional representation capturing all the information flow from one data set to another. [sent-609, score-0.238]
86 Similar observation holds also for the Gibbs sampler variant using the spike-and-slab prior (6); again the mean prediction only depends on the shared components. [sent-611, score-0.174]
87 When making predictions for new samples, we need to store all of the posterior samples and then run the sampler again for each of those estimate the latent variables and the predicted X(1) . [sent-613, score-0.187]
88 Due to this extra computational overhead for the sampler, we will next demonstrate the BIBFA in multi-label prediction tasks using only the variational inference variant. [sent-614, score-0.143]
89 We compare the BIBFA model (7) with both classical CCA and the standard Bayesian CCA with full covariance matrices (8), using the variational approximation of Wang (2007). [sent-624, score-0.196]
90 For BCCA we set the maximal number of components to K = min(D1 , D2 , 50), and for classical CCA we chose the number of components by 10-fold cross-validation within the training set. [sent-625, score-0.167]
91 Discussion In this paper we have reviewed the works on probabilistic and Bayesian canonical correlation analysis, with particular focus on the extensions made possible by the probabilistic interpretation. [sent-697, score-0.157]
92 The key is to make a low-rank assumption for the noise specific to each data set, which results in re-formulation of CCA as a more complex latent variable model called inter-battery factor analysis (IBFA) in the statistics literature (Tucker, 1958). [sent-704, score-0.158]
93 The computational difficulties stemming from introducing the extra latent variables are solved by clever usage of group-wise sparsity assumption. [sent-706, score-0.143]
94 Instead of explicitly instantiating several latent variables, we re-cast the IBFA model as a straightforward joint factor analysis model with a specific prior driving the component group-wise sparse, showing how the resulting model is equivalent to IBFA. [sent-707, score-0.287]
95 We use mean field variational approximation to approximate the posterior, with the factorization 2 Q(Θ) = q(Y) ∏ q(W(m) )q(α(m) )q(τm ), m=1 where Θ denotes all of the parameters and latent variables. [sent-717, score-0.188]
96 For the latent variables we further assume column-wise independence (that is, the latent variables of observations are independent) and for projections row-wise independence (that is, each component is independent). [sent-718, score-0.283]
97 BIBFA with Spike-and-slab Prior For inference with the spike-and-slab prior (6) we use Gibbs sampling, following closely the updates given for element-wise sparse FA model by Knowles and Ghahramani (2011). [sent-758, score-0.164]
98 Below we show the sampling equations for yn as an example; the updates for α(m) and τ(m) can be easily modified from the variational updates given in Appendix A. [sent-773, score-0.168]
99 A Gaussian process latent variable model formulation of canonical correlation analysis. [sent-963, score-0.238]
100 Learning shared latent structure for image synthesis and robotic imitation. [sent-1029, score-0.179]
wordName wordTfidf (topN-words)
[('cca', 0.616), ('bibfa', 0.5), ('ibfa', 0.188), ('klami', 0.171), ('bcca', 0.159), ('bayesian', 0.137), ('ard', 0.129), ('anonical', 0.108), ('aski', 0.108), ('irtanen', 0.108), ('lami', 0.108), ('orrelation', 0.108), ('latent', 0.096), ('variational', 0.092), ('dm', 0.091), ('arto', 0.085), ('shared', 0.083), ('virtanen', 0.08), ('components', 0.07), ('kaski', 0.065), ('nalysis', 0.064), ('samuel', 0.064), ('rcca', 0.057), ('views', 0.055), ('genes', 0.054), ('inference', 0.051), ('prior', 0.047), ('lahti', 0.045), ('sampler', 0.044), ('browne', 0.044), ('canonical', 0.044), ('gibbs', 0.041), ('model', 0.04), ('kc', 0.04), ('correlating', 0.04), ('huopaniemi', 0.04), ('pint', 0.04), ('seppo', 0.04), ('archambeau', 0.039), ('correlation', 0.039), ('correlations', 0.037), ('covariance', 0.037), ('variation', 0.035), ('gene', 0.03), ('gamma', 0.03), ('generative', 0.03), ('projections', 0.029), ('editors', 0.029), ('pca', 0.029), ('inactive', 0.028), ('sparsity', 0.028), ('posterior', 0.028), ('classical', 0.027), ('rotation', 0.027), ('extensions', 0.026), ('updates', 0.026), ('priors', 0.025), ('bach', 0.025), ('cancer', 0.025), ('brain', 0.025), ('fujiwara', 0.024), ('yn', 0.024), ('component', 0.024), ('fmri', 0.024), ('yk', 0.024), ('probabilistic', 0.024), ('chromosome', 0.023), ('cmf', 0.023), ('damianou', 0.023), ('hyman', 0.023), ('mlknn', 0.023), ('pollack', 0.023), ('rdm', 0.023), ('tsoumakas', 0.023), ('viinikanoja', 0.023), ('dr', 0.023), ('revealing', 0.023), ('hierarchical', 0.022), ('dimensionalities', 0.022), ('aalto', 0.022), ('knowles', 0.022), ('noise', 0.022), ('ghahramani', 0.022), ('dependencies', 0.022), ('variants', 0.022), ('solution', 0.021), ('rt', 0.021), ('models', 0.021), ('yyt', 0.02), ('multilabel', 0.019), ('daum', 0.019), ('ilin', 0.019), ('leen', 0.019), ('rai', 0.019), ('modeling', 0.019), ('variables', 0.019), ('activity', 0.019), ('formulation', 0.019), ('zoubin', 0.019), ('separate', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
2 0.10151307 108 jmlr-2013-Stochastic Variational Inference
Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley
Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics
3 0.10038967 121 jmlr-2013-Variational Inference in Nonconjugate Models
Author: Chong Wang, David M. Blei
Abstract: Mean-field variational methods are widely used for approximate posterior inference in many probabilistic models. In a typical application, mean-field methods approximately compute the posterior with a coordinate-ascent optimization algorithm. When the model is conditionally conjugate, the coordinate updates are easily derived and in closed form. However, many models of interest—like the correlated topic model and Bayesian logistic regression—are nonconjugate. In these models, mean-field methods cannot be directly applied and practitioners have had to develop variational algorithms on a case-by-case basis. In this paper, we develop two generic methods for nonconjugate models, Laplace variational inference and delta method variational inference. Our methods have several advantages: they allow for easily derived variational algorithms with a wide class of nonconjugate models; they extend and unify some of the existing algorithms that have been derived for specific models; and they work well on real-world data sets. We studied our methods on the correlated topic model, Bayesian logistic regression, and hierarchical Bayesian logistic regression. Keywords: variational inference, nonconjugate models, Laplace approximations, the multivariate delta method
4 0.080803059 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont
Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy. Keywords: group feature selection, generalized spike-and-slab priors, expectation propagation, sparse linear model, approximate inference, sequential experimental design, signal reconstruction
5 0.06604667 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
Author: Edward Challis, David Barber
Abstract: We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufficient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-t or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. Keywords: generalised linear models, latent linear models, variational approximate inference, large scale inference, sparse learning, experimental design, active learning, Gaussian processes
6 0.058981847 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
7 0.049351331 90 jmlr-2013-Quasi-Newton Method: A New Direction
8 0.041095052 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes
9 0.039343495 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
10 0.038666923 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
11 0.038281973 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
12 0.037857197 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
13 0.034724794 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
14 0.03087328 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
15 0.029626206 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
16 0.027017657 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
17 0.025915144 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
18 0.025758201 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model
19 0.025565907 120 jmlr-2013-Variational Algorithms for Marginal MAP
20 0.024644485 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
topicId topicWeight
[(0, -0.16), (1, -0.204), (2, -0.001), (3, -0.025), (4, -0.037), (5, 0.014), (6, 0.027), (7, -0.005), (8, 0.042), (9, 0.0), (10, 0.059), (11, -0.042), (12, -0.117), (13, -0.061), (14, 0.017), (15, -0.008), (16, 0.025), (17, -0.044), (18, -0.013), (19, 0.021), (20, -0.005), (21, 0.038), (22, -0.103), (23, 0.079), (24, 0.041), (25, -0.022), (26, -0.053), (27, -0.074), (28, -0.046), (29, -0.037), (30, 0.08), (31, 0.0), (32, 0.008), (33, -0.074), (34, -0.031), (35, -0.071), (36, 0.114), (37, 0.035), (38, -0.054), (39, -0.108), (40, 0.019), (41, 0.076), (42, 0.044), (43, 0.007), (44, 0.059), (45, 0.127), (46, -0.024), (47, -0.218), (48, 0.061), (49, -0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.92326468 15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
2 0.54542971 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
Author: Matthew J. Johnson, Alan S. Willsky
Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efficient posterior inference. The methods we introduce also provide new methods for sampling inference in the finite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont
Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy. Keywords: group feature selection, generalized spike-and-slab priors, expectation propagation, sparse linear model, approximate inference, sequential experimental design, signal reconstruction
4 0.4276723 108 jmlr-2013-Stochastic Variational Inference
Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley
Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics
5 0.42729998 121 jmlr-2013-Variational Inference in Nonconjugate Models
Author: Chong Wang, David M. Blei
Abstract: Mean-field variational methods are widely used for approximate posterior inference in many probabilistic models. In a typical application, mean-field methods approximately compute the posterior with a coordinate-ascent optimization algorithm. When the model is conditionally conjugate, the coordinate updates are easily derived and in closed form. However, many models of interest—like the correlated topic model and Bayesian logistic regression—are nonconjugate. In these models, mean-field methods cannot be directly applied and practitioners have had to develop variational algorithms on a case-by-case basis. In this paper, we develop two generic methods for nonconjugate models, Laplace variational inference and delta method variational inference. Our methods have several advantages: they allow for easily derived variational algorithms with a wide class of nonconjugate models; they extend and unify some of the existing algorithms that have been derived for specific models; and they work well on real-world data sets. We studied our methods on the correlated topic model, Bayesian logistic regression, and hierarchical Bayesian logistic regression. Keywords: variational inference, nonconjugate models, Laplace approximations, the multivariate delta method
6 0.42683929 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
7 0.40720302 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
8 0.40010145 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
9 0.35614353 90 jmlr-2013-Quasi-Newton Method: A New Direction
10 0.33994901 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
11 0.33559704 68 jmlr-2013-Machine Learning with Operational Costs
12 0.31909004 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
13 0.31863523 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
14 0.31536204 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes
15 0.31253812 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
16 0.28455889 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
17 0.28123078 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
18 0.28020343 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java
19 0.2767573 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
20 0.25828332 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
topicId topicWeight
[(0, 0.025), (5, 0.568), (6, 0.034), (10, 0.069), (20, 0.023), (23, 0.026), (44, 0.012), (53, 0.011), (68, 0.024), (70, 0.017), (75, 0.046), (85, 0.019), (87, 0.017), (93, 0.021)]
simIndex simValue paperId paperTitle
1 0.99796104 8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
2 0.99649334 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
Author: Bruno Scherrer
Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms
3 0.99585444 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java
Author: Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan, Yue Wang
Abstract: We describe a R-Java CAM (convex analysis of mixtures) package that provides comprehensive analytic functions and a graphic user interface (GUI) for blindly separating mixed nonnegative sources. This open-source multiplatform software implements recent and classic algorithms in the literature including Chan et al. (2008), Wang et al. (2010), Chen et al. (2011a) and Chen et al. (2011b). The CAM package offers several attractive features: (1) instead of using proprietary MATLAB, its analytic functions are written in R, which makes the codes more portable and easier to modify; (2) besides producing and plotting results in R, it also provides a Java GUI for automatic progress update and convenient visual monitoring; (3) multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, assuring that the whole CAM software runs responsively; (4) the package offers a simple mechanism to allow others to plug-in additional R-functions. Keywords: convex analysis of mixtures, blind source separation, affinity propagation clustering, compartment modeling, information-based model selection c 2013 Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan and Yue Wang. WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG 1. Overview Blind source separation (BSS) has proven to be a powerful and widely-applicable tool for the analysis and interpretation of composite patterns in engineering and science (Hillman and Moore, 2007; Lee and Seung, 1999). BSS is often described by a linear latent variable model X = AS, where X is the observation data matrix, A is the unknown mixing matrix, and S is the unknown source data matrix. The fundamental objective of BSS is to estimate both the unknown but informative mixing proportions and the source signals based only on the observed mixtures (Child, 2006; Cruces-Alvarez et al., 2004; Hyvarinen et al., 2001; Keshava and Mustard, 2002). While many existing BSS algorithms can usefully extract interesting patterns from mixture observations, they often prove inaccurate or even incorrect in the face of real-world BSS problems in which the pre-imposed assumptions may be invalid. There is a family of approaches exploiting the source non-negativity, including the non-negative matrix factorization (NMF) (Gillis, 2012; Lee and Seung, 1999). This motivates the development of alternative BSS techniques involving exploitation of source nonnegative nature (Chan et al., 2008; Chen et al., 2011a,b; Wang et al., 2010). The method works by performing convex analysis of mixtures (CAM) that automatically identifies pure-source signals that reside at the vertices of the multifaceted simplex most tightly enclosing the data scatter, enabling geometrically-principled delineation of distinct source patterns from mixtures, with the number of underlying sources being suggested by the minimum description length criterion. Consider a latent variable model x(i) = As(i), where the observation vector x(i) = [x1 (i), ..., xM (i)]T can be expressed as a non-negative linear combination of the source vectors s(i) = [s1 (i), ..., sJ (i)]T , and A = [a1 , ..., aJ ] is the mixing matrix with a j being the jth column vector. This falls neatly within the definition of a convex set (Fig. 1) (Chen et al., 2011a): X= J J ∑ j=1 s j (i)a j |a j ∈ A, s j (i) ≥ 0, ∑ j=1 s j (i) = 1, i = 1, ..., N . Assume that the sources have at least one sample point whose signal is exclusively enriched in a particular source (Wang et al., 2010), we have shown that the vertex points of the observation simplex (Fig. 1) correspond to the column vectors of the mixing matrix (Chen et al., 2011b). Via a minimum-error-margin volume maximization, CAM identifies the optimum set of the vertices (Chen et al., 2011b; Wang et al., 2010). Using the samples attached to the vertices, compartment modeling (CM) (Chen et al., 2011a) obtains a parametric solution of A, nonnegative independent component analysis (nICA) (Oja and Plumbley, 2004) estimates A (and s) that maximizes the independency in s, and nonnegative well-grounded component analysis (nWCA) (Wang et al., 2010) finds the column vectors of A directly from the vertex cluster centers. Figure 1: Schematic and illustrative flowchart of R-Java CAM package. 2900 T HE CAM S OFTWARE IN R-JAVA In this paper we describe a newly developed R-Java CAM package whose analytic functions are written in R, while a graphic user interface (GUI) is implemented in Java, taking full advantages of both programming languages. The core software suite implements CAM functions and includes normalization, clustering, and data visualization. Multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, which not only provides convenient data or parameter passing and visual progress monitoring but also assures the responsive execution of the entire CAM software. 2. Software Design and Implementation The CAM package mainly consists of R and Java modules. The R module is a collection of main and helper functions, each represented by an R function object and achieving an independent and specific task (Fig. 1). The R module mainly performs various analytic tasks required by CAM: figure plotting, update, or error message generation. The Java module is developed to provide a GUI (Fig. 2). We adopt the model-view-controller (MVC) design strategy, and use different Java classes to separately perform information visualization and human-computer interaction. The Java module also serves as the software driver and integrator that use a multi-thread strategy to facilitate the interactions between the R and Java modules, such as importing raw data, passing algorithmic parameters, calling R scripts, and transporting results and messages. Figure 2: Interactive Java GUI supported by a multi-thread design strategy. 2.1 Analytic and Presentation Tasks Implemented in R The R module performs the CAM algorithm and facilitates a suite of subsequent analyses including CM, nICA, and nWCA. These tasks are performed by the three main functions: CAM-CM.R, CAM-nICA.R, and CAM-nWCA.R, which can be activated by the three R scripts: Java-runCAM-CM.R, Java-runCAM-ICA.R, and Java-runCAM-nWCA.R. The R module also performs auxiliary tasks including automatic R library installation, figure drawing, and result recording; and offers other standard methods such as nonnegative matrix factorization (Lee and Seung, 1999), Fast ICA (Hyvarinen et al., 2001), factor analysis (Child, 2006), principal component analysis, affinity propagation, k-means clustering, and expectation-maximization algorithm for learning standard finite normal mixture model. 2.2 Graphic User Interface Written in Java Swing The Java GUI module allows users to import data, select algorithms and parameters, and display results. The module encloses two packages: guiView contains classes for handling frames and 2901 WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG Figure 3: Application of R-Java CAM to deconvolving dynamic medical image sequence. dialogs for managing user inputs; guiModel contains classes for representing result data sets and for interacting with the R script caller. Packaged as one jar file, the GUI module runs automatically. 2.3 Functional Interaction Between R and Java We adopt the open-source program RCaller (http://code.google.com/p/rcaller) to implement the interaction between R and Java modules (Fig. 2), supported by explicitly designed R scripts such as Java-runCAM-CM.R. Specifically, five featured Java classes are introduced to interact with R for importing data or parameters, running algorithms, passing on or recording results, displaying figures, and handing over error messages. The examples of these classes include guiModel.MyRCaller.java, guiModel.MyRCaller.readResults(), and guiView.MyRPlotViewer. 3. Case Studies and Experimental Results The CAM package has been successfully applied to various data types. Using dynamic contrastenhanced magnetic resonance imaging data set of an advanced breast cancer case (Chen, et al., 2011b),“double click” (or command lines under Ubuntu) activated execution of CAM-Java.jar reveals two biologically interpretable vascular compartments with distinct kinetic patterns: fast clearance in the peripheral “rim” and slow clearance in the inner “core”. These outcomes are consistent with previously reported intratumor heterogeneity (Fig. 3). Angiogenesis is essential to tumor development beyond 1-2mm3 . It has been widely observed that active angiogenesis is often observed in advanced breast tumors occurring in the peripheral “rim” with co-occurrence of inner-core hypoxia. This pattern is largely due to the defective endothelial barrier function and outgrowth blood supply. In another application to natural image mixtures, CAM algorithm successfully recovered the source images in a large number of trials (see Users Manual). 4. Summary and Acknowledgements We have developed a R-Java CAM package for blindly separating mixed nonnegative sources. The open-source cross-platform software is easy-to-use and effective, validated in several real-world applications leading to plausible scientific discoveries. The software is freely downloadable from http://mloss.org/software/view/437/. We intend to maintain and support this package in the future. This work was supported in part by the US National Institutes of Health under Grants CA109872, CA 100970, and NS29525. We thank T.H. Chan, F.Y. Wang, Y. Zhu, and D.J. Miller for technical discussions. 2902 T HE CAM S OFTWARE IN R-JAVA References T.H. Chan, W.K. Ma, C.Y. Chi, and Y. Wang. A convex analysis framework for blind separation of non-negative sources. IEEE Transactions on Signal Processing, 56:5120–5143, 2008. L. Chen, T.H. Chan, P.L. Choyke, and E.M. Hillman et al. Cam-cm: a signal deconvolution tool for in vivo dynamic contrast-enhanced imaging of complex tissues. Bioinformatics, 27:2607–2609, 2011a. L. Chen, P.L. Choyke, T.H. Chan, and C.Y. Chi et al. Tissue-specific compartmental analysis for dynamic contrast-enhanced mr imaging of complex tumors. IEEE Transactions on Medical Imaging, 30:2044–2058, 2011b. D. Child. The essentials of factor analysis. Continuum International, 2006. S.A. Cruces-Alvarez, Andrzej Cichocki, and Shun ichi Amari. From blind signal extraction to blind instantaneous signal separation: criteria, algorithms, and stability. IEEE Transactions on Neural Networks, 15:859–873, 2004. N. Gillis. Sparse and unique nonnegative matrix factorization through data preprocessing. Journal of Machine Learning Research, 13:3349–3386, 2012. E.M.C. Hillman and A. Moore. All-optical anatomical co-registration for molecular imaging of small animals using dynamic contrast. Nature Photonics, 1:526–530, 2007. A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley, New York, 2001. N. Keshava and J.F. Mustard. Spectral unmixing. IEEE Signal Processing Magazine, 19:44–57, 2002. D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999. E. Oja and M. Plumbley. Blind separation of positive sources by globally convergent gradient search. Neural Computation, 16:1811–1825, 2004. F.Y. Wang, C.Y. Chi, T.H. Chan, and Y. Wang. Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:857–888, 2010. 2903
4 0.99554569 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
same-paper 5 0.99062574 15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
7 0.97244489 114 jmlr-2013-The Rate of Convergence of AdaBoost
8 0.92466915 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
9 0.91545367 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
10 0.91295254 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
11 0.90850967 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
12 0.90626788 73 jmlr-2013-Multicategory Large-Margin Unified Machines
13 0.9025051 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
14 0.90088862 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
15 0.88798743 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
16 0.88333356 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
17 0.87884569 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
18 0.87590629 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling
19 0.8754822 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
20 0.87341166 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion