nips nips2011 nips2011-258 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We propose a new sparse Bayesian model for multi-task regression and classification. [sent-6, score-0.18]
2 The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. [sent-7, score-0.389]
3 We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. [sent-8, score-0.544]
4 We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. [sent-9, score-0.388]
5 Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. [sent-10, score-0.316]
6 While capturing correlations between the labels seems appealing it is in practice difficult as it rapidly leads to numerical problems when estimating the correlations. [sent-15, score-0.142]
7 A naive solution is to learn a model for each task separately and to make predictions using the independent models. [sent-16, score-0.114]
8 If the model is able to capture the task relatedness, it is expected to have generalisation capabilities that are drastically increased. [sent-18, score-0.155]
9 This motivated the introduction of the multi-task learning paradigm that exploits the correlations amongst multiple tasks by learning them simultaneously rather than individually [12]. [sent-19, score-0.347]
10 More recently, the abundant literature on multi-task learning demonstrated that performance indeed improves when the tasks are related [6, 31, 2, 14, 13]. [sent-20, score-0.155]
11 In this setting, the output of all tasks is observed for 1 While it is straightfoward to show that the maximum likelihood estimate of W would be the same as when considering uncorrelated noise, imposing any prior on W would lead to a different solution. [sent-24, score-0.278]
12 In the second setting, the goal is to learn from a set of observed tasks and to generalise to a new task. [sent-26, score-0.2]
13 This approach views the multi-task learning problem as a transfer learning problem, where it is assumed that the various tasks belong in some sense to the same environment and share common properties [23, 5]. [sent-27, score-0.155]
14 A recent trend in multi-task learning is to consider sparse solutions to facilitate the interpretation. [sent-29, score-0.096]
15 Many formulate the sparse multi-task learning problem in a (relaxed) convex optimization framework [5, 22, 35, 23]. [sent-30, score-0.096]
16 Alternatively, one can adopt a Bayesian approach to sparsity in the context of multi-task learning [29, 21]. [sent-32, score-0.222]
17 The main advantage of the Bayesian formalism is that it enables us to learn the degree of sparsity supported by the data and does not require the user to specify the type of penalisation in advance. [sent-33, score-0.306]
18 This is similar in spirit as the approach taken by [18], where tasks are related through a shared kernel matrix. [sent-35, score-0.155]
19 We will consider a matrix-variate prior to simultaneously model task correlations and group sparsity in W. [sent-36, score-0.545]
20 A matrix-variate Gaussian prior was used in [35] in a maximum likelihood setting to capture task correlations and feature correlations. [sent-37, score-0.331]
21 While we are also interested in task correlations, we will consider matrix-variate Gaussian scale mixture priors centred at zero to drive entire blocks of W to zero. [sent-38, score-0.333]
22 The Bayesian group LASSO proposed in [30] is a special case. [sent-39, score-0.112]
23 Group sparsity [34] is especially useful in presence of categorical features, which are in general represented as groups of “dummy” variables. [sent-40, score-0.222]
24 Finally, we will allow the covariance to be of low-rank so that we can deal with problems involving a very large number of tasks. [sent-41, score-0.154]
25 2 Matrix-variate Gaussian prior Before starting our discussion of the model, we introduce the matrix variate Gaussian as it plays a key role in our work. [sent-42, score-0.133]
26 For a matrix W ∈ RP ×D , the matrix-variate Gaussian density [16] with mean matrix M ∈ RP ×D , row covariance Ω ∈ RD×D and column covariance Σ ∈ RP ×P is given by 1 N (M, Ω, Σ) ∝ e− 2 vec(W−M) (Ω⊗Σ)−1 vec(W−M) 1 ∝ e− 2 tr{Ω −1 (W−M) Σ−1 (W−M)} . [sent-43, score-0.42]
27 While this introduces a scale ambiguity between Σ and Ω (easily removed by means of a prior), the use of a matrix-variate formulation is appealing as it makes explicit the structure vec(W), which is a vector formed by the concatenation of the columns of W. [sent-45, score-0.082]
28 This structure is reflected in its covariance matrix which is not of full rank, but is obtained by computing the Kronecker product of the row and the column covariance matrices. [sent-46, score-0.364]
29 It is interesting to compare a matrix-variate prior for W in (1) with the classical multi-level approach to multiple regression from statistics (see e. [sent-47, score-0.214]
30 In a standard multi-level model, the rows of W are drawn iid from a multivariate Gaussian with mean m and covariance S, and m is further drawn from zero mean Gaussian with covariance R. [sent-50, score-0.308]
31 Integrating out m leads then to a Gaussian distributed vec(W) with mean zero and with a covariance matrix that has the block diagonal elements equal to S + R and all off-diagonal elements equal to R. [sent-51, score-0.21]
32 Hence, the standard multi-level model assumes a very different covariance structure than the one based on (2) and incidentally cannot learn correlated and anti-correlated tasks simultaneously. [sent-52, score-0.354]
33 3 A general family of group sparsity inducing priors We seek a solution for which the expectation of W is sparse, i. [sent-53, score-0.502]
34 A straightforward way to induce sparsity, and which would be equivalent to 1 -regularisation on blocks of W, is to consider a Laplace prior (or double exponential). [sent-56, score-0.136]
35 Although applicable in a penalised likelihood framework, the Laplace prior would be computationally hard in a Bayesian setting as it is not conjugate to the Gaussian likelihood. [sent-57, score-0.123]
36 Hence, naively using this prior would prevent us from computing the posterior in closed form, even in a variational setting. [sent-58, score-0.364]
37 2 τ V σ2 Zi Wi yn tn N ω, χ, φ γi Ωi υ, λ Q Figure 1: Graphical model for sparse Bayesian multiple regression (when excluding the dashed arrow) and sparse Bayesian multiple classification (when considering all arrows). [sent-60, score-0.543]
38 A sparsity inducing prior for Wi can then be constructed by choosing a suitable hyperprior for γi . [sent-64, score-0.38]
39 The effective prior is then a symmetric matrix-variate generalised hyperbolic distribution: Kω+ P Di p(Wi ) ∝ 2 χ(φ + tr{Ω−1 Wi Σ−1 Wi }) i ω+ φ+tr{Ω−1 Wi Σ−1 Wi } i χ P Di 2 . [sent-66, score-0.131]
40 Several of the multivariate equivalents have recently been used as priors to induce sparsity in the Bayesian paradigm, both in the context of supervised [19, 11] and unsupervised linear Gaussian models [4]. [sent-69, score-0.31]
41 4 Sparse Bayesian multiple regression We view {Wi }Q , {Ωi }Q and {γ1 , . [sent-70, score-0.137]
42 We further introduce a latent projectoin matrix V ∈ RP ×K and a set of latent matrices {Zi }Q to make a low-rank approximation of the column covariance Σ as explained i=1 below. [sent-83, score-0.372]
43 Note also that Ωi captures the correlations between the rows of group i. [sent-84, score-0.177]
44 Thus, the probabilistic model induces sparsity in the blocks of W, while taking correlations between the task parameters into account through the random matrix Σ ≈ VV + τ IP . [sent-108, score-0.508]
45 The latent variables Z = {W, V, Z, Ω, Γ} are infered by variational EM [27], while the hyperparameters ϑ = {σ 2 , τ, υ, λ, ω, χ, φ} are estimated by type II ML [8, 25]). [sent-110, score-0.406]
46 Using variational inference is motivated by the fact that deterministic approximate inference schemes converge faster than traditional sampling methods such as Markov chain Monte Carlo (MCMC), and their convergence can easily be monitored. [sent-111, score-0.21]
47 The choice of learning the hyperparameters by type II ML is preferred to the option of placing vague priors over them, although this would also be a valid option. [sent-112, score-0.203]
48 In order to find a tractable solution, we assume that the variational posterior q(Z) = q(W, V, Z, Ω, Γ) factorises as q(W)q(V)q(, Z)q(Ω)q(Γ) given the data D = {(yn , xn )}N [7]. [sent-113, score-0.338]
49 n=1 The variational EM combined to the type II ML estimation of the hyperparameters cycles through the following two steps until convergence: 1. [sent-114, score-0.325]
50 Update of the approximate posterior of the latent variables and parameters for fixed hyperparameters. [sent-115, score-0.158]
51 The posteriors of the other latent matrices have the same form. [sent-117, score-0.125]
52 Update of the hyperparameters for fixed variational posteriors: ϑ ← argmax ln p(D, Z, |ϑ) q(Z) . [sent-119, score-0.348]
53 The convergence can be checked by monitoring the variational lower bound, which monotonically increases during the optimisation. [sent-121, score-0.21]
54 Next, we give the explicit expression of the variational EM steps and the updates for the hyperparameters, whereas we show that of the variational bound in the Supplemental Appendix D. [sent-122, score-0.483]
55 1 Variational E step (mean field) Asssuming a factorised posterior enables us to compute it in closed form as the priors are each conjugate to the Gaussian likelihood. [sent-124, score-0.227]
56 When D > N , we can use the Woodbury identity for a matrix inversion of complexity O(N 3 ) per iteration. [sent-128, score-0.102]
57 2 Hyperparameter updates To learn the degree of sparsity from data we optimise the hyperparameters. [sent-130, score-0.33]
58 , by line search: ω : Q ln χ: √ d ln Kω ( χφ) φ −Q χ dω Qω Q − χ 2 φ: Q χ Rω ( φ φ Rω ( χ χφ) + χφ) − 1 2 ln γi = 0, (10) −1 γi = 0, (11) i i γi = 0, (12) i where (? [sent-134, score-0.186]
59 When considering special cases of the mixing density such as the Gamma or the inverse Gamma simplified updates are obtained and no numerical differentiation is required. [sent-138, score-0.123]
60 Due to space constraints, we omit the type II ML updates for the other hyperparameters. [sent-139, score-0.102]
61 5 Sparse Bayesian multiple classification We restrict ourselves to multiple binary classifiers and consider a probit model in which the likelihood is derived from the Gaussian cumulative density. [sent-143, score-0.198]
62 A probit model is equivalent to a Gaussian noise and a step function likelihood [1]. [sent-144, score-0.092]
63 Let tn ∈ RP be the class label vectors, with tnp ∈ {−1, +1} for all n. [sent-145, score-0.265]
64 The likelihood is replaced by tn |yn ∼ yn |W, xn ∼ N (Wxn , σ 2 IP ), I(tnp ynp ), (13) p where I(z) = 1 for z 0 and 0 otherwise. [sent-146, score-0.34]
65 We further assume the variational posterior q(Y) is a product of truncated Gaussians (see Supplemental Appendix B): q(Y) ∝ N+ (νnp , 1) I(tnp ynp )N (νnp , 1) = n p tnp =+1 N− (νnp , 1), (14) tnp =−1 where νnp is the pth entry of ν n = MW xn . [sent-150, score-0.828]
66 The other variational and hyperparameter updates are unchanged, except that Y is replaced by matrix ν ± . [sent-151, score-0.367]
67 Based on the variational approximation we propose the following classification rule: ˆ∗ = arg max P (t∗ |T) ≈ arg max t t∗ t∗ Nt∗p (ν∗p , 1)dy∗p = arg max t∗ p Φ (t∗p ν∗p ) , (15) p where ν ∗ = MW x∗ . [sent-157, score-0.21]
68 5 Estimated task covariance True task covariance 8 SPBMRC Ordinary Least Squares Predict with ground truth W 6 5 Sparsity pattern 4 3 1 2 0. [sent-160, score-0.522]
69 2 0 20 30 40 50 60 70 80 90 100 0 Training set size 5 10 15 20 25 30 35 40 45 50 Feature index SBMR estimated weight matrix OLS estimated weight matrix True weight matrix Figure 2: Results for the ground truth data set. [sent-164, score-0.244]
70 Top right: estimated and true Σ (top), true underlying sparsity pattern (middle) and inverse of the posterior mean of {γi }i showing that the sparsity is correctly captured (bottom). [sent-166, score-0.581]
71 Bottom diagrams: Hinton diagram of true W (bottom), ordinary least squares learnt W (middle) and the sparse Bayesian multi-task learnt W (top). [sent-167, score-0.333]
72 The ordinary least squares learnt W contains many non-zero elements. [sent-168, score-0.156]
73 6 A model study with ground truth data To understand the properties of the model we study a regression problem with known parameters. [sent-169, score-0.16]
74 √ √ 2 shows the results for 5 tasks and 50 features. [sent-170, score-0.155]
75 the covariance for vec(W) has 1’s on the diagonal and ±. [sent-179, score-0.154]
76 The first three tasks and the last two tasks are positively correlated. [sent-181, score-0.31]
77 It can be observed that the proposed model performs better and converges faster to the optimal performance when the data set size increases compared ordinary least squares. [sent-187, score-0.112]
78 Note also that both Σ and the sparsity pattern are correctly identified. [sent-188, score-0.222]
79 6 Table 1: Performance (with standard deviation) of classification tasks on Yeast and Scene data sets in terms of accuracy and AUC. [sent-189, score-0.207]
80 LR: Bayesian logistic regression; Pooling: pooling all data and learning a single model; Xue: the matrix stick-breaking process based multi-task learning model proposed in [33]. [sent-190, score-0.209]
81 We evaluate all methods for the classification task using two metrics: (1) overall accuracy at a threshold of zero and (2) the average area under the curve (AUC). [sent-247, score-0.121]
82 We also study how the performances vary with different K on a tuning set, and observe that there are no significant differences on performances using different K (not shown in the paper). [sent-250, score-0.142]
83 The proposed models (Laplace, Student-t, ARD) significantly outperform the Bayesian logistic regression approach that learns each task separately. [sent-252, score-0.28]
84 This observation agrees with the previous work [6, 31, 2, 5] demonstrating that the multi-task approach is beneficial over the naive approach of learning tasks separately. [sent-253, score-0.155]
85 The advantage of using hierarchical priors is particularly evident in a low data regime. [sent-256, score-0.088]
86 Figure 3 shows that the proposed Bayesian methods perform well overall, but that the performances are not significantly impacted when the number of data is small. [sent-259, score-0.108]
87 8 Conclusion In this work we proposed a Bayesian multi-task learning model able to capture correlations between tasks and to learn the sparsity pattern of the data features simultaneously. [sent-261, score-0.598]
88 We further proposed a low-rank approximation of the covariance to handle a very large number of tasks. [sent-262, score-0.191]
89 Combining lowrank and sparsity at the same time has been a long open standing issue in machine learning. [sent-263, score-0.222]
90 5 400 600 800 1000 Number of training samples Figure 3: Model comparisons in terms of classification accuracy and AUC on the Scene data set for K = 10. [sent-275, score-0.09]
91 Results for Bayesian logistic regression (BLR), Model-1 and Model-2 are obtained based on the measurements using a ruler from Figure 2 in [29], for which no error bars are given. [sent-277, score-0.136]
92 proposed model combines sparsity and low-rank in a different manner than in [10], where a sum of a sparse and low-rank matrix is considered. [sent-278, score-0.411]
93 By considering a matrix-variate Gaussian scale mixture prior we extended the Bayesian group LASSO to a more general family of group sparsity inducing priors. [sent-279, score-0.647]
94 This suggests the extension of current Bayesian methodology to learn structured sparsity from data in the future. [sent-280, score-0.267]
95 A possible extension is to consider the graphical LASSO to learn sparse precision matrices Ω−1 abd Σ−1 . [sent-281, score-0.213]
96 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-297, score-0.248]
97 Learning incoherent sparse and low-rank patterns from multiple tasks. [sent-389, score-0.149]
98 Some matrix-variate distribution theory: Notational considerations and a bayesian application. [sent-395, score-0.223]
99 Sharp thresholds for high-dimensional and noisy sparsity recovery using l1 -constrained quadratic programming (lasso). [sent-498, score-0.222]
100 Model selection and estimation in regression with grouped variables. [sent-509, score-0.084]
wordName wordTfidf (topN-words)
[('bayesian', 0.223), ('sparsity', 0.222), ('variational', 0.21), ('tnp', 0.204), ('yeast', 0.188), ('rp', 0.188), ('ip', 0.168), ('tasks', 0.155), ('covariance', 0.154), ('wi', 0.147), ('auc', 0.135), ('laplace', 0.129), ('vec', 0.126), ('blr', 0.122), ('wxn', 0.122), ('scene', 0.12), ('mw', 0.117), ('ard', 0.117), ('xue', 0.112), ('correlations', 0.102), ('yn', 0.1), ('sparse', 0.096), ('np', 0.09), ('priors', 0.088), ('regression', 0.084), ('gaussian', 0.083), ('ynp', 0.082), ('learnt', 0.081), ('inducing', 0.081), ('latent', 0.081), ('multitask', 0.079), ('supplemental', 0.079), ('posterior', 0.077), ('prior', 0.077), ('hyperparameters', 0.076), ('group', 0.075), ('ordinary', 0.075), ('classi', 0.075), ('di', 0.074), ('vv', 0.072), ('performances', 0.071), ('task', 0.069), ('residual', 0.068), ('zi', 0.068), ('ml', 0.066), ('em', 0.065), ('pooling', 0.064), ('updates', 0.063), ('factorised', 0.062), ('hern', 0.062), ('ln', 0.062), ('tn', 0.061), ('inverse', 0.06), ('blocks', 0.059), ('appendix', 0.059), ('archambeau', 0.059), ('lasso', 0.057), ('ik', 0.056), ('dq', 0.056), ('matrix', 0.056), ('generalised', 0.054), ('rai', 0.054), ('multiple', 0.053), ('tr', 0.053), ('logistic', 0.052), ('accuracy', 0.052), ('xn', 0.051), ('lr', 0.05), ('generalisation', 0.049), ('jmlr', 0.047), ('likelihood', 0.046), ('evgeniou', 0.046), ('inversion', 0.046), ('probit', 0.046), ('learn', 0.045), ('ii', 0.044), ('posteriors', 0.044), ('student', 0.044), ('scale', 0.042), ('appealing', 0.04), ('gamma', 0.04), ('predictive', 0.04), ('type', 0.039), ('mixture', 0.039), ('targets', 0.039), ('na', 0.039), ('truth', 0.039), ('comparisons', 0.038), ('learns', 0.038), ('hyperparameter', 0.038), ('ground', 0.037), ('paradigm', 0.037), ('sigkdd', 0.037), ('proposed', 0.037), ('capture', 0.037), ('graphical', 0.036), ('family', 0.036), ('precision', 0.036), ('centred', 0.036), ('integrative', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
2 0.20662457 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
Author: Jun Zhu, Ning Chen, Eric P. Xing
Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.
3 0.19571473 301 nips-2011-Variational Gaussian Process Dynamical Systems
Author: Neil D. Lawrence, Michalis K. Titsias, Andreas Damianou
Abstract: High dimensional time series are endemic in applications of machine learning such as robotics (sensor data), computational biology (gene expression data), vision (video sequences) and graphics (motion capture data). Practical nonlinear probabilistic approaches to this data are required. In this paper we introduce the variational Gaussian process dynamical system. Our work builds on recent variational approximations for Gaussian process latent variable models to allow for nonlinear dimensionality reduction simultaneously with learning a dynamical prior in the latent space. The approach also allows for the appropriate dimensionality of the latent space to be automatically determined. We demonstrate the model on a human motion capture data set and a series of high resolution video sequences. 1
4 0.19147687 104 nips-2011-Generalized Beta Mixtures of Gaussians
Author: Artin Armagan, Merlise Clyde, David B. Dunson
Abstract: In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. 1
5 0.16099989 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
6 0.14482781 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
7 0.14428516 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
8 0.14118513 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
9 0.14068533 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
10 0.13523716 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
11 0.13273986 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
12 0.12688105 261 nips-2011-Sparse Filtering
13 0.12195583 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
14 0.12189191 259 nips-2011-Sparse Estimation with Structured Dictionaries
15 0.12083802 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
16 0.11513156 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
17 0.11421989 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
18 0.10941751 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis
19 0.10885424 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
20 0.1063251 213 nips-2011-Phase transition in the family of p-resistances
topicId topicWeight
[(0, 0.337), (1, 0.103), (2, -0.014), (3, -0.119), (4, -0.102), (5, -0.085), (6, 0.12), (7, 0.024), (8, 0.052), (9, 0.216), (10, -0.075), (11, -0.153), (12, 0.007), (13, -0.008), (14, -0.102), (15, 0.056), (16, -0.044), (17, -0.044), (18, 0.17), (19, -0.144), (20, 0.043), (21, -0.0), (22, 0.055), (23, 0.141), (24, 0.036), (25, -0.069), (26, 0.135), (27, 0.062), (28, 0.015), (29, -0.053), (30, 0.026), (31, -0.008), (32, -0.033), (33, -0.056), (34, 0.057), (35, -0.057), (36, -0.038), (37, 0.044), (38, -0.057), (39, -0.093), (40, -0.075), (41, 0.06), (42, 0.04), (43, 0.058), (44, -0.08), (45, -0.053), (46, -0.061), (47, -0.004), (48, 0.036), (49, -0.113)]
simIndex simValue paperId paperTitle
same-paper 1 0.97084242 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
2 0.81871057 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
Author: Oliver Stegle, Christoph Lippert, Joris M. Mooij, Neil D. Lawrence, Karsten M. Borgwardt
Abstract: Inference in matrix-variate Gaussian models has major applications for multioutput prediction and joint learning of row and column covariances from matrixvariate data. Here, we discuss an approach for efficient inference in such models that explicitly account for iid observation noise. Computational tractability can be retained by exploiting the Kronecker product between row and column covariance matrices. Using this framework, we show how to generalize the Graphical Lasso in order to learn a sparse inverse covariance between features while accounting for a low-rank confounding covariance between samples. We show practical utility on applications to biology, where we model covariances with more than 100,000 dimensions. We find greater accuracy in recovering biological network structures and are able to better reconstruct the confounders. 1
3 0.78620869 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
Author: Miguel Lázaro-gredilla, Michalis K. Titsias
Abstract: We introduce a variational Bayesian inference algorithm which can be widely applied to sparse linear models. The algorithm is based on the spike and slab prior which, from a Bayesian perspective, is the golden standard for sparse inference. We apply the method to a general multi-task and multiple kernel learning model in which a common set of Gaussian process functions is linearly combined with task-specific sparse weights, thus inducing relation between tasks. This model unifies several sparse linear models, such as generalized linear models, sparse factor analysis and matrix factorization with missing values, so that the variational algorithm can be applied to all these cases. We demonstrate our approach in multioutput Gaussian process regression, multi-class classification, image processing applications and collaborative filtering. 1
4 0.78174299 104 nips-2011-Generalized Beta Mixtures of Gaussians
Author: Artin Armagan, Merlise Clyde, David B. Dunson
Abstract: In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. 1
5 0.76168388 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
Author: Jun Zhu, Ning Chen, Eric P. Xing
Abstract: Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes’ theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the largemargin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.
6 0.71988922 301 nips-2011-Variational Gaussian Process Dynamical Systems
7 0.65242898 240 nips-2011-Robust Multi-Class Gaussian Process Classification
8 0.63394111 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
9 0.63201088 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
10 0.62736911 217 nips-2011-Practical Variational Inference for Neural Networks
11 0.60584486 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
12 0.59937149 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
13 0.58198249 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
14 0.56362891 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
15 0.55308664 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
16 0.55016267 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
17 0.55009365 14 nips-2011-A concave regularization technique for sparse mixture models
18 0.54190397 277 nips-2011-Submodular Multi-Label Learning
19 0.53701919 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction
20 0.53521335 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
topicId topicWeight
[(0, 0.024), (4, 0.062), (20, 0.033), (25, 0.016), (26, 0.026), (29, 0.096), (31, 0.118), (33, 0.025), (43, 0.104), (45, 0.12), (57, 0.05), (65, 0.026), (74, 0.099), (83, 0.053), (89, 0.014), (99, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.93438107 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
2 0.89887989 276 nips-2011-Structured sparse coding via lateral inhibition
Author: Arthur D. Szlam, Karol Gregor, Yann L. Cun
Abstract: This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modifications of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efficient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-specified. 1
3 0.89331543 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
Author: Jun-ichiro Hirayama, Aapo Hyvärinen
Abstract: Components estimated by independent component analysis and related methods are typically not independent in real data. A very common form of nonlinear dependency between the components is correlations in their variances or energies. Here, we propose a principled probabilistic model to model the energycorrelations between the latent variables. Our two-stage model includes a linear mixing of latent signals into the observed ones like in ICA. The main new feature is a model of the energy-correlations based on the structural equation model (SEM), in particular, a Linear Non-Gaussian SEM. The SEM is closely related to divisive normalization which effectively reduces energy correlation. Our new twostage model enables estimation of both the linear mixing and the interactions related to energy-correlations, without resorting to approximations of the likelihood function or other non-principled approaches. We demonstrate the applicability of our method with synthetic dataset, natural images and brain signals. 1
4 0.89315057 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
Author: Alyson K. Fletcher, Sundeep Rangan, Lav R. Varshney, Aniruddha Bhargava
Abstract: Many functional descriptions of spiking neurons assume a cascade structure where inputs are passed through an initial linear filtering stage that produces a lowdimensional signal that drives subsequent nonlinear stages. This paper presents a novel and systematic parameter estimation procedure for such models and applies the method to two neural estimation problems: (i) compressed-sensing based neural mapping from multi-neuron excitation, and (ii) estimation of neural receptive fields in sensory neurons. The proposed estimation algorithm models the neurons via a graphical model and then estimates the parameters in the model using a recently-developed generalized approximate message passing (GAMP) method. The GAMP method is based on Gaussian approximations of loopy belief propagation. In the neural connectivity problem, the GAMP-based method is shown to be computational efficient, provides a more exact modeling of the sparsity, can incorporate nonlinearities in the output and significantly outperforms previous compressed-sensing methods. For the receptive field estimation, the GAMP method can also exploit inherent structured sparsity in the linear weights. The method is validated on estimation of linear nonlinear Poisson (LNP) cascade models for receptive fields of salamander retinal ganglion cells. 1
5 0.88930136 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
6 0.88719308 186 nips-2011-Noise Thresholds for Spectral Clustering
7 0.88161713 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
8 0.87978083 68 nips-2011-Demixed Principal Component Analysis
9 0.87757337 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
10 0.87750328 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
11 0.87719053 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
12 0.87708271 102 nips-2011-Generalised Coupled Tensor Factorisation
13 0.87676191 180 nips-2011-Multiple Instance Filtering
14 0.87626195 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
15 0.87608504 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
16 0.87526304 66 nips-2011-Crowdclustering
17 0.87510079 301 nips-2011-Variational Gaussian Process Dynamical Systems
18 0.87459457 75 nips-2011-Dynamical segmentation of single trials from population neural data
19 0.87406439 219 nips-2011-Predicting response time and error rates in visual search
20 0.8720004 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning