nips nips2013 nips2013-153 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 es Abstract A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. [sent-5, score-0.822]
2 Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. [sent-8, score-0.512]
3 Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. [sent-10, score-0.514]
4 The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. [sent-12, score-0.434]
5 The sparsity assumption can be introduced by carrying out Bayesian inference under a sparsity enforcing prior for the model coefficients [2, 3], or by minimizing a loss function penalized by some sparse regularizer [4, 5]. [sent-19, score-0.237]
6 Among the priors that enforce sparsity, the horseshoe has some attractive properties that are very convenient for the scenario described [3]. [sent-20, score-0.436]
7 The estimation of the coefficients under the sparsity assumption can be improved by introducing dependencies in the process of determining which coefficients are zero [6, 7]. [sent-22, score-0.306]
8 An extreme case of these dependencies appears in group feature selection methods in which groups of coefficients are considered to be jointly equal or different from zero [8, 9]. [sent-23, score-0.428]
9 Here, we propose a model based on the horseshoe prior that induces the dependencies in the feature selection process from the training data. [sent-25, score-0.871]
10 These dependencies are expressed by a correlation matrix that is specified by O(d) parameters. [sent-26, score-0.269]
11 To improve the estimation process we assume a multi-task learning setting, where several learning tasks share feature selection dependencies. [sent-29, score-0.389]
12 1 Traditionally, methods for multi-task learning under the sparsity assumption have considered common relevant and irrelevant features among tasks [8, 10, 11, 12, 13, 14]. [sent-31, score-0.392]
13 The tasks used for induction can have, besides different model coefficients, different relevant features. [sent-34, score-0.263]
14 The model described here is most related to the method for sparse coding introduced in [16], where spike-and-slab priors [2] are considered for multi-task linear regression under the sparsity assumption and dependencies in the feature selection process are specified by a Boltzmann machine. [sent-36, score-0.581]
15 These experiments also illustrate the benefits of the proposed model for inducing dependencies in the feature selection process. [sent-44, score-0.434]
16 Specifically, the dependencies obtained are suitable for the multi-task learning problems considered. [sent-45, score-0.202]
17 The rest of the paper is organized as follows: Section 2 describes the proposed model for learning feature selection dependencies. [sent-49, score-0.232]
18 2 A Model for Learning Feature Selection Dependencies We describe a linear regression model that can be used for learning dependencies in the process of identifying relevant features or attributes for prediction. [sent-53, score-0.416]
19 This inductive bias can be naturally incorporated into the model using a horseshoe sparsity enforcing prior for w [3]. [sent-72, score-0.508]
20 Figure 1 (left) and (middle) show a comparison of the horseshoe with other priors from the literature. [sent-92, score-0.389]
21 The horseshoe has an infinitely tall spike at the origin which favors coefficients with small values, and has heavy tails which favor coefficients that take values that significantly differ from zero. [sent-93, score-0.518]
22 Then, the posterior mean for wj is (1 − κj )yj , where j κj is a random shrinkage coefficient that can be interpreted as the amount of weight placed at the origin [3]. [sent-95, score-0.309]
23 It is from the shape of this figure that the horseshoe takes its name. [sent-97, score-0.352]
24 The horseshoe is therefore very convenient for the sparse inducing scenario described before. [sent-99, score-0.432]
25 (right) Prior density of the shrinkage parameter κj for the horseshoe prior. [sent-109, score-0.405]
26 A limitation of the horseshoe is that it does not consider dependencies in the feature selection process. [sent-110, score-0.755]
27 Specifically, the fact that one feature is actually relevant for prediction has no impact at all in the prior relevancy or irrelevancy of other features. [sent-111, score-0.246]
28 We now describe how to introduce these dependencies in the horseshoe. [sent-112, score-0.202]
29 j (3) j=1 where uj and vj are latent variables introduced for each dimension j. [sent-115, score-0.504]
30 Furthermore, τ has been incorporated into the prior for uj and vj using τ 2 = ρ2 /γ 2 . [sent-117, score-0.513]
31 The latent variables uj and vj can be interpreted as indicators of the relevance or irrelevance of feature j. [sent-118, score-0.588]
32 j A simple way of introducing dependencies in the feature selection process is to consider correlations among variables uj and vj , with j = 1, . [sent-121, score-0.941]
33 , vd )T , C is a correlation matrix that specifies the dependencies in the feature selection process, and ρ2 and γ 2 act as regularization parameters that control the level of sparsity. [sent-131, score-0.47]
34 , ad ; D is a diagonal matrix whose entries are all equal to some small positive constant (this matrix guarantees that C−1 exists); the products by ∆ ensure that the entries of C are in the range (−1, 1); and P is a d × m matrix of real entries which specifies the correlation structure of C. [sent-144, score-0.232]
35 Based on the formulation of the previous section, the joint probability distribution of y and z is: d p(y, z|X, σ 2 , ρ2 , γ 2 , C) = N (y|Xw, σ 2 I)N (u|0, ρ2 C)N (v|0, γ 2 C) 2 N wj |0, u2 /vj . [sent-151, score-0.208]
36 All the factors in (6) are 2 Gaussian, except the ones corresponding to the prior for wj given uj and vj , N (wj |0, u2 /vj ). [sent-154, score-0.697]
37 of (7) is the joint distribution (6) and the denominator is simply a normalization constant (the model evidence) which can be used for Bayesian model selection [19]. [sent-159, score-0.264]
38 (8) Similarly, one can marginalize (7) with respect to w to obtain a posterior distribution for u and v which can be useful to identify the most relevant or irrelevant features. [sent-161, score-0.289]
39 Ideally, however, one should also infer C, the correlation matrix that describes the dependencies in the feature selection process, and compute a posterior distribution for it. [sent-162, score-0.515]
40 2 hood N (y|Xw, σ 2 I), and each gj (·) to the prior 2 for wj given uj and vj , N (wj |0, u2 /vj ). [sent-192, score-1.068]
41 A simple assumption is that all these tasks share a common dependency structure C for the feature selection process, although the model coefficients and the actual relevant features may differ between tasks. [sent-201, score-0.542]
42 This assumption is less restrictive than assuming jointly relevant and irrelevant features across tasks and can be incorporated into the learning process using the described model with few modifications. [sent-202, score-0.458]
43 Assume there are K learning tasks available for induction and that each task k = 1, . [sent-204, score-0.203]
44 Assume for the model coefficients of each task a horseshoe prior as the one specified in (4) with a shared correlation matrix C, but with task specific hyper-parameters ρ2 and k 2 γk . [sent-210, score-0.662]
45 Denote by uk and vk the vectors of latent Gaussian variables of the prior for task k. [sent-211, score-0.203]
46 Similarly, let T T zk = (wk , uT , vk )T be the vector of latent variables of task k. [sent-212, score-0.204]
47 Then, the joint posterior distribution k of the latent variables of the different tasks factorizes as follows: K p 2 2 {z}K |{Xk , yk , τk , ρ2 , σk }K , C k=1 k k=1 = k=1 2 2 p(yk , zk |Xk , σk , ρ2 , γk , C) k 2 , ρ2 , γ 2 , C) , p(yk |Xk , σk k k (9) where each factor in the r. [sent-213, score-0.415]
48 In summary, if there is a method to approximate the required quantities for learning a single task using the model proposed, implementing a multi-task learning method that assumes shared feature selection dependencies but task dependent hyper-parameters is straight-forward. [sent-228, score-0.628]
49 3 Approximate Inference Expectation propagation (EP) [20] is used to approximate the posterior distribution and the evidence of the model described in Section 2. [sent-229, score-0.211]
50 Up to a normalization constant this distribution can be written as d p(z|X, y, σ 2 , ρ2 , γ 2 ) ∝ f (w)hu (u)hv (v) gj (z) , (10) j=1 where the factors in the r. [sent-234, score-0.45]
51 Note that all factors except the gj ’s d are Gaussian. [sent-238, score-0.425]
52 EP approximates (10) by a distribution q(z) ∝ f (w)hu (u)hv (v) j=1 gj (z), which ˜ is obtained by replacing each non-Gaussian factor gj in (10) with an approximate factor gj that is ˜ Gaussian but need not be normalized. [sent-239, score-1.23]
53 EP iteratively updates each gj until convergence by first computing q \j ∝ q/˜j and then minimiz˜ g ing the Kullback-Leibler (KL) divergence between gj q \j and q new , KL(gj q \j ||q new ), with respect to ˜new q new . [sent-241, score-0.832]
54 The new approximate factor is obtained as gj = sj q new /q \j , where sj is the normalization \j constant of gj q . [sent-242, score-0.995]
55 This update rule ensures that gj looks similar to gj in regions of high posterior ˜ \j probability in terms of q [20]. [sent-243, score-0.841]
56 Minimizing the KL divergence is a convex problem whose optimum is found by matching the means and the covariance matrices between gj q \j and q new . [sent-244, score-0.398]
57 Importantly, gj , and therefore gj , depend only on wj , uj and vj , so a three-dimensional quadrature ˜ 5 will suffice. [sent-248, score-1.442]
58 Assume that q \j (wj , uj , vj ) = N (wj |mj , ηj )N (uj |0, νj )N (vj |0, ξj ), i. [sent-250, score-0.434]
59 , q \j factorizes with respect to wj , uj and vj and that the mean of uj and vj is zero. [sent-252, score-1.122]
60 Since gj is symmetric with respect to uj and vj then E[uj ] = E[vj ] = E[uj vj ] = E[uj wj ] = E[vj wj ] = 0 under gj q \j . [sent-253, score-1.831]
61 Thus, if the initial approximate factors gj factorize with respect to wj , uj and vj , and have zero mean with respect to ˜ uj and vj , any updated factor will also satisfy these properties and q \j will have the assumed form. [sent-254, score-1.583]
62 The crucial point here is that the dependencies introduced by gj do not lead to correlations that need to be tracked under a Gaussian approximation. [sent-255, score-0.648]
63 In this situation, the integral of gj q \j with respect to wj is given by the convolution of two Gaussians and the integral of the result with respect to uj and vj can be simplified using arguments similar to those employed to obtain (3). [sent-256, score-1.086]
64 Therefore, each update of gj requires five quadratures: one to evaluate sj and four to evaluate its derivatives. [sent-259, score-0.467]
65 ˜ Instead of sequentially updating each gj , we follow [7] and update these factors in parallel. [sent-260, score-0.425]
66 For this, ˜ we compute all q \j at the same time and update each gj . [sent-261, score-0.398]
67 These can be efficiently obtained using the low rank representation structure of the covariance matrix of q that results from the fact that all the gj ’s are factorizing univariate Gaussians ˜ and from the assumed form for C in (5). [sent-263, score-0.425]
68 Lastly, we damp the update of each gj as follows: gj = (˜j )α (˜j )1−α , where gj and gj respectively denote the new and the ˜ ˜ g new g old ˜new ˜old old gj , and α ∈ [0, 1] is a parameter that controls the amount of damping. [sent-265, score-2.048]
69 ˜ Similarly, the model evidence in (7) can be approximated by Z, the normalization constant of q: d ˜ Z= f (w)hu (u)hv (v) gj (z)dwdudv . [sent-268, score-0.497]
70 Specifically, once EP has converged, the gradient of the natural parameters of the gj ’s with respect to these hyper-parameters ˜ ˜ is zero [21]. [sent-270, score-0.468]
71 The first one, HSST , is a particular case of HSDep that is obtained when each task is learnt independently and correlations in the feature selection process are ignored (i. [sent-278, score-0.359]
72 A multi-task learning model, HSMT , which assumes common relevant and irrelevant features among tasks is also considered. [sent-281, score-0.346]
73 It assumes a horseshoe prior in which the scale parameters λj in (2) are shared among tasks, i. [sent-283, score-0.406]
74 , each feature is either relevant or irrelevant in all tasks. [sent-285, score-0.292]
75 SSMT considers a spike-and-slab prior for joint feature selection across all tasks, instead of a horseshoe prior. [sent-287, score-0.633]
76 This model assumes shared relevant and irrelevant features 6 among tasks. [sent-291, score-0.283]
77 However, some tasks are allowed to have specific relevant features. [sent-292, score-0.202]
78 This model, BM, uses spike-and-slab priors for feature selection and specifies dependencies in this process using a Boltzmann machine. [sent-296, score-0.471]
79 In each task, the entries of Xk are sampled from a standard Gaussian distribution and the model coefficients, wk , are all set to zero except for the i-th group of 8 consecutive coefficients, with i chosen randomly for each task from the set {1, 2, . [sent-302, score-0.261]
80 Thus, in each task there are only 8 relevant features for pre2 diction. [sent-307, score-0.231]
81 Given each Xk and each wk , the targets yk are obtained using (1) with σk = 0. [sent-308, score-0.257]
82 Denote by wk the estimate of the model coefficients for task k (this is the posterior mean except in BM and DM). [sent-331, score-0.269]
83 The ˆ reconstruction error is measured as ||wk − wk ||2 /||wk ||2 , where || · ||2 is the 2 -norm and wk are the exact coefficients of task k. [sent-332, score-0.377]
84 BM performs worse than HSDep because the greedy MAP estimation of the sparsity patterns of each task is sometimes trapped in sub-optimal solutions. [sent-336, score-0.193]
85 The poor results of HSMT , SSMT and DM are due to the assumption made by these models of all tasks sharing relevant features, which is not satisfied. [sent-337, score-0.202]
86 Thus, within each block the corresponding latent variables uj and vj are strongly correlated, indicating jointly relevant or irrelevant features. [sent-340, score-0.737]
87 Black squares are groups of jointly relevant / irrelevant features. [sent-345, score-0.233]
88 Furthermore, since the 7 tasks contain images of different digits they are expected to have different relevant features. [sent-359, score-0.302]
89 Given 2 Xk and wk , the targets yk are generated using (1) with σk = 0. [sent-360, score-0.257]
90 The objective is to reconstruct 2 wk from Xk and yk for each task k. [sent-362, score-0.282]
91 The second best result corresponds to HSMT , probably due to background pixels which are irrelevant in all the tasks and to the heavy-tails of the horseshoe prior. [sent-370, score-0.59]
92 DM performs poorly probably because of the inferior shrinking properties of the 1 norm compared to the horseshoe [3]. [sent-372, score-0.352]
93 Finally, Figure 4 (left, bottom) shows in gray scale the average correlations in absolute value induced by HSDep for the selection process of each pixel of the image with respect to the selection of a particular pixel which is displayed in green. [sent-378, score-0.511]
94 Correlations are high to avoid the selection of background pixels and to select pixels that actually correspond to the digits 3 and 5. [sent-379, score-0.254]
95 (left, bottom) Average absolute value correlation in a gray scale (white=0 and black=1) between the latent variables uj and vj corresponding to the pixel displayed in green and the variables uj and vj corresponding to all the other pixels of the image. [sent-394, score-1.162]
96 5 Conclusions and Future Work We have described a linear sparse model for learning dependencies in the feature selection process. [sent-397, score-0.467]
97 The model can be used in a multi-task learning setting with several tasks available for induction that need not share relevant features, but only dependencies in the feature selection process. [sent-398, score-0.702]
98 Our experiments also show that the proposed model is able to induce relevant feature selection dependencies from the training data alone. [sent-403, score-0.542]
99 Joint covariate selection and joint subspace selection for multiple classification problems. [sent-491, score-0.26]
100 Exploiting statistical dependencies in sparse representations for signal recovery. [sent-547, score-0.235]
wordName wordTfidf (topN-words)
[('gj', 0.398), ('horseshoe', 0.352), ('hsdep', 0.273), ('uj', 0.233), ('dependencies', 0.202), ('vj', 0.201), ('ssmt', 0.189), ('wj', 0.182), ('cients', 0.162), ('hsmt', 0.147), ('hsst', 0.147), ('coef', 0.147), ('hern', 0.136), ('ep', 0.127), ('selection', 0.117), ('wk', 0.114), ('relevant', 0.108), ('zmt', 0.105), ('bm', 0.102), ('irrelevant', 0.1), ('dm', 0.1), ('tasks', 0.094), ('yk', 0.089), ('feature', 0.084), ('xk', 0.08), ('task', 0.079), ('reconstruction', 0.07), ('sj', 0.069), ('hv', 0.068), ('propagation', 0.056), ('zk', 0.055), ('targets', 0.054), ('xw', 0.054), ('prior', 0.054), ('shrinkage', 0.053), ('xnew', 0.051), ('ynew', 0.051), ('images', 0.051), ('digits', 0.049), ('correlations', 0.048), ('tall', 0.048), ('scenario', 0.047), ('laplace', 0.047), ('pixel', 0.047), ('sparsity', 0.046), ('hu', 0.045), ('posterior', 0.045), ('latent', 0.045), ('pixels', 0.044), ('features', 0.044), ('jos', 0.044), ('evidence', 0.043), ('trapped', 0.041), ('correlation', 0.04), ('dirty', 0.039), ('student', 0.039), ('infosys', 0.037), ('gray', 0.037), ('entries', 0.037), ('priors', 0.037), ('factorizes', 0.036), ('mj', 0.036), ('approximate', 0.036), ('share', 0.036), ('respect', 0.036), ('editors', 0.035), ('nitely', 0.034), ('boltzmann', 0.034), ('denominator', 0.034), ('tails', 0.034), ('gradient', 0.034), ('sparse', 0.033), ('reconstructed', 0.033), ('gaussian', 0.032), ('displayed', 0.031), ('process', 0.031), ('model', 0.031), ('quadrature', 0.03), ('induction', 0.03), ('supports', 0.03), ('synthetic', 0.03), ('origin', 0.029), ('bayesian', 0.029), ('expectation', 0.029), ('old', 0.029), ('miguel', 0.028), ('differ', 0.028), ('bottom', 0.028), ('factors', 0.027), ('spike', 0.027), ('matrix', 0.027), ('estimation', 0.027), ('inference', 0.027), ('joint', 0.026), ('alternatives', 0.026), ('normalization', 0.025), ('variables', 0.025), ('incorporated', 0.025), ('jointly', 0.025), ('df', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
2 0.088515438 65 nips-2013-Compressive Feature Learning
Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1
3 0.083865471 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
Author: Edoardo M. Airoldi, Thiago B. Costa, Stanley H. Chan
Abstract: Non-parametric approaches for analyzing network data based on exchangeable graph models (ExGM) have recently gained interest. The key object that defines an ExGM is often referred to as a graphon. This non-parametric perspective on network modeling poses challenging questions on how to make inference on the graphon underlying observed network data. In this paper, we propose a computationally efficient procedure to estimate a graphon from a set of observed networks generated from it. This procedure is based on a stochastic blockmodel approximation (SBA) of the graphon. We show that, by approximating the graphon with a stochastic block model, the graphon can be consistently estimated, that is, the estimation error vanishes as the size of the graph approaches infinity.
4 0.082981363 193 nips-2013-Mixed Optimization for Smooth Functions
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
5 0.080699869 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
Author: Ryosuke Matsushita, Toshiyuki Tanaka
Abstract: We study the problem of reconstructing low-rank matrices from their noisy observations. We formulate the problem in the Bayesian framework, which allows us to exploit structural properties of matrices in addition to low-rankedness, such as sparsity. We propose an efficient approximate message passing algorithm, derived from the belief propagation algorithm, to perform the Bayesian inference for matrix reconstruction. We have also successfully applied the proposed algorithm to a clustering problem, by reformulating it as a low-rank matrix reconstruction problem with an additional structural property. Numerical experiments show that the proposed algorithm outperforms Lloyd’s K-means algorithm. 1
6 0.080463082 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
7 0.079541139 173 nips-2013-Least Informative Dimensions
8 0.078455061 318 nips-2013-Structured Learning via Logistic Regression
9 0.078184135 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
10 0.077109009 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series
11 0.076944642 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
12 0.073407874 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
13 0.073197111 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
14 0.06974455 168 nips-2013-Learning to Pass Expectation Propagation Messages
15 0.067691036 232 nips-2013-Online PCA for Contaminated Data
16 0.067476861 350 nips-2013-Wavelets on Graphs via Deep Learning
17 0.065974049 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
18 0.064531326 158 nips-2013-Learning Multiple Models via Regularized Weighting
19 0.061815523 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
20 0.061513808 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
topicId topicWeight
[(0, 0.195), (1, 0.067), (2, -0.021), (3, 0.011), (4, -0.007), (5, 0.057), (6, -0.006), (7, 0.045), (8, 0.014), (9, 0.005), (10, 0.016), (11, 0.055), (12, -0.065), (13, -0.047), (14, -0.05), (15, 0.029), (16, -0.022), (17, -0.019), (18, -0.075), (19, -0.022), (20, -0.014), (21, 0.016), (22, -0.013), (23, 0.022), (24, 0.025), (25, 0.008), (26, 0.044), (27, 0.012), (28, 0.031), (29, -0.089), (30, 0.039), (31, 0.086), (32, 0.014), (33, 0.077), (34, -0.035), (35, 0.029), (36, -0.018), (37, 0.019), (38, 0.012), (39, 0.065), (40, -0.073), (41, -0.012), (42, 0.045), (43, -0.025), (44, 0.056), (45, 0.036), (46, -0.009), (47, 0.104), (48, -0.018), (49, -0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.92693454 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
2 0.65707242 167 nips-2013-Learning the Local Statistics of Optical Flow
Author: Dan Rosenbaum, Daniel Zoran, Yair Weiss
Abstract: Motivated by recent progress in natural image statistics, we use newly available datasets with ground truth optical flow to learn the local statistics of optical flow and compare the learned models to prior models assumed by computer vision researchers. We find that a Gaussian mixture model (GMM) with 64 components provides a significantly better model for local flow statistics when compared to commonly used models. We investigate the source of the GMM’s success and show it is related to an explicit representation of flow boundaries. We also learn a model that jointly models the local intensity pattern and the local optical flow. In accordance with the assumptions often made in computer vision, the model learns that flow boundaries are more likely at intensity boundaries. However, when evaluated on a large dataset, this dependency is very weak and the benefit of conditioning flow estimation on the local intensity pattern is marginal. 1
3 0.64344627 245 nips-2013-Pass-efficient unsupervised feature selection
Author: Crystal Maung, Haim Schweitzer
Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modification of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the final selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efficiency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1
4 0.63406879 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle
Abstract: Multi-task prediction methods are widely used to couple regressors or classification models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives. 1
5 0.63214993 327 nips-2013-The Randomized Dependence Coefficient
Author: David Lopez-Paz, Philipp Hennig, Bernhard Schölkopf
Abstract: We introduce the Randomized Dependence Coefficient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-R´ nyi Maximum Correlation Coefficient. RDC is defined in e terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper. 1
6 0.62895411 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
7 0.61569834 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series
8 0.60608619 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
9 0.59773374 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
10 0.57614046 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
11 0.5727787 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs
12 0.57268661 337 nips-2013-Transportability from Multiple Environments with Limited Experiments
13 0.56693995 318 nips-2013-Structured Learning via Logistic Regression
14 0.56643701 265 nips-2013-Reconciling "priors" & "priors" without prejudice?
15 0.56062496 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
16 0.5586378 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
17 0.55853122 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
18 0.55576557 160 nips-2013-Learning Stochastic Feedforward Neural Networks
19 0.54967242 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
20 0.54847258 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
topicId topicWeight
[(16, 0.039), (33, 0.169), (34, 0.127), (41, 0.054), (49, 0.033), (56, 0.108), (70, 0.028), (71, 0.189), (85, 0.057), (89, 0.021), (93, 0.072), (95, 0.02)]
simIndex simValue paperId paperTitle
1 0.90668744 327 nips-2013-The Randomized Dependence Coefficient
Author: David Lopez-Paz, Philipp Hennig, Bernhard Schölkopf
Abstract: We introduce the Randomized Dependence Coefficient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-R´ nyi Maximum Correlation Coefficient. RDC is defined in e terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper. 1
2 0.86236817 226 nips-2013-One-shot learning by inverting a compositional causal process
Author: Brenden M. Lake, Ruslan Salakhutdinov, Josh Tenenbaum
Abstract: People can learn a new visual class from just one example, yet machine learning algorithms typically require hundreds or thousands of examples to tackle the same problems. Here we present a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image. We evaluated performance on a challenging one-shot classification task, where our model achieved a human-level error rate while substantially outperforming two deep learning models. We also tested the model on another conceptual task, generating new examples, by using a “visual Turing test” to show that our model produces human-like performance. 1
same-paper 3 0.85594308 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
4 0.79600966 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
5 0.79515368 99 nips-2013-Dropout Training as Adaptive Regularization
Author: Stefan Wager, Sida Wang, Percy Liang
Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1
6 0.7929135 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
7 0.79227 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
8 0.79176635 5 nips-2013-A Deep Architecture for Matching Short Texts
9 0.79070425 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
10 0.78988212 251 nips-2013-Predicting Parameters in Deep Learning
11 0.78683704 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
12 0.78663087 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
13 0.78628248 294 nips-2013-Similarity Component Analysis
14 0.78589404 173 nips-2013-Least Informative Dimensions
15 0.78564346 318 nips-2013-Structured Learning via Logistic Regression
16 0.78563744 301 nips-2013-Sparse Additive Text Models with Low Rank Background
17 0.7856108 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
18 0.78535056 350 nips-2013-Wavelets on Graphs via Deep Learning
19 0.78524983 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
20 0.78520882 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization