nips nips2011 nips2011-104 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Artin Armagan, Merlise Clyde, David B. Dunson
Abstract: In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. [sent-11, score-0.745]
2 In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. [sent-12, score-0.74]
3 We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. [sent-13, score-1.003]
4 This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. [sent-14, score-0.134]
5 We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. [sent-15, score-0.509]
6 1 Introduction Penalized likelihood estimation has evolved into a major area of research, with 1 [22] and other regularization penalties now used routinely in a rich variety of domains. [sent-16, score-0.136]
7 Often minimizing a loss function subject to a regularization penalty leads to an estimator that has a Bayesian interpretation as the mode of a posterior distribution [8, 11, 1, 2], with different prior distributions inducing different penalties. [sent-17, score-0.347]
8 For example, it is well known that Gaussian priors induce 2 penalties, while double exponential priors induce 1 penalties [8, 19, 13, 1]. [sent-18, score-1.044]
9 Viewing massive-dimensional parameter learning and prediction problems from a Bayesian perspective naturally leads one to design new priors that have substantial advantages over the simple normal or double exponential choices and that induce rich new families of penalties. [sent-19, score-0.763]
10 For example, in high-dimensional settings it is often appealing to have a prior that is concentrated at zero, favoring strong shrinkage of small signals and potentially a sparse estimator, while having heavy tails to avoid over-shrinkage of the larger signals. [sent-20, score-0.717]
11 This phenomenon has motivated a rich variety of new priors such as the normal-exponential-gamma, the horseshoe and the generalized double Pareto [11, 14, 1, 6, 20, 7, 12, 2]. [sent-22, score-0.807]
12 An alternative and widely applied Bayesian framework relies on variable selection priors and Bayesian model selection/averaging [18, 9, 16, 15]. [sent-23, score-0.367]
13 Under such approaches the prior is a mixture of a mass at zero, corresponding to the coefficients to be set equal to zero and hence excluded from the model, and a continuous distribution, providing a prior for the size of the non-zero signals. [sent-24, score-0.388]
14 Some recently proposed continuous shrinkage priors may be considered competitors to the conventional mixture priors [15, 6, 7, 12] yielding computationally attractive alternatives to Bayesian model averaging. [sent-28, score-1.012]
15 The ones represented as scale mixtures of Gaussians allow conjugate block updating of the regression coefficients in linear models and hence lead to substantial improvements in Markov chain Monte Carlo (MCMC) efficiency through more rapid mixing and convergence rates. [sent-30, score-0.407]
16 Under certain conditions these will also yield sparse estimates, if desired, via maximum a posteriori (MAP) estimation and approximate inferences via variational approaches [17, 24, 5, 8, 11, 1, 2]. [sent-31, score-0.259]
17 The class of priors that we consider in this paper encompasses many interesting priors as special cases and reveals interesting connections among different hierarchical formulations. [sent-32, score-0.987]
18 Exploiting an equivalent conjugate hierarchy of this class of priors, we develop a class of variational Bayes approximations that can scale up to truly massive data sets. [sent-33, score-0.684]
19 This conjugate hierarchy also allows for conjugate modeling of some previously proposed priors which have some rather complex yet advantageous forms and facilitates straightforward computation via Gibbs sampling. [sent-34, score-0.83]
20 We also argue intuitively that by adjusting a global shrinkage parameter that controls the overall sparsity level, we may control the number of non-zero parameters to be estimated, enhancing results, if there is an underlying sparse structure. [sent-35, score-0.4]
21 This global shrinkage parameter is inherent to the structure of the priors we discuss as in [6, 7] with close connections to the conventional variable selection priors. [sent-36, score-0.688]
22 2 Background We provide a brief background on shrinkage priors focusing primarily on the priors studied by [6, 7] and [11, 12] as well as the Strawderman-Berger (SB) prior [7]. [sent-37, score-1.121]
23 These priors possess some very appealing properties in contrast to the double exponential prior which leads to the Bayesian lasso [19, 13]. [sent-38, score-0.855]
24 They may be much heavier-tailed, biasing large signals less drastically while shrinking noiselike signals heavily towards zero. [sent-39, score-0.209]
25 In particular, the priors by [6, 7], along with the StrawdermanBerger prior [7], have a very interesting and intuitive representation later given in (2), yet, are not formed in a conjugate manner potentially leading to analytical and computational complexity. [sent-40, score-0.744]
26 [6, 7] propose a useful class of priors for the estimation of multiple means. [sent-41, score-0.441]
27 The independent hierarchical prior for θj is given by 1/2 θj |τj ∼ N (0, τj ), τj ∼ C + (0, φ1/2 ), (1) for j = 1, . [sent-43, score-0.249]
28 , p, where N (µ, ν) denotes a normal distribution with mean µ and variance ν and C + (0, s) denotes a half-Cauchy distribution on + with scale parameter s. [sent-46, score-0.338]
29 With an appropriate transformation ρj = 1/(1 + τj ), this hierarchy also can be represented as −1/2 θj |ρj ∼ N (0, 1/ρj − 1), π(ρj |φ) ∝ ρj (1 − ρj )−1/2 1 . [sent-47, score-0.183]
30 1 + (φ − 1)ρj (2) A special case where φ = 1 leads to ρj ∼ B(1/2, 1/2) (beta distribution) where the name of the prior arises, horseshoe (HS) [6, 7]. [sent-48, score-0.375]
31 Here ρj s are referred to as the shrinkage coefficients as they determine the magnitude with which θj s are pulled toward zero. [sent-49, score-0.234]
32 A prior of the form ρj ∼ B(1/2, 1/2) is natural to consider in the estimation of a signal θj as this yields a very desirable behavior both at the tails and in the neighborhood of zero. [sent-50, score-0.412]
33 That is, the resulting prior has heavy-tails as well as being unbounded at zero which creates a strong pull towards zero for those values close to zero. [sent-51, score-0.264]
34 [7] further discuss priors of the form ρj ∼ B(a, b) for a > 0, b > 0 to elaborate more on their focus on the choice a = b = 1/2. [sent-52, score-0.367]
35 [7] refer to the prior of the form ρj ∼ B(1, 1/2) as the Strawderman-Berger prior due to [21] and [4]. [sent-54, score-0.306]
36 The same hierarchical prior is also referred to as the quasi-Cauchy prior in [16]. [sent-55, score-0.402]
37 Hence, the tail behavior of the StrawdermanBerger prior remains similar to the horseshoe (when φ = 1), while the behavior around the origin changes. [sent-56, score-0.539]
38 The hierarchy in (2) is much more intuitive than the one in (1) as it explicitly reveals the behavior of the resulting marginal prior on θj . [sent-57, score-0.498]
39 This intuitive representation makes these hierarchical priors interesting despite their relatively complex forms. [sent-58, score-0.463]
40 On the other hand, what the prior in (1) or (2) lacks is a more trivial hierarchy that yields recognizable conditional posteriors in linear models. [sent-59, score-0.385]
41 2 [11, 12] consider the normal-exponential-gamma (NEG) and normal-gamma (NG) priors respectively which are formed in a conjugate manner yet lack the intuition the Strawderman-Berger and horseshoe priors provide in terms of the behavior of the density around the origin and at the tails. [sent-60, score-1.297]
42 Hence the implementation of these priors may be more user-friendly but they are very implicit in how they behave. [sent-61, score-0.367]
43 In fact, we may unite these two distinct hierarchical formulations under the same class of priors through a generalized beta distribution and the proposed equivalence of hierarchies in the following section. [sent-63, score-0.963]
44 This is rather important to be able to compare the behavior of priors emerging from different hierarchical formulations. [sent-64, score-0.518]
45 Furthermore, this equivalence in the hierarchies will allow for a straightforward Gibbs sampling update in posterior inference, as well as making variational approximations possible in linear models. [sent-65, score-0.421]
46 3 Equivalence of Hierarchies via a Generalized Beta Distribution In this section we propose a generalization of the beta distribution to form a flexible class of scale mixtures of normals with very appealing behavior. [sent-66, score-0.585]
47 We then formulate our hierarchical prior in a conjugate manner and reveal similarities and connections to the priors given in [16, 11, 12, 6, 7]. [sent-67, score-0.852]
48 As the name generalized beta has previously been used, we refer to our generalization as the threeparameter beta (TPB) distribution. [sent-68, score-0.438]
49 The three-parameter beta (TPB) distribution for a random variable X is defined by the density function f (x; a, b, φ) = Γ(a + b) b b−1 −(a+b) φ x (1 − x)a−1 {1 + (φ − 1)x} , Γ(a)Γ(b) (3) for 0 < x < 1, a > 0, b > 0 and φ > 0 and is denoted by T PB(a, b, φ). [sent-73, score-0.323]
50 The kth moment of the TPB distribution is given by E(X k ) = Γ(a + b)Γ(b + k) 2 F1 (a + b, b + k; a + b + k; 1 − φ) Γ(b)Γ(a + b + k) (4) where 2 F1 denotes the hypergeometric function. [sent-75, score-0.189]
51 In fact it can be shown that TPB is a subclass of Gauss hypergeometric (GH) distribution proposed in [3] and the compound confluent hypergeometric (CCH) distribution proposed in [10]. [sent-76, score-0.28]
52 [20] considered an alternative special case of the CCH distribution for the shrinkage coefficients, ρj , by letting ν = r = 1 in (6). [sent-80, score-0.318]
53 TPB and HB generalize the beta distribution in two distinct directions, with one practical advantage of the TPB being that it allows for a straightforward conjugate hierarchy leading to potentially substantial analytical and computational gains. [sent-82, score-0.592]
54 3 Now we move onto the hierarchical modeling of a flexible class of shrinkage priors for the estimation of a potentially sparse p-vector. [sent-83, score-0.823]
55 Now we define a shrinkage prior that is obtained by mixing a normal distribution over its scale parameter with the TPB distribution. [sent-88, score-0.622]
56 The TPB normal scale mixture representation for the distribution of random variable θj is given by θj |ρj ∼ N (0, 1/ρj − 1), ρj ∼ T PB(a, b, φ), (7) where a > 0, b > 0 and φ > 0. [sent-90, score-0.242]
57 Note that the special case for a = b = 1/2 in Figure 1(a) gives the horseshoe prior. [sent-93, score-0.187]
58 For a fixed value of φ, smaller a values yield a density on θj that is more peaked at zero, while smaller values of b yield a density on θj that is heavier tailed. [sent-95, score-0.186]
59 That said, the density assigned in the neighborhood of θj = 0 increases while making the overall density lighter-tailed. [sent-97, score-0.22]
60 We next propose the equivalence of three hierarchical representations revealing a wide class of priors encompassing many of those mentioned earlier. [sent-98, score-0.728]
61 Γ(a+b) 2) θj ∼ N (0, τj ), π(τj ) = Γ(a)Γ(b) φ−a τ a−1 (1 + τj /φ)−(a+b) which implies that τj φ ∼ β (a, b), the inverted beta distribution with parameters a and b. [sent-101, score-0.23]
62 The equivalence given in Proposition 1 is significant as it makes the work in Section 4 possible under the TPB normal scale mixtures as well as further revealing connections among previously proposed shrinkage priors. [sent-102, score-0.713]
63 It provides a rich class of priors leading to great flexibility in terms of the induced shrinkage and makes it clear that this new class of priors can be considered simultaneous extensions to the work by [11, 12] and [6, 7]. [sent-103, score-1.088]
64 It is worth mentioning that the hierarchical prior(s) given in Proposition 1 are different than the approach taken by [12] in how we handle the mixing. [sent-104, score-0.136]
65 In particular, the first hierarchy presented in Proposition 1 is identical to the NG prior up to the first stage mixing. [sent-105, score-0.336]
66 φ acts as a global shrinkage parameter in the hierarchy. [sent-107, score-0.27]
67 By doing so, they forfeit a complete conjugate structure and an explicit control over the tail behavior of π(θj ). [sent-109, score-0.241]
68 An interesting, yet expected, observation on Proposition 1 is that a half-Cauchy prior can be represented as a scale mixture of gamma distributions, i. [sent-114, score-0.342]
69 This makes sense as τ 1/2 |λj has a half-Normal distribution and the mixing distribution on the precision parameter is gamma with shape parameter 1/2. [sent-117, score-0.196]
70 [7] further place a half-Cauchy prior on φ1/2 to complete the hierarchy. [sent-118, score-0.153]
71 The aforementioned observation helps us formulate the complete hierarchy proposed in [7] in a conjugate manner. [sent-119, score-0.323]
72 Hence disregarding the different treatments of the higher-level hyper-parameters, we have shown that the class of priors given in Definition 1 unites the priors in [16, 11, 12, 6, 7] under one family and reveals their close connections through the equivalence of hierarchies given in Proposition 1. [sent-123, score-1.027]
73 The first hierarchy in Proposition 1 makes much of the work possible in the following sections. [sent-124, score-0.183]
74 We place the hierarchical prior given in Proposition 1 on each βj , i. [sent-166, score-0.249]
75 φ is used as a global shrinkage parameter common to all βj , and may be inferred using the data. [sent-169, score-0.27]
76 Thus we follow the hierarchy by letting φ ∼ G(1/2, ω), ω ∼ G(1/2, 1) which implies φ1/2 ∼ C + (0, 1) that is identical to what was used in [7] at this level of the hierarchy. [sent-170, score-0.225]
77 However, we do not believe at this level in the hierarchy the choice of the prior will have a huge impact on the results. [sent-171, score-0.336]
78 Although treating φ as unknown may be reasonable, when there exists some prior knowledge, it is appropriate to fix a φ value to reflect our prior belief in terms of underlying sparsity of the coefficient vector. [sent-172, score-0.349]
79 Note also that here we form the dependence on the error variance at a lower level of hierarchy rather than forming it in the prior of φ as done in [7]. [sent-174, score-0.336]
80 If we let a = b = 1/2, we will have formulated the hierarchical prior given in [7] in a completely conjugate manner. [sent-175, score-0.389]
81 Under a normal likelihood, an efficient Gibbs sampler may be obtained as the fully conditional posteriors can be extracted: β|y, X, σ 2 , τ1 , . [sent-177, score-0.135]
82 5 As an alternative to MCMC and Laplace approximations [23], a lower-bound on marginal likelihoods may be obtained via variational methods [17] yielding approximate posterior distributions on the model parameters. [sent-187, score-0.321]
83 This is accomplished in a similar manner to [8] by obtaining the joint MAP estimates of the error variance and the regression −1 coefficients having taken the expectation with respect to the conditional posterior distribution of τj using the second hierarchy given in Proposition 1. [sent-198, score-0.391]
84 b < 1 is a good choice as it will keep the tails of the marginal density on βj heavy. [sent-201, score-0.297]
85 Figure 2 (a) and (b) give the prior densities on ρj for b = 1/2, φ = 1 and a = {1/2, 1, 3/2} and the resulting marginal prior densities on βj . [sent-207, score-0.491]
86 These marginal densities are given by 2 2 √ 13/2 eβj /2 Γ(0, βj /2) a = 1/2 2π √ 2 2 |βj | βj /2 βj βj /2 1 √ − 2 e + 2e Erf(βj / 2) a = 1 π(βj ) = 2π √ 2 2 1 − 1 eβj /2 β 2 Γ(0, β 2 /2) a = 3/2 π 3/2 2 j j ∞ where Erf(. [sent-208, score-0.129]
87 Figure 2 clearly illustrates that while all three cases have very similar tail behavior, their behavior around the origin differ drastically. [sent-210, score-0.144]
88 5 Experiments Throughout this section we use the Jeffreys’ prior on the error precision by setting c0 = d0 = 0. [sent-218, score-0.153]
89 We obtain the estimate of the regression coefficients, β, using the variational Bayes procedure and measure the performance by model error which is calculated as ˆ ˆ (β ∗ − β) C(β ∗ − β). [sent-231, score-0.169]
90 The boxplots in Figure 3(a) and (b) correspond to different (a, b, φ) values where C+ signifies that φ is treated as unknown with a half-Cauchy prior as given earlier in Section 4. [sent-233, score-0.153]
91 It is worth mentioning that we attain a clearly superior performance compared to the lasso, particularly in the second case, despite the fact that the estimator resulting from the variational Bayes procedure is not a thresholding rule. [sent-235, score-0.205]
92 This is due to the fact that Case 2 involves a much sparser underlying setup on average than Case 1 and that the lighter tails attained by setting b = 1 leads to stronger shrinkage. [sent-237, score-0.166]
93 99 placing much more density in the neighborhood of ρj = 1 (total shrinkage). [sent-250, score-0.127]
94 Since we are adjusting the global shrinkage parameter, φ, a priori, and it is chosen such that P(ρj > 0. [sent-255, score-0.305]
95 Figure 4 gives the posterior means attained by sampling and the variational approximation. [sent-263, score-0.202]
96 6 Discussion We conclude that the proposed hierarchical prior formulation constitutes a useful encompassing framework in understanding the behavior of different scale mixtures of normals and connecting them under a broader family of hierarchical priors. [sent-294, score-0.734]
97 While 1 regularization, or namely lasso, arising from a double exponential prior in the Bayesian framework yields certain computational advantages, it demonstrates much inferior estimation performance relative to the more carefully formulated scale mixtures of normals. [sent-295, score-0.565]
98 The proposed equivalence of the hierarchies in Proposition 1 makes computation much easier for the TPB scale mixtures of normals. [sent-296, score-0.359]
99 These choices guarantee that the resulting prior has a kink at zero, which is essential for sparse estimation, and leads to heavy tails to avoid unnecessary bias in large signals (recall that a choice of b = 1/2 will yield Cauchy-like tails). [sent-298, score-0.458]
100 A robust generalized Bayes estimator and confidence region for a multivariate normal mean. [sent-323, score-0.188]
wordName wordTfidf (topN-words)
[('priors', 0.367), ('tpb', 0.364), ('shrinkage', 0.234), ('beta', 0.188), ('horseshoe', 0.187), ('hierarchy', 0.183), ('prior', 0.153), ('double', 0.141), ('conjugate', 0.14), ('tails', 0.131), ('variational', 0.125), ('proposition', 0.122), ('mixtures', 0.116), ('pb', 0.115), ('hypergeometric', 0.098), ('hierarchical', 0.096), ('equivalence', 0.096), ('density', 0.093), ('cch', 0.091), ('pbn', 0.091), ('signals', 0.087), ('normal', 0.086), ('duke', 0.083), ('bayes', 0.083), ('gig', 0.08), ('polson', 0.08), ('hierarchies', 0.077), ('posterior', 0.077), ('gamma', 0.075), ('encompassing', 0.074), ('durham', 0.074), ('normals', 0.074), ('marginal', 0.073), ('scale', 0.07), ('bayesian', 0.068), ('predictors', 0.064), ('generalized', 0.062), ('gibbs', 0.062), ('armagan', 0.061), ('artin', 0.061), ('clyde', 0.061), ('neg', 0.061), ('strawdermanberger', 0.061), ('tpbn', 0.061), ('revealing', 0.06), ('sb', 0.06), ('appealing', 0.06), ('densities', 0.056), ('behavior', 0.055), ('gh', 0.053), ('carvalho', 0.053), ('erf', 0.053), ('lasso', 0.053), ('dunson', 0.053), ('sparse', 0.052), ('connections', 0.051), ('massive', 0.05), ('rich', 0.05), ('nc', 0.049), ('posteriors', 0.049), ('denotes', 0.049), ('penalties', 0.047), ('cients', 0.047), ('exponential', 0.046), ('approximations', 0.046), ('tail', 0.046), ('manner', 0.045), ('association', 0.044), ('pareto', 0.044), ('regression', 0.044), ('mixture', 0.044), ('posteriori', 0.043), ('sparsity', 0.043), ('coef', 0.043), ('origin', 0.043), ('distribution', 0.042), ('letting', 0.042), ('ka', 0.042), ('xb', 0.04), ('hb', 0.04), ('mentioning', 0.04), ('estimator', 0.04), ('estimation', 0.039), ('exible', 0.039), ('recommend', 0.039), ('analytical', 0.039), ('zero', 0.038), ('induce', 0.038), ('encompasses', 0.037), ('mixing', 0.037), ('priori', 0.036), ('global', 0.036), ('class', 0.035), ('environmental', 0.035), ('adjusting', 0.035), ('towards', 0.035), ('leads', 0.035), ('health', 0.034), ('neighborhood', 0.034), ('reveals', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000014 104 nips-2011-Generalized Beta Mixtures of Gaussians
Author: Artin Armagan, Merlise Clyde, David B. Dunson
Abstract: In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. 1
2 0.19147687 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
3 0.11506099 221 nips-2011-Priors over Recurrent Continuous Time Processes
Author: Ardavan Saeedi, Alexandre Bouchard-côté
Abstract: We introduce the Gamma-Exponential Process (GEP), a prior over a large family of continuous time stochastic processes. A hierarchical version of this prior (HGEP; the Hierarchical GEP) yields a useful model for analyzing complex time series. Models based on HGEPs display many attractive properties: conjugacy, exchangeability and closed-form predictive distribution for the waiting times, and exact Gibbs updates for the time scale parameters. After establishing these properties, we show how posterior inference can be carried efficiently using Particle MCMC methods [1]. This yields a MCMC algorithm that can resample entire sequences atomically while avoiding the complications of introducing slice and stick auxiliary variables of the beam sampler [2]. We applied our model to the problem of estimating the disease progression in multiple sclerosis [3], and to RNA evolutionary modeling [4]. In both domains, we found that our model outperformed the standard rate matrix estimation approach. 1
4 0.098097913 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
5 0.095833562 285 nips-2011-The Kernel Beta Process
Author: Lu Ren, Yingjian Wang, Lawrence Carin, David B. Dunson
Abstract: A new L´ vy process prior is proposed for an uncountable collection of covariatee dependent feature-learning measures; the model is called the kernel beta process (KBP). Available covariates are handled efficiently via the kernel construction, with covariates assumed observed with each data sample (“customer”), and latent covariates learned for each feature (“dish”). Each customer selects dishes from an infinite buffet, in a manner analogous to the beta process, with the added constraint that a customer first decides probabilistically whether to “consider” a dish, based on the distance in covariate space between the customer and dish. If a customer does consider a particular dish, that dish is then selected probabilistically as in the beta process. The beta process is recovered as a limiting case of the KBP. An efficient Gibbs sampler is developed for computations, and state-of-the-art results are presented for image processing and music analysis tasks. 1
6 0.092886321 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
7 0.08480455 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
8 0.084650733 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
9 0.080608964 301 nips-2011-Variational Gaussian Process Dynamical Systems
10 0.080273598 156 nips-2011-Learning to Learn with Compound HD Models
11 0.080192953 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
12 0.07798098 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
13 0.076361075 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
14 0.075284526 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
15 0.072874196 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data
16 0.072305381 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
17 0.071009926 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
18 0.069677323 131 nips-2011-Inference in continuous-time change-point models
19 0.069296136 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
20 0.067832723 55 nips-2011-Collective Graphical Models
topicId topicWeight
[(0, 0.211), (1, 0.055), (2, 0.018), (3, -0.062), (4, -0.073), (5, -0.116), (6, 0.066), (7, 0.02), (8, 0.043), (9, 0.106), (10, -0.055), (11, -0.075), (12, 0.032), (13, -0.07), (14, -0.081), (15, 0.054), (16, 0.044), (17, -0.089), (18, 0.064), (19, -0.047), (20, 0.133), (21, 0.051), (22, 0.034), (23, 0.047), (24, -0.027), (25, 0.017), (26, 0.125), (27, 0.064), (28, -0.012), (29, 0.045), (30, 0.097), (31, 0.072), (32, -0.1), (33, -0.017), (34, 0.049), (35, -0.034), (36, -0.028), (37, 0.033), (38, -0.006), (39, 0.02), (40, -0.094), (41, -0.009), (42, -0.099), (43, 0.024), (44, -0.041), (45, 0.001), (46, 0.002), (47, -0.009), (48, 0.039), (49, -0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.96798587 104 nips-2011-Generalized Beta Mixtures of Gaussians
Author: Artin Armagan, Merlise Clyde, David B. Dunson
Abstract: In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. 1
2 0.74776858 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
Author: Miguel Lázaro-gredilla, Michalis K. Titsias
Abstract: We introduce a variational Bayesian inference algorithm which can be widely applied to sparse linear models. The algorithm is based on the spike and slab prior which, from a Bayesian perspective, is the golden standard for sparse inference. We apply the method to a general multi-task and multiple kernel learning model in which a common set of Gaussian process functions is linearly combined with task-specific sparse weights, thus inducing relation between tasks. This model unifies several sparse linear models, such as generalized linear models, sparse factor analysis and matrix factorization with missing values, so that the variational algorithm can be applied to all these cases. We demonstrate our approach in multioutput Gaussian process regression, multi-class classification, image processing applications and collaborative filtering. 1
3 0.74537528 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
4 0.73368949 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
Author: Onur Dikmen, Cédric Févotte
Abstract: In this paper we describe a maximum likelihood approach for dictionary learning in the multiplicative exponential noise model. This model is prevalent in audio signal processing where it underlies a generative composite model of the power spectrogram. Maximum joint likelihood estimation of the dictionary and expansion coefficients leads to a nonnegative matrix factorization problem where the Itakura-Saito divergence is used. The optimality of this approach is in question because the number of parameters (which include the expansion coefficients) grows with the number of observations. In this paper we describe a variational procedure for optimization of the marginal likelihood, i.e., the likelihood of the dictionary where the activation coefficients have been integrated out (given a specific prior). We compare the output of both maximum joint likelihood estimation (i.e., standard Itakura-Saito NMF) and maximum marginal likelihood estimation (MMLE) on real and synthetical datasets. The MMLE approach is shown to embed automatic model order selection, akin to automatic relevance determination.
5 0.65702736 221 nips-2011-Priors over Recurrent Continuous Time Processes
Author: Ardavan Saeedi, Alexandre Bouchard-côté
Abstract: We introduce the Gamma-Exponential Process (GEP), a prior over a large family of continuous time stochastic processes. A hierarchical version of this prior (HGEP; the Hierarchical GEP) yields a useful model for analyzing complex time series. Models based on HGEPs display many attractive properties: conjugacy, exchangeability and closed-form predictive distribution for the waiting times, and exact Gibbs updates for the time scale parameters. After establishing these properties, we show how posterior inference can be carried efficiently using Particle MCMC methods [1]. This yields a MCMC algorithm that can resample entire sequences atomically while avoiding the complications of introducing slice and stick auxiliary variables of the beam sampler [2]. We applied our model to the problem of estimating the disease progression in multiple sclerosis [3], and to RNA evolutionary modeling [4]. In both domains, we found that our model outperformed the standard rate matrix estimation approach. 1
6 0.63815773 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
7 0.63771814 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
8 0.63609374 285 nips-2011-The Kernel Beta Process
9 0.63600016 14 nips-2011-A concave regularization technique for sparse mixture models
10 0.61207902 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
11 0.57569546 131 nips-2011-Inference in continuous-time change-point models
12 0.55859768 217 nips-2011-Practical Variational Inference for Neural Networks
13 0.55303049 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction
14 0.5497365 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
15 0.54688412 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
16 0.52980781 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
17 0.52517265 55 nips-2011-Collective Graphical Models
18 0.51801556 301 nips-2011-Variational Gaussian Process Dynamical Systems
19 0.51440686 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
20 0.49801543 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
topicId topicWeight
[(0, 0.016), (4, 0.026), (20, 0.024), (26, 0.028), (31, 0.063), (33, 0.01), (43, 0.066), (45, 0.077), (57, 0.025), (74, 0.546), (83, 0.031), (99, 0.021)]
simIndex simValue paperId paperTitle
1 0.95758986 218 nips-2011-Predicting Dynamic Difficulty
Author: Olana Missura, Thomas Gärtner
Abstract: Motivated by applications in electronic games as well as teaching systems, we investigate the problem of dynamic difficulty adjustment. The task here is to repeatedly find a game difficulty setting that is neither ‘too easy’ and bores the player, nor ‘too difficult’ and overburdens the player. The contributions of this paper are (i) the formulation of difficulty adjustment as an online learning problem on partially ordered sets, (ii) an exponential update algorithm for dynamic difficulty adjustment, (iii) a bound on the number of wrong difficulty settings relative to the best static setting chosen in hindsight, and (iv) an empirical investigation of the algorithm when playing against adversaries. 1
same-paper 2 0.94560111 104 nips-2011-Generalized Beta Mixtures of Gaussians
Author: Artin Armagan, Merlise Clyde, David B. Dunson
Abstract: In recent years, a rich variety of shrinkage priors have been proposed that have great promise in addressing massive regression problems. In general, these new priors can be expressed as scale mixtures of normals, but have more complex forms and better properties than traditional Cauchy and double exponential priors. We first propose a new class of normal scale mixtures through a novel generalized beta distribution that encompasses many interesting priors as special cases. This encompassing framework should prove useful in comparing competing priors, considering properties and revealing close connections. We then develop a class of variational Bayes approximations through the new hierarchy presented that will scale more efficiently to the types of truly massive data sets that are now encountered routinely. 1
3 0.93846506 259 nips-2011-Sparse Estimation with Structured Dictionaries
Author: David P. Wipf
Abstract: In the vast majority of recent work on sparse estimation algorithms, performance has been evaluated using ideal or quasi-ideal dictionaries (e.g., random Gaussian or Fourier) characterized by unit ℓ2 norm, incoherent columns or features. But in reality, these types of dictionaries represent only a subset of the dictionaries that are actually used in practice (largely restricted to idealized compressive sensing applications). In contrast, herein sparse estimation is considered in the context of structured dictionaries possibly exhibiting high coherence between arbitrary groups of columns and/or rows. Sparse penalized regression models are analyzed with the purpose of finding, to the extent possible, regimes of dictionary invariant performance. In particular, a Type II Bayesian estimator with a dictionarydependent sparsity penalty is shown to have a number of desirable invariance properties leading to provable advantages over more conventional penalties such as the ℓ1 norm, especially in areas where existing theoretical recovery guarantees no longer hold. This can translate into improved performance in applications such as model selection with correlated features, source localization, and compressive sensing with constrained measurement directions. 1
4 0.82585472 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies
Author: Viren Jain, Srinivas C. Turaga, K Briggman, Moritz N. Helmstaedter, Winfried Denk, H. S. Seung
Abstract: An agglomerative clustering algorithm merges the most similar pair of clusters at every iteration. The function that evaluates similarity is traditionally handdesigned, but there has been recent interest in supervised or semisupervised settings in which ground-truth clustered data is available for training. Here we show how to train a similarity function by regarding it as the action-value function of a reinforcement learning problem. We apply this general method to segment images by clustering superpixels, an application that we call Learning to Agglomerate Superpixel Hierarchies (LASH). When applied to a challenging dataset of brain images from serial electron microscopy, LASH dramatically improved segmentation accuracy when clustering supervoxels generated by state of the boundary detection algorithms. The naive strategy of directly training only supervoxel similarities and applying single linkage clustering produced less improvement. 1
5 0.74983114 68 nips-2011-Demixed Principal Component Analysis
Author: Wieland Brendel, Ranulfo Romo, Christian K. Machens
Abstract: In many experiments, the data points collected live in high-dimensional observation spaces, yet can be assigned a set of labels or parameters. In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. Here, we start with the assumption that a particularly informative description is one that reveals the dependency of the high-dimensional data on the individual parameters. We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. We phrase the problem as a probabilistic graphical model, and present a fast Expectation-Maximization (EM) algorithm. We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response. 1
6 0.55687696 196 nips-2011-On Strategy Stitching in Large Extensive Form Multiplayer Games
7 0.55104655 276 nips-2011-Structured sparse coding via lateral inhibition
8 0.52732623 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
9 0.51931787 285 nips-2011-The Kernel Beta Process
10 0.51553476 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
11 0.50966638 158 nips-2011-Learning unbelievable probabilities
12 0.50309032 265 nips-2011-Sparse recovery by thresholded non-negative least squares
13 0.50124639 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)
14 0.49945334 258 nips-2011-Sparse Bayesian Multi-Task Learning
15 0.49206558 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data
16 0.49192578 186 nips-2011-Noise Thresholds for Spectral Clustering
17 0.48500797 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
18 0.48284575 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
20 0.4802731 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data