nips nips2011 nips2011-188 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David A. Knowles, Tom Minka
Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We demonstrate NCVMP on logistic binary and multinomial regression. [sent-6, score-0.209]
2 In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. [sent-7, score-0.761]
3 1 Introduction Variational Message Passing [20] is a message passing implementation of the mean-field approximation [1, 2], also known as variational Bayes (VB). [sent-8, score-0.519]
4 In conjugate exponential models this is avoided due to the closure of exponential family distributions under multiplication. [sent-11, score-0.201]
5 The first is how to fit the variational parameters when the VB free form updates are not viable. [sent-14, score-0.27]
6 We contribute to this line of work by proposing and evaluating a new bound for the useful softmax factor, which is tighter than other commonly used bounds whilst maintaining computational tractability. [sent-18, score-0.377]
7 We also demonstrate, in agreement with [19] and [14], that for univariate expectations such as required for logistic regression, carefully designed quadrature methods can be effective. [sent-19, score-0.307]
8 To maintain modularity one is effectively constrained to use exponential family bounds (e. [sent-21, score-0.184]
9 We propose a novel message passing algorithm, which we call Non-conjugate Variational Message Passing (NCVMP), which generalises VMP and gives a recipe for calculating messages out of any factor. [sent-29, score-0.411]
10 Section 5 describes the binary logistic and multinomial softmax regression models, and implementation options with and without NCVMP. [sent-34, score-0.424]
11 2 Mean-field approximation Our aim is to approximate some model p(x), represented as a factor graph p(x) = a fa (xa ) where factor fa is a function of all x ∈ xa . [sent-36, score-0.443]
12 The mean-field approximation assumes a fully-factorised variational posterior q(x) = i qi (xi ) where qi (xi ) is an approximation to the marginal distribution of xi (note however xi might be vector valued, e. [sent-37, score-0.997]
13 The variational approximation q(x) is chosen to minimise the Kullback-Leibler divergence KL(q||p), given by KL(q||p) = q(x) log q(x) dx = −H[q(x)] − p(x) q(x) log p(x)dx. [sent-40, score-0.433]
14 It can be shown [1] that if the functions qi (xi ) are unconstrained then minimising this functional can be achieved by coordinate descent, setting qi (xi ) = exp log p(x) ¬qi (xi ) , iteratively for each i, where . [sent-42, score-0.464]
15 ˜ The variational distribution q(x) factorises into approximate factors fa (xa ). [sent-47, score-0.426]
16 ˜ fa (xa ) = xi ∈xa ma→i (xi ) where the message from factor a to variable i is ma→i (xi ) = exp log fa (xa ) ¬qi (xi ) . [sent-50, score-0.657]
17 The message from variable i to factor a is the current variational posterior of xi , denoted qi (xi ), i. [sent-51, score-0.828]
18 mi→a (xi ) = qi (xi ) = a∈N (i) ma→i (xi ) where N (i) are the factors connected to variable i. [sent-53, score-0.222]
19 For conjugate-exponential models the messages to a particular variable xi , will all be in the same exponential family. [sent-54, score-0.266]
20 Thus calculating qi (xi ) simply involves summing sufficient statistics. [sent-55, score-0.21]
21 If, however, our model is not conjugate-exponential, there will be a variable xi which receives incoming messages which are in different exponential families, or which are not even exponential family distributions at all. [sent-56, score-0.365]
22 Thus qi (xi ) will be some more complex distribution. [sent-57, score-0.189]
23 allows modular implementation and combining of deterministic and stochastic factors 2 NCVMP ensures the gradients of the approximate KL divergence implied by the message match the gradients of the true KL. [sent-67, score-0.377]
24 We use the following notation: variable xi has current variational posterior qi (xi ; θi ), where θi is the vector of natural parameters of the exponential family distribution qi . [sent-69, score-0.865]
25 Each factor fa which is a neighbour of xi sends a message ma→i (xi ; φa→i ) to xi , where ma→i is in the same exponential family as qi , i. [sent-70, score-0.844]
26 ma→i (xi ; φ) = exp(φT u(xi )−κ(φ)) and qi (xi ; θ) = exp(θT u(xi )−κ(θ)) where u(·) are sufficient statistics, and κ(·) is the log partition function. [sent-72, score-0.252]
27 It is straightforward to show that C(θ) = cov(u(x)|θ) so if the exponential family qi is identifiable, C will be symmetric positive definite, and therefore invertible. [sent-76, score-0.26]
28 The factor fa contributes a term Sa (θi ) = qi (xi ; θi ) log fa (x) ¬qi (xi ) dxi to the KL divergence, where we have only made the dependence on θi explicit: this term is also a function of the variational parameters of the other variables neighbouring fa . [sent-77, score-0.98]
29 Firstly define the function ˜ Sa (θ; φ) := qi (xi ; θ) log ma→i (xi ; φ)dxi , (2) which is an approximation to the function Sa (θ). [sent-83, score-0.273]
30 The update in Algorithm 1, Line 7 for θi is minimising an approximate local KL divergence for xi : ˜ S(θ; φa→i ) = θi := arg min −H[qi (xi , θ)] − θ a∈N (i) φa→i (5) a∈N (i) where H[. [sent-88, score-0.212]
31 The update at variable xi for θi combines all these approximations from factors involving xi to get an approximation to the local KL, and then moves θi to the minimum of this approximation. [sent-97, score-0.288]
32 If log fa (x) ¬qi (xi ) as a function of xi can be written µT u(xi ) − c where c is a constant, then the NCVMP message ma→i (xi , φa→i ) will be the standard VMP message ma→i (xi , µ). [sent-100, score-0.669]
33 To see this note that log fa (x) ¬qi (xi ) = µT u(xi ) − c ⇒ Sa (θ) = µT u(xi ) θ − c, where µ is the expected natural statistic under the messages from the variables connected to fa other a (θ) than xi . [sent-102, score-0.542]
34 have φa→i := C(θ) ∂θ The update for θi in Algorithm 1, Line 7 is the same as for VMP, and Theorem 2 shows that for conjugate factors the messages sent to the variables are the same as for VMP. [sent-104, score-0.253]
35 NCVMP can alternatively be derived by assuming the incoming messages to xi are fixed apart from ma→i (xi ; φ) and calculating a fixed point update for ma→i (xi ; φ). [sent-106, score-0.274]
36 1 Gaussian variational distribution Here we describe the NCVMP updates for a Gaussian variational distribution q(x) = N (x; m, v) ˜ and approximate factor f (x; mf , vf ). [sent-110, score-0.719]
37 vf vf dm (7) Logistic regression models We illustrate NCVMP on Bayesian binary and multinomial logistic regression. [sent-113, score-0.463]
38 1 Binary logistic regression For logistic regression we require the following factor: f (s, x) = σ(x)s (1 − σ(x))1−s where we assume s is observed. [sent-119, score-0.346]
39 There are two problems: we cannot analytically compute expectations wrt to x, and we need to optimise the variational parameters. [sent-121, score-0.362]
40 An alternative proposed in [18] is to bound the integral: 1 ≥ sm − a2 v − log(1 + em+(1−2a)v/2) ), (9) 2 where m, v are the mean and variance of q(x) and a is a variational parameter which can be optimised using the fixed point iteration a := σ(m−(1−2a)v/2). [sent-125, score-0.403]
41 This bound is not conjugate to a Gaussian, but we can calculate the NCVMP message, which has m m parameters: v1 = a(1−a), vff = vf +s−a, where we have assumed a has been optimised. [sent-127, score-0.288]
42 The NCVMP message xσ(x) q −m σ(x) q mf m , vf = vf + s − σ(x) q . [sent-129, score-0.416]
43 2 q Multinomial softmax regression K Consider the softmax factor f (x, p) = k=1 δ (pk − σk (x)), where xk are real valued and p is a probability vector with current Dirichlet variational posterior q(p) = Dir(p; d). [sent-132, score-0.739]
44 Let the incoming message from x be q(x) = k=1 N (xk ; mk , vk ). [sent-136, score-0.248]
45 The approach used by [3] is a linear Taylor expansion of the log, which is accurate for small variances v: exi ≤ log log i exi = log i emi +vi /2 , (10) i which we refer to as the “log” bound. [sent-138, score-0.318]
46 Another bound was proposed by [5]: K K log(1 + exk −a ), exk ≤ a + log k=1 (11) k=1 where a is a new variational parameter. [sent-140, score-0.486]
47 Combining with (8) we get the “quadratic bound” on the integrand, with K + 1 variational parameters. [sent-141, score-0.24]
48 Inspired by the univariate “tilted” bound in Equation 9 we propose the multivariate tilted bound: exi ≤ log i 1 2 a2 vj + log j j emi +(1−2ai )vi /2 (12) i Setting ak = 0 for all k we recover Equation 10 (hence this is the “tilted” version). [sent-144, score-0.793]
49 For the softmax factor quadrature is not viable because of the high dimensionality of the integrals. [sent-147, score-0.299]
50 From Equation 7 the NCVMP messages using the tilted bound have natural parameters 5 m mk kf = (d. [sent-148, score-0.654]
51 As an alternative we suggest choosing whether to send the message resulting from the quadratic bound or tilted bound depending on which is currently the tightest, referred to as the “adaptive” method. [sent-151, score-0.912]
52 Finally we consider a simple Taylor series expansion of the integrand around the mean of x, denoted “Taylor”, and the multivariate quadratic bound of [4], denoted “Bohning” (see the Supplementary material for details). [sent-152, score-0.274]
53 We will see that for both binary logistic and multinomial softmax models achieving conjugate updates by being constrained to quadratic bounds is sub-optimal, in terms of estimates of variational parameters, marginal likelihood estimation, and predictive performance. [sent-154, score-1.016]
54 NCVMP gives the freedom to choose a wider class of bounds, or even use efficient quadrature methods in the univariate case, while maintaining simplicity and modularity. [sent-155, score-0.219]
55 1 The logistic factor We first test the logistic factor methods of Section 5. [sent-157, score-0.298]
56 The quadratic bound has the largest errors for the posterior mean, and the posterior variance is severely underestimated. [sent-160, score-0.376]
57 Using the tilted bound with NCVMP gives more robust estimates of the variance than the quadratic bound as the prior mean changes. [sent-162, score-0.815]
58 However, both the quadratic and tilted bounds underestimate the variance as the prior variance increases. [sent-163, score-0.736]
59 5 −20 −10 0 prior mean 10 NCVMP quad NCVMP tilted VMP quadratic −0. [sent-182, score-0.592]
60 6 error in posterior variance error in posterior variance −0. [sent-184, score-0.232]
61 1 20 0 5 10 15 prior variance 20 5 10 15 prior variance 20 0 −1 −2 −3 −4 0 (a) Posterior mean and variance estimates of σ(x)π(x) with varying prior π(x). [sent-187, score-0.3]
62 (b) Log likelihood of the true regression coefficients under the approximate posterior for 10 synthetic logistic regression datasets. [sent-190, score-0.4]
63 2 Binary logistic regression We generated ten synthetic logistic regression datasets with N = 30 data points and P = 8 covariates. [sent-193, score-0.411]
64 We evaluated the results in terms of the log likelihood of the true regression coefficients under the approximate posterior, a measure which penalises poorly estimated posterior variances. [sent-194, score-0.249]
65 Figure 1(b) compares the performance of non-conjugate VMP using quadrature and VMP using the quadratic bound. [sent-195, score-0.226]
66 For four of the ten datasets the quadratic bound finds very poor solutions. [sent-196, score-0.225]
67 3 Softmax bounds K To have some idea how the various bounds for the softmax integral Eq [log k=1 exk ] compare empirically we calculated relative absolute error on 100 random distributions q(x) = k N (xk ; mk , v). [sent-200, score-0.366]
68 As expected the tilted bound (12) dominates the log bound (10), since it is a generalisation. [sent-206, score-0.653]
69 As K is increased the relative error made using the quadratic bound increases, whereas both the log and the tilted bound get tighter. [sent-207, score-0.771]
70 In agreement with [5] we find the strength of the quadratic bound (11) is in the high variance case, and Bohning’s bound [4] is very loose under all conditions. [sent-208, score-0.341]
71 Both the log and tilted bound are extremely accurate for variances v < 1. [sent-209, score-0.57]
72 In fact, the log and tilted bounds are asymptotically optimal as v → 0. [sent-210, score-0.54]
73 These results show that even when quadrature is not an option, much tighter bounds can be found if the constraint of requiring quadratic bounds imposed by VMP is relaxed. [sent-213, score-0.362]
74 For the remainder of the paper we consider only the quadratic, log and tilted bounds. [sent-214, score-0.487]
75 4 log(relative abs error) 2 −1 0 0 −2 −2 −2 −4 −4 quadratic log tilted Bohning Taylor −6 −8 −6 −10 −8 −10 0 2 101 102 classes K 103 −12 10−1 100 101 input variance v 102 Figure 2: Log10 of the relative absolute error approximating E log 6. [sent-215, score-0.725]
76 However, we see that the methods using the tilted bound perform best, closely followed by the log bound. [sent-221, score-0.57]
77 Although the quadratic bound has comparable performance for small N < 200 it performs poorly with larger N due to its weakness at small variances. [sent-222, score-0.201]
78 The log bound performed almost as well as the tilted bound at recovering coefficients it takes many more iterations to converge. [sent-224, score-0.653]
79 The extra flexibility of the tilted bound allows faster convergence, analogous to parameter expansion [16]. [sent-225, score-0.507]
80 For small N the tilted bound, log bound and adaptive method converge rapidly, but as N increases the quadratic bound starts to converge much more slowly, as do the tilted and adaptive methods to a lesser extent. [sent-226, score-1.261]
81 “Adaptive” converges fastest because the quadratic bound gives good initial updates at high variance, and the tilted bound takes over once the variance decreases. [sent-227, score-0.795]
82 For all but very large noise values the tilted bound performs best. [sent-229, score-0.507]
83 6 Iterations to convergence RMS error of coefficents adaptive tilted quadratic log 0. [sent-243, score-0.638]
84 8 10−2 10−1 100 synthetic noise variance (a) Varying sample size 50 40 30 20 10 0 10−3 10−2 10−1 100 synthetic noise variance (b) Varying noise level Figure 3: Left: root mean squared error of inferred regression coefficients. [sent-244, score-0.285]
85 Iris Marginal likelihood Predictive likelihood Predictive error Glass Marginal likelihood Predictive likelihood Predictive error Thyroid Marginal likelihood Predictive likelihood Predictive error Quadratic −65 ± 3. [sent-248, score-0.21]
86 Adaptive and tilted use NCVMP, quadratic and probit use VMP. [sent-306, score-0.617]
87 Here we have also included “Probit”, corresponding to a Bayesian multinomial probit regression model, estimated using VMP, and similar in setup to [6], except that we use EP to approximate the predictive distribution, rather than sampling. [sent-308, score-0.318]
88 On all three datasets the marginal likelihood calculated using the tilted or adaptive bounds is optimal out of the logistic models (“Probit” has a different underlying model, so differences in marginal likelihood are confounded by the Bayes factor). [sent-309, score-0.821]
89 In terms of predictive performance the quadratic bound seems to be slightly worse across the datasets, with the performance of the other methods varying between datasets. [sent-310, score-0.275]
90 We did not compare to the log bound since it is dominated by the tilted bound and is considerably slower to converge. [sent-311, score-0.653]
91 Indeed, for some models we have found convergence to be a problem, which can be alleviated by damping: if the NCVMP message is mf →i (xi ) then send the message mf →i (xi )1−α mold (xi )α where mold (xi ) was the previous message sent to i and f →i f →i 0 ≤ α < 1 is a damping factor. [sent-313, score-0.795]
92 We have introduced Non-conjugate Variational Message Passing, which extends variational Bayes to non-conjugate models while maintaining the convenient message passing framework of VMP and allowing freedom to choose the most accurate available method to approximate required expectations. [sent-315, score-0.585]
93 We have shown NCVMP to be of practical use for fitting Bayesian binary and multinomial logistic models. [sent-317, score-0.209]
94 We derived a new bound for the softmax integral which is tighter than other commonly used bounds, but has variational parameters that are still simple to optimise. [sent-318, score-0.502]
95 Tightness of the bound is valuable both in terms of better approximating the posterior and giving a closer approximation to the marginal likelihood, which may be of interest for model selection. [sent-319, score-0.207]
96 Efficient bounds for the softmax and applications to approximate inference in hybrid models. [sent-342, score-0.228]
97 Variational bayesian multinomial probit regression with gaussian process priors. [sent-347, score-0.278]
98 Approximate riemannian conjugate gradient learning for fixed-form variational bayes. [sent-355, score-0.329]
99 A variational approach to bayesian logistic regression models and their extensions. [sent-374, score-0.448]
100 Building blocks for variational bayesian learning of latent variable models. [sent-444, score-0.275]
wordName wordTfidf (topN-words)
[('ncvmp', 0.483), ('tilted', 0.424), ('vmp', 0.366), ('variational', 0.24), ('qi', 0.189), ('message', 0.181), ('sa', 0.155), ('softmax', 0.149), ('fa', 0.127), ('quadratic', 0.118), ('xi', 0.117), ('messages', 0.108), ('quadrature', 0.108), ('logistic', 0.107), ('multinomial', 0.102), ('kl', 0.095), ('vf', 0.094), ('conjugate', 0.089), ('bound', 0.083), ('passing', 0.077), ('probit', 0.075), ('vb', 0.072), ('regression', 0.066), ('log', 0.063), ('modularity', 0.06), ('posterior', 0.059), ('xa', 0.058), ('variance', 0.057), ('bounds', 0.053), ('bohning', 0.05), ('exi', 0.05), ('exk', 0.05), ('integrand', 0.05), ('vkf', 0.05), ('univariate', 0.05), ('predictive', 0.049), ('mf', 0.047), ('ma', 0.046), ('wrt', 0.046), ('damping', 0.046), ('divergence', 0.046), ('marginal', 0.044), ('expectations', 0.042), ('factor', 0.042), ('exponential', 0.041), ('synthetic', 0.041), ('raiko', 0.04), ('mk', 0.039), ('maintaining', 0.038), ('dxi', 0.036), ('likelihood', 0.035), ('bayesian', 0.035), ('optimise', 0.034), ('xk', 0.034), ('taylor', 0.033), ('exl', 0.033), ('mold', 0.033), ('tornio', 0.033), ('adaptive', 0.033), ('modular', 0.033), ('factors', 0.033), ('uci', 0.032), ('ak', 0.031), ('bayes', 0.03), ('updates', 0.03), ('tighter', 0.03), ('family', 0.03), ('emi', 0.029), ('cents', 0.029), ('iris', 0.029), ('honkela', 0.029), ('neighbouring', 0.029), ('thyroid', 0.029), ('gradients', 0.029), ('incoming', 0.028), ('prior', 0.027), ('coef', 0.027), ('approximate', 0.026), ('varying', 0.025), ('marlin', 0.025), ('dk', 0.025), ('datasets', 0.024), ('glass', 0.024), ('khan', 0.024), ('whilst', 0.024), ('recipe', 0.024), ('freedom', 0.023), ('mean', 0.023), ('send', 0.023), ('compromise', 0.023), ('minimising', 0.023), ('sent', 0.023), ('eld', 0.023), ('calculated', 0.022), ('optimisation', 0.022), ('tightness', 0.022), ('calculate', 0.022), ('cients', 0.021), ('approximation', 0.021), ('calculating', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
Author: David A. Knowles, Tom Minka
Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. 1
2 0.11513156 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
3 0.10495485 301 nips-2011-Variational Gaussian Process Dynamical Systems
Author: Neil D. Lawrence, Michalis K. Titsias, Andreas Damianou
Abstract: High dimensional time series are endemic in applications of machine learning such as robotics (sensor data), computational biology (gene expression data), vision (video sequences) and graphics (motion capture data). Practical nonlinear probabilistic approaches to this data are required. In this paper we introduce the variational Gaussian process dynamical system. Our work builds on recent variational approximations for Gaussian process latent variable models to allow for nonlinear dimensionality reduction simultaneously with learning a dynamical prior in the latent space. The approach also allows for the appropriate dimensionality of the latent space to be automatically determined. We demonstrate the model on a human motion capture data set and a series of high resolution video sequences. 1
4 0.09939272 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction
Author: Siwei Lyu
Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1
5 0.095946096 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
6 0.089536302 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
7 0.089222759 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
8 0.08480455 104 nips-2011-Generalized Beta Mixtures of Gaussians
9 0.081065714 306 nips-2011-t-divergence Based Approximate Inference
10 0.070730262 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
11 0.069548741 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
12 0.065235905 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
13 0.064574659 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
14 0.064201333 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
15 0.061728075 302 nips-2011-Variational Learning for Recurrent Spiking Networks
16 0.060255453 158 nips-2011-Learning unbelievable probabilities
17 0.058948703 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
18 0.055890113 156 nips-2011-Learning to Learn with Compound HD Models
19 0.05587586 100 nips-2011-Gaussian Process Training with Input Noise
20 0.055191129 102 nips-2011-Generalised Coupled Tensor Factorisation
topicId topicWeight
[(0, 0.155), (1, 0.0), (2, 0.027), (3, -0.053), (4, -0.052), (5, -0.107), (6, 0.035), (7, -0.089), (8, 0.011), (9, 0.039), (10, -0.079), (11, -0.113), (12, -0.029), (13, -0.059), (14, -0.062), (15, 0.036), (16, -0.012), (17, -0.021), (18, 0.041), (19, 0.026), (20, -0.017), (21, 0.003), (22, 0.156), (23, 0.009), (24, 0.099), (25, 0.012), (26, 0.116), (27, 0.037), (28, -0.007), (29, 0.061), (30, -0.009), (31, 0.032), (32, -0.049), (33, -0.047), (34, 0.035), (35, 0.055), (36, -0.107), (37, 0.111), (38, -0.035), (39, 0.054), (40, -0.143), (41, 0.114), (42, -0.05), (43, -0.126), (44, -0.028), (45, -0.053), (46, 0.096), (47, 0.037), (48, 0.01), (49, 0.114)]
simIndex simValue paperId paperTitle
same-paper 1 0.92757422 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
Author: David A. Knowles, Tom Minka
Abstract: Variational Message Passing (VMP) is an algorithmic implementation of the Variational Bayes (VB) method which applies only in the special case of conjugate exponential family models. We propose an extension to VMP, which we refer to as Non-conjugate Variational Message Passing (NCVMP) which aims to alleviate this restriction while maintaining modularity, allowing choice in how expectations are calculated, and integrating into an existing message-passing framework: Infer.NET. We demonstrate NCVMP on logistic binary and multinomial regression. In the multinomial case we introduce a novel variational bound for the softmax factor which is tighter than other commonly used bounds whilst maintaining computational tractability. 1
2 0.74659741 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
Author: Onur Dikmen, Cédric Févotte
Abstract: In this paper we describe a maximum likelihood approach for dictionary learning in the multiplicative exponential noise model. This model is prevalent in audio signal processing where it underlies a generative composite model of the power spectrogram. Maximum joint likelihood estimation of the dictionary and expansion coefficients leads to a nonnegative matrix factorization problem where the Itakura-Saito divergence is used. The optimality of this approach is in question because the number of parameters (which include the expansion coefficients) grows with the number of observations. In this paper we describe a variational procedure for optimization of the marginal likelihood, i.e., the likelihood of the dictionary where the activation coefficients have been integrated out (given a specific prior). We compare the output of both maximum joint likelihood estimation (i.e., standard Itakura-Saito NMF) and maximum marginal likelihood estimation (MMLE) on real and synthetical datasets. The MMLE approach is shown to embed automatic model order selection, akin to automatic relevance determination.
3 0.61331409 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
Author: Miguel Lázaro-gredilla, Michalis K. Titsias
Abstract: We introduce a variational Bayesian inference algorithm which can be widely applied to sparse linear models. The algorithm is based on the spike and slab prior which, from a Bayesian perspective, is the golden standard for sparse inference. We apply the method to a general multi-task and multiple kernel learning model in which a common set of Gaussian process functions is linearly combined with task-specific sparse weights, thus inducing relation between tasks. This model unifies several sparse linear models, such as generalized linear models, sparse factor analysis and matrix factorization with missing values, so that the variational algorithm can be applied to all these cases. We demonstrate our approach in multioutput Gaussian process regression, multi-class classification, image processing applications and collaborative filtering. 1
4 0.59458655 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction
Author: Siwei Lyu
Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL contraction framework with different choices of the KL contraction operators. 1
5 0.58485359 306 nips-2011-t-divergence Based Approximate Inference
Author: Nan Ding, Yuan Qi, S.v.n. Vishwanathan
Abstract: Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student’s t-prior. 1
6 0.58160824 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
7 0.57361889 104 nips-2011-Generalized Beta Mixtures of Gaussians
8 0.5644815 217 nips-2011-Practical Variational Inference for Neural Networks
9 0.51897639 240 nips-2011-Robust Multi-Class Gaussian Process Classification
10 0.45987764 225 nips-2011-Probabilistic amplitude and frequency demodulation
11 0.45291287 158 nips-2011-Learning unbelievable probabilities
12 0.45155513 301 nips-2011-Variational Gaussian Process Dynamical Systems
13 0.45059589 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
14 0.4498322 258 nips-2011-Sparse Bayesian Multi-Task Learning
15 0.42986703 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
16 0.42071342 14 nips-2011-A concave regularization technique for sparse mixture models
17 0.4097451 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
18 0.40856871 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
19 0.40456989 69 nips-2011-Differentially Private M-Estimators
20 0.40032133 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
topicId topicWeight
[(0, 0.013), (4, 0.02), (20, 0.02), (26, 0.032), (31, 0.101), (33, 0.343), (43, 0.098), (45, 0.085), (57, 0.061), (74, 0.06), (83, 0.039), (99, 0.037)]
simIndex simValue paperId paperTitle
1 0.97859466 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes
Author: Yee W. Teh, Charles Blundell, Lloyd Elliott
Abstract: We propose a novel class of Bayesian nonparametric models for sequential data called fragmentation-coagulation processes (FCPs). FCPs model a set of sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for flexible modelling of the number of clusters, and they avoid label switching non-identifiability problems. We develop an efficient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data. 1
2 0.96039212 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
Author: Bogdan Alexe, Viviana Petrescu, Vittorio Ferrari
Abstract: We present a computationally efficient technique to compute the distance of highdimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18] and on hashing [21]. For example, our algorithm finds the most similar pair of windows between two images while computing only 1% of all distances on average. 1
3 0.87628597 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models
Author: Skander Mensi, Richard Naud, Wulfram Gerstner
Abstract: Variability in single neuron models is typically implemented either by a stochastic Leaky-Integrate-and-Fire model or by a model of the Generalized Linear Model (GLM) family. We use analytical and numerical methods to relate state-of-theart models from both schools of thought. First we find the analytical expressions relating the subthreshold voltage from the Adaptive Exponential Integrate-andFire model (AdEx) to the Spike-Response Model with escape noise (SRM as an example of a GLM). Then we calculate numerically the link-function that provides the firing probability given a deterministic membrane potential. We find a mathematical expression for this link-function and test the ability of the GLM to predict the firing probability of a neuron receiving complex stimulation. Comparing the prediction performance of various link-functions, we find that a GLM with an exponential link-function provides an excellent approximation to the Adaptive Exponential Integrate-and-Fire with colored-noise input. These results help to understand the relationship between the different approaches to stochastic neuron models. 1 Motivation When it comes to modeling the intrinsic variability in simple neuron models, we can distinguish two traditional approaches. One approach is inspired by the stochastic Leaky Integrate-and-Fire (LIF) hypothesis of Stein (1967) [1], where a noise term is added to the system of differential equations implementing the leaky integration to a threshold. There are multiple versions of such a stochastic LIF [2]. How the noise affects the firing probability is also a function of the parameters of the neuron model. Therefore, it is important to take into account the refinements of simple neuron models in terms of subthreshold resonance [3, 4], spike-triggered adaptation [5, 6] and non-linear spike 1 initiation [7, 5]. All these improvements are encompassed by the Adaptive Exponential Integrateand-Fire model (AdEx [8, 9]). The other approach is to start with some deterministic dynamics for the the state of the neuron (for instance the instantaneous distance from the membrane potential to the threshold) and link the probability intensity of emitting a spike with a non-linear function of the state variable. Under some conditions, this type of model is part of a greater class of statistical models called Generalized Linear Models (GLM [10]). As a single neuron model, the Spike Response Model (SRM) with escape noise is a GLM in which the state variable is explicitly the distance between a deterministic voltage and the threshold. The original SRM could account for subthreshold resonance, refractory effects and spike-frequency adaptation [11]. Mathematically similar models were developed independently in the study of the visual system [12] where spike-frequency adaptation has also been modeled [13]. Recently, this approach has retained increased attention since the probabilistic framework can be linked with the Bayesian theory of neural systems [14] and because Bayesian inference can be applied to the population of neurons [15]. In this paper, we investigate the similarity and differences between the state-of-the-art GLM and the stochastic AdEx. The motivation behind this work is to relate the traditional threshold neuron models to Bayesian theory. Our results extend the work of Plesser and Gerstner (2000) [16] since we include the non-linearity for spike initiation and spike-frequency adaptation. We also provide relationships between the parameters of the AdEx and the equivalent GLM. These precise relationships can be used to relate analog implementations of threshold models [17] to the probabilistic models used in the Bayesian approach. The paper is organized as follows: We first describe the expressions relating the SRM state-variable to the parameters of the AdEx (Sect. 3.1) in the subthreshold regime. Then, we use numerical methods to find the non-linear link-function that models the firing probability (Sect. 3.2). We find a functional form for the SRM link-function that best describes the firing probability of a stochastic AdEx. We then compare the performance of this link-function with the often used exponential or linear-rectifier link-functions (also called half-wave linear rectifier) in terms of predicting the firing probability of an AdEx under complex stimulus (Sect. 3.3). We find that the exponential linkfunction yields almost perfect prediction. Finally, we explore the relations between the statistic of the noise and the sharpness of the non-linearity for spike initiation with the parameters of the SRM. 2 Presentation of the Models In this section we present the general formula for the stochastic AdEx model (Sect. 2.1) and the SRM (Sect 2.2). 2.1 The Stochastic Adaptive Exponential Integrate-and-Fire Model The voltage dynamics of the stochastic AdEx is given by: V −Θ ˙ τm V = El − V + ∆T exp − Rw + RI + R (1) ∆T τw w = a(V − El ) − w ˙ (2) where τm is the membrane time constant, El the reverse potential, R the membrane resistance, Θ is the threshold, ∆T is the shape factor and I(t) the input current which is chosen to be an Ornstein−Θ Uhlenbeck process with correlation time-constant of 5 ms. The exponential term ∆T exp( V∆T ) is a non-linear function responsible for the emission of spikes and is a diffusive white noise with standard deviation σ (i.e. ∼ N (0, σ)). Note that the diffusive white-noise does not imply white noise fluctuations of the voltage V (t), the probability distribution of V (t) will depend on ∆T and Θ. The second variable, w, describes the subthreshold as well as the spike-triggered adaptation both ˆ parametrized by the coupling strength a and the time constant τw . Each time tj the voltage goes to infinity, we assumed that a spike is emitted. Then the voltage is reset to a fixed value Vr and w is increased by a constant value b. 2.2 The Generalized Linear Model In the SRM, The voltage V (t) is given by the convolution of the injected current I(t) with the membrane filter κ(t) plus the additional kernel η(t) that acts after each spikes (here we split the 2 spike-triggered kernel in two η(t) = ηv (t) + ηw (t) for reasons that will become clear later): V (t) = ˆ ˆ ηv (t − tj ) + ηw (t − tj ) El + [κ ∗ I](t) + (3) ˆ {tj } ˆ Then at each time tj a spike is emitted which results in a change of voltage described by η(t) = ηv (t) + ηw (t). Given the deterministic voltage, (Eq. 3) a spike is emitted according to the firing intensity λ(V ): λ(t) = f (V (t)) (4) where f (·) is an arbitrary function called the link-function. Then the firing behavior of the SRM depends on the choice of the link-function and its parameters. The most common link-function used to model single neuron activities are the linear-rectifier and the exponential function. 3 Mapping In order to map the stochastic AdEx to the SRM we follow a two-step procedure. First we derive the filter κ(t) and the kernels ηv (t) and ηw (t) analytically as a function of AdEx parameters. Second, we derive the link-function of the SRM from the stochastic spike emission of the AdEx. Figure 1: Mapping of the subthreshold dynamics of an AdEx to an equivalent SRM. A. Membrane filter κ(t) for three different sets of parameters of the AdEx leading to over-damped, critically damped and under-damped cases (upper, middle and lower panel, respectively). B. Spike-Triggered η(t) (black), ηv (t) (light gray) and ηw (gray) for the three cases. C. Example of voltage trace produced when an AdEx is stimulated with a step of colored noise (black). The corresponding voltage from a SRM stimulated with the same current and where we forced the spikes to match those of the AdEx (red). D. Error in the subthreshold voltage (VAdEx − VGLM ) as a function of the mean voltage of the AdEx, for the three different cases: over-, critically and under-damped (light gray, gray and black, respectively) with ∆T = 1 mV. Red line represents the voltage threshold Θ. E. Root Mean Square Error (RMSE) ratio for the three cases with ∆T = 1 mV. The RMSE ratio is the RMSE between the deterministic VSRM and the stochastic VAdEx divided by the RMSE between repetitions of the stochastic AdEx voltage. The error bar shows a single standard deviation as the RMSE ratio is averaged accross multiple value of σ. 3.1 Subthreshold voltage dynamics We start by assuming that the non-linearity for spike initiation does not affect the mean subthreshold voltage of the stochastic AdEx (see Figure 1 D). This assumption is motivated by the small ∆T 3 observed in in-vitro recordings (from 0.5 to 2 mV [8, 9]) which suggest that the subthreshold dynamics are mainly linear except very close to Θ. Also, we expect that the non-linear link-function will capture some of the dynamics due to the non-linearity for spike initiation. Thus it is possible to rewrite the deterministic subthreshold part of the AdEx (Eq. 1-2 without and without ∆T exp((V − Θ)/∆T )) using matrices: ˙ x = Ax (5) with x = V w and A = − τ1 m a τw − gl1m τ − τ1 w (6) In this form, the dynamics of the deterministic AdEx voltage is a damped oscillator with a driving force. Depending on the eigenvalues of A the system could be over-damped, critically damped or under-damped. The filter κ(t) of the GLM is given by the impulse response of the system of coupled differential equations of the AdEx, described by Eq. 5 and 6. In other words, one has to derive the response of the system when stimulating with a Dirac-delta function. The type of damping gives three different qualitative shapes of the kernel κ(t), which are summarized in Table 3.1 and Figure 1 A. Since the three different filters also affect the nature of the stochastic voltage fluctuations, we will keep the distinction between over-damped, critically damped and under-damped scenarios throughout the paper. This means that our approach is valid for at least 3 types of diffusive voltage-noise (i.e. the white noise in Eq. 1 filtered by 3 different membrane filters κ(t)). To complete the description of the deterministic voltage, we need an expression for the spiketriggered kernels. The voltage reset at each spike brings a spike-triggered jump in voltage of magˆ nitude ∆ = Vr − V (t). This perturbation is superposed to the current fluctuations due to I(t) and can be mediated by a Delta-diract pulse of current. Thus we can write the voltage reset kernel by: ηv (t) = ∆ ∆ [δ ∗ κ] (t) = κ(t) κ(0) κ(0) (7) where δ(t) is the Dirac-delta function. The shape of this kernel depends on κ(t) and can be computed from Table 3.1 (see Figure 1 B). Finally, the AdEx mediates spike-frequency adaptation by the jump of the second variables w. From Eq. 2 we can see that this produces a current wspike (t) = b exp (−t/τw ) that can cumulate over subsequent spikes. The effect of this current on voltage is then given by the convolution of wspike (t) with the membrane filter κ(t). Thus in the SRM framework the spike-frequency adaptation is taken into account by: ηw (t) = [wspike ∗ κ](t) (8) Again the precise form of ηw (t) depends on κ(t) and can be computed from Table 3.1 (see Figure 1 B). At this point, we would like to verify our assumption that the non-linearity for spike emission can be neglected. Fig. 1 C and D shows that the error between the voltage from Eq. 3 and the voltage from the stochastic AdEx is generally small. Moreover, we see that the main contribution to the voltage prediction error is due to the mismatch close to the spikes. However the non-linearity for spike initiation may change the probability distribution of the voltage fluctuations, which in turn influences the probability of spiking. This will influence the choice of the link-function, as we will see in the next section. 3.2 Spike Generation Using κ(t), ηv (t) and ηw (t), we must relate the spiking probability of the stochastic AdEx as a function of its deterministic voltage. According to [2] the probability of spiking in time bin dt given the deterministic voltage V (t) is given by: p(V ) = prob{spike in [t, t + dt]} = 1 − exp (−f (V (t))dt) (9) where f (·) gives the firing intensity as a function of the deterministic V (t) (Eq. 3). Thus to extract the link-function f we have to compute the probability of spiking given V (t) for our SRM. To do so we apply the method proposed by Jolivet et al. (2004) [18], where the probability of spiking is simply given by the distribution of the deterministic voltage estimated at the spike times divided by the distribution of the SRM voltage when there is no spike (see figure 2 A). One can numerically compute these two quantities for our models using N repetitions of the same stimulus. 4 Table 1: Analytical expressions for the membrane filter κ(t) in terms of the parameters of the AdEx for over-, critically-, and under-damped cases. Membrane Filter: κ(t) over-damped if: (τm + τw )2 > 4τm τw (gl +a) gl κ(t) = k1 eλ1 t + k2 eλ2 t λ1 = 1 2τm τw (−(τm + τw ) + critically-damped if: (τm + τw )2 = 4τm τw (gl +a) gl κ(t) = (αt + β)eλt λ= under-damped if: (τm + τw )2 < 4τm τw (gl +a) gl κ(t) = (k1 cos (ωt) + k2 sin (ωt)) eλt −(τm +τw ) 2τm τw λ= −(τm +τw ) 2τm τw (τm + τw )2 − 4 τm τw (gl + a) gl λ2 = 1 2τm τw (−(τm + τw ) − α= τm −τw 2Cτm τw ω= τw −τm 2τm τw 2 − a g l τm τw (τm + τw )2 − 4 τm τw (gl + a) gl k1 = −(1+(τm λ2 )) Cτm (λ1 −λ2 ) k2 = 1+(τm λ1 ) Cτm (λ1 −λ2 ) β= 1 C k1 = k2 = 1 C −(1+τm λ) Cωτm The standard deviation σ of the noise and the parameter ∆T of the AdEx non-linearity may affect the shape of the link-function. We thus extract p(V ) for different σ and ∆T (Fig. 2 B). Then using visual heuristics and previous knowledge about the potential analytical expression of the link-funtion, we try to find a simple analytical function that captures p(V ) for a large range of combinations of σ and ∆T . We observed that the log(− log(p)) is close to linear in most studied conditions Fig. 2 B suggesting the following two distributions of p(V ): V − VT (10) p(V ) = 1 − exp − exp ∆V V − VT p(V ) = exp − exp − (11) ∆V Once we have p(V ), we can use Eq. 4 to obtain the equivalent SRM link-function, which leads to: −1 f (V ) = log (1 − p(V )) (12) dt Then the two potential link-functions of the SRM can be derived from Eq. 10 and Eq. 11 (respectively): V − VT f (V ) = λ0 exp (13) ∆V V − VT (14) f (V ) = −λ0 log 1 − exp − exp − ∆V 1 with λ0 = dt , VT the threshold of the SRM and ∆V the sharpness of the link-function (i.e. the parameters that governs the degree of the stochasticity). Note that the exact value of λ0 has no importance since it is redundant with VT . Eq. 13 is the standard exponential link-function, but we call Eq. 14 the log-exp-exp link-function. 3.3 Prediction The next point is to evaluate the fit quality of each link-function. To do this, we first estimate the parameters VT and ∆V of the GLM link-function that maximize the likelihood of observing a spike 5 Figure 2: SRM link-function. A. Histogram of the SRM voltage at the AdEx firing times (red) and at non-firing times (gray). The ratio of the two distributions gives p(V ) (Eq. 9, dashed lines). Inset, zoom to see the voltage histogram evaluated at the firing time (red). B. log(− log(p)) as a function of the SRM voltage for three different noise levels σ = 0.07, 0.14, 0.18 nA (pale gray, gray, black dots, respectively) and ∆T = 1 mV. The line is a linear fit corresponding to the log-exp-exp linkfunction and the dashed line corresponds to a fit with the exponential link-function. C. Same data and labeling scheme as B, but plotting f (V ) according to Eq. 12. The lines are produced with Eq. 14 with parameters fitted as described in B. and the dashed lines are produced with Eq. 13. Inset, same plot but on a semi-log(y) axis. train generated with an AdEx. Second we look at the predictive power of the resulting SRM in terms of Peri-Stimulus Time Histogram (PSTH). In other words we ask how close the spike trains generated with a GLM are from the spike train generated with a stochastic AdEx when both models are stimulated with the same input current. For any GLM with link-function f (V ) ≡ f (t|I, θ) and parameters θ regulating the shape of κ(t), ˆ ηv (t) and ηw (t), the Negative Log-Likelihood (NLL) of observing a spike-train {t} is given by: NLL = − log(f (t|I, θ)) − f (t|I, θ) (15) t ˆ t It has been shown that the negative log-likelihood is convex in the parameters if f is convex and logconcave [19]. It is easy to show that a linear-rectifier link-function, the exponential link-function and the log-exp-exp link-function all satisfy these conditions. This allows efficient estimation of ˆ ˆ the optimal parameters VT and ∆V using a simple gradient descent. One can thus estimate from a single AdEx spike train the optimal parameters of a given link-function, which is more efficient than the method used in Sect. 3.2. The minimal NLL resulting from the gradient descent gives an estimation of the fit quality. A better estimate of the fit quality is given by the distance between the PSTHs in response to stimuli not used for parameter fitting . Let ν1 (t) be the PSTH of the AdEx, and ν2 (t) be the PSTH of the fitted SRM, 6 Figure 3: PSTH prediction. A. Injected current. B. Voltage traces produced by an AdEx (black) and the equivalent SRM (red), when stimulated with the current in A. C. Raster plot for 20 realizations of AdEx (black tick marks) and equivalent SRM (red tick marks). D. PSTH of the AdEx (black) and the SRM (red) obtained by averaging 10,000 repetitions. E. Optimal log-likelihood for the three cases of the AdEx, using three different link-functions, a linear-rectifier (light gray), an exponential link-function (gray) and the link-function defined by Eq. 14 (dark gray), these values are obtained by averaging over 40 different combinations σ and ∆T (see Fig. 4). Error bars are one standard deviation, the stars denote a significant difference, two-sample t-test with α = 0.01. F. same as E. but for Md (Eq. 16). then we use Md ∈ [0, 1] as a measure of match: Md = 2 2 (ν1 (t) − ν2 (t)) dt ν1 (t)2 dt + ν2 (t)2 dt (16) Md = 1 means that it is impossible to differentiate the SRM from the AdEx in terms of their PSTHs, whereas a Md of 0 means that the two PSTHs are completely different. Thus Md is a normalized similarity measure between two PSTHs. In practice, Md is estimated from the smoothed (boxcar average of 1 ms half-width) averaged spike train of 1 000 repetitions for each models. We use both the NLL and Md to quantify the fit quality for each of the three damping cases and each of the three link-functions. Figure 3 shows the match between the stochastic AdEx used as a reference and the derived GLM when both are stimulated with the same input current (Fig. 3 A). The resulting voltage traces are almost identical (Fig. 3 B) and both models predict almost the same spike trains and so the same PSTHs (Fig. 3 C and D). More quantitalively, we see on Fig. 3 E and F, that the linear-rectifier fits significantly worse than both the exponential and log-exp-exp link-functions, both in terms of NLL and of Md . The exponential link-function performs as well as the log-exp-exp link-function, with a spike train similarity measure Md being almost 1 for both. Finally the likelihood-based method described above gives us the opportunity to look at the relationship between the AdEx parameters σ and ∆T that governs its spike emission and the parameters VT and ∆V of the link-function (Fig. 4). We observe that an increase of the noise level produces a flatter link-function (greater ∆V ) while an increase in ∆T also produces an increase in ∆V and VT (note that Fig. 4 shows ∆V and VT for the exponential link-function only, but equivalent results are obtained with the log-exp-exp link-function). 4 Discussion In Sect. 3.3 we have shown that it is possible to predict with almost perfect accuracy the PSTH of a stochastic AdEx model using an appropriate set of parameters in the SRM. Moreover, since 7 Figure 4: Influence of the AdEx parameters on the parameters of the exponential link-function. A. VT as a function of ∆T and σ. B. ∆V as a function of ∆T and σ. the subthreshold voltage of the AdEx also gives a good match with the deterministic voltage of the SRM, we expect that the AdEx and the SRM will not differ in higher moments of the spike train probability distributions beyond the PSTH. We therefore conclude that diffusive noise models of the type of Eq. 1-2 are equivalent to GLM of the type of Eq. 3-4. Once combined with similar results on other types of stochastic LIF (e.g. correlated noise), we could bridge the gap between the literature on GLM and the literature on diffusive noise models. Another noteworthy observation pertains to the nature of the link-function. The link-function has been hypothesized to be a linear-rectifier, an exponential, a sigmoidal or a Gaussian [16]. We have observed that for the AdEx the link-function follows Eq. 14 that we called the log-exp-exp linkfunction. Although the link-function is log-exp-exp for most of the AdEx parameters, the exponential link-function gives an equivalently good prediction of the PSTH. This can be explained by the fact that the difference between log-exp-exp and exponential link-functions happens mainly at low voltage (i.e. far from the threshold), where the probability of emitting a spike is so low (Figure 2 C, until -50 mv). Therefore, even if the exponential link-function overestimates the firing probability at these low voltages it rarely produces extra spikes. At voltages closer to the threshold, where most of the spikes are emitted, the two link-functions behave almost identically and hence produce the same PSTH. The Gaussian link-function can be seen as lying in-between the exponential link-function and the log-exp-exp link-function in Fig. 2. This means that the work of Plesser and Gerstner (2000) [16] is in agreement with the results presented here. The importance of the time-derivative of the ˙ voltage stressed by Plesser and Gerstner (leading to a two-dimensional link-function f (V, V )) was not studied here to remain consistent with the typical usage of GLM in neural systems [14]. Finally we restricted our study to exponential non-linearity for spike initiation and do not consider other cases such as the Quadratic Integrate-and-fire (QIF, [5]) or other polynomial functional shapes. We overlooked these cases for two reasons. First, there are many evidences that the non-linearity in neurons (estimated from in-vitro recordings of Pyramidal neurons) is well approximated by a single exponential [9]. Second, the exponential non-linearity of the AdEx only affects the subthreshold voltage at high voltage (close to threshold) and thus can be neglected to derive the filters κ(t) and η(t). Polynomial non-linearities on the other hand affect a larger range of the subthreshold voltage so that it would be difficult to justify the linearization of subthreshold dynamics essential to the method presented here. References [1] R. B. Stein, “Some models of neuronal variability,” Biophys J, vol. 7, no. 1, pp. 37–68, 1967. [2] W. Gerstner and W. Kistler, Spiking neuron models. Cambridge University Press New York, 2002. [3] E. Izhikevich, “Resonate-and-fire neurons,” Neural Networks, vol. 14, no. 883-894, 2001. [4] M. J. E. Richardson, N. Brunel, and V. Hakim, “From subthreshold to firing-rate resonance,” Journal of Neurophysiology, vol. 89, pp. 2538–2554, 2003. 8 [5] E. Izhikevich, “Simple model of spiking neurons,” IEEE Transactions on Neural Networks, vol. 14, pp. 1569–1572, 2003. [6] S. Mensi, R. Naud, M. Avermann, C. C. H. Petersen, and W. Gerstner, “Parameter extraction and classification of three neuron types reveals two different adaptation mechanisms,” Under review. [7] N. Fourcaud-Trocme, D. Hansel, C. V. Vreeswijk, and N. Brunel, “How spike generation mechanisms determine the neuronal response to fluctuating inputs,” Journal of Neuroscience, vol. 23, no. 37, pp. 11 628–11 640, 2003. [8] R. Brette and W. Gerstner, “Adaptive exponential integrate-and-fire model as an effective description of neuronal activity,” Journal of Neurophysiology, vol. 94, pp. 3637–3642, 2005. [9] L. Badel, W. Gerstner, and M. Richardson, “Dependence of the spike-triggered average voltage on membrane response properties,” Neurocomputing, vol. 69, pp. 1062–1065, 2007. [10] P. McCullagh and J. A. Nelder, Generalized linear models, 2nd ed. Chapman & Hall/CRC, 1998, vol. 37. [11] W. Gerstner, J. van Hemmen, and J. Cowan, “What matters in neuronal locking?” Neural computation, vol. 8, pp. 1653–1676, 1996. [12] D. Hubel and T. Wiesel, “Receptive fields and functional architecture of monkey striate cortex,” Journal of Physiology, vol. 195, pp. 215–243, 1968. [13] J. Pillow, L. Paninski, V. Uzzell, E. Simoncelli, and E. Chichilnisky, “Prediction and decoding of retinal ganglion cell responses with a probabilistic spiking model,” Journal of Neuroscience, vol. 25, no. 47, pp. 11 003–11 013, 2005. [14] K. Doya, S. Ishii, A. Pouget, and R. P. N. Rao, Bayesian brain: Probabilistic approaches to neural coding. The MIT Press, 2007. [15] S. Gerwinn, J. H. Macke, M. Seeger, and M. Bethge, “Bayesian inference for spiking neuron models with a sparsity prior,” in Advances in Neural Information Processing Systems, 2007. [16] H. Plesser and W. Gerstner, “Noise in integrate-and-fire neurons: From stochastic input to escape rates,” Neural Computation, vol. 12, pp. 367–384, 2000. [17] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration of analog neural networks,” in Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, june 2008, pp. 431 –438. [18] R. Jolivet, T. Lewis, and W. Gerstner, “Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy,” Journal of Neurophysiology, vol. 92, pp. 959–976, 2004. [19] L. Paninski, “Maximum likelihood estimation of cascade point-process neural encoding models,” Network: Computation in Neural Systems, vol. 15, pp. 243–262, 2004. 9
4 0.86885285 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
Author: Bin Zhao, Fei Li, Eric P. Xing
Abstract: Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. 1
5 0.82255763 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
Author: Joseph Keshet, David A. McAllester
Abstract: We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1
same-paper 6 0.80154639 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
7 0.77502865 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
8 0.75553703 165 nips-2011-Matrix Completion for Multi-label Image Classification
9 0.6861527 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
10 0.67278516 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
11 0.67000717 227 nips-2011-Pylon Model for Semantic Segmentation
12 0.66160989 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
13 0.64400613 66 nips-2011-Crowdclustering
14 0.63912833 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
15 0.63862735 168 nips-2011-Maximum Margin Multi-Instance Learning
16 0.62719977 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
17 0.62436289 154 nips-2011-Learning person-object interactions for action recognition in still images
18 0.62049997 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
19 0.61516505 35 nips-2011-An ideal observer model for identifying the reference frame of objects
20 0.61498386 55 nips-2011-Collective Graphical Models