nips nips2005 nips2005-202 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. [sent-10, score-1.396]
2 We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases. [sent-11, score-0.665]
3 1 Introduction Probabilistic methods have become well-established in the analysis of learning algorithms over the past decade, drawing largely on classical Gaussian statistical theory [21, 2, 28]. [sent-12, score-0.067]
4 More recently, variational Bayes and ensemble learning methods [22, 13] have been proposed. [sent-13, score-0.74]
5 In addition to the evidence and VB methods, variational methods based on convex bounding have been proposed for dealing with non-gaussian latent variables [18, 14]. [sent-14, score-1.127]
6 We concentrate here on the theory of the linear model, with direct application to ICA [14], factor analysis [2], mixture models [13], kernel regression [30, 11, 32], and linearization approaches to nonlinear models [15]. [sent-15, score-0.107]
7 In Mackay’s evidence framework, ”hierarchical priors” are employed on the latent variables, using Gamma priors on the inverse variances, which has the effect of making the marginal distribution of the latent variable prior the non-Gaussian Student’s t [30]. [sent-17, score-0.438]
8 Based on Mackay’s framework, Tipping proposed the Relevance Vector Machine (RVM) [30] for estimation of sparse solutions in the kernel regression problem. [sent-18, score-0.138]
9 A relationship between the evidence framework and ensemble/VB methods has been noted in [22, 6] for the particular case of the RVM with t hyperprior. [sent-19, score-0.13]
10 Figueiredo [11] proposed EM algorithms based on hyperprior representations of the Laplacian and Jeffrey’s priors. [sent-20, score-0.195]
11 In [14], Girolami employed the convex variational framework of [16] to derive a different type of variational EM algorithm using a convex variational representation of the Laplacian prior. [sent-21, score-2.274]
12 [32] demonstrated the equivalence between the variational approach of [16, 14] and the evidence based RVM for the case of t priors, and thus via [6], the equivalence of the convex variational method and the ensemble/VB methods for the particular case of the t prior. [sent-23, score-1.626]
13 In this paper we consider these methods from a unifying viewpoint, deriving algorithms in more general form and establishing a more general relationship among the methods than has previously been shown. [sent-24, score-0.18]
14 In §2, we define the model and estimation problems we shall be concerned with, and in §3 we discuss criteria for variational representations. [sent-25, score-0.759]
15 2 The Bayesian linear model Throughout we shall consider the following model, y = Ax + ν , (1) where A ∈ Rm×n , x ∼ p(x) = i p(xi ), and ν ∼ N (0, Σν ), with x and ν independent. [sent-27, score-0.077]
16 The important thing to note for our purposes is that the xi are non-Gaussian. [sent-28, score-0.08]
17 We consider two types of variational representation of the non-Gaussian priors p(xi ), which we shall call convex type and integral type. [sent-29, score-1.273]
18 In the convex type of variational representation, the density is represented as a supremum over Gaussian functions of varying scale, p(x) = sup N (x; 0, ξ −1 ) ϕ(ξ) . [sent-30, score-1.035]
19 (2) ξ>0 The essential property of “concavity in x2 ” leading to this representation was used in [29, 17, 16, 18, 6] to represent the Logistic link function. [sent-31, score-0.063]
20 A convex type representation of the Laplace density was applied to learning overcomplete representations in [14]. [sent-32, score-0.62]
21 In the integral type of representation, the density p(x) is represented as an integral over the scale parameter of the density, with respect to some positive measure µ, ∞ p(x) = N (x; 0, ξ −1 ) dµ(ξ) . [sent-33, score-0.679]
22 (3) 0 Such representations with a general kernel are referred to as scale mixtures [19]. [sent-34, score-0.3]
23 Gaussian scale mixtures were discussed in the examples of Dempster, Laird, and Rubin’s original EM paper [9], and treated more extensively in [10]. [sent-35, score-0.235]
24 The integral representation has been used, sometimes implicitly, for kernel-based estimation [30, 11] and ICA [20]. [sent-36, score-0.319]
25 The distinction between MAP estimation of components and estimation of hyperparameters has been discussed in [23] and [30] for the case of Gamma distributed inverse variance. [sent-37, score-0.254]
26 (5) ξ The following section discusses the criteria for and relationship between the two types of variational representation. [sent-39, score-0.64]
27 In §4, we discuss algorithms for each problem based on the two types of variational representations, and determine when these are equivalent. [sent-40, score-0.594]
28 We also discuss the approximation of the likelihood p(y; A) using the ensemble learning or VB method, which approximates the posterior p(x, ξ|y) by a factorial density q(x|y)q(ξ|y). [sent-41, score-0.375]
29 We show that the ensemble method is equivalent to the hyperparameter MAP method. [sent-42, score-0.295]
30 3 Variational representations of super-Gaussian densities In this section we discuss the criteria for the convex and integral type representations. [sent-43, score-0.803]
31 1 Convex variational bounds We wish to determine when a symmetric, unimodal density p(x) can be represented in the form (2) for some function ϕ(ξ). [sent-45, score-0.652]
32 Equivalently, when, − log p(x) = − sup log N x ; 0, ξ −1 ϕ(ξ) = inf ξ>0 1 ξ>0 2 1 x2 ξ − log ξ 2 ϕ(ξ) √ for all x > 0. [sent-46, score-0.368]
33 The last formula says that − log p( x) is the concave conjugate of (the 1 closure of √ convex hull of) the function, log ξ 2 ϕ(ξ) [27, §12]. [sent-47, score-0.644]
34 This is possible if and only the if − log p( x) is closed, increasing and concave on (0, ∞). [sent-48, score-0.248]
35 A symmetric probability density p(x) ≡ exp(−g(x2 )) can be represented in the convex variational form, p(x) = sup N (x; 0, ξ −1 ) ϕ(ξ) ξ>0 √ if and only if g(x) ≡ − log p( x) is increasing and concave on (0, ∞). [sent-51, score-1.276]
36 In this case we can use the function, ϕ(ξ) = 2π/ξ exp g ∗(ξ/2) , where g ∗ is the concave conjugate of g. [sent-52, score-0.214]
37 Examples of densities satisfying this criterion include: (i) Generalized Gaussian ∝ exp(−|x|β ), 0 < β ≤ 2, (ii) Logistic ∝ 1/ cosh2 (x/2), (iii) Student’s t ∝ (1 + x2 /ν)−(ν+1)/2 , ν > 0, and (iv) symmetric α-stable densities (having characteristic function exp(−|ω|α ), 0 < α ≤ 2). [sent-53, score-0.18]
38 A symmetric probability density p(x) is √ strongly super-gaussian if p( x) is log-convex on (0, ∞), and strongly sub-gaussian if p( x) is log-concave on (0, ∞). [sent-56, score-0.206]
39 This condition is equivalent to f (x) = g(x2 ) with g concave, i. [sent-59, score-0.045]
40 2 Scale mixtures We now wish to determine when a probability density p(x) can be represented in the form (3) for some µ(ξ) non-decreasing on (0, ∞). [sent-64, score-0.226]
41 A fundamental result dealing with integral representations was given by Bernstein and Widder (see [31]). [sent-65, score-0.297]
42 A function f (x) is completely monotonic on (a, b) if, (−1)n f (n) (x) ≥ 0 , n = 0, 1, . [sent-68, score-0.262]
43 That is, f (x) is completely monotonic if it is positive, decreasing, convex, and so on. [sent-72, score-0.262]
44 A necessary and sufficient condition that p(x) should be completely monotonic on (0, ∞) is that, ∞ p(x) = e−tx dα(t) , 0 where α(t) is non-decreasing on (0, ∞). [sent-75, score-0.262]
45 Thus for p(x) to be a Gaussian scale mixture, 2 p(x) = e−f (x) = e−g(x ) ∞ = 0 1 2 e− 2 tx dα(t) , √ a necessary and sufficient condition is that p( x) = e−g(x) be completely monotonic for 0 < x < ∞, and we have the following (see also [19, 1]), Theorem 3. [sent-76, score-0.404]
46 A function p(x) can be represented as a Gaussian scale mixture if and only if √ p( x) is completely monotonic on (0, ∞). [sent-77, score-0.479]
47 3 Relationship between convex and integral type representations We now consider the relationship between the convex and integral types of variational representation. [sent-79, score-1.699]
48 We have seen that p(x) can be represented in the form (2) if and only if g(x) is symmetric and concave on (0, ∞). [sent-81, score-0.232]
49 And we have seen that √ p(x) can be represented in the form (3) if and only if p( x) = exp(−g(x)) is completely √ monotonic. [sent-82, score-0.161]
50 We shall consider now whether or not complete monotonicity of p( x) implies √ the concavity of g(x) = − log p( x), that is whether representability in the integral form implies representability in the convex form. [sent-83, score-1.018]
51 Complete monotonicity of a function q(x) implies that q ≥ 0, q ≤ 0, q ≥ 0, etc. [sent-84, score-0.065]
52 For √ example, if p( x) is completely monotonic, then, d2 √ d2 p( x) = 2 e−g(x) = e−g(x) g (x)2 − g (x) ≥ 0 . [sent-85, score-0.113]
53 dx2 dx √ Thus if g ≤ 0, then p( x) is convex, but the converse does not necessarily hold. [sent-86, score-0.123]
54 That √ is, concavity of g does not follow from convexity of p( x), as the latter only requires that 2 g ≤g . [sent-87, score-0.104]
55 √ Concavity of g does follow however from the complete monotonicity of p( x). [sent-88, score-0.108]
56 Thus completely monotonic functions, being scale mixtures of the log convex function e−x by Theorem 2, are also log convex. [sent-94, score-0.97]
57 We thus see that any function representable in the integral variational form (3) is also representable in the convex variational form (2). [sent-95, score-1.637]
58 5] establishes the equivalence between q(x) and g (x) = d/dx−log q(x) in terms of complete monotonicity. [sent-100, score-0.142]
59 If g(x) > 0, then e−ug(x) is completely monotonic for every u > 0, if and only if g (x) is completely monotonic. [sent-102, score-0.375]
60 √ In particular, it holds that q(x) ≡ p( x) = exp(−g(x)) is convex only if g (x) ≤ 0. [sent-103, score-0.288]
61 If g is increasing and concave for x > 0, then p(x) admits the convex type of variational representation (2). [sent-105, score-1.119]
62 , then p(x) also admits the Gaussian scale mixture representation (3). [sent-107, score-0.281]
63 1 MAP estimation of components Consider first the MAP estimate of the latent variables (4). [sent-109, score-0.273]
64 1 Component MAP – Integral case Following [10]1 , consider an EM algorithm to estimate x when the p(xi ) are independent Gaussian scale mixtures as in (3). [sent-112, score-0.245]
65 Differentiating inside the integral gives, ∞ ∞ d p(x|ξ)p(ξ)dξ = − ξ xp(x, ξ) dξ p (x) = dx 0 0 ∞ = −xp(x) ξp(ξ|x) dξ . [sent-113, score-0.324]
66 0 1 In [10], the xi in (1) are actually estimated as non-random parameters, with the noise ν being non-gaussian, but the underlying theory is essentially the same. [sent-114, score-0.08]
67 xi p(xi ) xi (6) ˆ The EM algorithm alternates setting ξi to the posterior mean, E(ξi |xi ) = f (xi )/xi , and setting x to minimize, 1 1 ˆ (7) − log p(y|x)p(x|ξ) = 2 xTAT Σ−1Ax − yT Σ−1Ax + 2 xTΛx + const. [sent-116, score-0.312]
68 We thus see that this algorithm is equivalent to the MAP algorithm derived in §4. [sent-125, score-0.08]
69 That is, for direct MAP estimation of latent variable x, the EM Gaussian scale mixture method and the variational bounding method yield the same algorithm. [sent-128, score-0.947]
70 This algorithm has also been derived in the image restoration literature [12] as the “halfquadratic” algorithm, and it is the basis for the FOCUSS algorithms derived in [26, 25]. [sent-129, score-0.149]
71 The regression algorithm given in [11] for the particular cases of Laplacian and Jeffrey’s priors is based on the theory in §4. [sent-130, score-0.105]
72 1, and is in fact equivalent to the FOCUSS algorithm derived in [26]. [sent-132, score-0.08]
73 2 MAP estimate of variational parameters Now consider MAP estimation of the (random) variational hyperparameters ξ. [sent-134, score-1.265]
74 1 Hyperparameter MAP – Integral case Consider an EM algorithm to find the MAP estimate of the hyperparameters ξ in the integral representation (Gaussian scale mixture) case, where the latent variables x are hidden. [sent-137, score-0.697]
75 The function to be minimized over ξ is then, − log p(x|ξ)p(ξ) x 1 2 = i x2 ξi − log i ξi p(ξi ) + const. [sent-139, score-0.216]
76 (9) √ If we define h(ξ) ≡ log ξi p(ξi ), and assume that this function is concave, then the optimal value of ξ is given by, 1 ξ i = h∗ 2 x 2 . [sent-140, score-0.108]
77 Alternative algorithms result from using this method to find the MAP estimate of different functions of the scale random variable ξ. [sent-142, score-0.176]
78 2 Hyperparameter MAP – Convex case In the convex representation, the ξ parameters do not actually represent a probabilistic quantity, but rather arise as parameters in a variational inequality. [sent-145, score-0.816]
79 Specifically, we write, p(y) = p(y, x) dx = max p(y|x) p(x|ξ) ϕ(ξ) dx ξ ≥ max = max N y; 0, AΛAT + Σν ϕ(ξ) . [sent-146, score-0.432]
80 ξ p(y|x) p(x|ξ) ϕ(ξ) dx ξ Now we define the function, p(y; ξ) ≡ N y; 0, AΛAT + Σν ϕ(ξ) ˜ ˆ and try to find ξ = arg max p(y; ξ). [sent-147, score-0.246]
81 We maximize p by EM, marginalizing over x, ˜ ˜ p(y; ξ) = ˜ p(y|x) p(x|ξ) ϕ(ξ) dx . [sent-148, score-0.123]
82 Although p is not a true probability density function, the proof ˜ i of convergence for EM does not assume unit normalization. [sent-152, score-0.076]
83 log p(y) = q(z|y) log The term F (q) is commonly called the variational free energy [29, 24]. [sent-156, score-0.744]
84 Minimizing the F over q is equivalent to minimizing D over q. [sent-157, score-0.045]
85 The posterior approximating distribution is taken to be factorial, q(z|y) = q(x, ξ|y) = q(x|y)q(ξ|y) . [sent-158, score-0.044]
86 For fixed q(ξ|y), the free energy F is given by, − q(x|y)q(ξ|y) log p(x, ξ|y) dξ dx = D q(x|y) e q(x|y)q(ξ|y) log p(x,ξ|y) ξ + const. [sent-159, score-0.339]
87 The minimum of the KL divergence in (10) is attained if and only if q(x|y) ∝ exp log p(x, ξ|y) ξ ∝ p(y|x) exp log p(x|ξ) almost surely. [sent-161, score-0.364]
88 An identical derivation yields the optimal q(ξ|y) ∝ exp log p(x, ξ|y) x ∝ p(ξ) exp log p(x|ξ) x ξ when q(x|y) is fixed. [sent-162, score-0.364]
89 The ensemble (or VB) algorithm consists of alternately updating the parameters of these approximating marginal distributions. [sent-163, score-0.257]
90 In the linear model with Gaussian scale mixture latent variables, the complete likelihood is again, p(y, x, ξ) = p(y|x)p(x|ξ)p(ξ) . [sent-164, score-0.357]
91 The optimal approximate posteriors are given by, q(x|y) = N (x; µx|y , Σx|y ) , q(ξi |y) = p ξi xi = x2 i 1/2 , where, letting Λ = diag( ξ )−1 , the posterior moments are given by, µx|y ≡ ΛAT (AΛAT + Σν )−1 y Σx|y ≡ (AT Σ−1 A + Λ−1 )−1 = Λ − ΛAT (AΛAT + Σν )−1AΛ . [sent-165, score-0.124]
92 ν The only relevant fact about q(ξ|y) that we need is ξ , for which we have, using (6), ξi = ξi q(ξi |y) dξi = ξi p ξi | xi = x2 i 1/2 dξi = f (σi ) , σi where σi = E (x2 |y; ξi ). [sent-166, score-0.08]
93 We thus see that the ensemble learning algorithm is equivalent i to the approximate hyperparameter MAP algorithm of §4. [sent-167, score-0.295]
94 Note also that this shows that the VB methods can be applied to any Gaussian scale mixture density, using only the form of the latent variable prior p(x), without needing the marginal hyperprior p(ξ) in closed form. [sent-170, score-0.447]
95 This is particularly important in the case of the Generalized Gaussian and Logistic densities, whose scale parameter densities are α-Stable and Kolmogorov [1] respectively. [sent-171, score-0.17]
96 5 Conclusion In this paper, we have discussed criteria for variational representations of non-Gaussian latent variables, and derived general variational EM algorithms based on these representations. [sent-172, score-1.462]
97 We have shown a general equivalence between the two representations in MAP estimation taking hyperparameters as hidden, and we have shown the general equivalence between the variational convex approximate MAP estimate of hyperparameters and the ensemble learning or VB method. [sent-173, score-1.61]
98 The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. [sent-199, score-0.614]
99 A variational method for learning sparse and overcomplete representations. [sent-272, score-0.617]
100 A variational approach to Bayesian logistic regression models and their extensions. [sent-290, score-0.625]
wordName wordTfidf (topN-words)
[('variational', 0.528), ('convex', 0.288), ('integral', 0.201), ('ensemble', 0.178), ('em', 0.172), ('map', 0.164), ('monotonic', 0.149), ('latent', 0.145), ('concave', 0.14), ('vb', 0.13), ('dx', 0.123), ('hyperparameters', 0.113), ('completely', 0.113), ('log', 0.108), ('concavity', 0.104), ('mixtures', 0.102), ('scale', 0.102), ('equivalence', 0.099), ('representations', 0.096), ('xk', 0.092), ('focuss', 0.086), ('wipf', 0.086), ('rvm', 0.086), ('palmer', 0.086), ('xi', 0.08), ('shall', 0.077), ('density', 0.076), ('exp', 0.074), ('hyperparameter', 0.072), ('dempster', 0.069), ('laird', 0.069), ('densities', 0.068), ('bayesian', 0.067), ('mixture', 0.067), ('criteria', 0.066), ('hyperprior', 0.066), ('jeffrey', 0.066), ('representability', 0.066), ('xtat', 0.066), ('priors', 0.065), ('monotonicity', 0.065), ('representation', 0.063), ('max', 0.062), ('arg', 0.061), ('gaussian', 0.058), ('logistic', 0.057), ('jaakkola', 0.056), ('laplacian', 0.055), ('estimation', 0.055), ('graphical', 0.054), ('theorem', 0.053), ('editor', 0.053), ('rao', 0.053), ('type', 0.051), ('evidence', 0.05), ('bounding', 0.05), ('admits', 0.049), ('represented', 0.048), ('relationship', 0.046), ('overcomplete', 0.046), ('restoration', 0.046), ('alternately', 0.046), ('bernstein', 0.046), ('xp', 0.046), ('representable', 0.046), ('equivalent', 0.045), ('sup', 0.044), ('posterior', 0.044), ('factorial', 0.044), ('student', 0.044), ('symmetric', 0.044), ('sparse', 0.043), ('relevance', 0.043), ('strongly', 0.043), ('complete', 0.043), ('diag', 0.042), ('estimate', 0.041), ('regression', 0.04), ('kluwer', 0.04), ('tx', 0.04), ('laplace', 0.039), ('mackay', 0.039), ('gamma', 0.037), ('ica', 0.036), ('derived', 0.035), ('methods', 0.034), ('ghahramani', 0.034), ('press', 0.034), ('unifying', 0.033), ('algorithms', 0.033), ('discuss', 0.033), ('intelligence', 0.033), ('marginal', 0.033), ('handling', 0.032), ('incomplete', 0.032), ('bayes', 0.032), ('jordan', 0.032), ('variables', 0.032), ('princeton', 0.031), ('discussed', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
2 0.38121998 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
3 0.20152898 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1
4 0.19139709 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
5 0.15922461 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang
Abstract: We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efficient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classification data sets show that the proposed approach is promising.
6 0.14605609 58 nips-2005-Divergences, surrogate loss functions and experimental design
7 0.12456849 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
8 0.1197136 50 nips-2005-Convex Neural Networks
9 0.11874361 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
10 0.11100769 30 nips-2005-Assessing Approximations for Gaussian Process Classification
11 0.10845663 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
12 0.10662714 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
13 0.10507421 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
14 0.1025109 47 nips-2005-Consistency of one-class SVM and related algorithms
15 0.10044789 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
16 0.098580308 105 nips-2005-Large-Scale Multiclass Transduction
17 0.097500212 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
18 0.096473396 80 nips-2005-Gaussian Process Dynamical Models
19 0.095188387 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
20 0.091390952 95 nips-2005-Improved risk tail bounds for on-line algorithms
topicId topicWeight
[(0, 0.282), (1, 0.126), (2, -0.061), (3, 0.028), (4, 0.238), (5, -0.313), (6, -0.152), (7, -0.118), (8, -0.144), (9, -0.233), (10, 0.039), (11, -0.064), (12, -0.073), (13, 0.076), (14, -0.054), (15, -0.235), (16, -0.003), (17, -0.096), (18, -0.032), (19, 0.01), (20, 0.125), (21, 0.031), (22, -0.039), (23, -0.086), (24, 0.016), (25, -0.118), (26, -0.085), (27, 0.06), (28, 0.033), (29, -0.025), (30, 0.102), (31, 0.064), (32, -0.057), (33, 0.074), (34, -0.043), (35, -0.026), (36, 0.06), (37, -0.079), (38, -0.007), (39, -0.046), (40, -0.077), (41, 0.003), (42, 0.02), (43, 0.053), (44, -0.049), (45, -0.0), (46, -0.003), (47, -0.108), (48, 0.094), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.97526103 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
2 0.89636987 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
3 0.65900016 52 nips-2005-Correlated Topic Models
Author: John D. Lafferty, David M. Blei
Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1
4 0.64660311 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting
Author: Martin J. Wainwright
Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1
5 0.58275044 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
Author: Baback Moghaddam, Yair Weiss, Shai Avidan
Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied fields, from bio-informatics to finance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1
6 0.47089553 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
7 0.43534914 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
8 0.43407944 30 nips-2005-Assessing Approximations for Gaussian Process Classification
9 0.43297634 58 nips-2005-Divergences, surrogate loss functions and experimental design
10 0.43193036 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
11 0.42127731 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
12 0.37346593 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
13 0.37303931 105 nips-2005-Large-Scale Multiclass Transduction
14 0.36021447 4 nips-2005-A Bayesian Spatial Scan Statistic
15 0.35020736 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
16 0.34392491 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
17 0.34238914 35 nips-2005-Bayesian model learning in human visual perception
18 0.33887318 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
19 0.33557773 33 nips-2005-Bayesian Sets
20 0.33545494 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
topicId topicWeight
[(3, 0.052), (10, 0.011), (27, 0.025), (31, 0.043), (34, 0.091), (55, 0.015), (69, 0.025), (73, 0.023), (88, 0.081), (91, 0.532)]
simIndex simValue paperId paperTitle
1 0.9817518 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
Author: Ricky Der, Daniel D. Lee
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
same-paper 2 0.97490191 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
3 0.94002354 118 nips-2005-Learning in Silicon: Timing is Everything
Author: John V. Arthur, Kwabena Boahen
Abstract: We describe a neuromorphic chip that uses binary synapses with spike timing-dependent plasticity (STDP) to learn stimulated patterns of activity and to compensate for variability in excitability. Specifically, STDP preferentially potentiates (turns on) synapses that project from excitable neurons, which spike early, to lethargic neurons, which spike late. The additional excitatory synaptic current makes lethargic neurons spike earlier, thereby causing neurons that belong to the same pattern to spike in synchrony. Once learned, an entire pattern can be recalled by stimulating a subset. 1 Variability in Neural Systems Evidence suggests precise spike timing is important in neural coding, specifically, in the hippocampus. The hippocampus uses timing in the spike activity of place cells (in addition to rate) to encode location in space [1]. Place cells employ a phase code: the timing at which a neuron spikes relative to the phase of the inhibitory theta rhythm (5-12Hz) conveys information. As an animal approaches a place cell’s preferred location, the place cell not only increases its spike rate, but also spikes at earlier phases in the theta cycle. To implement a phase code, the theta rhythm is thought to prevent spiking until the input synaptic current exceeds the sum of the neuron threshold and the decreasing inhibition on the downward phase of the cycle [2]. However, even with identical inputs and common theta inhibition, neurons do not spike in synchrony. Variability in excitability spreads the activity in phase. Lethargic neurons (such as those with high thresholds) spike late in the theta cycle, since their input exceeds the sum of the neuron threshold and theta inhibition only after the theta inhibition has had time to decrease. Conversely, excitable neurons (such as those with low thresholds) spike early in the theta cycle. Consequently, variability in excitability translates into variability in timing. We hypothesize that the hippocampus achieves its precise spike timing (about 10ms) through plasticity enhanced phase-coding (PEP). The source of hippocampal timing precision in the presence of variability (and noise) remains unexplained. Synaptic plasticity can compensate for variability in excitability if it increases excitatory synaptic input to neurons in inverse proportion to their excitabilities. Recasting this in a phase-coding framework, we desire a learning rule that increases excitatory synaptic input to neurons directly related to their phases. Neurons that lag require additional synaptic input, whereas neurons that lead 120µm 190µm A B Figure 1: STDP Chip. A The chip has a 16-by-16 array of microcircuits; one microcircuit includes four principal neurons, each with 21 STDP circuits. B The STDP Chip is embedded in a circuit board including DACs, a CPLD, a RAM chip, and a USB chip, which communicates with a PC. require none. The spike timing-dependent plasticity (STDP) observed in the hippocampus satisfies this requirement [3]. It requires repeated pre-before-post spike pairings (within a time window) to potentiate and repeated post-before-pre pairings to depress a synapse. Here we validate our hypothesis with a model implemented in silicon, where variability is as ubiquitous as it is in biology [4]. Section 2 presents our silicon system, including the STDP Chip. Section 3 describes and characterizes the STDP circuit. Section 4 demonstrates that PEP compensates for variability and provides evidence that STDP is the compensation mechanism. Section 5 explores a desirable consequence of PEP: unconventional associative pattern recall. Section 6 discusses the implications of the PEP model, including its benefits and applications in the engineering of neuromorphic systems and in the study of neurobiology. 2 Silicon System We have designed, submitted, and tested a silicon implementation of PEP. The STDP Chip was fabricated through MOSIS in a 1P5M 0.25µm CMOS process, with just under 750,000 transistors in just over 10mm2 of area. It has a 32 by 32 array of excitatory principal neurons commingled with a 16 by 16 array of inhibitory interneurons that are not used here (Figure 1A). Each principal neuron has 21 STDP synapses. The address-event representation (AER) [5] is used to transmit spikes off chip and to receive afferent and recurrent spike input. To configure the STDP Chip as a recurrent network, we embedded it in a circuit board (Figure 1B). The board has five primary components: a CPLD (complex programmable logic device), the STDP Chip, a RAM chip, a USB interface chip, and DACs (digital-to-analog converters). The central component in the system is the CPLD. The CPLD handles AER traffic, mediates communication between devices, and implements recurrent connections by accessing a lookup table, stored in the RAM chip. The USB interface chip provides a bidirectional link with a PC. The DACs control the analog biases in the system, including the leak current, which the PC varies in real-time to create the global inhibitory theta rhythm. The principal neuron consists of a refractory period and calcium-dependent potassium circuit (RCK), a synapse circuit, and a soma circuit (Figure 2A). RCK and the synapse are ISOMA Soma Synapse STDP Presyn. Spike PE LPF A Presyn. Spike Raster AH 0 0.1 Spike probability RCK Postsyn. Spike B 0.05 0.1 0.05 0.1 0.08 0.06 0.04 0.02 0 0 Time(s) Figure 2: Principal neuron. A A simplified schematic is shown, including: the synapse, refractory and calcium-dependent potassium channel (RCK), soma, and axon-hillock (AH) circuits, plus their constituent elements, the pulse extender (PE) and the low-pass filter (LPF). B Spikes (dots) from 81 principal neurons are temporally dispersed, when excited by poisson-like inputs (58Hz) and inhibited by the common 8.3Hz theta rhythm (solid line). The histogram includes spikes from five theta cycles. composed of two reusable blocks: the low-pass filter (LPF) and the pulse extender (PE). The soma is a modified version of the LPF, which receives additional input from an axonhillock circuit (AH). RCK is inhibitory to the neuron. It consists of a PE, which models calcium influx during a spike, and a LPF, which models calcium buffering. When AH fires a spike, a packet of charge is dumped onto a capacitor in the PE. The PE’s output activates until the charge decays away, which takes a few milliseconds. Also, while the PE is active, charge accumulates on the LPF’s capacitor, lowering the LPF’s output voltage. Once the PE deactivates, this charge leaks away as well, but this takes tens of milliseconds because the leak is smaller. The PE’s and the LPF’s inhibitory effects on the soma are both described below in terms of the sum (ISHUNT ) of the currents their output voltages produce in pMOS transistors whose sources are at Vdd (see Figure 2A). Note that, in the absence of spikes, these currents decay exponentially, with a time-constant determined by their respective leaks. The synapse circuit is excitatory to the neuron. It is composed of a PE, which represents the neurotransmitter released into the synaptic cleft, and a LPF, which represents the bound neurotransmitter. The synapse circuit is similar to RCK in structure but differs in function: It is activated not by the principal neuron itself but by the STDP circuits (or directly by afferent spikes that bypass these circuits, i.e., fixed synapses). The synapse’s effect on the soma is also described below in terms of the current (ISYN ) its output voltage produces in a pMOS transistor whose source is at Vdd. The soma circuit is a leaky integrator. It receives excitation from the synapse circuit and shunting inhibition from RCK and has a leak current as well. Its temporal behavior is described by: τ dISOMA ISYN I0 + ISOMA = dt ISHUNT where ISOMA is the current the capacitor’s voltage produces in a pMOS transistor whose source is at Vdd (see Figure 2A). ISHUNT is the sum of the leak, refractory, and calciumdependent potassium currents. These currents also determine the time constant: τ = C Ut κISHUNT , where I0 and κ are transistor parameters and Ut is the thermal voltage. STDP circuit ~LTP SRAM Presynaptic spike A ~LTD Inverse number of pairings Integrator Decay Postsynaptic spike Potentiation 0.1 0.05 0 0.05 0.1 Depression -80 -40 0 Presynaptic spike Postsynaptic spike 40 Spike timing: t pre - t post (ms) 80 B Figure 3: STDP circuit design and characterization. A The circuit is composed of three subcircuits: decay, integrator, and SRAM. B The circuit potentiates when the presynaptic spike precedes the postsynaptic spike and depresses when the postsynaptic spike precedes the presynaptic spike. The soma circuit is connected to an AH, the locus of spike generation. The AH consists of model voltage-dependent sodium and potassium channel populations (modified from [6] by Kai Hynna). It initiates the AER signaling process required to send a spike off chip. To characterize principal neuron variability, we excited 81 neurons with poisson-like 58Hz spike trains (Figure 2B). We made these spike trains poisson-like by starting with a regular 200Hz spike train and dropping spikes randomly, with probability of 0.71. Thus spikes were delivered to neurons that won the coin toss in synchrony every 5ms. However, neurons did not lock onto the input synchrony due to filtering by the synaptic time constant (see Figure 2B). They also received a common inhibitory input at the theta frequency (8.3Hz), via their leak current. Each neuron was prevented from firing more than one spike in a theta cycle by its model calcium-dependent potassium channel population. The principal neurons’ spike times were variable. To quantify the spike variability, we used timing precision, which we define as twice the standard deviation of spike times accumulated from five theta cycles. With an input rate of 58Hz the timing precision was 34ms. 3 STDP Circuit The STDP circuit (related to [7]-[8]), for which the STDP Chip is named, is the most abundant, with 21,504 copies on the chip. This circuit is built from three subcircuits: decay, integrator, and SRAM (Figure 3A). The decay and integrator are used to implement potentiation, and depression, in a symmetric fashion. The SRAM holds the current binary state of the synapse, either potentiated or depressed. For potentiation, the decay remembers the last presynaptic spike. Its capacitor is charged when that spike occurs and discharges linearly thereafter. A postsynaptic spike samples the charge remaining on the capacitor, passes it through an exponential function, and dumps the resultant charge into the integrator. This charge decays linearly thereafter. At the time of the postsynaptic spike, the SRAM, a cross-coupled inverter pair, reads the voltage on the integrator’s capacitor. If it exceeds a threshold, the SRAM switches state from depressed to potentiated (∼LTD goes high and ∼LTP goes low). The depression side of the STDP circuit is exactly symmetric, except that it responds to postsynaptic activation followed by presynaptic activation and switches the SRAM’s state from potentiated to depressed (∼LTP goes high and ∼LTD goes low). When the SRAM is in the potentiated state, the presynaptic 50 After STDP 83 92 100 Timing precision(ms) Before STDP 75 B Before STDP After STDP 40 30 20 10 0 50 60 70 80 90 Input rate(Hz) 100 50 58 67 text A 0.2 0.4 Time(s) 0.6 0.2 0.4 Time(s) 0.6 C Figure 4: Plasticity enhanced phase-coding. A Spike rasters of 81 neurons (9 by 9 cluster) display synchrony over a two-fold range of input rates after STDP. B The degree of enhancement is quantified by timing precision. C Each neuron (center box) sends synapses to (dark gray) and receives synapses from (light gray) twenty-one randomly chosen neighbors up to five nodes away (black indicates both connections). spike activates the principal neuron’s synapse; otherwise the spike has no effect. We characterized the STDP circuit by activating a plastic synapse and a fixed synapse– which elicits a spike at different relative times. We repeated this pairing at 16Hz. We counted the number of pairings required to potentiate (or depress) the synapse. Based on this count, we calculated the efficacy of each pairing as the inverse number of pairings required (Figure 3B). For example, if twenty pairings were required to potentiate the synapse, the efficacy of that pre-before-post time-interval was one twentieth. The efficacy of both potentiation and depression are fit by exponentials with time constants of 11.4ms and 94.9ms, respectively. This behavior is similar to that observed in the hippocampus: potentiation has a shorter time constant and higher maximum efficacy than depression [3]. 4 Recurrent Network We carried out an experiment designed to test the STDP circuit’s ability to compensate for variability in spike timing through PEP. Each neuron received recurrent connections from 21 randomly selected neurons within an 11 by 11 neighborhood centered on itself (see Figure 4C). Conversely, it made recurrent connections to randomly chosen neurons within the same neighborhood. These connections were mediated by STDP circuits, initialized to the depressed state. We chose a 9 by 9 cluster of neurons and delivered spikes at a mean rate of 50 to 100Hz to each one (dropping spikes with a probability of 0.75 to 0.5 from a regular 200Hz train) and provided common theta inhibition as before. We compared the variability in spike timing after five seconds of learning with the initial distribution. Phase coding was enhanced after STDP (Figure 4A). Before STDP, spike timing among neurons was highly variable (except for the very highest input rate). After STDP, variability was virtually eliminated (except for the very lowest input rate). Initially, the variability, characterized by timing precision, was inversely related to the input rate, decreasing from 34 to 13ms. After five seconds of STDP, variability decreased and was largely independent of input rate, remaining below 11ms. Potentiated synapses 25 A Synaptic state after STDP 20 15 10 5 0 B 50 100 150 200 Spiking order 250 Figure 5: Compensating for variability. A Some synapses (dots) become potentiated (light) while others remain depressed (dark) after STDP. B The number of potentiated synapses neurons make (pluses) and receive (circles) is negatively (r = -0.71) and positively (r = 0.76) correlated to their rank in the spiking order, respectively. Comparing the number of potentiated synapses each neuron made or received with its excitability confirmed the PEP hypothesis (i.e., leading neurons provide additional synaptic current to lagging neurons via potentiated recurrent synapses). In this experiment, to eliminate variability due to noise (as opposed to excitability), we provided a 17 by 17 cluster of neurons with a regular 200Hz excitatory input. Theta inhibition was present as before and all synapses were initialized to the depressed state. After 10 seconds of STDP, a large fraction of the synapses were potentiated (Figure 5A). When the number of potentiated synapses each neuron made or received was plotted versus its rank in spiking order (Figure 5B), a clear correlation emerged (r = -0.71 or 0.76, respectively). As expected, neurons that spiked early made more and received fewer potentiated synapses. In contrast, neurons that spiked late made fewer and received more potentiated synapses. 5 Pattern Completion After STDP, we found that the network could recall an entire pattern given a subset, thus the same mechanisms that compensated for variability and noise could also compensate for lack of information. We chose a 9 by 9 cluster of neurons as our pattern and delivered a poisson-like spike train with mean rate of 67Hz to each one as in the first experiment. Theta inhibition was present as before and all synapses were initialized to the depressed state. Before STDP, we stimulated a subset of the pattern and only neurons in that subset spiked (Figure 6A). After five seconds of STDP, we stimulated the same subset again. This time they recruited spikes from other neurons in the pattern, completing it (Figure 6B). Upon varying the fraction of the pattern presented, we found that the fraction recalled increased faster than the fraction presented. We selected subsets of the original pattern randomly, varying the fraction of neurons chosen from 0.1 to 1.0 (ten trials for each). We classified neurons as active if they spiked in the two second period over which we recorded. Thus, we characterized PEP’s pattern-recall performance as a function of the probability that the pattern in question’s neurons are activated (Figure 6C). At a fraction of 0.50 presented, nearly all of the neurons in the pattern are consistently activated (0.91±0.06), showing robust pattern completion. We fitted the recall performance with a sigmoid that reached 0.50 recall fraction with an input fraction of 0.30. No spurious neurons were activated during any trials. Rate(Hz) Rate(Hz) 8 7 7 6 6 5 5 0.6 0.4 2 0.2 0 0 3 3 2 1 1 A 0.8 4 4 Network activity before STDP 1 Fraction of pattern actived 8 0 B Network activity after STDP C 0 0.2 0.4 0.6 0.8 Fraction of pattern stimulated 1 Figure 6: Associative recall. A Before STDP, half of the neurons in a pattern are stimulated; only they are activated. B After STDP, half of the neurons in a pattern are stimulated, and all are activated. C The fraction of the pattern activated grows faster than the fraction stimulated. 6 Discussion Our results demonstrate that PEP successfully compensates for graded variations in our silicon recurrent network using binary (on–off) synapses (in contrast with [8], where weights are graded). While our chip results are encouraging, variability was not eliminated in every case. In the case of the lowest input (50Hz), we see virtually no change (Figure 4A). We suspect the timing remains imprecise because, with such low input, neurons do not spike every theta cycle and, consequently, provide fewer opportunities for the STDP synapses to potentiate. This shortfall illustrates the system’s limits; it can only compensate for variability within certain bounds, and only for activity appropriate to the PEP model. As expected, STDP is the mechanism responsible for PEP. STDP potentiated recurrent synapses from leading neurons to lagging neurons, reducing the disparity among the diverse population of neurons. Even though the STDP circuits are themselves variable, with different efficacies and time constants, when using timing the sign of the weight-change is always correct (data not shown). For this reason, we chose STDP over other more physiological implementations of plasticity, such as membrane-voltage-dependent plasticity (MVDP), which has the capability to learn with graded voltage signals [9], such as those found in active dendrites, providing more computational power [10]. Previously, we investigated a MVDP circuit, which modeled a voltage-dependent NMDAreceptor-gated synapse [11]. It potentiated when the calcium current analog exceeded a threshold, which was designed to occur only during a dendritic action potential. This circuit produced behavior similar to STDP, implying it could be used in PEP. However, it was sensitive to variability in the NMDA and potentiation thresholds, causing a fraction of the population to potentiate anytime the synapse received an input and another fraction to never potentiate, rendering both subpopulations useless. Therefore, the simpler, less biophysical STDP circuit won out over the MVDP circuit: In our system timing is everything. Associative storage and recall naturally emerge in the PEP network when synapses between neurons coactivated by a pattern are potentiated. These synapses allow neurons to recruit their peers when a subset of the pattern is presented, thereby completing the pattern. However, this form of pattern storage and completion differs from Hopfield’s attractor model [12] . Rather than forming symmetric, recurrent neuronal circuits, our recurrent network forms asymmetric circuits in which neurons make connections exclusively to less excitable neurons in the pattern. In both the poisson-like and regular cases (Figures 4 & 5), only about six percent of potentiated connections were reciprocated, as expected by chance. We plan to investigate the storage capacity of this asymmetric form of associative memory. Our system lends itself to modeling brain regions that use precise spike timing, such as the hippocampus. We plan to extend the work presented to store and recall sequences of patterns, as the hippocampus is hypothesized to do. Place cells that represent different locations spike at different phases of the theta cycle, in relation to the distance to their preferred locations. This sequential spiking will allow us to link patterns representing different locations in the order those locations are visited, thereby realizing episodic memory. We propose PEP as a candidate neural mechanism for information coding and storage in the hippocampal system. Observations from the CA1 region of the hippocampus suggest that basal dendrites (which primarily receive excitation from recurrent connections) support submillisecond timing precision, consistent with PEP [13]. We have shown, in a silicon model, PEP’s ability to exploit such fast recurrent connections to sharpen timing precision as well as to associatively store and recall patterns. Acknowledgments We thank Joe Lin for assistance with chip generation. The Office of Naval Research funded this work (Award No. N000140210468). References [1] O’Keefe J. & Recce M.L. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus 3(3):317-330. [2] Mehta M.R., Lee A.K. & Wilson M.A. (2002) Role of experience and oscillations in transforming a rate code into a temporal code. Nature 417(6890):741-746. [3] Bi G.Q. & Wang H.X. (2002) Temporal asymmetry in spike timing-dependent synaptic plasticity. Physiology & Behavior 77:551-555. [4] Rodriguez-Vazquez, A., Linan, G., Espejo S. & Dominguez-Castro R. (2003) Mismatch-induced trade-offs and scalability of analog preprocessing visual microprocessor chips. Analog Integrated Circuits and Signal Processing 37:73-83. [5] Boahen K.A. (2000) Point-to-point connectivity between neuromorphic chips using address events. IEEE Transactions on Circuits and Systems II 47:416-434. [6] Culurciello E.R., Etienne-Cummings R. & Boahen K.A. (2003) A biomorphic digital image sensor. IEEE Journal of Solid State Circuits 38:281-294. [7] Bofill A., Murray A.F & Thompson D.P. (2005) Citcuits for VLSI Implementation of Temporally Asymmetric Hebbian Learning. In: Advances in Neural Information Processing Systems 14, MIT Press, 2002. [8] Cameron K., Boonsobhak V., Murray A. & Renshaw D. (2005) Spike timing dependent plasticity (STDP) can ameliorate process variations in neuromorphic VLSI. IEEE Transactions on Neural Networks 16(6):1626-1627. [9] Chicca E., Badoni D., Dante V., D’Andreagiovanni M., Salina G., Carota L., Fusi S. & Del Giudice P. (2003) A VLSI recurrent network of integrate-and-fire neurons connected by plastic synapses with long-term memory. IEEE Transaction on Neural Networks 14(5):1297-1307. [10] Poirazi P., & Mel B.W. (2001) Impact of active dendrites and structural plasticity on the memory capacity of neural tissue. Neuron 29(3)779-796. [11] Arthur J.V. & Boahen K. (2004) Recurrently connected silicon neurons with active dendrites for one-shot learning. In: IEEE International Joint Conference on Neural Networks 3, pp.1699-1704. [12] Hopfield J.J. (1984) Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science 81(10):3088-3092. [13] Ariav G., Polsky A. & Schiller J. (2003) Submillisecond precision of the input-output transformation function mediated by fast sodium dendritic spikes in basal dendrites of CA1 pyramidal neurons. Journal of Neuroscience 23(21):7750-7758.
4 0.90870792 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
5 0.68979847 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
6 0.64430612 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
7 0.61135423 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
8 0.60824907 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
9 0.60666466 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
10 0.6023699 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
11 0.58773017 54 nips-2005-Data-Driven Online to Batch Conversions
12 0.57660937 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
13 0.57216054 30 nips-2005-Assessing Approximations for Gaussian Process Classification
14 0.56354165 181 nips-2005-Spiking Inputs to a Winner-take-all Network
15 0.55768716 52 nips-2005-Correlated Topic Models
16 0.55473113 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
17 0.54095769 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
18 0.53329515 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
19 0.5277704 46 nips-2005-Consensus Propagation
20 0.52728909 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation