nips nips2011 nips2011-228 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yichuan Zhang, Charles A. Sutton
Abstract: The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. In this paper we propose MCMC samplers that make use of quasiNewton approximations, which approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. On several real world datasets, we show that the quasi-Newton sampler is more effective than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. [sent-8, score-0.097]
2 An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. [sent-9, score-0.066]
3 In this paper we propose MCMC samplers that make use of quasiNewton approximations, which approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. [sent-10, score-0.234]
4 A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. [sent-11, score-0.153]
5 We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. [sent-12, score-0.075]
6 On several real world datasets, we show that the quasi-Newton sampler is more effective than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives. [sent-13, score-0.087]
7 1 Introduction The design of effective approximate inference methods for continuous variables often requires considering the curvature of the target distribution. [sent-14, score-0.065]
8 This is especially true of Markov chain Monte Carlo (MCMC) methods. [sent-15, score-0.097]
9 For example, it is well known that the Gibbs sampler mixes extremely poorly on distributions that are strongly correlated. [sent-16, score-0.06]
10 In a similar way, the performance of a random walk Metropolis-Hastings algorithm is sensitive to the variance of the proposal distribution. [sent-17, score-0.087]
11 Many samplers can be improved by incorporating second-order information about the target distribution. [sent-18, score-0.153]
12 For example, several authors have used a Metropolis-Hastings algorithm in which the Hessian is used to form a covariance for a Gaussian proposal [3, 11]. [sent-19, score-0.112]
13 Unfortunately, second derivatives can be inconvenient or infeasible to obtain and the quadratic cost of manipulating a d × d Hessian matrix can also be prohibitive. [sent-21, score-0.069]
14 However, samplers that depend on the history of previous samples must be carefully designed in order to guarantee the chain converges to the target distribution. [sent-23, score-0.349]
15 The key point is that we use a limited memory approximation, in which only a small window of previous samples are used to the approximate the Hessian. [sent-26, score-0.136]
16 This makes it straightforward to show that our samplers are valid, because the samples are distributed as a order-k Markov chain. [sent-27, score-0.176]
17 Second, by taking advantage 1 of the special structure in the Hessian approximation, the samplers require only linear time and linear space in the dimensionality of the problem. [sent-28, score-0.115]
18 On several logistic regression data sets, we show that the quasi-Newton samplers produce samples of higher quality than standard HMC, but with significantly less computation time than methods that require higher-order derivatives. [sent-31, score-0.206]
19 Let x be a random variable on state space X = Rd with a target probability distribution π(x) ∝ exp(L(x)) and p be a Gaussian random variable on P = Rd with density p(p) = N (p|0, M) where M is the covariance matrix. [sent-34, score-0.089]
20 In general, Hamiltonian Monte Carlo (HMC) defines a stationary Markov chain on the augmented state space X × P with invariant distribution p(x, p) = π(x)p(p). [sent-35, score-0.151]
21 The sampler is defined using a Hamiltonian function, which up to a constant is the negative log density of (x, p), given as follows: 1 H(x, p) = − L(x) + pT M−1 p. [sent-36, score-0.06]
22 (1) 2 In an analogy to physical systems, the first term on the RHS is called the potential energy, the second term is called the kinetic energy, the state x is called the position variable, and p the momentum variable. [sent-37, score-0.08]
23 Finally, we will call the covariance M the mass matrix. [sent-38, score-0.068]
24 The most common mass matrix is the identity matrix I. [sent-39, score-0.099]
25 (2) One common approximation to this dynamical system is given by the leapfrog algorithm. [sent-43, score-0.154]
26 One single iteration of leapfrog algorithm is given by the recursive formula p(τ + ) = p(τ ) + 2 2 x L(x(τ )), x(τ + ) = x(τ ) + M−1 p(τ + ), 2 (3) (4) (5) p(τ + ) = p(τ + ) + x L(x(τ + )), 2 2 where is the step size and τ is a discrete time variable. [sent-44, score-0.153]
27 The leapfrog algorithm is initialised by the current sample, that is (x(0), p(0)) = (x, p). [sent-45, score-0.125]
28 After L leapfrog steps (3)-(5), the final state (x(L ), p(L )) is used as the proposal (x∗ , p∗ ) in Metropolis-Hastings correction with acceptance probability min[1, exp(H(x, p) − H(x∗ , p∗ ))]. [sent-46, score-0.282]
29 The step size and the number of leapfrog steps L are two parameters of HMC. [sent-47, score-0.125]
30 In the context of HMC, this transformation is equivalent to changing mass matrix M. [sent-54, score-0.071]
31 This is because the Hamiltonian dynamics of the system (Ax, p) with mass matrix M are isomorphic to the dynamics on (x, AT p), which is equivalent to defining the state as (x, p) and using the mass matrix M = AT MA. [sent-55, score-0.214]
32 For more general functions without a constant Hessian, this argument suggests the idea of employing a mass matrix M(x) that is a function of the position. [sent-60, score-0.071]
33 Consider minimising the function f : Rd → R, quasi-Newton methods search for the minimum of f (x) by generating a sequence of iterates xk+1 = xk −αk Hk f (xk ) where Hk is an approximation to the inverse Hessian at xk , which is computed from the previous function values and gradients. [sent-65, score-0.282]
34 xk , the L-BFGS approximation Hk+1 is yk sT sk yT Hk+1 = (I − T k )Hk (I − T k ) + sk sT (7) k sk yk sk yk where sk = xk+1 − xk and yk = fk+1 − fk . [sent-70, score-1.339]
35 It can be seen that the BFGS formula (7) is a rank-two update to the previous Hessian approximation Hk . [sent-74, score-0.077]
36 In contrast to optimization methods, most sampling methods need a factorized form of Hk to draw samples from N (0, Hk ). [sent-77, score-0.079]
37 The matrix vector product Sk+1 z can be k computed by a sequence of inner products Sk+1 z = i=k−m−1 (I − pi qT )Sk−m z, in time O(md). [sent-86, score-0.068]
38 Instead, we first sort the previous samples {xi } in ascending order with respect to L(x), and then check if there are any adjacent pairs (xi , xi+1 ) such that the resulting si and yi have sT yi ≤ 0. [sent-90, score-0.121]
39 , previous iterates of an optimization algorithm, or previous samples of an MCMC chain, in principle the BFGS equations could be used to generate a Hessian approximation from any set of points X = {x1 . [sent-95, score-0.155]
40 This function first sorts x ∈ X by L(xi ), then computes si = xi+1 − xi and yi = L(xi+1 ) − L(xi ), then filters xi as described above so that sT yi > 0 ∀i, and finally computes the Hessian approximation Hk using the recursion (8)–(11). [sent-100, score-0.209]
41 The previous samples provide information about the target distribution, so it is reasonable to use them to adapt the kernel. [sent-104, score-0.136]
42 However, naively tuning the sampling parameters using all previous samples may lead to an invalid chain, that is, a chain that does not have π as its invariant distribution. [sent-105, score-0.25]
43 Our samplers will use a simple solution to this problem. [sent-106, score-0.115]
44 Rather than adapting the kernel using all of the previous samples in the Markov chain, we will adapt using a limited window of K previous samples. [sent-107, score-0.188]
45 The chain as a whole will then be an order K Markov chain. [sent-108, score-0.097]
46 It is easiest to analyze this chain by converting it into a first-order Markov chain over an enlarged space. [sent-109, score-0.217]
47 Specifically, we build a Markov chain in a K-fold product space X K with the stationary distribution p(x1:K ) = i=1:K π(xi ). [sent-110, score-0.097]
48 We denote a state of this chain by xt−K+1 , xt−K+2 , . [sent-111, score-0.123]
49 We use the short-hand (t) (t) (t) notation x1:K\i for the subset of x1:K excluding the xi . [sent-115, score-0.058]
50 (t) Our samplers will then update one component of x1:K per iteration, in a Gibbs-like fashion. [sent-116, score-0.115]
51 We (t) define a transition kernel Ti that only updates the ith component of x1:K , that is: (t) (t) (t) Ti (x1:K , x1:K ) = δ(x1:K\i , x1:K\i )B(xi , xi |x1:K\i ), (12) where B(xi , xi |x1:K\i ) is called the base kernel that is a MCMC kernel in X and adapts with (t) x1:K\i . [sent-117, score-0.294]
52 If B leaves π(xi ) invariant for all fixed values of x1:K\i , it is straightforward to show that (t) Ti leaves p invariant. [sent-118, score-0.128]
53 Then, the sampler as a whole updates each of the components xi in sequence, so that the method as a whole is described by the kernel T (x1:K , x1:K ) = T1 ◦ T2 . [sent-119, score-0.184]
54 Because the each kernel Ti leaves p(x1:K ) invariant, the composition kernel T also leaves p(x1:K ) invariant. [sent-123, score-0.192]
55 Such an adaptive scheme is equivalent to using an ensemble of K chains and changing the kernel of each chain with the state of the others. [sent-124, score-0.305]
56 To simplify the analysis of the validity of the chain, we assume the base kernel B is irreducible in one iteration. [sent-127, score-0.125]
57 The intuition is to fit the Gaussian proposal distribution to the target distribution, so that points in a high probability region are more likely to be proposed. [sent-131, score-0.125]
58 Specifically, the proposal distribution of M H B FGS is defined as (t) (t) (t) q(x |x1:K ) = N (x ; µ, Σ), where the proposal mean µ = µ(x1:K ) and covariance Σ = Σ(x1:K ) depend on the state of all K chains. [sent-133, score-0.225]
59 One simple choice is to use one of the samples (t) (t) in the window as the mean, e. [sent-135, score-0.085]
60 The proposal x of T1 is accepted with probability (t) (t) α(x1 , x ) = min 1, (t) q(x1 |x2:K , x ) π(x ) (t) π(x1 ) q(x |x1 , x2:K ) (t) (t) (t) . [sent-140, score-0.087]
61 Because the Gaussian proposal (t) (t) q(A|x1 , x2:K ) has positive probability for all A ∈ X , the M-H kernel is irreducible within one iteration. [sent-142, score-0.155]
62 Because the M-H algorithm with acceptance ratio defined as (14) leaves π(x) invariant, M H B FGS is a valid method that leaves p(x1:K ) invariant. [sent-143, score-0.147]
63 Although M H B FGS is simple and intuitive, in preliminary experiments we have found that M H B FGS sampler may converge slowly in high 4 Algorithm 1 H MC B FGS (t) (t) (t) Input: Current memory (x1 , x2 , . [sent-144, score-0.091]
64 In general Metropolis Hastings with a Gaussian proposal can suffer from random walk behavior, even if the true Hessian is used. [sent-154, score-0.087]
65 The high-level idea is to start with the M H B FGS algorithm, but to replace the Gaussian proposal with a simulation of Hamiltonian dynamics. [sent-158, score-0.087]
66 However, we will need to be a bit careful in order to ensure that the Hamiltonian is separable, because otherwise we would need to employ a generalized leapfrog integrator [5] which is significantly more expensive. [sent-159, score-0.125]
67 The new samples in H MC B FGS are generated as follows. [sent-160, score-0.061]
68 First we sample a new value of the momentum (t) variable p ∼ N (0, BBFGS (x1:K\i )). [sent-163, score-0.074]
69 It is important that when constructing the BFGS approximation, (t) we not use the value xi that we are currently resampling. [sent-164, score-0.058]
70 Then we simulate the Hamiltonian (t) dynamics starting at the point (xi , p) using the leapfrog method (3)–(5). [sent-165, score-0.148]
71 Finally, the proposal is accepted with probability min[1, exp(H(xi , p) − H(x∗ , p∗ )], for H in (15) and p∗ is discarded after M-H correction. [sent-167, score-0.087]
72 This procedure is an instance of the general ECA scheme described above, with base kernel ˆ B(xi , pi , xi , pi |x1:K\i )dpi dpi . [sent-169, score-0.232]
73 B(xi , xi |x1:K\i ) = ˆ where B(xi , pi , xi , pi |x1:K\i ) is a standard HMC kernel with mass matrix BBFGS (x1:K\i ) that inˆ cludes sampling pi . [sent-170, score-0.371]
74 The Hamiltonian energy function of B given by (15) is separable, that means xi only appear in potential energy. [sent-171, score-0.077]
75 It is easy to see that B is a valid kernel in X , so as a ECA method, H MC B FGS leaves p(x1:K ) = i π(xi ) invariant. [sent-172, score-0.117]
76 So despite the auxiliary variables, it is easiest to establish validity in the original space. [sent-176, score-0.06]
77 But, being an ECA method, H MC B FGS has the disadvantage that the larger the number of chains K, the updates are “spread across” the chains, so that each chain gets a small number of updates during a fixed amount of computation time. [sent-178, score-0.205]
78 Girolami and Calderhead adopted a generalised leapfrog method that is a reversible and volume-preserving approximation to nonseparable Hamiltonian. [sent-182, score-0.154]
79 An early example of ECA is adaptive direction sampling (ADS) [4], in which each sample is taken along a random direction that is chosen based on the samples from a set of chains. [sent-187, score-0.13]
80 However, the validity of ADS can be established only when the size of ensemble is greater than the number of dimensions, otherwise the samples are trapped in a subspace. [sent-188, score-0.135]
81 These methods must be designed carefully because if the kernel is adapted with the full sampling history in a naive way, the sampler can be invalid [13]. [sent-191, score-0.168]
82 A well known example of a correct adaptive algorithm is the Adaptive Metropolis [6] algorithm, which adapts the Gaussian proposal of a Metropolis Hastings algorithm based on the empirical covariance of previous samples in a way that maintains ergodicity. [sent-192, score-0.224]
83 Being a valid method, the adaptation of kernel must keep decreasing over time. [sent-193, score-0.067]
84 In practice, the parameters of the kernel in many diminishing adaptive methods converge to a single value over the entire state space. [sent-194, score-0.103]
85 This could be problematic if we want the sampler to adapt to local characteristics of the target distribution, e. [sent-195, score-0.115]
86 We compare H MC B FGS to the standard HMC which uses identity mass matrix and RMHMC which requires computing the Hessian matrix. [sent-200, score-0.071]
87 For H MC B FGS, we heuristically chose the number of ensemble chains K to be slightly higher than d/2. [sent-212, score-0.105]
88 1 Our implementation was based on the Matlab code of RMHMC of Girolami and Calderhead and checked against the original Matlab version 6 For each method, we drew 5000 samples after 1000 burn-in samples. [sent-213, score-0.061]
89 The convergence speed is measured by effective sample size (ESS) [5], which summaries the amount of autocorrelation across different lags over all dimensions2 . [sent-214, score-0.136]
90 Because H MC B FGS displays more correlation within individual chain than across chains, we calculate the ESS separately for individual chains in the ensemble and the overall ESS is simply the sum of that from individual chains. [sent-216, score-0.202]
91 00 107 Table 1: Performance of MCMC samplers on Bayesian logistic regression, as measured by Effective Sample Size (ESS). [sent-221, score-0.145]
92 ES/s is the number of effective samples per second Dataset Australian German Heart Pima Ripley D 15 25 14 8 7 N 690 1000 532 270 250 HMC 396 255 1054 591 1396 H MC B FGS 818 397 2009 1383 2745 RMHMC 18 3 54 118 344 Table 2: Effective samples per second on Bayesian logistic regression. [sent-224, score-0.179]
93 RMHMC achieves the highest minimum and mean and maximum ESS and that are all very close to the total number of samples 5000. [sent-227, score-0.061]
94 For each chain we draw 8000 samples with 1000 burn-in. [sent-238, score-0.158]
95 It is clear that H MC B FGS demonstrates remarkably less autocorrelation than HMC. [sent-243, score-0.089]
96 The results suggest that BFGS approximation dramatically reduces the sample autocorrelation with a small increase of computational overhead on this dataset. [sent-245, score-0.156]
97 Using an ensemble of K = 5 chains, the samples from H MC B FGS are less correlated than HMC along the largest eigenvalue direction (Figure 2). [sent-247, score-0.098]
98 2 We use the code from [5] to compute ESS of samples 7 ESS HMC H MC B FGS Min Mean Max Time (s) ES/h 3 9 25 35743 1 26 438 5371 37387 42 Table 3: Performance of MCMC samplers on Bayesian CRFs, as measured by Effective Sample Size (ESS). [sent-248, score-0.176]
99 On the other hand, H MC B FGS is more effective than the state-of-the-art sampler on several real world data sets. [sent-259, score-0.087]
100 Another potential issue, the asymptotic independence between the chains in ECA methods may lead to poor Hessian approximations. [sent-262, score-0.068]
wordName wordTfidf (topN-words)
[('fgs', 0.506), ('hmc', 0.375), ('hamiltonian', 0.256), ('mc', 0.253), ('rmhmc', 0.237), ('bfgs', 0.23), ('hk', 0.197), ('ess', 0.182), ('sk', 0.17), ('hessian', 0.152), ('leapfrog', 0.125), ('mcmc', 0.116), ('samplers', 0.115), ('eca', 0.111), ('xk', 0.104), ('chain', 0.097), ('autocorrelation', 0.089), ('monte', 0.089), ('proposal', 0.087), ('girolami', 0.084), ('calderhead', 0.079), ('carlo', 0.07), ('chains', 0.068), ('bbfgs', 0.063), ('metropolis', 0.063), ('yk', 0.063), ('samples', 0.061), ('sampler', 0.06), ('xi', 0.058), ('leaps', 0.056), ('momentum', 0.054), ('st', 0.052), ('leaves', 0.05), ('hbfgs', 0.047), ('bk', 0.047), ('kernel', 0.046), ('mass', 0.043), ('pi', 0.04), ('hastings', 0.038), ('target', 0.038), ('ensemble', 0.037), ('validity', 0.037), ('ti', 0.036), ('riemannian', 0.036), ('markov', 0.034), ('crf', 0.032), ('barthelme', 0.032), ('quasinewton', 0.032), ('adaptive', 0.031), ('doi', 0.031), ('memory', 0.031), ('logistic', 0.03), ('approximation', 0.029), ('invariant', 0.028), ('bayesian', 0.028), ('dpi', 0.028), ('formula', 0.028), ('matrix', 0.028), ('md', 0.028), ('effective', 0.027), ('acceptance', 0.026), ('state', 0.026), ('roberts', 0.026), ('invalid', 0.026), ('manifold', 0.025), ('iterates', 0.025), ('covariance', 0.025), ('recursion', 0.024), ('window', 0.024), ('infeasible', 0.023), ('dynamics', 0.023), ('rhs', 0.023), ('unif', 0.023), ('easiest', 0.023), ('irreducible', 0.022), ('lag', 0.022), ('valid', 0.021), ('ck', 0.021), ('updates', 0.02), ('sample', 0.02), ('yi', 0.02), ('base', 0.02), ('neal', 0.02), ('approximations', 0.02), ('previous', 0.02), ('roy', 0.019), ('ads', 0.019), ('energy', 0.019), ('gaussian', 0.019), ('tj', 0.019), ('pt', 0.019), ('derivatives', 0.018), ('overhead', 0.018), ('history', 0.018), ('correction', 0.018), ('sampling', 0.018), ('german', 0.017), ('xt', 0.017), ('adapt', 0.017), ('pk', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
Author: Yichuan Zhang, Charles A. Sutton
Abstract: The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. In this paper we propose MCMC samplers that make use of quasiNewton approximations, which approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. On several real world datasets, we show that the quasi-Newton sampler is more effective than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives. 1
2 0.15586004 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
Author: David Wingate, Noah Goodman, Andreas Stuhlmueller, Jeffrey M. Siskind
Abstract: Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distribution represented by the probabilistic program. We show how nonstandard interpretations of probabilistic programs can be used to craft efficient inference algorithms: information about the structure of a distribution (such as gradients or dependencies) is generated as a monad-like side computation while executing the program. These interpretations can be easily coded using special-purpose objects and operator overloading. We implement two examples of nonstandard interpretations in two different languages, and use them as building blocks to construct inference algorithms: automatic differentiation, which enables gradient based methods, and provenance tracking, which enables efficient construction of global proposals. 1
3 0.08960247 229 nips-2011-Query-Aware MCMC
Author: Michael L. Wick, Andrew McCallum
Abstract: Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for all the unobserved variables in a graphical model. However, in many real-world applications the user’s interests are focused on a subset of the variables, specified by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-specific approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efficiency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models. 1
4 0.076163821 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
5 0.07452099 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
Author: Trung T. Pham, Tat-jun Chin, Jin Yu, David Suter
Abstract: Multi-structure model fitting has traditionally taken a two-stage approach: First, sample a (large) number of model hypotheses, then select the subset of hypotheses that optimise a joint fitting and model selection criterion. This disjoint two-stage approach is arguably suboptimal and inefficient — if the random sampling did not retrieve a good set of hypotheses, the optimised outcome will not represent a good fit. To overcome this weakness we propose a new multi-structure fitting approach based on Reversible Jump MCMC. Instrumental in raising the effectiveness of our method is an adaptive hypothesis generator, whose proposal distribution is learned incrementally and online. We prove that this adaptive proposal satisfies the diminishing adaptation property crucial for ensuring ergodicity in MCMC. Our method effectively conducts hypothesis sampling and optimisation simultaneously, and yields superior computational efficiency over previous two-stage methods. 1
6 0.065899678 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
7 0.064968728 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
8 0.060859274 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
9 0.058920447 302 nips-2011-Variational Learning for Recurrent Spiking Networks
10 0.057890303 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
11 0.054950513 4 nips-2011-A Convergence Analysis of Log-Linear Training
12 0.05093218 168 nips-2011-Maximum Margin Multi-Instance Learning
13 0.050842606 73 nips-2011-Divide-and-Conquer Matrix Factorization
14 0.048392657 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
15 0.047944371 101 nips-2011-Gaussian process modulated renewal processes
16 0.044564493 33 nips-2011-An Exact Algorithm for F-Measure Maximization
17 0.044294842 131 nips-2011-Inference in continuous-time change-point models
18 0.043759599 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
19 0.043151818 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
20 0.042537723 221 nips-2011-Priors over Recurrent Continuous Time Processes
topicId topicWeight
[(0, 0.132), (1, -0.006), (2, 0.017), (3, -0.036), (4, -0.058), (5, -0.024), (6, -0.04), (7, -0.046), (8, 0.017), (9, 0.019), (10, -0.002), (11, -0.078), (12, 0.054), (13, -0.003), (14, -0.02), (15, -0.027), (16, -0.023), (17, -0.002), (18, -0.059), (19, 0.08), (20, -0.013), (21, 0.086), (22, 0.076), (23, -0.091), (24, -0.054), (25, 0.045), (26, 0.011), (27, -0.014), (28, 0.035), (29, -0.075), (30, -0.154), (31, 0.018), (32, -0.119), (33, -0.045), (34, -0.097), (35, 0.052), (36, -0.061), (37, -0.036), (38, 0.07), (39, -0.083), (40, -0.071), (41, -0.032), (42, -0.13), (43, -0.047), (44, -0.019), (45, -0.016), (46, -0.045), (47, -0.044), (48, -0.061), (49, 0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.91381669 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
Author: Yichuan Zhang, Charles A. Sutton
Abstract: The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. In this paper we propose MCMC samplers that make use of quasiNewton approximations, which approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. On several real world datasets, we show that the quasi-Newton sampler is more effective than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives. 1
2 0.71890706 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
Author: David Wingate, Noah Goodman, Andreas Stuhlmueller, Jeffrey M. Siskind
Abstract: Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distribution represented by the probabilistic program. We show how nonstandard interpretations of probabilistic programs can be used to craft efficient inference algorithms: information about the structure of a distribution (such as gradients or dependencies) is generated as a monad-like side computation while executing the program. These interpretations can be easily coded using special-purpose objects and operator overloading. We implement two examples of nonstandard interpretations in two different languages, and use them as building blocks to construct inference algorithms: automatic differentiation, which enables gradient based methods, and provenance tracking, which enables efficient construction of global proposals. 1
3 0.57878536 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
Author: Trung T. Pham, Tat-jun Chin, Jin Yu, David Suter
Abstract: Multi-structure model fitting has traditionally taken a two-stage approach: First, sample a (large) number of model hypotheses, then select the subset of hypotheses that optimise a joint fitting and model selection criterion. This disjoint two-stage approach is arguably suboptimal and inefficient — if the random sampling did not retrieve a good set of hypotheses, the optimised outcome will not represent a good fit. To overcome this weakness we propose a new multi-structure fitting approach based on Reversible Jump MCMC. Instrumental in raising the effectiveness of our method is an adaptive hypothesis generator, whose proposal distribution is learned incrementally and online. We prove that this adaptive proposal satisfies the diminishing adaptation property crucial for ensuring ergodicity in MCMC. Our method effectively conducts hypothesis sampling and optimisation simultaneously, and yields superior computational efficiency over previous two-stage methods. 1
4 0.55267417 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
5 0.54461044 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
6 0.54102892 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
7 0.52822661 229 nips-2011-Query-Aware MCMC
8 0.47128141 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
9 0.46049383 55 nips-2011-Collective Graphical Models
10 0.43854031 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
11 0.41042626 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
12 0.40497676 260 nips-2011-Sparse Features for PCA-Like Linear Regression
13 0.39918563 201 nips-2011-On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference
14 0.39617899 73 nips-2011-Divide-and-Conquer Matrix Factorization
15 0.39478427 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
16 0.38939148 171 nips-2011-Metric Learning with Multiple Kernels
17 0.38262507 197 nips-2011-On Tracking The Partition Function
18 0.37961197 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
19 0.37564582 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
20 0.37270993 131 nips-2011-Inference in continuous-time change-point models
topicId topicWeight
[(0, 0.02), (4, 0.024), (20, 0.024), (26, 0.065), (31, 0.095), (33, 0.013), (43, 0.052), (45, 0.092), (57, 0.03), (63, 0.321), (74, 0.043), (83, 0.052), (99, 0.055)]
simIndex simValue paperId paperTitle
1 0.78349423 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm
Author: Julie Dethier, Paul Nuyujukian, Chris Eliasmith, Terrence C. Stewart, Shauki A. Elasaad, Krishna V. Shenoy, Kwabena A. Boahen
Abstract: Motor prostheses aim to restore function to disabled patients. Despite compelling proof of concept systems, barriers to clinical translation remain. One challenge is to develop a low-power, fully-implantable system that dissipates only minimal power so as not to damage tissue. To this end, we implemented a Kalman-filter based decoder via a spiking neural network (SNN) and tested it in brain-machine interface (BMI) experiments with a rhesus monkey. The Kalman filter was trained to predict the arm’s velocity and mapped on to the SNN using the Neural Engineering Framework (NEF). A 2,000-neuron embedded Matlab SNN implementation runs in real-time and its closed-loop performance is quite comparable to that of the standard Kalman filter. The success of this closed-loop decoder holds promise for hardware SNN implementations of statistical signal processing algorithms on neuromorphic chips, which may offer power savings necessary to overcome a major obstacle to the successful clinical translation of neural motor prostheses. ∗ Present: Research Fellow F.R.S.-FNRS, Systmod Unit, University of Liege, Belgium. 1 1 Cortically-controlled motor prostheses: the challenge Motor prostheses aim to restore function for severely disabled patients by translating neural signals from the brain into useful control signals for prosthetic limbs or computer cursors. Several proof of concept demonstrations have shown encouraging results, but barriers to clinical translation still remain. One example is the development of a fully-implantable system that meets power dissipation constraints, but is still powerful enough to perform complex operations. A recently reported closedloop cortically-controlled motor prosthesis is capable of producing quick, accurate, and robust computer cursor movements by decoding neural signals (threshold-crossings) from a 96-electrode array in rhesus macaque premotor/motor cortex [1]-[4]. This, and previous designs (e.g., [5]), employ versions of the Kalman filter, ubiquitous in statistical signal processing. Such a filter and its variants are the state-of-the-art decoder for brain-machine interfaces (BMIs) in humans [5] and monkeys [2]. While these recent advances are encouraging, clinical translation of such BMIs requires fullyimplanted systems, which in turn impose severe power dissipation constraints. Even though it is an open, actively-debated question as to how much of the neural prosthetic system must be implanted, we note that there are no reports to date demonstrating a fully implantable 100-channel wireless transmission system, motivating performing decoding within the implanted chip. This computation is constrained by a stringent power budget: A 6 × 6mm2 implant must dissipate less than 10mW to avoid heating the brain by more than 1◦ C [6], which is believed to be important for long term cell health. With this power budget, current approaches can not scale to higher electrode densities or to substantially more computer-intensive decode/control algorithms. The feasibility of mapping a Kalman-filter based decoder algorithm [1]-[4] on to a spiking neural network (SNN) has been explored off-line (open-loop). In these off-line tests, the SNN’s performance virtually matched that of the standard implementation [7]. These simulations provide confidence that this algorithm—and others similar to it—could be implemented using an ultra-low-power approach potentially capable of meeting the severe power constraints set by clinical translation. This neuromorphic approach uses very-large-scale integrated systems containing microelectronic analog circuits to morph neural systems into silicon chips [8, 9]. These neuromorphic circuits may yield tremendous power savings—50nW per silicon neuron [10]—over digital circuits because they use physical operations to perform mathematical computations (analog approach). When implemented on a chip designed using the neuromorphic approach, a 2,000-neuron SNN network can consume as little as 100µW. Demonstrating this approach’s feasibility in a closed-loop system running in real-time is a key, non-incremental step in the development of a fully implantable decoding chip, and is necessary before proceeding with fabricating and implanting the chip. As noise, delay, and over-fitting play a more important role in the closed-loop setting, it is not obvious that the SNN’s stellar open-loop performance will hold up. In addition, performance criteria are different in the closed-loop and openloop settings (e.g., time per target vs. root mean squared error). Therefore, a SNN of a different size may be required to meet the desired specifications. Here we present results and assess the performance and viability of the SNN Kalman-filter based decoder in real-time, closed-loop tests, with the monkey performing a center-out-and-back target acquisition task. To achieve closed-loop operation, we developed an embedded Matlab implementation that ran a 2,000-neuron version of the SNN in real-time on a PC. We achieved almost a 50-fold speed-up by performing part of the computation in a lower-dimensional space defined by the formal method we used to map the Kalman filter on to the SNN. This shortcut allowed us to run a larger SNN in real-time than would otherwise be possible. 2 Spiking neural network mapping of control theory algorithms As reported in [11], a formal methodology, called the Neural Engineering Framework (NEF), has been developed to map control-theory algorithms onto a computational fabric consisting of a highly heterogeneous population of spiking neurons simply by programming the strengths of their connections. These artificial neurons are characterized by a nonlinear multi-dimensional-vector-to-spikerate function—ai (x(t)) for the ith neuron—with parameters (preferred direction, maximum firing rate, and spiking-threshold) drawn randomly from a wide distribution (standard deviation ≈ mean). 2 Spike rate (spikes/s) Representation ˆ x → ai (x) → x = ∑i ai (x)φix ˜ ai (x) = G(αi φix · x + Jibias ) 400 Transformation y = Ax → b j (Aˆ ) x Aˆ = ∑i ai (x)Aφix x x(t) B' y(t) A' 200 0 −1 Dynamics ˙ x = Ax → x = h ∗ A x A = τA + I 0 Stimulus x 1 bk(t) y(t) B' h(t) x(t) A' aj(t) Figure 1: NEF’s three principles. Representation. 1D tuning curves of a population of 50 leaky integrate-and-fire neurons. The neurons’ tuning curves map control variables (x) to spike rates (ai (x)); this nonlinear transformation is inverted by linear weighted decoding. G() is the neurons’ nonlinear current-to-spike-rate function. Transformation. SNN with populations bk (t) and a j (t) representing y(t) and x(t). Feedforward and recurrent weights are determined by B and A , as described next. Dynamics. The system’s dynamics is captured in a neurally plausible fashion by replacing integration with the synapses’ spike response, h(t), and replacing the matrices with A = τA + I and B = τB to compensate. The neural engineering approach to configuring SNNs to perform arbitrary computations is underlined by three principles (Figure 1) [11]-[14]: Representation is defined by nonlinear encoding of x(t) as a spike rate, ai (x(t))—represented by the neuron tuning curve—combined with optimal weighted linear decoding of ai (x(t)) to recover ˆ an estimate of x(t), x(t) = ∑i ai (x(t))φix , where φix are the decoding weights. Transformation is performed by using alternate decoding weights in the decoding operation to map transformations of x(t) directly into transformations of ai (x(t)). For example, y(t) = Ax(t) is represented by the spike rates b j (Aˆ (t)), where unit j’s input is computed directly from unit i’s x output using Aˆ (t) = ∑i ai (x(t))Aφix , an alternative linear weighting. x Dynamics brings the first two principles together and adds the time dimension to the circuit. This principle aims at reuniting the control-theory and neural levels by modifying the matrices to render the system neurally plausible, thereby permitting the synapses’ spike response, h(t), (i.e., impulse ˙ response) to capture the system’s dynamics. For example, for h(t) = τ −1 e−t/τ , x = Ax(t) is realized by replacing A with A = τA + I. This so-called neurally plausible matrix yields an equivalent dynamical system: x(t) = h(t) ∗ A x(t), where convolution replaces integration. The nonlinear encoding process—from a multi-dimensional stimulus, x(t), to a one-dimensional soma current, Ji (x(t)), to a firing rate, ai (x(t))—is specified as: ai (x(t)) = G(Ji (x(t))). (1) Here G is the neurons’ nonlinear current-to-spike-rate function, which is given by G(Ji (x)) = τ ref − τ RC ln (1 − Jth /Ji (x)) −1 , (2) for the leaky integrate-and-fire model (LIF). The LIF neuron has two behavioral regimes: subthreshold and super-threshold. The sub-threshold regime is described by an RC circuit with time constant τ RC . When the sub-threshold soma voltage reaches the threshold, Vth , the neuron emits a spike δ (t −tn ). After this spike, the neuron is reset and rests for τ ref seconds (absolute refractory period) before it resumes integrating. Jth = Vth /R is the minimum input current that produces spiking. Ignoring the soma’s RC time-constant when specifying the SNN’s dynamics are reasonable because the neurons cross threshold at a rate that is proportional to their input current, which thus sets the spike rate instantaneously, without any filtering [11]. The conversion from a multi-dimensional stimulus, x(t), to a one-dimensional soma current, Ji , is ˜ performed by assigning to the neuron a preferred direction, φix , in the stimulus space and taking the dot-product: ˜ Ji (x(t)) = αi φix · x(t) + Jibias , (3) 3 where αi is a gain or conversion factor, and Jibias is a bias current that accounts for background ˜ activity. For a 1D space, φix is either +1 or −1 (drawn randomly), for ON and OFF neurons, respectively. The resulting tuning curves are illustrated in Figure 1, left. The linear decoding process is characterized by the synapses’ spike response, h(t) (i.e., post-synaptic currents), and the decoding weights, φix , which are obtained by minimizing the mean square error. A single noise term, η, takes into account all sources of noise, which have the effect of introducing uncertainty into the decoding process. Hence, the transmitted firing rate can be written as ai (x(t)) + ηi , where ai (x(t)) represents the noiseless set of tuning curves and ηi is a random variable picked from a zero-mean Gaussian distribution with variance σ 2 . Consequently, the mean square error can be written as [11]: E = 1 ˆ [x(t) − x(t)]2 2 x,η,t = 2 1 2 x(t) − ∑ (ai (x(t)) + ηi ) φix i (4) x,η,t where · x,η denotes integration over the range of x and η, the expected noise. We assume that the noise is independent and has the same variance for each neuron [11], which yields: E= where σ2 1 2 2 x(t) − ∑ ai (x(t))φix i x,t 1 + σ 2 ∑(φix )2 , 2 i (5) is the noise variance ηi η j . This expression is minimized by: N φix = ∑ Γ−1 ϒ j , ij (6) j with Γi j = ai (x)a j (x) x + σ 2 δi j , where δ is the Kronecker delta function matrix, and ϒ j = xa j (x) x [11]. One consequence of modeling noise in the neural representation is that the matrix Γ is invertible despite the use of a highly overcomplete representation. In a noiseless representation, Γ is generally singular because, due to the large number of neurons, there is a high probability of having two neurons with similar tuning curves leading to two similar rows in Γ. 3 Kalman-filter based cortical decoder In the 1960’s, Kalman described a method that uses linear filtering to track the state of a dynamical system throughout time using a model of the dynamics of the system as well as noisy measurements [15]. The model dynamics gives an estimate of the state of the system at the next time step. This estimate is then corrected using the observations (i.e., measurements) at this time step. The relative weights for these two pieces of information are given by the Kalman gain, K [15, 16]. Whereas the Kalman gain is updated at each iteration, the state and observation matrices (defined below)—and corresponding noise matrices—are supposed constant. In the case of prosthetic applications, the system’s state vector is the cursor’s kinematics, xt = y [veltx , velt , 1], where the constant 1 allows for a fixed offset compensation. The measurement vector, yt , is the neural spike rate (spike counts in each time step) of 192 channels of neural threshold crossings. The system’s dynamics is modeled by: xt yt = Axt−1 + wt , = Cxt + qt , (7) (8) where A is the state matrix, C is the observation matrix, and wt and qt are additive, Gaussian noise sources with wt ∼ N (0, W) and qt ∼ N (0, Q). The model parameters (A, C, W and Q) are fit with training data by correlating the observed hand kinematics with the simultaneously measured neural signals (Figure 2). For an efficient decoding, we derived the steady-state update equation by replacing the adaptive Kalman gain by its steady-state formulation: K = (I + WCQ−1 C)−1 W CT Q−1 . This yields the following estimate of the system’s state: xt = (I − KC)Axt−1 + Kyt = MDT xt−1 + MDT yt , x y 4 (9) a Velocity (cm/s) Neuron 10 c 150 5 100 b 50 20 0 −20 0 0 x−velocity y−velocity 2000 4000 6000 8000 Time (ms) 10000 12000 1cm 14000 Trials: 0034-0049 Figure 2: Neural and kinematic measurements (monkey J, 2011-04-16, 16 continuous trials) used to fit the standard Kalman filter model. a. The 192 cortical recordings fed as input to fit the Kalman filter’s matrices (color code refers to the number of threshold crossings observed in each 50ms bin). b. Hand x- and y-velocity measurements correlated with the neural data to obtain the Kalman filter’s matrices. c. Cursor kinematics of 16 continuous trials under direct hand control. where MDT = (I − KC)A and MDT = K are the discrete time (DT) Kalman matrices. The steadyx y state formulation improves efficiency with little loss in accuracy because the optimal Kalman gain rapidly converges (typically less than 100 iterations). Indeed, in neural applications under both open-loop and closed-loop conditions, the difference between the full Kalman filter and its steadystate implementation falls to within 1% in a few seconds [17]. This simplifying assumption reduces the execution time for decoding a typical neuronal firing rate signal approximately seven-fold [17], a critical speed-up for real-time applications. 4 Kalman filter with a spiking neural network To implement the Kalman filter with a SNN by applying the NEF, we first convert Equation 9 from DT to continuous time (CT), and then replace the CT matrices with neurally plausible ones, which yields: x(t) = h(t) ∗ A x(t) + B y(t) , (10) where A = τMCT + I, B = τMCT , with MCT = MDT − I /∆t and MCT = MDT /∆t, the CT x y x x y y Kalman matrices, and ∆t = 50ms, the discrete time step; τ is the synaptic time-constant. The jth neuron’s input current (see Equation 3) is computed from the system’s current state, x(t), which is computed from estimates of the system’s previous state (ˆ (t) = ∑i ai (t)φix ) and current x y input (ˆ (t) = ∑k bk (t)φk ) using Equation 10. This yields: y ˜x J j (x(t)) = α j φ j · x(t) + J bias j ˜x ˆ ˆ = α j φ j · h(t) ∗ A x(t) + B y(t) ˜x = α j φ j · h(t) ∗ A + J bias j ∑ ai (t)φix + B ∑ bk (t)φky i + J bias j (11) k This last equation can be written in a neural network form: J j (x(t)) = h(t) ∗ ∑ ω ji ai (t) + ∑ ω jk bk (t) i + J bias j (12) k y ˜x ˜x where ω ji = α j φ j A φix and ω jk = α j φ j B φk are the recurrent and feedforward weights, respectively. 5 Efficient implementation of the SNN In this section, we describe the two distinct steps carried out when implementing the SNN: creating and running the network. The first step has no computational constraints whereas the second must be very efficient in order to be successfully deployed in the closed-loop experimental setting. 5 x ( 1000 x ( = 1000 1000 = 1000 x 1000 b 1000 x 1000 1000 a Figure 3: Computing a 1000-neuron pool’s recurrent connections. a. Using connection weights requires multiplying a 1000×1000 matrix by a 1000 ×1 vector. b. Operating in the lower-dimensional state space requires multiplying a 1 × 1000 vector by a 1000 × 1 vector to get the decoded state, multiplying this state by a component of the A matrix to update it, and multiplying the updated state by a 1000 × 1 vector to re-encode it as firing rates, which are then used to update the soma current for every neuron. Network creation: This step generates, for a specified number of neurons composing the network, x ˜x the gain α j , bias current J bias , preferred direction φ j , and decoding weight φ j for each neuron. The j ˜x preferred directions φ j are drawn randomly from a uniform distribution over the unit sphere. The maximum firing rate, max G(J j (x)), and the normalized x-axis intercept, G(J j (x)) = 0, are drawn randomly from a uniform distribution on [200, 400] Hz and [-1, 1], respectively. From these two specifications, α j and J bias are computed using Equation 2 and Equation 3. The decoding weights j x φ j are computed by minimizing the mean square error (Equation 6). For efficient implementation, we used two 1D integrators (i.e., two recurrent neuron pools, with each pool representing a scalar) rather than a single 3D integrator (i.e., one recurrent neuron pool, with the pool representing a 3D vector by itself) [13]. The constant 1 is fed to the 1D integrators as an input, rather than continuously integrated as part of the state vector. We also replaced the bk (t) units’ spike rates (Figure 1, middle) with the 192 neural measurements (spike counts in 50ms bins), y which is equivalent to choosing φk from a standard basis (i.e., a unit vector with 1 at the kth position and 0 everywhere else) [7]. Network simulation: This step runs the simulation to update the soma current for every neuron, based on input spikes. The soma voltage is then updated following RC circuit dynamics. Gaussian noise is normally added at this step, the rest of the simulation being noiseless. Neurons with soma voltage above threshold generate a spike and enter their refractory period. The neuron firing rates are decoded using the linear decoding weights to get the updated states values, x and y-velocity. These values are smoothed with a filter identical to h(t), but with τ set to 5ms instead of 20ms to avoid introducing significant delay. Then the simulation step starts over again. In order to ensure rapid execution of the simulation step, neuron interactions are not updated dix rectly using the connection matrix (Equation 12), but rather indirectly with the decoding matrix φ j , ˜x dynamics matrix A , and preferred direction matrix φ j (Equation 11). To see why this is more efficient, suppose we have 1000 neurons in the a population for each of the state vector’s two scalars. Computing the recurrent connections using connection weights requires multiplying a 1000 × 1000 matrix by a 1000-dimensional vector (Figure 3a). This requires 106 multiplications and about 106 sums. Decoding each scalar (i.e., ∑i ai (t)φix ), however, requires only 1000 multiplications and 1000 sums. The decoded state vector is then updated by multiplying it by the (diagonal) A matrix, another 2 products and 1 sum. The updated state vector is then encoded by multiplying it with the neurons’ preferred direction vectors, another 1000 multiplications per scalar (Figure 3b). The resulting total of about 3000 operations is nearly three orders of magnitude fewer than using the connection weights to compute the identical transformation. To measure the speedup, we simulated a 2,000-neuron network on a computer running Matlab 2011a (Intel Core i7, 2.7-GHz, Mac OS X Lion). Although the exact run-times depend on the computing hardware and software, the run-time reduction factor should remain approximately constant across platforms. For each reported result, we ran the simulation 10 times to obtain a reliable estimate of the execution time. The run-time for neuron interactions using the recurrent connection weights was 9.9ms and dropped to 2.7µs in the lower-dimensional space, approximately a 3,500-fold speedup. Only the recurrent interactions benefit from the speedup, the execution time for the rest of the operations remaining constant. The run-time for a 50ms network simulation using the recurrent connec6 Table 1: Model parameters Symbol max G(J j (x)) G(J j (x)) = 0 J bias j αj ˜x φj Range 200-400 Hz −1 to 1 Satisfies first two Satisfies first two ˜x φj = 1 Description Maximum firing rate Normalized x-axis intercept Bias current Gain factor Preferred-direction vector σ2 τ RC j τ ref j τ PSC j 0.1 20 ms 1 ms 20 ms Gaussian noise variance RC time constant Refractory period PSC time constant tion weights was 0.94s and dropped to 0.0198s in the lower-dimensional space, a 47-fold speedup. These results demonstrate the efficiency the lower-dimensional space offers, which made the closedloop application of SNNs possible. 6 Closed-loop implementation An adult male rhesus macaque (monkey J) was trained to perform a center-out-and-back reaching task for juice rewards to one of eight targets, with a 500ms hold time (Figure 4a) [1]. All animal protocols and procedures were approved by the Stanford Institutional Animal Care and Use Committee. Hand position was measured using a Polaris optical tracking system at 60Hz (Northern Digital Inc.). Neural data were recorded from two 96-electrode silicon arrays (Blackrock Microsystems) implanted in the dorsal pre-motor and motor cortex. These recordings (-4.5 RMS threshold crossing applied to each electrode’s signal) yielded tuned activity for the direction and speed of arm movements. As detailed in [1], a standard Kalman filter model was fit by correlating the observed hand kinematics with the simultaneously measured neural signals, while the monkey moved his arm to acquire virtual targets (Figure 2). The resulting model was used in a closed-loop system to control an on-screen cursor in real-time (Figure 4a, Decoder block). A steady-state version of this model serves as the standard against which the SNN implementation’s performance is compared. We built a SNN using the NEF methodology based on derived Kalman filter parameters mentioned above. This SNN was then simulated on an xPC Target (Mathworks) x86 system (Dell T3400, Intel Core 2 Duo E8600, 3.33GHz). It ran in closed-loop, replacing the standard Kalman filter as the decoder block in Figure 4a. The parameter values listed in Table 1 were used for the SNN implementation. We ensured that the time constants τiRC ,τiref , and τiPSC were smaller than the implementation’s time step (50ms). Noise was not explicitly added. It arose naturally from the fluctuations produced by representing a scalar with filtered spike trains, which has been shown to have effects similar to Gaussian noise [11]. For the purpose of computing the linear decoding weights (i.e., Γ), we modeled the resulting noise as Gaussian with a variance of 0.1. A 2,000-neuron version of the SNN-based decoder was tested in a closed-loop system, the largest network our embedded MatLab implementation could run in real-time. There were 1206 trials total among which 301 (center-outs only) were performed with the SNN and 302 with the standard (steady-state) Kalman filter. The block structure was randomized and interleaved, so that there is no behavioral bias present in the findings. 100 trials under hand control are used as a baseline comparison. Success corresponds to a target acquisition under 1500ms, with 500ms hold time. Success rates were higher than 99% on all blocks for the SNN implementation and 100% for the standard Kalman filter. The average time to acquire the target was slightly slower for the SNN (Figure 5b)—711ms vs. 661ms, respectively—we believe this could be improved by using more neurons in the SNN.1 The average distance to target (Figure 5a) and the average velocity of the cursor (Figure 5c) are very similar. 1 Off-line, the SNN performed better as we increased the number of neurons [7]. 7 a Neural Spikes b c BMI: Kalman decoder BMI: SNN decoder Decoder Cursor Velocity 1cm 1cm Trials: 2056-2071 Trials: 1748-1763 5 0 0 400 Time after Target Onset (ms) 800 Target acquisition time histogram 40 Mean cursor velocity 50 Standard Kalman filter 40 20 Hand 30 30 Spiking Neural Network 20 10 0 c Cursor Velocity (cm/s) b Mean distance to target 10 Percent of Trials (%) a Distance to Target (cm) Figure 4: Experimental setup and results. a. Data are recorded from two 96-channel silicon electrode arrays implanted in dorsal pre-motor and motor cortex of an adult male monkey performing a centerout-and-back reach task for juice rewards to one of eight targets with a 500ms hold time. b. BMI position kinematics of 16 continuous trials for the standard Kalman filter implementation. c. BMI position kinematics of 16 continuous trials for the SNN implementation. 10 0 500 1000 Target Acquire Time (ms) 1500 0 0 200 400 600 800 Time after Target Onset (ms) 1000 Figure 5: SNN (red) performance compared to standard Kalman filter (blue) (hand trials are shown for reference (yellow)). The SNN achieves similar results—success rates are higher than 99% on all blocks—as the standard Kalman filter implementation. a. Plot of distance to target vs. time both after target onset for different control modalities. The thicker traces represent the average time when the cursor first enters the acceptance window until successfully entering for the 500ms hold time. b. Histogram of target acquisition time. c. Plot of mean cursor velocity vs. time. 7 Conclusions and future work The SNN’s performance was quite comparable to that produced by a standard Kalman filter implementation. The 2,000-neuron network had success rates higher than 99% on all blocks, with mean distance to target, target acquisition time, and mean cursor velocity curves very similar to the ones obtained with the standard implementation. Future work will explore whether these results extend to additional animals. As the Kalman filter and its variants are the state-of-the-art in cortically-controlled motor prostheses [1]-[5], these simulations provide confidence that similar levels of performance can be attained with a neuromorphic system, which can potentially overcome the power constraints set by clinical applications. Our ultimate goal is to develop an ultra-low-power neuromorphic chip for prosthetic applications on to which control theory algorithms can be mapped using the NEF. As our next step in this direction, we will begin exploring this mapping with Neurogrid, a hardware platform with sixteen programmable neuromorphic chips that can simulate up to a million spiking neurons in real-time [9]. However, bandwidth limitations prevent Neurogrid from realizing random connectivity patterns. It can only connect each neuron to thousands of others if neighboring neurons share common inputs — just as they do in the cortex. Such columnar organization may be possible with NEF-generated networks if preferred directions vectors are assigned topographically rather than randomly. Implementing this constraint effectively is a subject of ongoing research. Acknowledgment This work was supported in part by the Belgian American Education Foundation(J. Dethier), Stanford NIH Medical Scientist Training Program (MSTP) and Soros Fellowship (P. Nuyujukian), DARPA Revolutionizing Prosthetics program (N66001-06-C-8005, K. V. Shenoy), and two NIH Director’s Pioneer Awards (DP1-OD006409, K. V. Shenoy; DPI-OD000965, K. Boahen). 8 References [1] V. Gilja, Towards clinically viable neural prosthetic systems, Ph.D. Thesis, Department of Computer Science, Stanford University, 2010, pp 19–22 and pp 57–73. [2] V. Gilja, P. Nuyujukian, C.A. Chestek, J.P. Cunningham, J.M. Fan, B.M. Yu, S.I. Ryu, and K.V. Shenoy, A high-performance continuous cortically-controlled prosthesis enabled by feedback control design, 2010 Neuroscience Meeting Planner, San Diego, CA: Society for Neuroscience, 2010. [3] P. Nuyujukian, V. Gilja, C.A. Chestek, J.P. Cunningham, J.M. Fan, B.M. Yu, S.I. Ryu, and K.V. Shenoy, Generalization and robustness of a continuous cortically-controlled prosthesis enabled by feedback control design, 2010 Neuroscience Meeting Planner, San Diego, CA: Society for Neuroscience, 2010. [4] V. Gilja, C.A. Chestek, I. Diester, J.M. Henderson, K. Deisseroth, and K.V. Shenoy, Challenges and opportunities for next-generation intra-cortically based neural prostheses, IEEE Transactions on Biomedical Engineering, 2011, in press. [5] S.P. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, and M.J. Black, Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia, Journal of Neural Engineering, vol. 5, 2008, pp 455–476. [6] S. Kim, P. Tathireddy, R.A. Normann, and F. Solzbacher, Thermal impact of an active 3-D microelectrode array implanted in the brain, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 15, 2007, pp 493–501. [7] J. Dethier, V. Gilja, P. Nuyujukian, S.A. Elassaad, K.V. Shenoy, and K. Boahen, Spiking neural network decoder for brain-machine interfaces, IEEE Engineering in Medicine & Biology Society Conference on Neural Engineering, Cancun, Mexico, 2011, pp 396–399. [8] K. Boahen, Neuromorphic microchips, Scientific American, vol. 292(5), 2005, pp 56–63. [9] R. Silver, K. Boahen, S. Grillner, N. Kopell, and K.L. Olsen, Neurotech for neuroscience: unifying concepts, organizing principles, and emerging tools, Journal of Neuroscience, vol. 27(44), 2007, pp 11807– 11819. [10] J.V. Arthur and K. Boahen, Silicon neuron design: the dynamical systems approach, IEEE Transactions on Circuits and Systems, vol. 58(5), 2011, pp 1034-1043. [11] C. Eliasmith and C.H. Anderson, Neural engineering: computation, representation, and dynamics in neurobiological systems, MIT Press, Cambridge, MA; 2003. [12] C. Eliasmith, A unified approach to building and controlling spiking attractor networks, Neural Computation, vol. 17, 2005, pp 1276–1314. [13] R. Singh and C. Eliasmith, Higher-dimensional neurons explain the tuning and dynamics of working memory cells, The Journal of Neuroscience, vol. 26(14), 2006, pp 3667–3678. [14] C. Eliasmith, How to build a brain: from function to implementation, Synthese, vol. 159(3), 2007, pp 373–388. [15] R.E. Kalman, A new approach to linear filtering and prediction problems, Transactions of the ASME– Journal of Basic Engineering, vol. 82(Series D), 1960, pp 35–45. [16] G. Welsh and G. Bishop, An introduction to the Kalman Filter, University of North Carolina at Chapel Hill Chapel Hill NC, vol. 95(TR 95-041), 1995, pp 1–16. [17] W.Q. Malik, W. Truccolo, E.N. Brown, and L.R. Hochberg, Efficient decoding with steady-state Kalman filter in neural interface systems, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 19(1), 2011, pp 25–34. 9
same-paper 2 0.73790443 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
Author: Yichuan Zhang, Charles A. Sutton
Abstract: The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. In this paper we propose MCMC samplers that make use of quasiNewton approximations, which approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. On several real world datasets, we show that the quasi-Newton sampler is more effective than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives. 1
3 0.67517012 221 nips-2011-Priors over Recurrent Continuous Time Processes
Author: Ardavan Saeedi, Alexandre Bouchard-côté
Abstract: We introduce the Gamma-Exponential Process (GEP), a prior over a large family of continuous time stochastic processes. A hierarchical version of this prior (HGEP; the Hierarchical GEP) yields a useful model for analyzing complex time series. Models based on HGEPs display many attractive properties: conjugacy, exchangeability and closed-form predictive distribution for the waiting times, and exact Gibbs updates for the time scale parameters. After establishing these properties, we show how posterior inference can be carried efficiently using Particle MCMC methods [1]. This yields a MCMC algorithm that can resample entire sequences atomically while avoiding the complications of introducing slice and stick auxiliary variables of the beam sampler [2]. We applied our model to the problem of estimating the disease progression in multiple sclerosis [3], and to RNA evolutionary modeling [4]. In both domains, we found that our model outperformed the standard rate matrix estimation approach. 1
4 0.49537584 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
Author: David Wingate, Noah Goodman, Andreas Stuhlmueller, Jeffrey M. Siskind
Abstract: Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distribution represented by the probabilistic program. We show how nonstandard interpretations of probabilistic programs can be used to craft efficient inference algorithms: information about the structure of a distribution (such as gradients or dependencies) is generated as a monad-like side computation while executing the program. These interpretations can be easily coded using special-purpose objects and operator overloading. We implement two examples of nonstandard interpretations in two different languages, and use them as building blocks to construct inference algorithms: automatic differentiation, which enables gradient based methods, and provenance tracking, which enables efficient construction of global proposals. 1
5 0.47584313 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
Author: Feng Yan, Yuan Qi
Abstract: For many real-world applications, we often need to select correlated variables— such as genetic variations and imaging features associated with Alzheimer’s disease—in a high dimensional space. The correlation between variables presents a challenge to classical variable selection methods. To address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not exploit the correlation information embedded in the data to select correlated variables. To overcome this limitation, we present a novel hybrid model, EigenNet, that uses the eigenstructures of data to guide variable selection. Specifically, it integrates a sparse conditional classification model with a generative model capturing variable correlations in a principled Bayesian framework. We develop an efficient active-set algorithm to estimate the model via evidence maximization. Experimental results on synthetic data and imaging genetics data demonstrate the superior predictive performance of the EigenNet over the lasso, the elastic net, and the automatic relevance determination. 1
6 0.47470301 229 nips-2011-Query-Aware MCMC
7 0.47227642 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
8 0.47114447 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
9 0.47082627 249 nips-2011-Sequence learning with hidden units in spiking neural networks
10 0.47036105 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
11 0.4702788 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
12 0.46984896 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach
13 0.46924067 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
14 0.46890393 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
15 0.46784091 22 nips-2011-Active Ranking using Pairwise Comparisons
16 0.46755356 180 nips-2011-Multiple Instance Filtering
17 0.46753252 258 nips-2011-Sparse Bayesian Multi-Task Learning
18 0.46713305 102 nips-2011-Generalised Coupled Tensor Factorisation
19 0.46674883 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
20 0.46667367 271 nips-2011-Statistical Tests for Optimization Efficiency