nips nips2012 nips2012-347 knowledge-graph by maker-knowledge-mining

347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

Source: pdf

Author: David Balduzzi, Michel Besserve

Abstract: This paper suggests a learning-theoretic perspective on how synaptic plasticity beneﬁts global brain functioning. We introduce a model, the selectron, that (i) arises as the fast time constant limit of leaky integrate-and-ﬁre neurons equipped with spiking timing dependent plasticity (STDP) and (ii) is amenable to theoretical analysis. We show that the selectron encodes reward estimates into spikes and that an error bound on spikes is controlled by a spiking margin and the sum of synaptic weights. Moreover, the efﬁcacy of spikes (their usefulness to other reward maximizing selectrons) also depends on total synaptic strength. Finally, based on our analysis, we propose a regularized version of STDP, and show the regularization improves the robustness of neuronal learning when faced with multiple stimuli. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract This paper suggests a learning-theoretic perspective on how synaptic plasticity beneﬁts global brain functioning. [sent-7, score-0.392]

2 We introduce a model, the selectron, that (i) arises as the fast time constant limit of leaky integrate-and-ﬁre neurons equipped with spiking timing dependent plasticity (STDP) and (ii) is amenable to theoretical analysis. [sent-8, score-0.443]

3 We show that the selectron encodes reward estimates into spikes and that an error bound on spikes is controlled by a spiking margin and the sum of synaptic weights. [sent-9, score-1.773]

4 Moreover, the efﬁcacy of spikes (their usefulness to other reward maximizing selectrons) also depends on total synaptic strength. [sent-10, score-0.781]

5 For biological networks however, the currently known neural plasticity mechanisms use a very restricted set of data – largely consisting of spikes and diffuse neuromodulatory signals. [sent-14, score-0.62]

6 The selectron is a model derived from leaky integrate and ﬁre neurons equipped with spiketiming dependent plasticity that is amenable to learning-theoretic analysis. [sent-21, score-0.894]

7 We state a constrained reward maximization problem which implies that selectrons encode empirical reward estimates into spikes. [sent-24, score-0.44]

8 Our ﬁrst result, section §3, 1 is that the selectron arises as the fast time constant limit of well-established models of neuronal spiking and plasticity, suggesting that cortical neurons may also be encoding reward estimates into their spiketrains. [sent-25, score-1.015]

9 First, what guarantees can be provided on spikes being reliable predictors of global (neuromodulatory) outcomes? [sent-27, score-0.35]

10 Second, what guarantees can be provided on the usefulness of spikes to other neurons? [sent-28, score-0.35]

11 Both bounds are controlled by the sum of synaptic weights kwk1 , thereby justifying the constraint introduced in §2. [sent-30, score-0.288]

12 Spike-timing dependent plasticity and its implications for the neural code have been intensively studied in recent years. [sent-35, score-0.192]

13 An information-theoretic perspective on synaptic homeostasis and metabolic cost, complementing the results in this paper, can be found in [12, 13]. [sent-39, score-0.294]

14 Simulations combining synaptic renormalization with burst-STDP can be found in [14]. [sent-40, score-0.288]

15 2 The selectron We introduce the selectron, which can be considered a biologically motivated adaptation of the perceptron, see §3. [sent-44, score-0.561]

16 The mechanism governing whether or not the selectron spikes is a Heaviside function acting on a weighted sum of synaptic inputs; our contribution is to propose a new reward function and corresponding learning rule. [sent-45, score-1.318]

17 Let X denote the set of N -dimensional {0, 1}-valued vectors forming synaptic inputs to a selectron, and Y = {0, 1} the set of outputs. [sent-47, score-0.296]

18 A selectron spikes according to ⇢ 1 if z > 0 y = fw (x) := H (w| x #) , where H(z) := (1) 0 else is the Heaviside function and w is a [0, 1] ⇢ R valued N -vector specifying the selectron’s synaptic weights. [sent-48, score-1.426]

19 Deﬁne reward function R(x, fw , ⌫) = ⌫(x) |{z} neuromodulators · (w| x #) · fw (x) = | {z } | {z } margin selectivity ⇢ ⌫(x) · (w| x 0 #) if y = 1 else. [sent-54, score-0.764]

20 2 The third term gates rewards according to whether or not the selectron spikes. [sent-59, score-0.564]

21 The reward is thus selected1 : neuromodulatory signals are ignored by the selectron’s reward function when it does not spike, enabling specialization. [sent-60, score-0.483]

22 The selectron solves the following optimization problem: b maximize: Rn := w n X i=1 ⌫(x(i) ) · (w| x(i) #) · fw (x(i) ) (3) subject to: kwk1  ! [sent-62, score-0.793]

23 Optimization problem (3) ensures that selectrons spike for inputs that, on the basis of their empirical sample, reliably lead to neuromodulatory rewards. [sent-66, score-0.415]

24 We postpone discussion of how to impose the constraint to §6, and focus on reward maximization here. [sent-69, score-0.205]

25 Although fw (x) is not continuous, the reward function is a continuous function of w and is differentiable everywhere except for the “corner” where w| x # = 0. [sent-72, score-0.43]

26 We therefore apply gradient ascent by computing the derivative of (3) with respect to synaptic weights to obtain online learning rule ⇢ ↵ · ⌫(x) if xj = 1 and y = 1 wj = ↵ · ⌫(x) · xj · fw (x) = (4) 0 else where update factor ↵ controls the learning rate. [sent-73, score-0.755]

27 The learning rule is selective: regardless of the neuromodulatory signal, synapse wjk is updated only if there is both an input xj = 1 and output spike y = fw (x) = 1. [sent-74, score-0.718]

28 The selectron is not guaranteed to ﬁnd a global optimum. [sent-75, score-0.537]

29 It is prone to initial condition dependent local optima because rewards depend on output spikes in learning rule (4). [sent-76, score-0.452]

30 The reward function reduces to R(x, fw ) = (w| x #) · fw (x). [sent-80, score-0.686]

31 Imposing the constraint yields a more interesting solution where the selectron ﬁnds a weight vector summing to ! [sent-82, score-0.568]

32 which balances (i) frequent spikes and (ii) high margins. [sent-83, score-0.35]

33 Interestingly, in this setting the constraint provides a lever for controlling (lower bounding) rewards per spike n o reward per spike = b R P (fw (x) = 1) c1 · b R . [sent-91, score-0.571]

34 inputs are unrealistic, note that recent neurophysiological evidence suggests neuronal ﬁring – even of nearby neurons – is uncorrelated [18]. [sent-101, score-0.21]

35 3 3 Relation to leaky integrate-and-ﬁre neurons equipped with STDP The literature contains an enormous variety of neuronal models, which vary dramatically in sophistication and the extent to which they incorporate the the details of the underlying biochemical processes. [sent-103, score-0.244]

36 Similarly, there is a large menagerie of models of synaptic plasticity [19]. [sent-104, score-0.392]

37 Suppose neuron nk last outputted a spike at time tk and receives input spikes at times tj from neuron nj . [sent-107, score-0.969]

38 Neuron nk spikes or according to the Heaviside function applied to the membrane potential Mw : X fw (t) = H (Mw (t) #) where Mw (t) = ⌘(t tk ) + wjk · ✏(t tj ) at time t tk . [sent-108, score-1.005]

39 tj t Input and output spikes add  ⇣t t⌘ ⇣t t⌘ j j ✏(t tj ) = K e ⌧m e ⌧s and ⌘(t  ⇣ ⌘ tk t tk ) = # K 1 e ⌧ m ✓ ⇣t t⌘ k K2 e ⌧m e ⇣ tk t ⌧s ⌘◆ to the membrane potential for tj  t and tk  t respectively. [sent-109, score-0.904]

40 The original STDP update rule [5] is wjk = 8 < : ↵+ · e ⇣t ↵ ·e j tk ⌧+ ⇣t ⌘ k tj ⌧ ⌘ if tj  tk else (5) where ⌧+ and ⌧ are time constants. [sent-111, score-0.42]

41 STDP potentiates input synapses that spike prior to output spikes and depotentiates input synapses that spike subsequent to output spikes. [sent-112, score-0.89]

42 Theorem 2 (the selectron is the fast time constant limit of SRM + STDP). [sent-113, score-0.565]

43 0, the SRM transforms into a selectron with ⇣ ⌘ X fw (t) = H Mw (t) # where Mw = wjk · tk (t). [sent-115, score-0.923]

44 Finally, STDP arises as gradient ascent on a reward function whose limit is the unsupervised setting of reward function (2). [sent-117, score-0.376]

45 Theorem 2 shows that STDP implicitly maximizes a time-discounted analog of the reward function in (3). [sent-118, score-0.21]

46 We expect many models of reward-modulated synaptic plasticity to be analytically tractable in the fast time constant limit. [sent-119, score-0.392]

47 An important property shared by STDP and the selectron is that synaptic (de)potentiation is gated by output spikes, see §A. [sent-120, score-0.794]

48 1 for a comparison with the perceptron which does not gate synaptic learning 4 An error bound Maximizing reward function (3) implies that selectrons encode reward estimates into their spikes. [sent-121, score-0.748]

49 Indeed, it recursively justiﬁes incorporating spikes into the reward function via the margin (w| x #), which only makes sense if upstream spikes predict reward. [sent-122, score-0.946]

50 It is therefore crucial to provide guarantees on the quality of spikes as estimators. [sent-124, score-0.35]

51 Cortical learning may be analogous to boosting: individual neurons have access to a tiny fraction of the total brain state, and so are weak learners; and in the fast time constant limit, neurons are essentially aggregators. [sent-126, score-0.254]

52 The goal is to show how the margin and constraint on synaptic weights improve error bounds. [sent-129, score-0.335]

53 A selectron incurs a 0/1 loss if a spike is followed by negative neuromodulatory feedback ⇢ 1 if y = 1 and ⌫(x) = 1 l(x, fw , ⌫) = = (6) fw (x)·⌫(x) 0 else. [sent-131, score-1.364]

54 The 0/1 loss fails to take the estimates (spikes) of other selectrons into account and is difﬁcult to optimize, so we also introduce the hinge loss: ⇢ ⇣ ⌘ x if x 0 h (x, fw , ⌫) :=  (w| x #) · ⌫(x) · fw (x), where (x)+ := (7) 0 else. [sent-132, score-0.604]

55 An alternate 0/1 loss2 penalizes a selectron if it (i) ﬁres when it shouldn’t, i. [sent-135, score-0.537]

56 For any selectron nk , let S k = {nk } [ {nj : nj ! [sent-143, score-0.707]

57 For all  1, with probability at least 1 , p ⇥ ⇤ 1 X  (i) 8(N + 1) log(n + 1) + 1 p E l(x, fw , ⌫)  h x , fw , ⌫(x(i) ) +! [sent-145, score-0.512]

58 Second, it shows the capacity term depends on the number of synapses N and the constraint ! [sent-153, score-0.187]

59 on synaptic weights, rather than the capacity of S k – which can be very large. [sent-154, score-0.292]

60 The hinge loss is difﬁcult to optimize directly since gating with output spikes fw (x) renders it discontinuous. [sent-155, score-0.606]

61 2 E l(x, fw , ⌫)  p 2 # b Rn x(i) , fw , ⌫(x(i) ) + ! [sent-157, score-0.512]

62 5 A bound on the efﬁcacy of inter-neuronal communication Even if a neuron’s spikes perfectly predict positive neuromodulatory signals, the spikes only matter to the extent they affect other neurons in cortex. [sent-162, score-0.95]

63 In this section we quantify the effect of one selectron’s spikes on another selectron’s expected reward. [sent-165, score-0.35]

64 The efﬁcacy of spikes from selectron nj on selectron nk is Rk E[Rk |xj = 1] := xj 1 E[Rk |xj = 0] , 0 i. [sent-170, score-1.63]

65 the expected contribution of spikes from selectron nj to selectron nk ’s expected reward, relative to not spiking. [sent-172, score-1.594]

66 The notation is intended to suggest an analogy with differentiation – the inﬁnitesimal difference made by spikes on a single synapse. [sent-173, score-0.35]

67 In other words, if spikes from nj make no difference to the expected reward of nk . [sent-175, score-0.694]

68 The following theorem relies on the assumption that the average contribution of neuromodulators is higher after nj spikes than after it does not spike (i. [sent-176, score-0.608]

69 Let pj := E[Y j ] denote the frequency of spikes from neuron nj . [sent-182, score-0.582]

70 First, the guarantee improves as co-spiking by nj and nk increases. [sent-186, score-0.199]

71 However, the denominators imply that increasing the frequency of nj ’s spikes worsens the guarantee, insofar as nj is not correlated with nk . [sent-187, score-0.668]

72 Similarly, from the third term, increasing nk ’s spikes worsens the guarantee if they do not correlate with nj . [sent-188, score-0.551]

73 An immediate corollary of Theorem 4 is that Hebbian learning rules, such as STDP and the selectron learning rule (4), improve the efﬁcacy of spikes. [sent-189, score-0.578]

74 However, it also shows that naively increasing the frequency of spikes carries a cost. [sent-190, score-0.389]

75 on synaptic strength can be used as a lever to improve guarantees on efﬁcacy. [sent-195, score-0.298]

76 The 1st term in (9) suggests that pruning weak synapses increases the efﬁcacy of spikes, and so may aid learning in populations of selectrons or neurons. [sent-197, score-0.262]

77 Observe that regularizing optimization problem (3) yields maximize: w learning rule: n X R x(i) , fw , ⌫(x(i) ) i=1 2 wj = ↵ · ⌫(x) · xj · fw (x) (kwk1 ! [sent-202, score-0.626]

78 · wj (12) incorporates synaptic renormalization directly into the update. [sent-204, score-0.366]

79 However, (12) requires continuously re-evaluating the sum of synaptic weights. [sent-205, score-0.257]

80 We therefore decouple learning into an online reward maximization phase and an ofﬂine regularization phase which resets the synaptic weights. [sent-206, score-0.572]

81 It has recently been proposed that a function of NREM sleep may be to regulate synaptic weights [28]. [sent-208, score-0.352]

82 Indeed, neurophysiological evidence suggests that average cortical ﬁring rates increase during wakefulness and decrease during sleep, possibly reﬂecting synaptic strengths [29, 30]. [sent-209, score-0.304]

83 Experimental evidence also points to a net increase in dendritic spines (synapses) during waking and a net decrease during sleep [31]. [sent-210, score-0.193]

84 We then performed 700 trials (350 classical and 350 regularized) exposing the neuron to a new pattern for 20 seconds and observed performance under classical and regularized STDP. [sent-215, score-0.296]

85 In the ofﬂine phase, modify synapses once per second according to ⇢ · 3 wj · (! [sent-231, score-0.199]

86 Since synapses are frequently renormalized ofﬂine we incorporate a weak exploratory (potentiation) bias during the online phase which helps avoid local minima. [sent-239, score-0.228]

87 Since computing the sum of synaptic weights is non-physiological, we draw on Theorem 1 and use the neuron’s ﬁring rate when responding to uncorrelated inputs as a proxy for kwk1 . [sent-241, score-0.296]

88 Thus, in the ofﬂine phase, synapses receive inputs generated as in the online phase but without repeated patterns. [sent-242, score-0.243]

89 Motivated by Remark 4, we introduce bias ( 3 wj ) in the ofﬂine phase to ensure weaker synapses 2 are downscaled more than strong synapses. [sent-244, score-0.288]

90 Regularized STDP alternates between 2 seconds online and 4 seconds ofﬂine, which sufﬁces to renormalize synaptic strengths. [sent-248, score-0.332]

91 The frequency of the ofﬂine phase could be reduced by decreasing the update factors ↵± , presenting stimuli less frequently (than 7 times per second), or adding inhibitory neurons to the system. [sent-249, score-0.212]

92 A summary of results is presented in the table below: accuracy quantiﬁes the fraction of spikes that co-occur with each pattern. [sent-251, score-0.35]

93 It should be noted that regularized neurons were not only online for 20 seconds but also ofﬂine – and exposed to Poisson noise – for 40 seconds. [sent-253, score-0.239]

94 7 Discussion The selectron provides a bridge between a particular model of spiking neurons – the Spike Response Model [20] with the original spike-timing dependent plasticity rule [5] – and models that are amenable to learning-theoretic analysis. [sent-265, score-0.92]

95 Our hope is that the selectron and related models lead to an improved understanding of the principles underlying learning in cortex. [sent-266, score-0.537]

96 The selectron is an interesting model in its own right: it embeds reward estimates into spikes and maximizes a margin that improves error bounds. [sent-268, score-1.137]

97 It imposes a constraint on synaptic weights that: concentrates rewards/spike, tightens error bounds and improves guarantees on spiking efﬁcacy. [sent-269, score-0.375]

98 8 [5] Song S, Miller KD, Abbott LF: Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. [sent-284, score-0.257]

99 [28] Tononi G, Cirelli C: Sleep function and synaptic homeostasis. [sent-335, score-0.257]

100 [29] Vyazovskiy VV, Cirelli C, Pﬁster-Genskow M, Faraguna U, Tononi G: Molecular and electrophysiological evidence for net synaptic potentiation in wake and depression in sleep. [sent-337, score-0.345]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('selectron', 0.537), ('spikes', 0.35), ('stdp', 0.298), ('synaptic', 0.257), ('fw', 0.256), ('reward', 0.174), ('spike', 0.149), ('neuromodulatory', 0.135), ('plasticity', 0.135), ('synapses', 0.121), ('neurons', 0.115), ('cacy', 0.102), ('sleep', 0.095), ('ine', 0.092), ('nk', 0.092), ('selectrons', 0.092), ('tononi', 0.092), ('nj', 0.078), ('wj', 0.078), ('tk', 0.077), ('neuron', 0.075), ('tj', 0.073), ('srm', 0.068), ('potentiation', 0.062), ('cirelli', 0.061), ('mw', 0.059), ('phase', 0.058), ('spiking', 0.058), ('neuronal', 0.056), ('balduzzi', 0.054), ('wjk', 0.053), ('perceptron', 0.051), ('leaky', 0.05), ('synapse', 0.048), ('margin', 0.047), ('cortical', 0.047), ('faraguna', 0.046), ('waking', 0.046), ('regularized', 0.045), ('ring', 0.045), ('classical', 0.044), ('boosting', 0.041), ('rule', 0.041), ('lever', 0.041), ('pj', 0.04), ('frequency', 0.039), ('inputs', 0.039), ('tweak', 0.037), ('maass', 0.037), ('heaviside', 0.037), ('gerstner', 0.037), ('metabolic', 0.037), ('trials', 0.037), ('analog', 0.036), ('xj', 0.036), ('capacity', 0.035), ('dependent', 0.034), ('rk', 0.033), ('mpi', 0.032), ('constraint', 0.031), ('abbott', 0.031), ('feedback', 0.031), ('besserve', 0.031), ('depotentiation', 0.031), ('downscaled', 0.031), ('hedonistic', 0.031), ('neuromodulators', 0.031), ('olcese', 0.031), ('renormalization', 0.031), ('runaway', 0.031), ('vyazovskiy', 0.031), ('worsens', 0.031), ('exposed', 0.029), ('improves', 0.029), ('limit', 0.028), ('learners', 0.027), ('masquelier', 0.027), ('wc', 0.027), ('fusi', 0.027), ('bernoulli', 0.027), ('rewards', 0.027), ('membrane', 0.027), ('accuracies', 0.026), ('else', 0.026), ('pattern', 0.026), ('schapire', 0.026), ('net', 0.026), ('seconds', 0.025), ('remark', 0.025), ('online', 0.025), ('legenstein', 0.025), ('upstream', 0.025), ('pruning', 0.025), ('biologically', 0.024), ('weak', 0.024), ('freund', 0.023), ('intensively', 0.023), ('perceptrons', 0.023), ('rosenblatt', 0.023), ('equipped', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

Author: David Balduzzi, Michel Besserve

2 0.29079732 190 nips-2012-Learning optimal spike-based representations

Author: Ralph Bourdoukan, David Barrett, Sophie Deneve, Christian K. Machens

Abstract: How can neural networks learn to represent information optimally? We answer this question by deriving spiking dynamics and learning dynamics directly from a measure of network performance. We ﬁnd that a network of integrate-and-ﬁre neurons undergoing Hebbian plasticity can learn an optimal spike-based representation for a linear decoder. The learning rule acts to minimise the membrane potential magnitude, which can be interpreted as a representation error after learning. In this way, learning reduces the representation error and drives the network into a robust, balanced regime. The network becomes balanced because small representation errors correspond to small membrane potentials, which in turn results from a balance of excitation and inhibition. The representation is robust because neurons become self-correcting, only spiking if the representation error exceeds a threshold. Altogether, these results suggest that several observed features of cortical dynamics, such as excitatory-inhibitory balance, integrate-and-ﬁre dynamics and Hebbian plasticity, are signatures of a robust, optimal spike-based code. A central question in neuroscience is to understand how populations of neurons represent information and how they learn to do so. Usually, learning and information representation are treated as two different functions. From the outset, this separation seems like a good idea, as it reduces the problem into two smaller, more manageable chunks. Our approach, however, is to study these together. This allows us to treat learning and information representation as two sides of a single mechanism, operating at two different timescales. Experimental work has given us several clues about the regime in which real networks operate in the brain. Some of the most prominent observations are: (a) high trial-to-trial variability—a neuron responds differently to repeated, identical inputs [1, 2]; (b) asynchronous ﬁring at the network level—spike trains of different neurons are at most very weakly correlated [3, 4, 5]; (c) tight balance of excitation and inhibition—every excitatory input is met by an inhibitory input of equal or greater size [6, 7, 8] and (4) spike-timing-dependent plasticity (STDP)—the strength of synapses change as a function of presynaptic and postsynaptic spike times [9]. Previously, it has been shown that observations (a)–(c) can be understood as signatures of an optimal, spike-based code [10, 11]. The essential idea is to derive spiking dynamics from the assumption that neurons only ﬁre if their spike improves information representation. Information in a network may ∗ Authors contributed equally 1 originate from several possible sources: external sensory input, external neural network input, or alternatively, it may originate within the network itself as a memory, or as a computation. Whatever the source, this initial assumption leads directly to the conclusion that a network of integrate-and-ﬁre neurons can optimally represent a signal while exhibiting properties (a)–(c). A major problem with this framework is that network connectivity must be completely speciﬁed a priori, and requires the tuning of N 2 parameters, where N is the number of neurons in the network. Although this is feasible mathematically, it is unclear how a real network could tune itself into this optimal regime. In this work, we solve this problem using a simple synaptic learning rule. The key insight is that the plasticity rule can be derived from the same basic principle as the spiking rule in the earlier work—namely, that any change should improve information representation. Surprisingly, this can be achieved with a local, Hebbian learning rule, where synaptic plasticity is proportional to the product of presynaptic ﬁring rates with post-synaptic membrane potentials. Spiking and synaptic plasticity then work hand in hand towards the same goal: the spiking of a neuron decreases the representation error on a fast time scale, thereby giving rise to the actual population representation; synaptic plasticity decreases the representation error on a slower time scale, thereby improving or maintaining the population representation. For a large set of initial connectivities and spiking dynamics, neural networks are driven into a balanced regime, where excitation and inhibition cancel each other and where spike trains are asynchronous and irregular. Furthermore, the learning rule that we derive reproduces the main features of STDP (property (d) above). In this way, a network can learn to represent information optimally, with synaptic, neural and network dynamics consistent with those observed experimentally. 1 Derivation of the learning rule for a single neuron We begin by deriving a learning rule for a single neuron with an autapse (a self-connection) (Fig. 1A). Our approach is to derive synaptic dynamics for the autapse and spiking dynamics for the neuron such that the neuron learns to optimally represent a time-varying input signal. We will derive a learning rule for networks of neurons later, after we have developed the fundamental concepts for the single neuron case. Our ﬁrst step is to derive optimal spiking dynamics for the neuron, so that we have a target for our learning rule. We do this by making two simple assumptions [11]. First, we assume that the neuron can provide an estimate or read-out x(t) of a time-dependent signal x(t) by ﬁltering its spike train ˆ o(t) as follows: ˙ x(t) = −ˆ(t) + Γo(t), ˆ x (1) where Γ is a ﬁxed read-out weight, which we will refer to as the neuron’s “output kernel” and the spike train can be written as o(t) = i δ(t − ti ), where {ti } are the spike times. Next, we assume that the neuron only produces a spike if that spike improves the read-out, where we measure the read-out performance through a simple squared-error loss function: 2 L(t) = x(t) − x(t) . ˆ (2) With these two assumptions, we can now derive optimal spiking dynamics. First, we observe that if the neuron produces an additional spike at time t, the read-out increases by Γ, and the loss function becomes L(t|spike) = (x(t) − (x(t) + Γ))2 . This allows us to restate our spiking rule as follows: ˆ the neuron should only produce a spike if L(t|no spike) > L(t|spike), or (x(t) − x(t))2 > (x(t) − ˆ (x(t) + Γ))2 . Now, squaring both sides of this inequality, deﬁning V (t) ≡ Γ(x(t) − x(t)) and ˆ ˆ deﬁning T ≡ Γ2 /2 we ﬁnd that the neuron should only spike if: V (t) > T. (3) We interpret V (t) to be the membrane potential of the neuron, and we interpret T as the spike threshold. This interpretation allows us to understand the membrane potential functionally: the voltage is proportional to a prediction error—the difference between the read-out x(t) and the actual ˆ signal x(t). A spike is an error reduction mechanism—the neuron only spikes if the error exceeds the spike threshold. This is a greedy minimisation, in that the neuron ﬁres a spike whenever that action decreases L(t) without considering the future impact of that spike. Importantly, the neuron does not require direct access to the loss function L(t). 2 To determine the membrane potential dynamics, we take the derivative of the voltage, which gives ˙ ˙ us V = Γ(x − x). (Here, and in the following, we will drop the time index for notational brevity.) ˙ ˆ ˙ Now, using Eqn. (1) we obtain V = Γx − Γ(−x + Γo) = −Γ(x − x) + Γ(x + x) − Γ2 o, so that: ˙ ˆ ˆ ˙ ˙ V = −V + Γc − Γ2 o, (4) where c = x + x is the neural input. This corresponds exactly to the dynamics of a leaky integrate˙ and-ﬁre neuron with an inhibitory autapse1 of strength Γ2 , and a feedforward connection strength Γ. The dynamics and connectivity guarantee that a neuron spikes at just the right times to optimise the loss function (Fig. 1B). In addition, it is especially robust to noise of different forms, because of its error-correcting nature. If x is constant in time, the voltage will rise up to the threshold T at which point a spike is ﬁred, adding a delta function to the spike train o at time t, thereby producing a read-out x that is closer to x and causing an instantaneous drop in the voltage through the autapse, ˆ by an amount Γ2 = 2T , effectively resetting the voltage to V = −T . We now have a target for learning—we know the connection strength that a neuron must have at the end of learning if it is to represent information optimally, for a linear read-out. We can use this target to derive synaptic dynamics that can learn an optimal representation from experience. Speciﬁcally, we consider an integrate-and-ﬁre neuron with some arbitrary autapse strength ω. The dynamics of this neuron are given by ˙ V = −V + Γc − ωo. (5) This neuron will not produce the correct spike train for representing x through a linear read-out (Eqn. (1)) unless ω = Γ2 . Our goal is to derive a dynamical equation for the synapse ω so that the spike train becomes optimal. We do this by quantifying the loss that we are incurring by using the suboptimal strength, and then deriving a learning rule that minimises this loss with respect to ω. The loss function underlying the spiking dynamics determined by Eqn. (5) can be found by reversing the previous membrane potential analysis. First, we integrate the differential equation for V , assuming that ω changes on time scales much slower than the membrane potential. We obtain the following (formal) solution: V = Γx − ω¯, o (6) ˙ where o is determined by o = −¯ + o. The solution to this latter equation is o = h ∗ o, a convolution ¯ ¯ o ¯ of the spike train with the exponential kernel h(τ ) = θ(τ ) exp(−τ ). As such, it is analogous to the instantaneous ﬁring rate of the neuron. Now, using Eqn. (6), and rewriting the read-out as x = Γ¯, we obtain the loss incurred by the ˆ o sub-optimal neuron, L = (x − x)2 = ˆ 1 V 2 + 2(ω − Γ2 )¯ + (ω − Γ2 )2 o2 . o ¯ Γ2 (7) We observe that the last two terms of Eqn. (7) will vanish whenever ω = Γ2 , i.e., when the optimal reset has been found. We can therefore simplify the problem by deﬁning an alternative loss function, 1 2 V , (8) 2 which has the same minimum as the original loss (V = 0 or x = x, compare Eqn. (2)), but yields a ˆ simpler learning algorithm. We can now calculate how changes to ω affect LV : LV = ∂LV ∂V ∂o ¯ =V = −V o − V ω ¯ . (9) ∂ω ∂ω ∂ω We can ignore the last term in this equation (as we will show below). Finally, using simple gradient descent, we obtain a simple Hebbian-like synaptic plasticity rule: τω = − ˙ ∂LV = V o, ¯ ∂ω (10) where τ is the learning time constant. 1 This contribution of the autapse can also be interpreted as the reset of an integrate-and-ﬁre neuron. Later, when we generalise to networks of neurons, we shall employ this interpretation. 3 This synaptic learning rule is capable of learning the synaptic weight ω that minimises the difference between x and x (Fig. 1B). During learning, the synaptic weight changes in proportion to the postˆ synaptic voltage V and the pre-synaptic ﬁring rate o (Fig. 1C). As such, this is a Hebbian learning ¯ rule. Of course, in this single neuron case, the pre-synaptic neuron and post-synaptic neuron are the same neuron. The synaptic weight gradually approaches its optimal value Γ2 . However, it never completely stabilises, because learning never stops as long as neurons are spiking. Instead, the synapse oscillates closely about the optimal value (Fig. 1D). This is also a “greedy” learning rule, similar to the spiking rule, in that it seeks to minimise the error at each instant in time, without regard for the future impact of those changes. To demonstrate that the second term in Eqn. (5) can be neglected we note that the equations for V , o, and ω deﬁne a system ¯ of coupled differential equations that can be solved analytically by integrating between spikes. This results in a simple recurrence relation for changes in ω from the ith to the (i + 1)th spike, ωi+1 = ωi + ωi (ωi − 2T ) . τ (T − Γc − ωi ) (11) This iterative equation has a single stable ﬁxed point at ω = 2T = Γ2 , proving that the neuron’s autaptic weight or reset will approach the optimal solution. 2 Learning in a homogeneous network We now generalise our learning rule derivation to a network of N identical, homogeneously connected neurons. This generalisation is reasonably straightforward because many characteristics of the single neuron case are shared by a network of identical neurons. We will return to the more general case of heterogeneously connected neurons in the next section. We begin by deriving optimal spiking dynamics, as in the single neuron case. This provides a target for learning, which we can then use to derive synaptic dynamics. As before, we want our network to produce spikes that optimally represent a variable x for a linear read-out. We assume that the read-out x is provided by summing and ﬁltering the spike trains of all the neurons in the network: ˆ ˙ x = −ˆ + Γo, ˆ x (12) 2 where the row vector Γ = (Γ, . . . , Γ) contains the read-out weights of the neurons and the column vector o = (o1 , . . . , oN ) their spike trains. Here, we have used identical read-out weights for each neuron, because this indirectly leads to homogeneous connectivity, as we will demonstrate. Next, we assume that a neuron only spikes if that spike reduces a loss-function. This spiking rule is similar to the single neuron spiking rule except that this time there is some ambiguity about which neuron should spike to represent a signal. Indeed, there are many different spike patterns that provide exactly the same estimate x. For example, one neuron could ﬁre regularly at a high rate (exactly like ˆ our previous single neuron example) while all others are silent. To avoid this ﬁring rate ambiguity, we use a modiﬁed loss function, that selects amongst all equivalent solutions, those with the smallest neural ﬁring rates. We do this by adding a ‘metabolic cost’ term to our loss function, so that high ﬁring rates are penalised: ¯ L = (x − x)2 + µ o 2 , ˆ (13) where µ is a small positive constant that controls the cost-accuracy trade-off, akin to a regularisation parameter. Each neuron in the optimal network will seek to reduce this loss function by ﬁring a spike. Speciﬁcally, the ith neuron will spike whenever L(no spike in i) > L(spike in i). This leads to the following spiking rule for the ith neuron: Vi > Ti (14) where Vi ≡ Γ(x − x) − µoi and Ti ≡ Γ2 /2 + µ/2. We can naturally interpret Vi as the membrane ˆ potential of the ith neuron and Ti as the spiking threshold of that neuron. As before, we can now derive membrane potential dynamics: ˙ V = −V + ΓT c − (ΓT Γ + µI)o, 2 (15) The read-out weights must scale as Γ ∼ 1/N so that ﬁring rates are not unrealistically small in large networks. We can see this by calculating the average ﬁring rate N oi /N ≈ x/(ΓN ) ∼ O(N/N ) ∼ O(1). i=1 ¯ 4 where I is the identity matrix and ΓT Γ + µI is the network connectivity. We can interpret the selfconnection terms {Γ2 +µ} as voltage resets that decrease the voltage of any neuron that spikes. This optimal network is equivalent to a network of identical integrate-and-ﬁre neurons with homogeneous inhibitory connectivity. The network has some interesting dynamical properties. The voltages of all the neurons are largely synchronous, all increasing to the spiking threshold at about the same time3 (Fig. 1F). Nonetheless, neural spiking is asynchronous. The ﬁrst neuron to spike will reset itself by Γ2 + µ, and it will inhibit all the other neurons in the network by Γ2 . This mechanism prevents neurons from spik- x 3 The ﬁrst neuron to spike will be random if there is some membrane potential noise. V (A) (B) x x ˆ x 10 1 0.1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 1 D 0.5 V V 0 ˆ x V ˆ x (C) 1 0 1 2 0 0.625 25 25.625 (D) start of learning 1 V 50 200.625 400 400.625 1 2.4 O 1.78 ω 1.77 25 neuron$ 0 1 2 !me$ 3 4 25 1 5 V 400.625 !me$ (F) 25 1 2.35 1.05 1.049 400 25.625 !me$ (E) neuron$ 100.625 200 end of learning 1.4 1.35 ω 100 !me$ 1 V 1 O 50.625 0 1 2 !me$ 3 4 5 V !me$ !me$ Figure 1: Learning in a single neuron and a homogeneous network. (A) A single neuron represents an input signal x by producing an output x. (B) During learning, the single neuron output x (solid red ˆ ˆ line, top panel) converges towards the input x (blue). Similarly, for a homogeneous network the output x (dashed red line, top panel) converges towards x. Connectivity also converges towards optimal ˆ connectivity in both the single neuron case (solid black line, middle panel) and the homogeneous net2 2 work case (dashed black line, middle panel), as quantiﬁed by D = maxi,j ( Ωij − Ωopt / Ωopt ) ij ij at each point in time. Consequently, the membrane potential reset (bottom panel) converges towards the optimal reset (green line, bottom panel). Spikes are indicated by blue vertical marks, and are produced when the membrane potential reaches threshold (bottom panel). Here, we have rescaled time, as indicated, for clarity. (C) Our learning rule dictates that the autapse ω in our single neuron (bottom panel) changes in proportion to the membrane potential (top panel) and the ﬁring rate (middle panel). (D) At the end of learning, the reset ω ﬂuctuates weakly about the optimal value. (E) For a homogeneous network, neurons spike regularly at the start of learning, as shown in this raster plot. Membrane potentials of different neurons are weakly correlated. (F) At the end of learning, spiking is very irregular and membrane potentials become more synchronous. 5 ing synchronously. The population as a whole acts similarly to the single neuron in our previous example. Each neuron ﬁres regularly, even if a different neuron ﬁres in every integration cycle. The design of this optimal network requires the tuning of N (N − 1) synaptic parameters. How can an arbitrary network of integrate-and-ﬁre neurons learn this optimum? As before, we address this question by using the optimal network as a target for learning. We start with an arbitrarily connected network of integrate-and-ﬁre neurons: ˙ V = −V + ΓT c − Ωo, (16) where Ω is a matrix of connectivity weights, which includes the resets of the individual neurons. Assuming that learning occurs on a slow time scale, we can rewrite this equation as V = ΓT x − Ω¯ . o (17) Now, repeating the arguments from the single neuron derivation, we modify the loss function to obtain an online learning rule. Speciﬁcally, we set LV = V 2 /2, and calculate the gradient: ∂LV = ∂Ωij Vk k ∂Vk =− ∂Ωij Vk δki oj − ¯ k Vk Ωkl kl ∂ ol ¯ . ∂Ωij (18) We can simplify this equation considerably by observing that the contribution of the second summation is largely averaged out under a wide variety of realistic conditions4 . Therefore, it can be neglected, and we obtain the following local learning rule: ∂LV ˙ = V i oj . ¯ τ Ωij = − ∂Ωij (19) This is a Hebbian plasticity rule, whereby connectivity changes in proportion to the presynaptic ﬁring rate oj and post-synaptic membrane potential Vi . We assume that the neural thresholds are set ¯ to a constant T and that the neural resets are set to their optimal values −T . In the previous section we demonstrated that these resets can be obtained by a Hebbian plasticity rule (Eqn. (10)). This learning rule minimises the difference between the read-out and the signal, by approaching the optimal recurrent connection strengths for the network (Fig. 1B). As in the single neuron case, learning does not stop, so the connection strengths ﬂuctuate close to their optimal value. During learning, network activity becomes progressively more asynchronous as it progresses towards optimal connectivity (Fig. 1E, F). 3 Learning in the general case Now that we have developed the fundamental concepts underlying our learning rule, we can derive a learning rule for the more general case of a network of N arbitrarily connected leaky integrateand-ﬁre neurons. Our goal is to understand how such networks can learn to optimally represent a ˙ J-dimensional signal x = (x1 , . . . , xJ ), using the read-out equation x = −x + Γo. We consider a network with the following membrane potential dynamics: ˙ V = −V + ΓT c − Ωo, (20) where c is a J-dimensional input. We assume that this input is related to the signal according to ˙ c = x + x. This assumption can be relaxed by treating the input as the control for an arbitrary linear dynamical system, in which case the signal represented by the network is the output of such a computation [11]. However, this further generalisation is beyond the scope of this work. As before, we need to identify the optimal recurrent connectivity so that we have a target for learning. Most generally, the optimal recurrent connectivity is Ωopt ≡ ΓT Γ + µI. The output kernels of the individual neurons, Γi , are given by the rows of Γ, and their spiking thresholds by Ti ≡ Γi 2 /2 + 4 From the deﬁnition of the membrane potential we can see that Vk ∼ O(1/N ) because Γ ∼ 1/N . Therefore, the size of the ﬁrst term in Eqn. (18) is k Vk δki oj = Vi oj ∼ O(1/N ). Therefore, the second term can ¯ ¯ be ignored if kl Vk Ωkl ∂ ol /∂Ωij ¯ O(1/N ). This happens if Ωkl O(1/N 2 ) as at the start of learning. It also happens towards the end of learning if the terms {Ωkl ∂ ol /∂Ωij } are weakly correlated with zero mean, ¯ or if the membrane potentials {Vi } are weakly correlated with zero mean. 6 µ/2. With these connections and thresholds, we ﬁnd that a network of integrate-and-ﬁre neurons ˆ ¯ will produce spike trains in such a way that the loss function L = x − x 2 + µ o 2 is minimised, ˆ where the read-out is given by x = Γ¯ . We can show this by prescribing a greedy5 spike rule: o a spike is ﬁred by neuron i whenever L(no spike in i) > L(spike in i) [11]. The resulting spike generation rule is Vi > Ti , (21) ˆ where Vi ≡ ΓT (x − x) − µ¯i is interpreted as the membrane potential. o i 5 Despite being greedy, this spiking rule can generate ﬁring rates that are practically identical to the optimal solutions: we checked this numerically in a large ensemble of networks with randomly chosen kernels. (A) x1 … x … 1 1 (B) xJJ x 10 L 10 T T 10 4 6 8 1 Viii V D ˆˆ ˆˆ x11 xJJ x x F 0.5 0 0.4 … … 0.2 0 0 2000 4000 !me (C) x V V 1 x 10 x 3 ˆ x 8 0 x 10 1 2 3 !me 4 5 4 0 1 4 0 1 8 V (F) Ρ(Δt) E-‐I input 0.4 ˆ x 0 3 0 1 x 10 1.3 0.95 x 10 ˆ x 4 V (E) 1 x 0 end of learning 50 neuron neuron 50 !me 2 0 ˆ x 0 0.5 ISI Δt 1 2 !me 4 5 4 1.5 1.32 3 2 0.1 Ρ(Δt) x E-‐I input (D) start of learning 0 2 !me 0 0 0.5 ISI Δt 1 Figure 2: Learning in a heterogeneous network. (A) A network of neurons represents an input ˆ signal x by producing an output x. (B) During learning, the loss L decreases (top panel). The difference between the connection strengths and the optimal strengths also decreases (middle panel), as 2 2 quantiﬁed by the mean difference (solid line), given by D = Ω − Ωopt / Ωopt and the maxi2 2 mum difference (dashed line), given by maxi,j ( Ωij − Ωopt / Ωopt ). The mean population ﬁring ij ij rate (solid line, bottom panel) also converges towards the optimal ﬁring rate (dashed line, bottom panel). (C, E) Before learning, a raster plot of population spiking shows that neurons produce bursts ˆ of spikes (upper panel). The network output x (red line, middle panel) fails to represent x (blue line, middle panel). The excitatory input (red, bottom left panel) and inhibitory input (green, bottom left panel) to a randomly selected neuron is not tightly balanced. Furthermore, a histogram of interspike intervals shows that spiking activity is not Poisson, as indicated by the red line that represents a best-ﬁt exponential distribution. (D, F) At the end of learning, spiking activity is irregular and ˆ Poisson-like, excitatory and inhibitory input is tightly balanced and x matches x. 7 How can we learn this optimal connection matrix? As before, we can derive a learning rule by minimising the cost function LV = V 2 /2. This leads to a Hebbian learning rule with the same form as before: ˙ τ Ωij = Vi oj . ¯ (22) Again, we assume that the neural resets are given by −Ti . Furthermore, in order for this learning rule to work, we must assume that the network input explores all possible directions in the J-dimensional input space (since the kernels Γi can point in any of these directions). The learning performance does not critically depend on how the input variable space is sampled as long as the exploration is extensive. In our simulations, we randomly sample the input c from a Gaussian white noise distribution at every time step for the entire duration of the learning. We ﬁnd that this learning rule decreases the loss function L, thereby approaching optimal network connectivity and producing optimal ﬁring rates for our linear decoder (Fig. 2B). In this example, we have chosen connectivity that is initially much too weak at the start of learning. Consequently, the initial network behaviour is similar to a collection of unconnected single neurons that ignore each other. Spike trains are not Poisson-like, ﬁring rates are excessively large, excitatory and inhibitory ˆ input is unbalanced and the decoded variable x is highly unreliable (Fig. 2C, E). As a result of learning, the network becomes tightly balanced and the spike trains become asynchronous, irregular and Poisson-like with much lower rates (Fig. 2D, F). However, despite this apparent variability, the population representation is extremely precise, only limited by the the metabolic cost and the discrete nature of a spike. This learnt representation is far more precise than a rate code with independent Poisson spike trains [11]. In particular, shufﬂing the spike trains in response to identical inputs drastically degrades this precision. 4 Conclusions and Discussion In population coding, large trial-to-trial spike train variability is usually interpreted as noise [2]. We show here that a deterministic network of leaky integrate-and-ﬁre neurons with a simple Hebbian plasticity rule can self-organise into a regime where information is represented far more precisely than in noisy rate codes, while appearing to have noisy Poisson-like spiking dynamics. Our learning rule (Eqn. (22)) has the basic properties of STDP. Speciﬁcally, a presynaptic spike occurring immediately before a post-synaptic spike will potentiate a synapse, because membrane potentials are positive immediately before a postsynaptic spike. Furthermore, a presynaptic spike occurring immediately after a post-synaptic spike will depress a synapse, because membrane potentials are always negative immediately after a postsynaptic spike. This is similar in spirit to the STDP rule proposed in [12], but different to classical STDP, which depends on post-synaptic spike times [9]. This learning rule can also be understood as a mechanism for generating a tight balance between excitatory and inhibitory input. We can see this by observing that membrane potentials after learning can be interpreted as representation errors (projected onto the read-out kernels). Therefore, learning acts to minimise the magnitude of membrane potentials. Excitatory and inhibitory input must be balanced if membrane potentials are small, so we can equate balance with optimal information representation. Previous work has shown that the balanced regime produces (quasi-)chaotic network dynamics, thereby accounting for much observed cortical spike train variability [13, 14, 4]. Moreover, the STDP rule has been known to produce a balanced regime [16, 17]. Additionally, recent theoretical studies have suggested that the balanced regime plays an integral role in network computation [15, 13]. In this work, we have connected these mechanisms and functions, to conclude that learning this balance is equivalent to the development of an optimal spike-based population code, and that this learning can be achieved using a simple Hebbian learning rule. Acknowledgements We are grateful for generous funding from the Emmy-Noether grant of the Deutsche Forschungsgemeinschaft (CKM) and the Chaire d’excellence of the Agence National de la Recherche (CKM, DB), as well as a James Mcdonnell Foundation Award (SD) and EU grants BACS FP6-IST-027140, BIND MECT-CT-20095-024831, and ERC FP7-PREDSPIKE (SD). 8 References [1] Tolhurst D, Movshon J, and Dean A (1982) The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res 23: 775–785. [2] Shadlen MN, Newsome WT (1998) The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci 18(10): 3870–3896. [3] Zohary E, Newsome WT (1994) Correlated neuronal discharge rate and its implication for psychophysical performance. Nature 370: 140–143. [4] Renart A, de la Rocha J, Bartho P, Hollender L, Parga N, Reyes A, & Harris, KD (2010) The asynchronous state in cortical circuits. Science 327, 587–590. [5] Ecker AS, Berens P, Keliris GA, Bethge M, Logothetis NK, Tolias AS (2010) Decorrelated neuronal ﬁring in cortical microcircuits. Science 327: 584–587. [6] Okun M, Lampl I (2008) Instantaneous correlation of excitation and inhibition during ongoing and sensory-evoked activities. Nat Neurosci 11, 535–537. [7] Shu Y, Hasenstaub A, McCormick DA (2003) Turning on and off recurrent balanced cortical activity. Nature 423, 288–293. [8] Gentet LJ, Avermann M, Matyas F, Staiger JF, Petersen CCH (2010) Membrane potential dynamics of GABAergic neurons in the barrel cortex of behaving mice. Neuron 65: 422–435. [9] Caporale N, Dan Y (2008) Spike-timing-dependent plasticity: a Hebbian learning rule. Annu Rev Neurosci 31: 25–46. [10] Boerlin M, Deneve S (2011) Spike-based population coding and working memory. PLoS Comput Biol 7, e1001080. [11] Boerlin M, Machens CK, Deneve S (2012) Predictive coding of dynamic variables in balanced spiking networks. under review. [12] Clopath C, B¨ sing L, Vasilaki E, Gerstner W (2010) Connectivity reﬂects coding: a model of u voltage-based STDP with homeostasis. Nat Neurosci 13(3): 344–352. [13] van Vreeswijk C, Sompolinsky H (1998) Chaotic balanced state in a model of cortical circuits. Neural Comput 10(6): 1321–1371. [14] Brunel N (2000) Dynamics of sparsely connected networks of excitatory and inhibitory neurons. J Comput Neurosci 8, 183–208. [15] Vogels TP, Rajan K, Abbott LF (2005) Neural network dynamics. Annu Rev Neurosci 28: 357–376. [16] Vogels TP, Sprekeler H, Zenke F, Clopath C, Gerstner W. (2011) Inhibitory plasticity balances excitation and inhibition in sensory pathways and memory networks. Science 334(6062):1569– 73. [17] Song S, Miller KD, Abbott LF (2000) Competitive Hebbian learning through spike-timingdependent synaptic plasticity. Nat Neurosci 3(9): 919–926. 9

3 0.19596612 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

Author: Jaldert Rombouts, Pieter Roelfsema, Sander M. Bohte

Abstract: A key function of brains is undoubtedly the abstraction and maintenance of information from the environment for later use. Neurons in association cortex play an important role in this process: by learning these neurons become tuned to relevant features and represent the information that is required later as a persistent elevation of their activity [1]. It is however not well known how such neurons acquire these task-relevant working memories. Here we introduce a biologically plausible learning scheme grounded in Reinforcement Learning (RL) theory [2] that explains how neurons become selective for relevant information by trial and error learning. The model has memory units which learn useful internal state representations to solve working memory tasks by transforming partially observable Markov decision problems (POMDP) into MDPs. We propose that synaptic plasticity is guided by a combination of attentional feedback signals from the action selection stage to earlier processing levels and a globally released neuromodulatory signal. Feedback signals interact with feedforward signals to form synaptic tags at those connections that are responsible for the stimulus-response mapping. The neuromodulatory signal interacts with tagged synapses to determine the sign and strength of plasticity. The learning scheme is generic because it can train networks in different tasks, simply by varying inputs and rewards. It explains how neurons in association cortex learn to 1) temporarily store task-relevant information in non-linear stimulus-response mapping tasks [1, 3, 4] and 2) learn to optimally integrate probabilistic evidence for perceptual decision making [5, 6]. 1

4 0.18155684 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

Author: Stefan Habenschuss, Johannes Bill, Bernhard Nessler

Abstract: Recent spiking network models of Bayesian inference and unsupervised learning frequently assume either inputs to arrive in a special format or employ complex computations in neuronal activation functions and synaptic plasticity rules. Here we show in a rigorous mathematical treatment how homeostatic processes, which have previously received little attention in this context, can overcome common theoretical limitations and facilitate the neural implementation and performance of existing models. In particular, we show that homeostatic plasticity can be understood as the enforcement of a ’balancing’ posterior constraint during probabilistic inference and learning with Expectation Maximization. We link homeostatic dynamics to the theory of variational inference, and show that nontrivial terms, which typically appear during probabilistic inference in a large class of models, drop out. We demonstrate the feasibility of our approach in a spiking WinnerTake-All architecture of Bayesian inference and learning. Finally, we sketch how the mathematical framework can be extended to richer recurrent network architectures. Altogether, our theory provides a novel perspective on the interplay of homeostatic processes and synaptic plasticity in cortical microcircuits, and points to an essential role of homeostasis during inference and learning in spiking networks. 1

5 0.17809425 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter

Author: Dmitri B. Chklovskii, Daniel Soudry

Abstract: We test the hypothesis that the neuronal spike generation mechanism is an analog-to-digital (AD) converter encoding rectified low-pass filtered summed synaptic currents into a spike train linearly decodable in postsynaptic neurons. Faithful encoding of an analog waveform by a binary signal requires that the spike generation mechanism has a sampling rate exceeding the Nyquist rate of the analog signal. Such oversampling is consistent with the experimental observation that the precision of the spikegeneration mechanism is an order of magnitude greater than the cut -off frequency of low-pass filtering in dendrites. Additional improvement in the coding accuracy may be achieved by noise-shaping, a technique used in signal processing. If noise-shaping were used in neurons, it would reduce coding error relative to Poisson spike generator for frequencies below Nyquist by introducing correlations into spike times. By using experimental data from three different classes of neurons, we demonstrate that biological neurons utilize noise-shaping. Therefore, the spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter. The nature of the neural spike code remains a central problem in neuroscience [1-3]. In particular, no consensus exists on whether information is encoded in firing rates [4, 5] or individual spike timing [6, 7]. On the single-neuron level, evidence exists to support both points of view. On the one hand, post-synaptic currents are low-pass-filtered by dendrites with the cut-off frequency of approximately 30Hz [8], Figure 1B, providing ammunition for the firing rate camp: if the signal reaching the soma is slowly varying, why would precise spike timing be necessary? On the other hand, the ability of the spike-generation mechanism to encode harmonics of the injected current up to about 300Hz [9, 10], Figure 1B, points at its exquisite temporal precision [11]. Yet, in view of the slow variation of the somatic current, such precision may seem gratuitous and puzzling. The timescale mismatch between gradual variation of the somatic current and high precision of spike generation has been addressed previously. Existing explanations often rely on the population nature of the neural code [10, 12]. Although this is a distinct possibility, the question remains whether invoking population coding is necessary. Other possible explanations for the timescale mismatch include the possibility that some synaptic currents (for example, GABAergic) may be generated by synapses proximal to the soma and therefore not subject to low-pass filtering or that the high frequency harmonics are so strong in the pre-synaptic spike that despite attenuation, their trace is still present. Although in some cases, these explanations could apply, for the majority of synaptic inputs to typical neurons there is a glaring mismatch. The perceived mismatch between the time scales of somatic currents and the spike-generation mechanism can be resolved naturally if one views spike trains as digitally encoding analog somatic currents [13-15], Figure 1A. Although somatic currents vary slowly, information that could be communicated by their analog amplitude far exceeds that of binary signals, such as all- or-none spikes, of the same sampling rate. Therefore, faithful digital encoding requires sampling rate of the digital signal to be much higher than the cut-off frequency of the analog signal, socalled over-sampling. Although the spike generation mechanism operates in continuous time, the high temporal precision of the spikegeneration mechanism may be viewed as a manifestation of oversampling, which is needed for the digital encoding of the analog signal. Therefore, the extra order of magnitude in temporal precision available to the spike-generation mechanism relative to somatic current, Figure 1B, is necessary to faithfully encode the amplitude of the analog signal, thus potentially reconciling the firing rate and the spike timing points of view [13-15]. Figure 1. Hybrid digital-analog operation of neuronal circuits. A. Post-synaptic currents are low-pass filtered and summed in dendrites (black) to produce a somatic current (blue). This analog signal is converted by the spike generation mechanism into a sequence of all-or-none spikes (green), a digital signal. Spikes propagate along an axon and are chemically transduced across synapses (gray) into post-synatpic currents (black), whose amplitude reflects synaptic weights, thus converting digital signal back to analog. B. Frequency response function for dendrites (blue, adapted from [8]) and for the spike generation mechanism (green, adapted from [9]). Note one order of magnitude gap between the cut off frequencies. C. Amplitude of the summed postsynaptic currents depends strongly on spike timing. If the blue spike arrives just 5ms later, as shown in red, the EPSCs sum to a value already 20% less. Therefore, the extra precision of the digital signal may be used to communicate the amplitude of the analog signal. In signal processing, efficient AD conversion combines the principle of oversampling with that of noise-shaping, which utilizes correlations in the digital signal to allow more accurate encoding of the analog amplitude. This is exemplified by a family of AD converters called modulators [16], of which the basic one is analogous to an integrate-and-fire (IF) neuron [13-15]. The analogy between the basic modulator and the IF neuron led to the suggestion that neurons also use noise-shaping to encode incoming analog current waveform in the digital spike train [13]. However, the hypothesis of noise-shaping AD conversion has never been tested experimentally in biological neurons. In this paper, by analyzing existing experimental datasets, we demonstrate that noise-shaping is present in three different classes of neurons from vertebrates and invertebrates. This lends support to the view that neurons act as oversampling and noise-shaping AD converters and accounts for the mismatch between the slowly varying somatic currents and precise spike timing. Moreover, we show that the degree of noise-shaping in biological neurons exceeds that used by basic  modulators or IF neurons and propose viewing more complicated models in the noise-shaping framework. This paper is organized as follows: We review the principles of oversampling and noise-shaping in Section 2. In Section 3, we present experimental evidence for noise-shaping AD conversion in neurons. In Section 4 we argue that rectification of somatic currents may improve energy efficiency and/or implement de-noising. 2 . Oversampling and noise-shaping in AD converters To understand how oversampling can lead to more accurate encoding of the analog signal amplitude in a digital form, we first consider a Poisson spike encoder, whose rate of spiking is modulated by the signal amplitude, Figure 2A. Such an AD converter samples an analog signal at discrete time points and generates a spike with a probability given by the (normalized) signal amplitude. Because of the binary nature of spike trains, the resulting spike train encodes the signal with a large error even when the sampling is done at Nyquist rate, i.e. the lowest rate for alias-free sampling. To reduce the encoding error a Poisson encoder can sample at frequencies, fs , higher than Nyquist, fN – hence, the term oversampling, Figure 2B. When combined with decoding by lowpass filtering (down to Nyquist) on the receiving end, this leads to a reduction of the error, which can be estimated as follows. The number of samples over a Nyquist half-period (1/2fN) is given by the oversampling ratio: . As the normalized signal amplitude, , stays roughly constant over the Nyquist half-period, it can be encoded by spikes generated with a fixed probability, x. For a Poisson process the variance in the number of spikes is equal to the mean, . Therefore, the mean relative error of the signal decoded by averaging over the Nyquist half-period: , (1) indicating that oversampling reduces transmission error. However, the weak dependence of the error on the oversampling frequency indicates diminishing returns on the investment in oversampling and motivates one to search for other ways to lower the error. Figure 2. Oversampling and noise-shaping in AD conversion. A. Analog somatic current (blue) and its digital code (green). The difference between the green and the blue curves is encoding error. B. Digital output of oversampling Poisson encoder over one Nyquist half-period. C. Error power spectrum of a Nyquist (dark green) and oversampled (light green) Poisson encoder. Although the total error power is the same, the fraction surviving low-pass filtering during decoding (solid green) is smaller in oversampled case. D. Basic  modulator. E. Signal at the output of the integrator. F. Digital output of the  modulator over one Nyquist period. G. Error power spectrum of the  modulator (brown) is shifted to higher frequencies and low-pass filtered during decoding. The remaining error power (solid brown) is smaller than for Poisson encoder. To reduce encoding error beyond the ½ power of the oversampling ratio, the principle of noiseshaping was put forward [17]. To illustrate noise-shaping consider a basic AD converter called  [18], Figure 2D. In the basic  modulator, the previous quantized signal is fed back and subtracted from the incoming signal and then the difference is integrated in time. Rather than quantizing the input signal, as would be done in the Poisson encoder,  modulator quantizes the integral of the difference between the incoming analog signal and the previous quantized signal, Figure 2F. One can see that, in the oversampling regime, the quantization error of the basic  modulator is significantly less than that of the Poisson encoder. As the variance in the number of spikes over the Nyquist period is less than one, the mean relative error of the signal is at most, , which is better than the Poisson encoder. To gain additional insight and understand the origin of the term noise-shaping, we repeat the above analysis in the Fourier domain. First, the Poisson encoder has a flat power spectrum up to the sampling frequency, Figure 2C. Oversampling preserves the total error power but extends the frequency range resulting in the lower error power below Nyquist. Second, a more detailed analysis of the basic  modulator, where the dynamics is linearized by replacing the quantization device with a random noise injection [19], shows that the quantization noise is effectively differentiated. Taking the derivative in time is equivalent to multiplying the power spectrum of the quantization noise by frequency squared. Such reduction of noise power at low frequencies is an example of noise shaping, Figure 2G. Under the additional assumption of the white quantization noise, such analysis yields: , (2) which for R >> 1 is significantly better performance than for the Poisson encoder, Eq.(1). As mentioned previously, the basic  modulator, Figure 2D, in the continuous-time regime is nothing other than an IF neuron [13, 20, 21]. In the IF neuron, quantization is implemented by the spike generation mechanism and the negative feedback corresponds to the after-spike reset. Note that resetting the integrator to zero is strictly equivalent to subtraction only for continuous-time operation. In discrete-time computer simulations, the integrator value may exceed the threshold, and, therefore, subtraction of the threshold value rather than reset must be used. Next, motivated by the -IF analogy, we look for the signs of noise-shaping AD conversion in real neurons. 3 . Experimental evidence of noise-shaping AD conversion in real neurons In order to determine whether noise-shaping AD conversion takes place in biological neurons, we analyzed three experimental datasets, where spike trains were generated by time-varying somatic currents: 1) rat somatosensory cortex L5 pyramidal neurons [9], 2) mouse olfactory mitral cells [22, 23], and 3) fruit fly olfactory receptor neurons [24]. In the first two datasets, the current was injected through an electrode in whole-cell patch clamp mode, while in the third, the recording was extracellular and the intrinsic somatic current could be measured because the glial compartment included only one active neuron. Testing the noise-shaping AD conversion hypothesis is complicated by the fact that encoded and decoded signals are hard to measure accurately. First, as somatic current is rectified by the spikegeneration mechanism, only its super-threshold component can be encoded faithfully making it hard to know exactly what is being encoded. Second, decoding in the dendrites is not accessible in these single-neuron recordings. In view of these difficulties, we start by simply computing the power spectrum of the reconstruction error obtained by subtracting a scaled and shifted, but otherwise unaltered, spike train from the somatic current. The scaling factor was determined by the total weight of the decoding linear filter and the shift was optimized to maximize information capacity, see below. At the frequencies below 20Hz the error contains significantly lower power than the input signal, Figure 3, indicating that the spike generation mechanism may be viewed as an AD converter. Furthermore, the error power spectrum of the biological neuron is below that of the Poisson encoder, thus indicating the presence of noise-shaping. For dataset 3 we also plot the error power spectrum of the IF neuron, the threshold of which is chosen to generate the same number of spikes as the biological neuron. 4 somatic current biological neuron error Poisson encoder error I&F; neuron error 10 1 10 0 Spectral power, a.u. Spectral power, a.u. 10 3 10 -1 10 -2 10 -3 10 2 10 -4 10 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 3. Evidence of noise-shaping. Power spectra of the somatic current (blue), difference between the somatic current and the digital spike train of the biological neuron (black), of the Poisson encoder (green) and of the IF neuron (red). Left: datset 1, right: dataset 3. Although the simple analysis presented above indicates noise-shaping, subtracting the spike train from the input signal, Figure 3, does not accurately quantify the error when decoding involves additional filtering. An example of such additional encoding/decoding is predictive coding, which will be discussed below [25]. To take such decoding filter into account, we computed a decoded waveform by convolving the spike train with the optimal linear filter, which predicts the somatic current from the spike train with the least mean squared error. Our linear decoding analysis lends additional support to the noise-shaping AD conversion hypothesis [13-15]. First, the optimal linear filter shape is similar to unitary post-synaptic currents, Figure 4B, thus supporting the view that dendrites reconstruct the somatic current of the presynaptic neuron by low-pass filtering the spike train in accordance with the noise-shaping principle [13]. Second, we found that linear decoding using an optimal filter accounts for 60-80% of the somatic current variance. Naturally, such prediction works better for neurons in suprathreshold regime, i.e. with high firing rates, an issue to which we return in Section 4. To avoid complications associated with rectification for now we focused on neurons which were in suprathreshold regime by monitoring that the relationship between predicted and actual current is close to linear. 2 10 C D 1 10 somatic current biological neuron error Poisson encoder error Spectral power, a.u. Spectral power, a.u. I&F; neuron error 3 10 0 10 -1 10 -2 10 -3 10 2 10 -4 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 10 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 4. Linear decoding of experimentally recorded spike trains. A. Waveform of somatic current (blue), resulting spike train (black), and the linearly decoded waveform (red) from dataset 1. B. Top: Optimal linear filter for the trace in A, is representative of other datasets as well. Bottom: Typical EPSPs have a shape similar to the decoding filter (adapted from [26]). C-D. Power spectra of the somatic current (blue), the decdoding error of the biological neuron (black), the Poisson encoder (green), and IF neuron (red) for dataset 1 (C) dataset 3 (D). Next, we analyzed the spectral distribution of the reconstruction error calculated by subtracting the decoded spike train, i.e. convolved with the computed optimal linear filter, from the somatic current. We found that at low frequencies the error power is significantly lower than in the input signal, Figure 4C,D. This observation confirms that signals below the dendritic cut-off frequency of 20-30Hz can be efficiently communicated using spike trains. To quantify the effect of noise-shaping we computed information capacity of different encoders: where S(f) and N(f) are the power spectra of the somatic current and encoding error correspondingly and the sum is computed only over the frequencies for which S(f) > N(f). Because the plots in Figure 4C,D use semi-logrithmic scale, the information capacity can be estimated from the area between a somatic current (blue) power spectrum and an error power spectrum. We find that the biological spike generation mechanism has higher information capacity than the Poisson encoder and IF neurons. Therefore, neurons act as AD converters with stronger noise-shaping than IF neurons. We now return to the predictive nature of the spike generation mechanism. Given the causal nature of the spike generation mechanism it is surprising that the optimal filters for all three datasets carry most of their weight following a spike, Figure 4B. This indicates that the spike generation mechanism is capable of making predictions, which are possible in these experiments because somatic currents are temporally correlated. We note that these observations make delay-free reconstruction of the signal possible, thus allowing fast operation of neural circuits [27]. The predictive nature of the encoder can be captured by a  modulator embedded in a predictive coding feedback loop [28], Figure 5A. We verified by simulation that such a nested architecture generates a similar optimal linear filter with most of its weight in the time following a spike, Figure 5A right. Of course such prediction is only possible for correlated inputs implying that the shape of the optimal linear filter depends on the statistics of the inputs. The role of predictive coding is to reduce the dynamic range of the signal that enters , thus avoiding overloading. A possible biological implementation for such integrating feedback could be Ca2+ 2+ concentration and Ca dependent potassium channels [25, 29]. Figure 5. Enhanced  modulators. A.  modulator combined with predictive coder. In such device, the optimal decoding filter computed for correlated inputs has most of its weight following a spike, similar to experimental measurements, Figure 4B. B. Second-order  modulator possesses stronger noise-shaping properties. Because such circuit contains an internal state variable it generates a non-periodic spike train in response to a constant input. Bottom trace shows a typical result of a simulation. Black – spikes, blue – input current. 4 . Possible reasons for current rectification: energy efficiency and de-noising We have shown that at high firing rates biological neurons encode somatic current into a linearly decodable spike train. However, at low firing rates linear decoding cannot faithfully reproduce the somatic current because of rectification in the spike generation mechanism. If the objective of spike generation is faithful AD conversion, why would such rectification exist? We see two potential reasons: energy efficiency and de-noising. It is widely believed that minimizing metabolic costs is an important consideration in brain design and operation [30, 31]. Moreover, spikes are known to consume a significant fraction of the metabolic budget [30, 32] placing a premium on their total number. Thus, we can postulate that neuronal spike trains find a trade-off between the mean squared error in the decoded spike train relative to the input signal and the total number of spikes, as expressed by the following cost function over a time interval T: , (3) where x is the analog input signal, s is the binary spike sequence composed of zeros and ones, and is the linear filter. To demonstrate how solving Eq.(3) would lead to thresholding, let us consider a simplified version taken over a Nyquist period, during which the input signal stays constant: (4) where and normalized by w. Minimizing such a cost function reduces to choosing the lowest lying parabola for a given , Figure 6A. Therefore, thresholding is a natural outcome of minimizing a cost function combining the decoding error and the energy cost, Eq.(3). In addition to energy efficiency, there may be a computational reason for thresholding somatic current in neurons. To illustrate this point, we note that the cost function in Eq. (3) for continuous variables, st, may be viewed as a non-negative version of the L1-norm regularized linear regression called LASSO [33], which is commonly used for de-noising of sparse and Laplacian signals [34]. Such cost function can be minimized by iteratively applying a gradient descent and a shrinkage steps [35], which is equivalent to thresholding (one-sided in case of non-negative variables), Figure 6B,C. Therefore, neurons may be encoding a de-noised input signal. Figure 6. Possible reasons for rectification in neurons. A. Cost function combining encoding error squared with metabolic expense vs. input signal for different values of the spike number N, Eq.(4). Note that the optimal number of spikes jumps from zero to one as a function of input. B. Estimating most probable “clean” signal value for continuous non-negative Laplacian signal and Gaussian noise, Eq.(3) (while setting w = 1). The parabolas (red) illustrate the quadratic loglikelihood term in (3) for different values of the measurement, s, while the linear function (blue) reflects the linear log-prior term in (3). C. The minimum of the combined cost function in B is at zero if s , and grows linearly with s, if s >. 5 . Di scu ssi on In this paper, we demonstrated that the neuronal spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter, which encodes a rectified low-pass filtered somatic current as a digital spike train. Rectification by the spike generation mechanism may subserve both energy efficiency and de-noising. As the degree of noise-shaping in biological neurons exceeds that in IF neurons, or basic , we suggest that neurons should be modeled by more advanced  modulators, e.g. Figure 5B. Interestingly,  modulators can be also viewed as coders with error prediction feedback [19]. Many publications studied various aspects of spike generation in neurons yet we believe that the framework [13-15] we adopt is different and discuss its relationship to some of the studies. Our framework is different from previous proposals to cast neurons as predictors [36, 37] because a different quantity is being predicted. The possibility of perfect decoding from a spike train with infinite temporal precision has been proven in [38]. Here, we are concerned with a more practical issue of how reconstruction error scales with the over-sampling ratio. Also, we consider linear decoding which sets our work apart from [39]. Finally, previous experiments addressing noiseshaping [40] studied the power spectrum of the spike train rather than that of the encoding error. Our work is aimed at understanding biological and computational principles of spike-generation and decoding and is not meant as a substitute for the existing phenomenological spike-generation models [41], which allow efficient fitting of parameters and prediction of spike trains [42]. Yet, the theoretical framework [13-15] we adopt may assist in building better models of spike generation for a given somatic current waveform. First, having interpreted spike generation as AD conversion, we can draw on the rich experience in signal processing to attack the problem. Second, this framework suggests a natural metric to compare the performance of different spike generation models in the high firing rate regime: a mean squared error between the injected current waveform and the filtered version of the spike train produced by a model provided the total number of spikes is the same as in the experimental data. The AD conversion framework adds justification to the previously proposed spike distance obtained by subtracting low-pass filtered spike trains [43]. As the framework [13-15] we adopt relies on viewing neuronal computation as an analog-digital hybrid, which requires AD and DA conversion at every step, one may wonder about the reason for such a hybrid scheme. Starting with the early days of computers, the analog mode is known to be advantageous for computation. For example, performing addition of many variables in one step is possible in the analog mode simply by Kirchhoff law, but would require hundreds of logical gates in the digital mode [44]. However, the analog mode is vulnerable to noise build-up over many stages of computation and is inferior in precisely communicating information over long distances under limited energy budget [30, 31]. While early analog computers were displaced by their digital counterparts, evolution combined analog and digital modes into a computational hybrid [44], thus necessitating efficient AD and DA conversion, which was the focus of the present study. We are grateful to L. Abbott, S. Druckmann, D. Golomb, T. Hu, J. Magee, N. Spruston, B. Theilman for helpful discussions and comments on the manuscript, to X.-J. Wang, D. McCormick, K. Nagel, R. Wilson, K. Padmanabhan, N. Urban, S. Tripathy, H. Koendgen, and M. Giugliano for sharing their data. The work of D.S. was partially supported by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI). R e f e re n c e s 1. Ferster, D. and N. Spruston, Cracking the neural code. Science, 1995. 270: p. 756-7. 2. Panzeri, S., et al., Sensory neural codes using multiplexed temporal scales. Trends Neurosci, 2010. 33(3): p. 111-20. 3. Stevens, C.F. and A. Zador, Neural coding: The enigma of the brain. Curr Biol, 1995. 5(12): p. 1370-1. 4. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 5. Shadlen, M.N. and W.T. Newsome, Noise, neural codes and cortical organization. Curr Opin Neurobiol, 1994. 4(4): p. 569-79. 6. Singer, W. and C.M. Gray, Visual feature integration and the temporal correlation hypothesis. Annu Rev Neurosci, 1995. 18: p. 555-86. 7. Meister, M., Multineuronal codes in retinal signaling. Proc Natl Acad Sci U S A, 1996. 93(2): p. 609-14. 8. Cook, E.P., et al., Dendrite-to-soma input/output function of continuous timevarying signals in hippocampal CA1 pyramidal neurons. J Neurophysiol, 2007. 98(5): p. 2943-55. 9. Kondgen, H., et al., The dynamical response properties of neocortical neurons to temporally modulated noisy inputs in vitro. Cereb Cortex, 2008. 18(9): p. 2086-97. 10. Tchumatchenko, T., et al., Ultrafast population encoding by cortical neurons. J Neurosci, 2011. 31(34): p. 12171-9. 11. Mainen, Z.F. and T.J. Sejnowski, Reliability of spike timing in neocortical neurons. Science, 1995. 268(5216): p. 1503-6. 12. Mar, D.J., et al., Noise shaping in populations of coupled model neurons. Proc Natl Acad Sci U S A, 1999. 96(18): p. 10450-5. 13. Shin, J., Adaptive noise shaping neural spike encoding and decoding. Neurocomputing, 2001. 38-40: p. 369-381. 14. Shin, J., The noise shaping neural coding hypothesis: a brief history and physiological implications. Neurocomputing, 2002. 44: p. 167-175. 15. Shin, J.H., Adaptation in spiking neurons based on the noise shaping neural coding hypothesis. Neural Networks, 2001. 14(6-7): p. 907-919. 16. Schreier, R. and G.C. Temes, Understanding delta-sigma data converters2005, Piscataway, NJ: IEEE Press, Wiley. xii, 446 p. 17. Candy, J.C., A use of limit cycle oscillations to obtain robust analog-to-digital converters. IEEE Trans. Commun, 1974. COM-22: p. 298-305. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. Inose, H., Y. Yasuda, and J. Murakami, A telemetring system code modulation -  modulation. IRE Trans. Space Elect. Telemetry, 1962. SET-8: p. 204-209. Spang, H.A. and P.M. Schultheiss, Reduction of quantizing noise by use of feedback. IRE TRans. Commun. Sys., 1962: p. 373-380. Hovin, M., et al., Delta-Sigma modulation in single neurons, in IEEE International Symposium on Circuits and Systems2002. Cheung, K.F. and P.Y.H. Tang, Sigma-Delta Modulation Neural Networks. Proc. IEEE Int Conf Neural Networkds, 1993: p. 489-493. Padmanabhan, K. and N. Urban, Intrinsic biophysical diversity decorelates neuronal firing while increasing information content. Nat Neurosci, 2010. 13: p. 1276-82. Urban, N. and S. Tripathy, Neuroscience: Circuits drive cell diversity. Nature, 2012. 488(7411): p. 289-90. Nagel, K.I. and R.I. Wilson, personal communication. Shin, J., C. Koch, and R. Douglas, Adaptive neural coding dependent on the timevarying statistics of the somatic input current. Neural Comp, 1999. 11: p. 1893-913. Magee, J.C. and E.P. Cook, Somatic EPSP amplitude is independent of synapse location in hippocampal pyramidal neurons. Nat Neurosci, 2000. 3(9): p. 895-903. Thorpe, S., D. Fize, and C. Marlot, Speed of processing in the human visual system. Nature, 1996. 381(6582): p. 520-2. Tewksbury, S.K. and R.W. Hallock, Oversample, linear predictive and noiseshaping coders of order N>1. IEEE Trans Circuits & Sys, 1978. CAS25: p. 436-47. Wang, X.J., et al., Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J Neurophysiol, 2003. 89(6): p. 3279-93. Attwell, D. and S.B. Laughlin, An energy budget for signaling in the grey matter of the brain. J Cereb Blood Flow Metab, 2001. 21(10): p. 1133-45. Laughlin, S.B. and T.J. Sejnowski, Communication in neuronal networks. Science, 2003. 301(5641): p. 1870-4. Lennie, P., The cost of cortical computation. Curr Biol, 2003. 13(6): p. 493-7. Tibshirani, R., Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 1996. 58(1): p. 267-288. Chen, S.S.B., D.L. Donoho, and M.A. Saunders, Atomic decomposition by basis pursuit. Siam Journal on Scientific Computing, 1998. 20(1): p. 33-61. Elad, M., et al., Wide-angle view at iterated shrinkage algorithms. P SOc Photo-Opt Ins, 2007. 6701: p. 70102. Deneve, S., Bayesian spiking neurons I: inference. Neural Comp, 2008. 20: p. 91. Yu, A.J., Optimal Change-Detection and Spinking Neurons, in NIPS, B. Scholkopf, J. Platt, and T. Hofmann, Editors. 2006. Lazar, A. and L. Toth, Perfect Recovery and Sensitivity Analysis of Time Encoded Bandlimited Signals. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, 2004. 51(10). Pfister, J.P., P. Dayan, and M. Lengyel, Synapses with short-term plasticity are optimal estimators of presynaptic membrane potentials. Nat Neurosci, 2010. 13(10): p. 1271-5. Chacron, M.J., et al., Experimental and theoretical demonstration of noise shaping by interspike interval correlations. Fluctuations and Noise in Biological, Biophysical, and Biomedical Systems III, 2005. 5841: p. 150-163. Pillow, J., Likelihood-based approaches to modeling the neural code, in Bayesian Brain: Probabilistic Approaches to Neural Coding, K. Doya, et al., Editors. 2007, MIT Press. Jolivet, R., et al., A benchmark test for a quantitative assessment of simple neuron models. J Neurosci Methods, 2008. 169(2): p. 417-24. van Rossum, M.C., A novel spike distance. Neural Comput, 2001. 13(4): p. 751-63. Sarpeshkar, R., Analog versus digital: extrapolating from electronics to neurobiology. Neural Computation, 1998. 10(7): p. 1601-38.

6 0.17419466 150 nips-2012-Hierarchical spike coding of sound

7 0.14812076 362 nips-2012-Waveform Driven Plasticity in BiFeO3 Memristive Devices: Model and Implementation

8 0.14316285 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

9 0.11181638 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

10 0.11164521 195 nips-2012-Learning visual motion in recurrent neural networks

11 0.10971095 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

12 0.087077983 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

13 0.0829546 224 nips-2012-Multi-scale Hyper-time Hardware Emulation of Human Motor Nervous System Based on Spiking Neurons using FPGA

14 0.079982452 322 nips-2012-Spiking and saturating dendrites differentially expand single neuron computation capacity

15 0.073767923 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing

16 0.073120549 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

17 0.064361311 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

18 0.062100261 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

19 0.058176056 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

20 0.05757897 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.136), (1, -0.051), (2, -0.104), (3, 0.148), (4, -0.055), (5, 0.336), (6, -0.002), (7, 0.082), (8, -0.025), (9, 0.147), (10, 0.013), (11, 0.013), (12, 0.076), (13, 0.028), (14, 0.025), (15, 0.055), (16, -0.06), (17, 0.027), (18, -0.024), (19, 0.005), (20, -0.056), (21, -0.125), (22, 0.002), (23, -0.004), (24, -0.026), (25, 0.039), (26, 0.014), (27, -0.103), (28, -0.089), (29, -0.048), (30, 0.022), (31, -0.058), (32, 0.091), (33, 0.026), (34, -0.103), (35, -0.002), (36, -0.01), (37, 0.028), (38, -0.133), (39, -0.118), (40, -0.041), (41, -0.003), (42, -0.048), (43, -0.002), (44, 0.003), (45, -0.026), (46, 0.037), (47, -0.043), (48, 0.011), (49, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9630183 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

Author: David Balduzzi, Michel Besserve

2 0.87379092 362 nips-2012-Waveform Driven Plasticity in BiFeO3 Memristive Devices: Model and Implementation

Author: Christian Mayr, Paul Stärke, Johannes Partzsch, Love Cederstroem, Rene Schüffny, Yao Shuai, Nan Du, Heidemarie Schmidt

Abstract: Memristive devices have recently been proposed as efﬁcient implementations of plastic synapses in neuromorphic systems. The plasticity in these memristive devices, i.e. their resistance change, is deﬁned by the applied waveforms. This behavior resembles biological synapses, whose plasticity is also triggered by mechanisms that are determined by local waveforms. However, learning in memristive devices has so far been approached mostly on a pragmatic technological level. The focus seems to be on ﬁnding any waveform that achieves spike-timing-dependent plasticity (STDP), without regard to the biological veracity of said waveforms or to further important forms of plasticity. Bridging this gap, we make use of a plasticity model driven by neuron waveforms that explains a large number of experimental observations and adapt it to the characteristics of the recently introduced BiFeO3 memristive material. Based on this approach, we show STDP for the ﬁrst time for this material, with learning window replication superior to previous memristor-based STDP implementations. We also demonstrate in measurements that it is possible to overlay short and long term plasticity at a memristive device in the form of the well-known triplet plasticity. To the best of our knowledge, this is the ﬁrst implementations of triplet plasticity on any physical memristive device. 1

3 0.86075872 190 nips-2012-Learning optimal spike-based representations

Author: Ralph Bourdoukan, David Barrett, Sophie Deneve, Christian K. Machens

4 0.82146931 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter

Author: Dmitri B. Chklovskii, Daniel Soudry

5 0.81579113 322 nips-2012-Spiking and saturating dendrites differentially expand single neuron computation capacity

Author: Romain Cazé, Mark Humphries, Boris S. Gutkin

Abstract: The integration of excitatory inputs in dendrites is non-linear: multiple excitatory inputs can produce a local depolarization departing from the arithmetic sum of each input’s response taken separately. If this depolarization is bigger than the arithmetic sum, the dendrite is spiking; if the depolarization is smaller, the dendrite is saturating. Decomposing a dendritic tree into independent dendritic spiking units greatly extends its computational capacity, as the neuron then maps onto a two layer neural network, enabling it to compute linearly non-separable Boolean functions (lnBFs). How can these lnBFs be implemented by dendritic architectures in practise? And can saturating dendrites equally expand computational capacity? To address these questions we use a binary neuron model and Boolean algebra. First, we conﬁrm that spiking dendrites enable a neuron to compute lnBFs using an architecture based on the disjunctive normal form (DNF). Second, we prove that saturating dendrites as well as spiking dendrites enable a neuron to compute lnBFs using an architecture based on the conjunctive normal form (CNF). Contrary to a DNF-based architecture, in a CNF-based architecture, dendritic unit tunings do not imply the neuron tuning, as has been observed experimentally. Third, we show that one cannot use a DNF-based architecture with saturating dendrites. Consequently, we show that an important family of lnBFs implemented with a CNF-architecture can require an exponential number of saturating dendritic units, whereas the same family implemented with either a DNF-architecture or a CNF-architecture always require a linear number of spiking dendritic units. This minimization could explain why a neuron spends energetic resources to make its dendrites spike. 1

6 0.76618683 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

7 0.68922174 224 nips-2012-Multi-scale Hyper-time Hardware Emulation of Human Motor Nervous System Based on Spiking Neurons using FPGA

8 0.62268496 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

9 0.59101254 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

10 0.48906067 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

11 0.4888438 39 nips-2012-Analog readout for optical reservoir computers

12 0.48323742 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing

13 0.42389786 150 nips-2012-Hierarchical spike coding of sound

14 0.42225578 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

15 0.39167345 195 nips-2012-Learning visual motion in recurrent neural networks

16 0.32358703 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

17 0.31562212 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

18 0.30310979 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

19 0.28564507 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

20 0.28230357 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (6, 0.265), (11, 0.014), (17, 0.028), (21, 0.051), (38, 0.121), (42, 0.029), (54, 0.032), (55, 0.028), (61, 0.055), (74, 0.032), (76, 0.075), (80, 0.079), (92, 0.032), (94, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78100598 347 nips-2012-Towards a learning-theoretic analysis of spike-timing dependent plasticity

Author: David Balduzzi, Michel Besserve

2 0.58575052 190 nips-2012-Learning optimal spike-based representations

Author: Ralph Bourdoukan, David Barrett, Sophie Deneve, Christian K. Machens

3 0.57102096 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a ﬁnal decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the beneﬁts that they bring compared to single-task learning. PAC-style bounds for classiﬁcation [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difﬁcult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, ﬁnding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower ﬁnal part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully speciﬁed by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the ﬂexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an inﬁnite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is deﬁned as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a speciﬁed number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are deﬁned by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general inﬁnite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the ﬁnite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to inﬁnity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we deﬁne Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simpliﬁcations below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The ﬁrst term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality deﬁnes Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the ﬁrst gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the ﬁrst and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, ﬁnally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the deﬁnition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and deﬁne a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to ﬁnd the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to ﬁnd a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is speciﬁed by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisﬁed whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisﬁed by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as deﬁned by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to ﬁnd the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the ﬁrst T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the beneﬁts of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the ﬁrst T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the ﬁrst T0 tasks. One ﬁnds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are ﬁxed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We ﬁx π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse ﬁrst the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as deﬁned e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-speciﬁc signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept ﬁxed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at ﬁxed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This ﬁnal stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of speciﬁc choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artiﬁcial Intelligence and Statistics 10, pages 333–340. Society for Artiﬁcial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-speciﬁc features. In Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artiﬁcial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal ﬁnite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

4 0.54161221 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

Author: Jake Bouvrie, Jean-jeacques Slotine

Abstract: To learn reliable rules that can generalize to novel situations, the brain must be capable of imposing some form of regularization. Here we suggest, through theoretical and computational arguments, that the combination of noise with synchronization provides a plausible mechanism for regularization in the nervous system. The functional role of regularization is considered in a general context in which coupled computational systems receive inputs corrupted by correlated noise. Noise on the inputs is shown to impose regularization, and when synchronization upstream induces time-varying correlations across noise variables, the degree of regularization can be calibrated over time. The resulting qualitative behavior matches experimental data from visual cortex. 1

5 0.53970236 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

Author: Brett Vintch, Andrew Zaharia, J Movshon, Hhmi) Hhmi), Eero P. Simoncelli

Abstract: Many visual and auditory neurons have response properties that are well explained by pooling the rectiﬁed responses of a set of spatially shifted linear ﬁlters. These ﬁlters cannot be estimated using spike-triggered averaging (STA). Subspace methods such as spike-triggered covariance (STC) can recover multiple ﬁlters, but require substantial amounts of data, and recover an orthogonal basis for the subspace in which the ﬁlters reside rather than the ﬁlters themselves. Here, we assume a linear-nonlinear–linear-nonlinear (LN-LN) cascade model in which the ﬁrst linear stage is a set of shifted (‘convolutional’) copies of a common ﬁlter, and the ﬁrst nonlinear stage consists of rectifying scalar nonlinearities that are identical for all ﬁlter outputs. We refer to these initial LN elements as the ‘subunits’ of the receptive ﬁeld. The second linear stage then computes a weighted sum of the responses of the rectiﬁed subunits. We present a method for directly ﬁtting this model to spike data, and apply it to both simulated and real neuronal data from primate V1. The subunit model signiﬁcantly outperforms STA and STC in terms of cross-validated accuracy and efﬁciency. 1

6 0.53863853 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

7 0.53593379 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

8 0.53571177 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

9 0.53569263 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

10 0.53315449 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

11 0.53257483 23 nips-2012-A lattice filter model of the visual pathway

12 0.53230637 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

13 0.53178865 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

14 0.53123516 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

15 0.53087425 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

16 0.53075033 2 nips-2012-3D Social Saliency from Head-mounted Cameras

17 0.53047884 65 nips-2012-Cardinality Restricted Boltzmann Machines

18 0.53037578 227 nips-2012-Multiclass Learning with Simplex Coding

19 0.52984345 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

20 0.52921069 22 nips-2012-A latent factor model for highly multi-relational data