nips nips2004 nips2004-135 knowledge-graph by maker-knowledge-mining

135 nips-2004-On-Chip Compensation of Device-Mismatch Effects in Analog VLSI Neural Networks


Source: pdf

Author: Miguel Figueroa, Seth Bridges, Chris Diorio

Abstract: Device mismatch in VLSI degrades the accuracy of analog arithmetic circuits and lowers the learning performance of large-scale neural networks implemented in this technology. We show compact, low-power on-chip calibration techniques that compensate for device mismatch. Our techniques enable large-scale analog VLSI neural networks with learning performance on the order of 10 bits. We demonstrate our techniques on a 64-synapse linear perceptron learning with the Least-Mean-Squares (LMS) algorithm, and fabricated in a 0.35µm CMOS process. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Device mismatch in VLSI degrades the accuracy of analog arithmetic circuits and lowers the learning performance of large-scale neural networks implemented in this technology. [sent-5, score-0.461]

2 We show compact, low-power on-chip calibration techniques that compensate for device mismatch. [sent-6, score-0.267]

3 Our techniques enable large-scale analog VLSI neural networks with learning performance on the order of 10 bits. [sent-7, score-0.296]

4 We demonstrate our techniques on a 64-synapse linear perceptron learning with the Least-Mean-Squares (LMS) algorithm, and fabricated in a 0. [sent-8, score-0.297]

5 More specifically, analog VLSI neural networks perform their computation using the physical properties of transistors with orders of magnitude less power and die area than their digital counterparts. [sent-13, score-0.427]

6 Despite the promises delivered by analog VLSI, an important factor has prevented the success of large-scale neural networks using this technology: device mismatch. [sent-15, score-0.324]

7 Although it is possible to combat some of these effects using careful design techniques, they come at the cost of increased power and area, making an analog solution less attractive. [sent-20, score-0.239]

8 (a) The output z of the perceptron is the inner product between the input and weight vectors. [sent-24, score-0.37]

9 The LMS algorithm updates the weights based on the inputs and an error signal e. [sent-25, score-0.29]

10 (b) The synapse stores the weight in an analog memory cell. [sent-26, score-0.617]

11 A Gilbert multiplier computes the product between the input and the weight and outputs a differential current. [sent-27, score-0.343]

12 We have built a 64-synapse analog neural network with an learning performance of 10 bits, representing an improvement of more than one order of magnitude over that of traditional analog designs, with a modest increase in power and die area. [sent-29, score-0.518]

13 We achieve this performance by locally calibrating the critical analog blocks after circuit fabrication using a combination of one-time (or periodic) and continuous calibration using the same feedback as the network’s learning algorithm. [sent-32, score-0.465]

14 1(a) shows our system architecture, a linear perceptron with scalar output that performs the function: N z(i) = bw0 (i) + xj (i) wj (i) (1) j=1 where i represents time, z(i) is the output, xj (i) are the inputs, wj (i) are the synaptic weights, and b is a constant bias input. [sent-36, score-0.47]

15 After each presentation of the input, the LMS algorithm updates the weights using the learning rule: wj (i + 1) = wj (i) + η xj (i) e(i) i = 0 . [sent-39, score-0.294]

16 We store the synaptic weights in a memory cell that implements nonvolatile analog storage with linear updates. [sent-45, score-0.548]

17 A circuit transforms the single-ended voltage output of the memory cell (Vw ) into a differential voltage signal + − (Vw , Vw ), with a constant common mode. [sent-46, score-0.828]

18 A Gilbert multiplier computes the 4-quadrant + product between this signal and the input (also represented as a differential voltage Vx , − + − Vx ). [sent-47, score-0.486]

19 The output is a differential analog current pair (Io , Io ), which we sum across all synapses by connecting them to common wires. [sent-48, score-0.476]

20 (a) Our multiplier maximizes the linearity of xi , achieving a linear range of 600mV differential. [sent-54, score-0.277]

21 Gain mismatch is 2:1 and offset mismatch is up to 200mV. [sent-55, score-0.422]

22 (b) Our multiplier maximizes weight range at the cost of weight linearity (1V single-ended, 2V differential). [sent-56, score-0.441]

23 The gain variation is lower, but the offset mismatch exceeds 60% of the range. [sent-57, score-0.294]

24 We then transform (off-chip in our current implementation) the resulting analog error signal using a pulse-density modulation (PDM) representation [1]. [sent-59, score-0.293]

25 The performance of the perceptron is highly sensitive to the resolution of the error signal; therefore the PDM representation is a good match for it. [sent-63, score-0.354]

26 The LMS block in the synapse takes the error and input values and computes update pulses (also using PDM) according to Eqn. [sent-64, score-0.43]

27 In the rest of this section, we analyze the effects of device mismatch in the performance of the major blocks, discuss their impact in overall system performance, and present the techniques that we developed to deal with them. [sent-66, score-0.312]

28 We illustrate with experimental results taken from silicon implementation of the perceptron in a 0. [sent-67, score-0.297]

29 1 Multiplier A Gilbert multiplier implements a nonlinear function of the product between two differential voltages. [sent-71, score-0.22]

30 Device mismatch in the multiplier has two main effects: First, it creates offsets in the inputs. [sent-72, score-0.518]

31 Second, mismatch across the entire perceptron creates variations in the offsets, gain, and linearity of the product. [sent-73, score-0.545]

32 Therefore, the adaptation w compensates for mild nonlinearities in the weights as long as fj remains a monotonic odd function [2]. [sent-77, score-0.227]

33 Consequently, we sized the transistors in the Gilbert multiplier to maximize x w the linearity of fj , but paid less attention (in order to minimize size and power) to fj . [sent-78, score-0.477]

34 (b) Measured weight updates Figure 3: A simple PDM analog memory cell. [sent-82, score-0.518]

35 (a) We store each weight as nonvolatile analog charge on the floating gate FG. [sent-83, score-0.55]

36 (b) Memory updates as a function of the increment and decrement pulse densities for 8 synapses. [sent-85, score-0.255]

37 The updates show excellent linearity (10 bits), but also poor matching both within a synapse and between synapses. [sent-86, score-0.452]

38 The gain mismatch is about 2:1, but the LMS algorithm naturally absorbs it into the learned weight value. [sent-88, score-0.318]

39 2(b) shows the multiplier output as a function of the single-ended weight value Vw . [sent-90, score-0.297]

40 The linearity is visibly worse in this case, but the LMS algorithm compensates for it. [sent-91, score-0.238]

41 Because of the added mismatch in the single-ended to differential converter, the weights present an offset of up to ±300mV, or 30% of the weight range. [sent-94, score-0.432]

42 Consequently, we sacrifice weight linearity to increase the weight range. [sent-97, score-0.289]

43 The offsets are small (up 100mV), but because of the restricted input range (to maximize linearity), they are large enough to dramatically affect the learning performance of the perceptron. [sent-99, score-0.225]

44 Our solution was to use the bias synapse w0 to compensate for the accumulated input offset. [sent-100, score-0.352]

45 Assuming that the multiplier is linear, offsets translate into nonzero-mean inputs, which a bias synapse trained with LMS can remove as demonstrated in [4]. [sent-101, score-0.602]

46 To guarantee sufficient gain, we provide a stronger bias current to the multiplier in the bias synapse. [sent-102, score-0.258]

47 2 Memory cell A synapse transistor [5] is a silicon device that provides compact, accurate, nonvolatile analog storage as charge on its floating gate. [sent-104, score-0.927]

48 Fowler-Nordheim tunneling adds charge to the floating gate and hot-electron injection removes charge. [sent-105, score-0.545]

49 Because of these properties, synapse transistors have been a popular choice for weight storage in recent silicon learning systems [6, 7]. [sent-107, score-0.454]

50 This is because their dynamics are exponential with respect to their control variables (floating-gate voltage, tunneling voltage and injection drain current), which naturally lead to weight-dependent nonlinear update rules. [sent-109, score-0.528]

51 This is an important problem because the learning performance of the perceptron is strongly dependent on the accuracy of the weight updates; therefore distortions in the learning rule will degrade performance. [sent-110, score-0.266]

52 3(a) and based on the work presented in [8], solves this problem: We store the analog weight as charge on the floating gate FG of synapse transistor M1 . [sent-112, score-0.778]

53 Pulses on Pdec and Pinc activate tunneling and injection and add or remove charge from the floating gate, respectively. [sent-113, score-0.432]

54 (a) We first match the tunneling rate across all synapses by locally changing the voltage at the floating gate FGdec . [sent-117, score-0.683]

55 Then, we modify the injection rate to match the local tunneling rate using the floating gate FGinc . [sent-118, score-0.495]

56 (b) The calibrated updates are symmetric and uniform within 9-10 bits. [sent-119, score-0.239]

57 amplifier sets the floating-gate voltage at the global voltage Vbias . [sent-120, score-0.362]

58 Because the floatinggate voltage is constant and so are the pulse widths and amplitudes, the magnitude of the updates depends on the density of the pulses Pinc and Pdec . [sent-122, score-0.463]

59 3(b) shows the magnitude of the weight updates as a function of the density of pulses in Pinc (positive slopes) and Pdec (negative slopes) for 8 synapses. [sent-124, score-0.29]

60 3(b) highlights an important problem caused by device mismatch: the strengths of tunneling and injection are poorly balanced within a synapse (the slopes show up to a 4:1 mismatch). [sent-128, score-0.681]

61 The nonuniformity between learning rates across the perceptron changes Eqn. [sent-132, score-0.311]

62 Therefore, learning rate mismatch does not affect the accuracy of the learned weights, but it does slow down convergence because we need to scale all learning rates globally to limit the value of the maximal rate. [sent-138, score-0.255]

63 We modified the design of the memory cell to incorporate local calibration mechanisms that achieve this goal. [sent-140, score-0.296]

64 We set the voltage at FGdec by first tunneling using the global line erase dec, and then injecting on transistor M3 by lowering the local line set dec to equalize the tunneling rates across all synapses. [sent-144, score-0.922]

65 To compare the tunneling rates, we issue a fixed number of pulses at Pdec and compare the memory cell outputs using a double-sampling comparator (off-chip in the current implementation). [sent-145, score-0.515]

66 (b) RMS error for a single-synapse with a constant input and reference, including a calibrated memory cell with symmetric updates, a simple synapse with asymmetric updates, and a simulated ideal synapse. [sent-150, score-0.699]

67 We control the current limit with the voltage at the new floating gate FGinc : we first remove electrons from the floating gate using the global line erase inc. [sent-152, score-0.443]

68 Then we inject on transistor M4 by lowering the local line set inc to match the injection rates across all synapses. [sent-153, score-0.4]

69 4(b) shows the tunneling and injection rates after calibration as a function of the density of pulses Pinc and Pdec . [sent-156, score-0.606]

70 4(b), it is clear that the update rates are now symmetric and uniform across all synapses (they match within 9-10 bits). [sent-158, score-0.285]

71 This optimization would result in approximately a 25% reduction in memory cell area (6% reduction in total synapse area), but would also cause an increase of more than 200% in convergence time, as illustrated in Section 4. [sent-160, score-0.417]

72 To validate our design, we first trained a single synapse with a DC input to learn a constant reference. [sent-172, score-0.254]

73 Because the input is constant, the linearity and offsets in the input signal do not affect the learning performance; therefore this experiment tests the resolution of the feedback path (LMS circuit and memory cell) isolated from the analog multipliers. [sent-173, score-0.938]

74 5(b) shows the evolution of the RMS value of the error for a synapse using the original and calibrated memory cells. [sent-175, score-0.453]

75 The resolution of the pulse-density modulators is about 8 bits, which limits the resolution of the error signal. [sent-176, score-0.229]

76 We also show the RMS error for a simulated (ideal) synapse learning from the same error. [sent-177, score-0.286]

77 The RMS error of the calibrated synapse converges to about 0. [sent-179, score-0.38]

78 (a) Asymmetric learning rates and multiplier offsets limit the output resolution to around 3 bits. [sent-184, score-0.566]

79 Symmetric learning rates and a bias synapse brings the resolution up to more 10 bits, and uniform updates reduce convergence time. [sent-185, score-0.547]

80 (b) Synapse 4 shows a larger mismatch than synapse 1 and therefore it deviates from its theoretical target value to compensate. [sent-186, score-0.395]

81 The bias synapse in the VLSI perceptron converges to a value that compensates for offsets in the inputs xi to the multipliers. [sent-187, score-0.801]

82 5 output error , we find that for a 2µA output range, this error represents an range output resolution of about 13 bits. [sent-189, score-0.365]

83 The difference with the simulated synapse is due to the discrete weight updates in the PDM memory cell. [sent-190, score-0.555]

84 4 A 64-synapse perceptron To test our techniques in a larger-scale system, we fabricated a 64-synapse linear perceptron in a 0. [sent-194, score-0.481]

85 We used random zero-mean inputs selected from a uniform distribution over the entire input range, and trained the network using the response from a simulated perceptron with ideal multipliers and fixed weights as a reference. [sent-200, score-0.393]

86 The error settles to 10µA RMS, which corresponds to an output resolution of about 3 bits for a full range of 128µA differential. [sent-202, score-0.3]

87 Calibrating the synapses for symmetric learning rates only improves the RMS error to 5µA (4 bits), but the error introduced by the multiplier offsets still dominates the residual error. [sent-203, score-0.614]

88 Introducing the bias synapse and keeping the learning rates symmetric (but nonuniform across the perceptron) compensates for the offsets and brings the error down to 60nA RMS, corresponding to an output resolution better than 10 bits. [sent-204, score-0.892]

89 Further calibrating the synapses to achieve uniform, symmetric learning rates maintains the same learning performance, but reduces convergence time to less than one half, as predicted by the analysis in Section 3. [sent-205, score-0.238]

90 A simulated software perceptron with ideal multipliers and LMS updates that uses an error signal of the same resolution as our experiments gives an upper bound of just under 12 bits for the learning performance. [sent-207, score-0.688]

91 6(b) depicts the evolution of selected weights in the silicon perceptron with on-chip compensation and the software version. [sent-209, score-0.427]

92 The graph shows that synapse 1 in our VLSI implementation suffers from little mismatch, and therefore its weight virtually converges to the theoretical value given by the software implementation. [sent-210, score-0.368]

93 Because the PDM updates are discrete, the weight shows a larger oscillation around its target value than the software version. [sent-211, score-0.228]

94 The bias weight in the software percep- tron converges to zero because the inputs have zero mean. [sent-213, score-0.257]

95 In the VLSI perceptron, input offsets in the multipliers create nonzero-mean inputs; therefore the bias synapse converges to a value that compensates for the aggregated effect of the offsets. [sent-214, score-0.654]

96 5 Conclusions Device mismatch prevents analog VLSI neural networks from delivering good learning performance for large-scale applications. [sent-217, score-0.427]

97 We identified the key effects of mismatch and presented on-chip compensation techniques. [sent-218, score-0.238]

98 Combining these techniques with careful circuit design enables an improvement of more than one order of magnitude in accuracy compared to traditional analog designs, at the cost of an off-line calibration phase and a modest increase in die area and power. [sent-220, score-0.509]

99 We illustrated our techniques with a 64-synapse analog-VLSI linear perceptron that adapts using the LMS algorithm. [sent-221, score-0.235]

100 Mead, “A high-resolution nonvolatile analog memory cell,” in IEEE Intl. [sent-277, score-0.384]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lms', 0.409), ('tunneling', 0.217), ('synapse', 0.213), ('analog', 0.208), ('perceptron', 0.184), ('offsets', 0.184), ('mismatch', 0.182), ('voltage', 0.181), ('pdm', 0.16), ('oating', 0.155), ('multiplier', 0.152), ('pdec', 0.142), ('pinc', 0.142), ('vlsi', 0.142), ('injection', 0.13), ('linearity', 0.125), ('memory', 0.114), ('updates', 0.114), ('silicon', 0.113), ('gate', 0.113), ('diorio', 0.107), ('rms', 0.102), ('bits', 0.102), ('pulses', 0.094), ('resolution', 0.094), ('calibration', 0.092), ('cell', 0.09), ('circuit', 0.087), ('charge', 0.085), ('calibrated', 0.085), ('synapses', 0.083), ('weight', 0.082), ('device', 0.079), ('gilbert', 0.077), ('compensates', 0.077), ('transistor', 0.077), ('fj', 0.077), ('pulse', 0.074), ('rates', 0.073), ('fgdec', 0.071), ('vw', 0.071), ('pe', 0.071), ('die', 0.071), ('wj', 0.069), ('differential', 0.068), ('output', 0.063), ('cmos', 0.062), ('fabricated', 0.062), ('nonvolatile', 0.062), ('offset', 0.058), ('compensation', 0.056), ('gain', 0.054), ('across', 0.054), ('fginc', 0.053), ('figueroa', 0.053), ('hasler', 0.053), ('bias', 0.053), ('techniques', 0.051), ('inputs', 0.049), ('mead', 0.046), ('transistors', 0.046), ('compensate', 0.045), ('multipliers', 0.045), ('px', 0.045), ('signal', 0.044), ('asymmetric', 0.043), ('slopes', 0.042), ('fg', 0.042), ('calibrating', 0.042), ('weights', 0.042), ('block', 0.041), ('error', 0.041), ('converges', 0.041), ('input', 0.041), ('adaptive', 0.04), ('symmetric', 0.04), ('networks', 0.037), ('concepci', 0.036), ('decrement', 0.036), ('equalize', 0.036), ('erase', 0.036), ('fabrication', 0.036), ('hsu', 0.036), ('oatinggate', 0.036), ('seth', 0.036), ('visibly', 0.036), ('vout', 0.036), ('match', 0.035), ('digital', 0.034), ('circuits', 0.034), ('simulated', 0.032), ('synaptic', 0.032), ('software', 0.032), ('power', 0.031), ('portable', 0.031), ('increment', 0.031), ('lowering', 0.031), ('nonlinearities', 0.031), ('io', 0.031), ('minch', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 135 nips-2004-On-Chip Compensation of Device-Mismatch Effects in Analog VLSI Neural Networks

Author: Miguel Figueroa, Seth Bridges, Chris Diorio

Abstract: Device mismatch in VLSI degrades the accuracy of analog arithmetic circuits and lowers the learning performance of large-scale neural networks implemented in this technology. We show compact, low-power on-chip calibration techniques that compensate for device mismatch. Our techniques enable large-scale analog VLSI neural networks with learning performance on the order of 10 bits. We demonstrate our techniques on a 64-synapse linear perceptron learning with the Least-Mean-Squares (LMS) algorithm, and fabricated in a 0.35µm CMOS process. 1

2 0.24245723 176 nips-2004-Sub-Microwatt Analog VLSI Support Vector Machine for Pattern Classification and Sequence Estimation

Author: Shantanu Chakrabartty, Gert Cauwenberghs

Abstract: An analog system-on-chip for kernel-based pattern classification and sequence estimation is presented. State transition probabilities conditioned on input data are generated by an integrated support vector machine. Dot product based kernels and support vector coefficients are implemented in analog programmable floating gate translinear circuits, and probabilities are propagated and normalized using sub-threshold current-mode circuits. A 14-input, 24-state, and 720-support vector forward decoding kernel machine is integrated on a 3mm×3mm chip in 0.5µm CMOS technology. Experiments with the processor trained for speaker verification and phoneme sequence estimation demonstrate real-time recognition accuracy at par with floating-point software, at sub-microwatt power. 1

3 0.11102097 157 nips-2004-Saliency-Driven Image Acuity Modulation on a Reconfigurable Array of Spiking Silicon Neurons

Author: R. J. Vogelstein, Udayan Mallik, Eugenio Culurciello, Gert Cauwenberghs, Ralph Etienne-Cummings

Abstract: We have constructed a system that uses an array of 9,600 spiking silicon neurons, a fast microcontroller, and digital memory, to implement a reconfigurable network of integrate-and-fire neurons. The system is designed for rapid prototyping of spiking neural networks that require high-throughput communication with external address-event hardware. Arbitrary network topologies can be implemented by selectively routing address-events to specific internal or external targets according to a memory-based projective field mapping. The utility and versatility of the system is demonstrated by configuring it as a three-stage network that accepts input from an address-event imager, detects salient regions of the image, and performs spatial acuity modulation around a high-resolution fovea that is centered on the location of highest salience. 1

4 0.10110552 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

Author: Felix Schürmann, Karlheinz Meier, Johannes Schemmel

Abstract: Computation without stable states is a computing paradigm different from Turing’s and has been demonstrated for various types of simulated neural networks. This publication transfers this to a hardware implemented neural network. Results of a software implementation are reproduced showing that the performance peaks when the network exhibits dynamics at the edge of chaos. The liquid computing approach seems well suited for operating analog computing devices such as the used VLSI neural network. 1

5 0.085514002 118 nips-2004-Methods for Estimating the Computational Power and Generalization Capability of Neural Microcircuits

Author: Wolfgang Maass, Robert A. Legenstein, Nils Bertschinger

Abstract: What makes a neural microcircuit computationally powerful? Or more precisely, which measurable quantities could explain why one microcircuit C is better suited for a particular family of computational tasks than another microcircuit C ? We propose in this article quantitative measures for evaluating the computational power and generalization capability of a neural microcircuit, and apply them to generic neural microcircuit models drawn from different distributions. We validate the proposed measures by comparing their prediction with direct evaluations of the computational performance of these microcircuit models. This procedure is applied first to microcircuit models that differ with regard to the spatial range of synaptic connections and with regard to the scale of synaptic efficacies in the circuit, and then to microcircuit models that differ with regard to the level of background input currents and the level of noise on the membrane potential of neurons. In this case the proposed method allows us to quantify differences in the computational power and generalization capability of circuits in different dynamic regimes (UP- and DOWN-states) that have been demonstrated through intracellular recordings in vivo. 1

6 0.06909056 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

7 0.06186974 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning

8 0.05491665 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

9 0.052090343 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

10 0.050892469 112 nips-2004-Maximising Sensitivity in a Spiking Network

11 0.048446756 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography

12 0.045945797 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

13 0.045942977 145 nips-2004-Parametric Embedding for Class Visualization

14 0.045477699 153 nips-2004-Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity

15 0.044910248 28 nips-2004-Bayesian inference in spiking neurons

16 0.043519784 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

17 0.041495945 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern

18 0.041392095 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

19 0.040062424 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

20 0.03963143 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.131), (1, -0.089), (2, -0.03), (3, 0.012), (4, -0.022), (5, 0.01), (6, 0.071), (7, -0.002), (8, 0.022), (9, -0.065), (10, -0.026), (11, -0.104), (12, -0.085), (13, 0.023), (14, -0.075), (15, -0.094), (16, -0.144), (17, -0.1), (18, -0.05), (19, -0.243), (20, -0.037), (21, 0.024), (22, 0.03), (23, -0.045), (24, -0.121), (25, -0.113), (26, -0.152), (27, 0.104), (28, -0.145), (29, -0.214), (30, -0.047), (31, -0.131), (32, -0.063), (33, 0.192), (34, 0.065), (35, 0.126), (36, -0.037), (37, 0.02), (38, 0.011), (39, 0.029), (40, 0.002), (41, 0.073), (42, 0.03), (43, -0.107), (44, -0.023), (45, 0.026), (46, 0.149), (47, -0.102), (48, 0.092), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96348894 135 nips-2004-On-Chip Compensation of Device-Mismatch Effects in Analog VLSI Neural Networks

Author: Miguel Figueroa, Seth Bridges, Chris Diorio

Abstract: Device mismatch in VLSI degrades the accuracy of analog arithmetic circuits and lowers the learning performance of large-scale neural networks implemented in this technology. We show compact, low-power on-chip calibration techniques that compensate for device mismatch. Our techniques enable large-scale analog VLSI neural networks with learning performance on the order of 10 bits. We demonstrate our techniques on a 64-synapse linear perceptron learning with the Least-Mean-Squares (LMS) algorithm, and fabricated in a 0.35µm CMOS process. 1

2 0.84424692 176 nips-2004-Sub-Microwatt Analog VLSI Support Vector Machine for Pattern Classification and Sequence Estimation

Author: Shantanu Chakrabartty, Gert Cauwenberghs

Abstract: An analog system-on-chip for kernel-based pattern classification and sequence estimation is presented. State transition probabilities conditioned on input data are generated by an integrated support vector machine. Dot product based kernels and support vector coefficients are implemented in analog programmable floating gate translinear circuits, and probabilities are propagated and normalized using sub-threshold current-mode circuits. A 14-input, 24-state, and 720-support vector forward decoding kernel machine is integrated on a 3mm×3mm chip in 0.5µm CMOS technology. Experiments with the processor trained for speaker verification and phoneme sequence estimation demonstrate real-time recognition accuracy at par with floating-point software, at sub-microwatt power. 1

3 0.58597267 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning

Author: Constanze Hofstoetter, Manuel Gil, Kynan Eng, Giacomo Indiveri, Matti Mintz, Jörg Kramer, Paul F. Verschure

Abstract: We present a biophysically constrained cerebellar model of classical conditioning, implemented using a neuromorphic analog VLSI (aVLSI) chip. Like its biological counterpart, our cerebellar model is able to control adaptive behavior by predicting the precise timing of events. Here we describe the functionality of the chip and present its learning performance, as evaluated in simulated conditioning experiments at the circuit level and in behavioral experiments using a mobile robot. We show that this aVLSI model supports the acquisition and extinction of adaptively timed conditioned responses under real-world conditions with ultra-low power consumption. I n tro d u cti o n 1 The association of two correlated stimuli, an initially neutral conditioned stimulus (CS) which predicts a meaningful unconditioned stimulus (US), leading to the acquisition of an adaptive conditioned response (CR), is one of the most essential forms of learning. Pavlov introduced the classical conditioning paradigm in the early 20th century to study associative learning (Pavlov 1927). In classical conditioning training an animal is repeatedly exposed to a CS followed by a US after a certain inter-stimulus interval (ISI). The animal learns to elicit a CR matched to the ISI, reflecting its knowledge about an association between the CS, US, and their temporal relationship. Our earlier software implementation of a * Jörg Kramer designed the cerebellum chip that was first tested at the 2002 Telluride Neuromorphic Engineering Workshop. Tragically, he died soon afterwards while hiking on Telescope Peak on 24 July, 2002. biophysically constrained model of the cerebellar circuit underlying classical conditioning (Verschure and Mintz 2001; Hofstötter et al. 2002) provided an explanation of this phenomenon by assuming a negative feedback loop between the cerebellar cortex, deep nucleus and inferior olive. It could acquire and extinguish correctly timed CRs over a range of ISIs in simulated classical conditioning experiments, as well as in associative obstacle avoidance tasks using a mobile robot. In this paper we present the analog VLSI (aVLSI) implementation of this cerebellum model – the cerebellum chip – and the results of chip-level and behavioral robot experiments. 2 T h e mo d el ci r cu i t a n d a VL S I i mp l eme n ta ti o n Figure 1: Anatomy of the cerebellar model circuit (left) and the block diagram of the corresponding chip (right). The model (Figure 1) is based on the identified cerebellar pathways of CS, US and CR (Kim and Thompson 1997) and includes four key hypotheses which were implemented in the earlier software model (Hofstötter et al. 2002): 1. CS related parallel fiber (pf) and US related climbing fiber (cf) signals converge at Purkinje cells (PU) in the cerebellum (Steinmetz et al. 1989). The direction of the synaptic changes at the pf-PU-synapse depends on the temporal coincidence of pf and cf activity. Long-term depression (LTD) is induced by pf activity followed by cf activity within a certain time interval, while pf activity alone induces long-term potentiation (LTP) (Hansel et al. 2001). 2. A prolonged second messenger response to pf stimulation in the dendrites of PU constitutes an eligibility trace from the CS pathway (Sutton and Barto 1990) that bridges the ISI (Fiala et al. 1996). 3. A microcircuit (Ito 1984) comprising PU, deep nucleus (DN) and inferior olive (IO) forms a negative feedback loop. Shunting inhibition of IO by DN blocks the reinforcement pathway (Thompson et al. 1998), thus controlling the induction of LTD and LTP at the pf-PU-synapse. 4. DN activity triggers behavioral CRs (McCormick and Thompson 1984). The inhibitory PU controls DN activity by a mechanism called rebound excitation (Hesslow 1994): When DN cells are disinhibited from PU input, their membrane potential slowly repolarises and spikes are emitted if a certain threshold is reached. Thereby, the correct timing of CRs results from the adaptation of a pause in PU spiking following the CS. In summary, in the model the expression of a CR is triggered by DN rebound excitation upon release from PU inhibition. The precise timing of a CR is dependent on the duration of an acquired pause in PU spiking following a CS. The PU response is regulated by LTD and LTP at the pf-PU-synapse under the control of a negative feedback loop comprising DN, PU and IO. We implemented an analog VLSI version of the cerebellar model using a standard 1.6µm CMOS technology, and occupying an area of approximately 0.25 mm2. A block diagram of the hardware model is shown in Figure 1. The CS block receives the conditioned stimulus and generates two signals: an analog long-lasting, slowly decaying trace (cs_out) and an equally long binary pulse (cs_wind). Similarly, the US block receives an unconditioned stimulus and generates a fast pulse (us_out). The two pulses cs_wind and us_out are sent to the LT-ISI block that is responsible for perfoming LTP and LTD, upregulating or downregulating the synaptic weight signal w. This signal determines the gain by which the cs_out trace is multiplied in the MU block. The output of the multiplier MU is sent on to the PU block, together with the us_out signal. It is a linear integrate-and-fire neuron (the axon-hillock circuit) connected to a constant current source that produces regular spontaneous activity. The current source is gated by the digital cf_wind signal, such that the spontaneous activity is shut off for the duration of the cs_out trace. The chip allowed one of three learning rules to be connected. Experiments showed that an ISI-dependent learning rule with short ISIs resulting in the strongest LTD was the most useful (Kramer and Hofstötter 2002). Two elements were added to adapt the model circuit for real-world robot experiments. Firstly, to prevent the expression of a CR after a US had already been triggered, an inhibitory connection from IO to CRpathway was added. Secondly, the transduction delay (TD) from the aVLSI circuit to any effectors (e.g. motor controls of a robot) had to be taken into account, which was done by adding a delay from DN to IO of 500ms. The chip’s power consumption is conservatively estimated at around 100 W (excluding off-chip interfacing), based on measurements from similar integrateand-fire neuron circuits (Indiveri 2003). This figure is an order of magnitude lower than what could be achieved using conventional microcontrollers (typically 1-10 mW), and could be improved further by optimising the circuit design. 3 S i mu l a ted co n d i ti o n i n g ex p eri men ts The aim of the “in vitro” simulated conditioning experiments was to understand the learning performance of the chip. To obtain a meaningful evaluation of the performance of the learning system for both the simulated conditioning experiments and the robot experiments, the measure of effective CRs was used. In acquisition experiments CS-US pairs are presented with a fixed ISI. Whenever a CR occurs that precedes the US, the US signal is not propagated to PU due to the inhibitory connection from DN to IO. Thus in the context of acquisition experiments a CR is defined as effective if it prevents the occurrence of a US spike at PU. In contrast, in robot experiments an effective CR is defined at the behavioral level, including only CRs that prevent the US from occurring. Figure 2: Learning related response changes in the cerebellar aVLSI chip. The most relevant neural responses to a CS-US pair (ISI of 3s, ITI of 12s) are presented for a trial before (naive) significant learning occurred and when a correctly timed CR is expressed (trained). US-related pf and CS/CR-related cf signals are indicated by vertical lines passing through the subplots. A CS-related pf-signal evokes a prolonged response in the pf-PU-synapse, the CS-trace (Trace subplot). While an active CS-trace is present, an inhibitory element (I) is active which inactivates an element representing the spontaneous activity of PU (Hofstötter et al. 2002). (A) The US-related cf input occurs while there is an active CS-trace (Trace subplot), in this case following the CS with an ISI of 3s. LTD predominates over LTP under these conditions (Weight subplot). Because the PU membrane potential (PU) remains above spiking threshold, PU is active and supplies constant inhibition to DN (DN) while in the CS-mode. Thus, DN cannot repolarize and remains inactive so that no CR is triggered. (B) Later in the experiment, the synaptic weight of the pf-PU-synapse (Weight) has been reduced due to previous LTD. As a result, following a CS-related pf input, the PU potential (PU subplot) falls below the spiking threshold, which leads to a pause in PU spiking. The DN membrane potential repolarises, so that rebound spikes are emitted (DN subplot). This rebound excitation triggers a CR. DN inhibition of IO prevents US related cfactivity. Thus, although a US signal is still presented to the circuit, the reinforcing US pathway is blocked. These conditions induce only LTP, raising the synaptic weight of the pf-PU-synapse (Weight subplot). The results we obtained were broadly consistent with those reported in the biological literature (Ito 1984; Kim and Thompson 1997). The correct operation of the circuit can be seen in the cell traces illustrating the properties of the aVLSI circuit components before significant learning (Figure 2 A), and after a CR is expressed (Figure 2B). Long-term acquisition experiments (25 blocks of 10 trials each over 50 minutes) showed that chip functions remained stable over a long time period. In each trial the CS was followed by a US with a fixed ISI of 3s; the inter trial interval (ITI) was 12s. The number of effective CRs shows an initial fast learning phase followed by a stable phase with higher percentages of effective CRs (Figure 3B). In the stable phase the percentage of effective CRs per block fluctuates around 80-90%. There are fluctuations of up to 500ms in the CR latency caused by the interaction of LTD and LTP in the stable phase, but the average CR latency remains fairly constant. Figure 4 shows the average of five acquisition experiments (5 blocks of 10 trials per experiment) for ISIs of 2.5s, 3s and 3.5s. The curves are similar in shape to the ones in the long-term experiment. The CR latency quickly adjusts to match the ISI and remains stable thereafter (Figure 4A). The effect of the ISI-dependent learning rule can be seen in two ways: firstly, the shorter the ISI, the faster the stable phase is reached, denoting faster learning. Secondly, the shorter the ISI, the better the performance in terms of percentage of effective CRs (Figure 4B). The parameters of the chip were tuned to optimally encode short ISIs in the range of 1.75s to 4.5s. Separate experiments showed that the chip could also adapt rapidly to changes in the ISI within this range after initial learning. (Error bar = 1 std. dev.) Figure 3: Long-term changes in CR latency (A) and % effective CRs (B) per block of 10 CSs during acquisition. Experiment length = 50min., ISI = 3s, ITI = 12s. (Error bar = 1 std. dev.) Figure 4: Average of five acquisition experiments per block of 10 CSs for ISIs of 2.5s ( ), 3s (*) and 3.5s ( ). (A) Avg. CR latency. (B) Avg. % effective CRs. 4 Ro b o t a s s o ci a ti v e l ea rn i n g ex p eri men t s The “in vivo” learning capability of the chip was evaluated by interfacing it to a robot and observing its behavior in an unsupervised obstacle avoidance task. Experiments were performed using a Khepera microrobot (K-team, Lausanne, Switzerland, Figure 5A) in a circular arena with striped walls (Figure 5C). The robot was equipped with 6 proximal infra-red (IR) sensors (Figure 5B). Activation of these sensors (US) due to a collision triggered a turn of ~110° in the opposite direction (UR). A line camera (64 pixels x 256 gray-levels) constituted the distal sensor, with detection of a certain spatial frequency (~0.14 periods/degree) signalling the CS. Visual CSs and collision USs were conveyed to CSpathway and USpathway on the chip. The activation of CRpathway triggered a motor CR: a 1s long regression followed by a turn of ~180°. Communication between the chip and the robot was performed using Matlab on a PC. The control program could be downloaded to the robot's processor, allowing the robot to act fully autonomously. In each experiment, the robot was placed in the circular arena exploring its environment with a constant speed of ~4 cm/s. A spatial frequency CS was detected at some distance when the robot approached the wall, followed by a collision with the wall, stimulating the IR sensors and thus triggering a US. Consequently the CS was correlated with the US, predicting it. The ISIs of these stimuli were variable, due to noise in sensor sampling, and variations in the angle at which the robot approached the wall. Figure 5: (A) Khepera microrobot with aVLSI chip mounted on top. (B) Only the forward sensors were used during the experiments. (C) The environment: a 60cm diameter circular arena surrounded by a 15cm high wall. A pattern of vertical, equally sized black and white bars was placed on the wall. Associative learning mediated by the cerebellum chip significantly altered the robot's behavior in the obstacle avoidance task (Figure 6) over the course of each experiment. In the initial learning phase, the behavior was UR driven: the robot drove forwards until it collided with the wall, only then performing a turn (Figure 6A1). In the trained phase, the robot usually turned just before it collided with the wall (Figure 6A2), reducing the number of collisions. The positions of the robot when a CS, US or CR event occurred in these two phases are shown in Figure 6B1 and B2. The CRs were not expressed immediately after the CSs, but rather with a CR latency adjusted to just prevent collisions (USs). Not all USs were avoided in the trained phase due to some excessively short ISIs (Figure 7) and normal extinction processes over many unreinforced trials. After the learning phase the percentage of effective CRs fluctuated between 70% and 100% (Figure 7). Figure 6: Learning performance of the robot. (Top row) Trajectories of the robot. The white circle with the black dot in the center indicates the beginning of trajectories. (Bottom row) The same periods of the experiment examined at the circuit level: = CS, * = US, = CR. (A1, B1) Beginning of the experiment (CS 3-15). (A2, B2) Later in the experiment (CS 32-44). Figure 7: Trends in learning behavior (average of 5 experiments, 25 min. each). 90 CSs were presented in each experiment. Error bars indicate one standard deviation. (A) Average percentage of effective CRs over 9 blocks of 10 CSs. (B) Number of CS occurrences ( ), US occurrences (*) and CR occurrences ( ). 5 Di s cu s s i o n We have presented one of the first examples of a biologically constrained model of learning implemented in hardware. Our aVLSI cerebellum chip supports the acquisition and extinction of adaptively timed responses under noisy, real world conditions. These results provide further evidence for the role of the cerebellar circuit embedded in a synaptic feedback loop in the learning of adaptive behavior, and pave the way for the creation of artefacts with embedded ultra low-power learning capabilities. 6 Ref eren ces Fiala, J. C., Grossberg, S. and Bullock, D. (1996). Metabotropic glutamate receptor activation in cerebellar Purkinje cells as substrate for adaptive timing of the classical conditioned eye-blink response. Journal of Neuroscience 16: 3760-3774. Hansel, C., Linden, D. J. and D'Angelo, E. (2001). Beyond parallel fiber LTD, the diversity of synaptic and nonsynaptic plasticity in the cerebellum. Nature Neuroscience 4: 467-475. Hesslow, G. (1994). Inhibition of classical conditioned eyeblink response by stimulation of the cerebellar cortex in decerebrate cat. Journal of Physiology 476: 245-256. Hofstötter, C., Mintz, M. and Verschure, P. F. M. J. (2002). The cerebellum in action: a simulation and robotics study. European Journal of Neuroscience 16: 1361-1376. Indiveri, G. (2003). A low-power adaptive integrate-and-fire neuron circuit. IEEE International Symposium on Circuits and Systems, Bangkok, Thailand, 4: 820-823. Ito, M. (1984). The modifiable neuronal network of the cerebellum. Japanese Journal of Physiology 5: 781-792. Kim, J. J. and Thompson, R. F. (1997). Cerebellar circuits and synaptic mechanisms involved in classical eyeblink conditioning. Trends in the Neurosciences 20(4): 177-181. Kim, J. J. and Thompson, R. F. (1997). Cerebellar circuits and synaptic mechanisms involved in classical eyeblink conditioning. Trend. Neurosci. 20: 177-181. Kramer, J. and Hofstötter, C. (2002). An aVLSI model of cerebellar mediated associative learning. Telluride Workshop, CO, USA. McCormick, D. A. and Thompson, R. F. (1984). Neuronal response of the rabbit cerebellum during acquisition and performance of a classical conditioned nictitating membrane-eyelid response. J. Neurosci. 4: 2811-2822. Pavlov, I. P. (1927). Conditioned Reflexes, Oxford University Press. Steinmetz, J. E., Lavond, D. G. and Thompson, R. F. (1989). Classical conditioning in rabbits using pontine nucleus stimulation as a conditioned stimulus and inferior olive stimulation as an unconditioned stimulus. Synapse 3: 225-233. Sutton, R. S. and Barto, A. G. (1990). Time derivate models of Pavlovian Reinforcement Learning and Computational Neuroscience: Foundations of Adaptive Networks., MIT press: chapter 12, 497-537. Thompson, R. F., Thompson, J. K., Kim, J. J. and Shinkman, P. G. (1998). The nature of reinforcement in cerebellar learning. Neurobiology of Learning and Memory 70: 150-176. Verschure, P. F. M. J. and Mintz, M. (2001). A real-time model of the cerebellar circuitry underlying classical conditioning: A combined simulation and robotics study. Neurocomputing 38-40: 1019-1024.

4 0.55020487 157 nips-2004-Saliency-Driven Image Acuity Modulation on a Reconfigurable Array of Spiking Silicon Neurons

Author: R. J. Vogelstein, Udayan Mallik, Eugenio Culurciello, Gert Cauwenberghs, Ralph Etienne-Cummings

Abstract: We have constructed a system that uses an array of 9,600 spiking silicon neurons, a fast microcontroller, and digital memory, to implement a reconfigurable network of integrate-and-fire neurons. The system is designed for rapid prototyping of spiking neural networks that require high-throughput communication with external address-event hardware. Arbitrary network topologies can be implemented by selectively routing address-events to specific internal or external targets according to a memory-based projective field mapping. The utility and versatility of the system is demonstrated by configuring it as a three-stage network that accepts input from an address-event imager, detects salient regions of the image, and performs spatial acuity modulation around a high-resolution fovea that is centered on the location of highest salience. 1

5 0.53354228 118 nips-2004-Methods for Estimating the Computational Power and Generalization Capability of Neural Microcircuits

Author: Wolfgang Maass, Robert A. Legenstein, Nils Bertschinger

Abstract: What makes a neural microcircuit computationally powerful? Or more precisely, which measurable quantities could explain why one microcircuit C is better suited for a particular family of computational tasks than another microcircuit C ? We propose in this article quantitative measures for evaluating the computational power and generalization capability of a neural microcircuit, and apply them to generic neural microcircuit models drawn from different distributions. We validate the proposed measures by comparing their prediction with direct evaluations of the computational performance of these microcircuit models. This procedure is applied first to microcircuit models that differ with regard to the spatial range of synaptic connections and with regard to the scale of synaptic efficacies in the circuit, and then to microcircuit models that differ with regard to the level of background input currents and the level of noise on the membrane potential of neurons. In this case the proposed method allows us to quantify differences in the computational power and generalization capability of circuits in different dynamic regimes (UP- and DOWN-states) that have been demonstrated through intracellular recordings in vivo. 1

6 0.43103611 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

7 0.33256269 128 nips-2004-Neural Network Computation by In Vitro Transcriptional Circuits

8 0.2944653 35 nips-2004-Chemosensory Processing in a Spiking Model of the Olfactory Bulb: Chemotopic Convergence and Center Surround Inhibition

9 0.24226142 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

10 0.24100536 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography

11 0.22142814 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

12 0.20807157 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

13 0.20784509 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

14 0.20276318 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

15 0.19380909 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

16 0.17773995 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models

17 0.17469938 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

18 0.17353888 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

19 0.1733785 127 nips-2004-Neighbourhood Components Analysis

20 0.16944987 112 nips-2004-Maximising Sensitivity in a Spiking Network


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.393), (13, 0.164), (15, 0.094), (26, 0.043), (31, 0.018), (33, 0.093), (35, 0.017), (39, 0.02), (50, 0.032), (71, 0.011), (76, 0.011), (94, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82652622 135 nips-2004-On-Chip Compensation of Device-Mismatch Effects in Analog VLSI Neural Networks

Author: Miguel Figueroa, Seth Bridges, Chris Diorio

Abstract: Device mismatch in VLSI degrades the accuracy of analog arithmetic circuits and lowers the learning performance of large-scale neural networks implemented in this technology. We show compact, low-power on-chip calibration techniques that compensate for device mismatch. Our techniques enable large-scale analog VLSI neural networks with learning performance on the order of 10 bits. We demonstrate our techniques on a 64-synapse linear perceptron learning with the Least-Mean-Squares (LMS) algorithm, and fabricated in a 0.35µm CMOS process. 1

2 0.73973 112 nips-2004-Maximising Sensitivity in a Spiking Network

Author: Anthony J. Bell, Lucas C. Parra

Abstract: We use unsupervised probabilistic machine learning ideas to try to explain the kinds of learning observed in real neurons, the goal being to connect abstract principles of self-organisation to known biophysical processes. For example, we would like to explain Spike TimingDependent Plasticity (see [5,6] and Figure 3A), in terms of information theory. Starting out, we explore the optimisation of a network sensitivity measure related to maximising the mutual information between input spike timings and output spike timings. Our derivations are analogous to those in ICA, except that the sensitivity of output timings to input timings is maximised, rather than the sensitivity of output ‘firing rates’ to inputs. ICA and related approaches have been successful in explaining the learning of many properties of early visual receptive fields in rate coding models, and we are hoping for similar gains in understanding of spike coding in networks, and how this is supported, in principled probabilistic ways, by cellular biophysical processes. For now, in our initial simulations, we show that our derived rule can learn synaptic weights which can unmix, or demultiplex, mixed spike trains. That is, it can recover independent point processes embedded in distributed correlated input spike trains, using an adaptive single-layer feedforward spiking network. 1 Maximising Sensitivity. In this section, we will follow the structure of the ICA derivation [4] in developing the spiking theory. We cannot claim, as before, that this gives us an information maximisation algorithm, for reasons that we will delay addressing until Section 3. But for now, to first develop our approach, we will explore an interim objective function called sensitivity which we define as the log Jacobian of how input spike timings affect output spike timings. 1.1 How to maximise the effect of one spike timing on another. Consider a spike in neuron j at time tl that has an effect on the timing of another spike in neuron i at time tk . The neurons are connected by a weight wij . We use i and j to index neurons, and k and l to index spikes, but sometimes for convenience we will use spike indices in place of neuron indices. For example, wkl , the weight between an input spike l and an output spike k, is naturally understood to be just the corresponding wij . dtk dtl threshold potential du u(t) R(t) resting potential tk output spikes tl input spikes Figure 1: Firing time tk is determined by the time of threshold crossing. A change of an input spike time dtl affects, via a change of the membrane potential du the time of the output spike by dtk . In the simplest version of the Spike Response Model [7], spike l has an effect on spike k that depends on the time-course of the evoked EPSP or IPSP, which we write as R kl (tk − tl ). In general, this Rkl models both synaptic and dendritic linear responses to an input spike, and thus models synapse type and location. For learning, we need only consider the value of this function when an output spike, k, occurs. In this model, depicted in Figure 1, a neuron adds up its spiking inputs until its membrane potential, ui (t), reaches threshold at time tk . This threshold we will often, again for convenience, write as uk ≡ ui (tk , {tl }), and it is given by a sum over spikes l: uk = wkl Rkl (tk − tl ) . (1) l To maximise timing sensitivity, we need to determine the effect of a small change in the input firing time tl on the output firing time tk . (A related problem is tackled in [2].) When tl is changed by a small amount dtl the membrane potential will change as a result. This change in the membrane potential leads to a change in the time of threshold crossing dt k . The contribution to the membrane potential, du, due to dtl is (∂uk /∂tl )dtl , and the change in du corresponding to a change dtk is (∂uk /∂tk )dtk . We can relate these two effects by noting that the total change of the membrane potential du has to vanish because u k is defined as the potential at threshold. ie: du = ∂uk ∂uk dtk + dtl = 0 . ∂tk ∂tl (2) This is the total differential of the function uk = u(tk , {tl }), and is a special case of the implicit function theorem. Rearranging this: dtk ∂uk =− dtl ∂tl ∂uk ˙ = −wkl Rkl /uk . ˙ ∂tk (3) Now, to connect with the standard ICA derivation [4], recall the ‘rate’ (or sigmoidal) neuron, for which yi = gi (ui ) and ui = j wij xj . For this neuron, the output dependence on input is ∂yi /∂xj = wij gi while the learning gradient is: ∂yi ∂ 1 log − fi (ui )xj = ∂wij ∂xj wij (4) where the ‘score functions’, fi , are defined in terms of a density estimate on the summed ∂ ∂ inputs: fi (ui ) = ∂ui log gi = ∂ui log p(ui ). ˆ The analogous learning gradient for the spiking case, from (3), is: ˙ j(a)Rka ∂ dtk 1 log − a . = ∂wij dtl wij uk ˙ (5) where j(a) = 1 if spike a came from neuron j, and 0 otherwise. Comparing the two cases in (4) and (5), we see that the input variable xj has become the temporal derivative of the sum of the EPSPs coming from synapse j, and the output variable (or score function) fi (ui ) has become u−1 , the inverse of the temporal derivative ˙k of the membrane potential at threshold. It is intriguing (A) to see this quantity appear as analogous to the score function in the ICA likelihood model, and, (B) to speculate that experiments could show that this‘ voltage slope at threshold’ is a hidden factor in STDP data, explaining some of the scatter in Figure 3A. In other words, an STDP datapoint should lie on a 2-surface in a 3D space of {∆w, ∆t, uk }. Incidentally, uk shows up in any ˙ ˙ learning rule optimising an objective function involving output spike timings. 1.2 How to maximise the effect of N spike timings on N other ones. Now we deal with the case of a ‘square’ single-layer feedforward mapping between spike timings. There can be several input and output neurons, but here we ignore which neurons are spiking, and just look at how the input timings affect the output timings. This is captured in a Jacobian matrix of all timing dependencies we call T. The entries of this matrix are Tkl ≡ ∂tk /∂tl . A multivariate version of the sensitivity measure introduced in the previous section is the log of the absolute determinant of the timing matrix, ie: log |T|. The full derivation for the gradient W log |T| is in the Appendix. Here, we again draw out the analogy between Square ICA [4] and this gradient, as follows. Square ICA with a network y = g(Wx) is: ∆W ∝ W log |J| = W−1 − f (u)xT (6) where the Jacobian J has entries ∂yi /∂xj and the score functions are now, fi (u) = ∂ − ∂ui log p(u) for the general likelihood case, with p(u) = i gi being the special case of ˆ ˆ ICA. We will now split the gradient in (6) according to the chain rule: W log |J| = [ J log |J|] ⊗ [ W J] j(l) − fk (u)xj wkl J−T ⊗ Jkl i(k) = (7) . (8) In this equation, i(k) = δik and j(l) = δjl . The righthand term is a 4-tensor with entries ∂Jkl /∂wij , and ⊗ is defined as A ⊗ Bij = kl Akl Bklij . We write the gradient this way to preserve, in the second term, the independent structure of the 1 → 1 gradient term in (4), and to separate a difficult derivation into two easy parts. The structure of (8) holds up when we move to the spiking case, giving: W log |T| = = [ T log |T|] ⊗ [ W T] T−T ⊗ Tkl i(k) j(l) − wkl (9) a ˙ j(a)Rka uk ˙ (10) where i(k) is now defined as being 1 if spike k occured in neuron i, and 0 otherwise. j(l) and j(a) are analogously defined. Because the T matrix is much bigger than the J matrix, and because it’s entries are more complex, here the similarity ends. When (10) is evaluated for a single weight influencing a single spike coupling (see the Appendix for the full derivation), it yields: ∆wkl ∝ ∂ log |T| Tkl = ∂wkl wkl T−1 lk −1 , (11) This is a non-local update involving a matrix inverse at each step. In the ICA case of (6), such an inverse was removed by the Natural Gradient transform (see [1]), but in the spike timing case, this has turned out not to be possible, because of the additional asymmetry ˙ introduced into the T matrix (as opposed to the J matrix) by the Rkl term in (3). 2 Results. Nonetheless, this learning rule can be simulated. It requires running the network for a while to generate spikes (and a corresponding T matrix), and then for each input/output spike coupling, the corresponding synapse is updated according to (11). When this is done, and the weights learn, it is clear that something has been sacrificed by ignoring the issue of which neurons are producing the spikes. Specifically, the network will often put all the output spikes on one output neuron, with the rates of the others falling to zero. It is happy to do this, if a large log |T| can thereby be achieved, because we have not included this ‘which neuron’ information in the objective. We will address these and other problems in Section 3, but now we report on our simulation results on demultiplexing. 2.1 Demultiplexing spike trains. An interesting possibility in the brain is that ‘patterns’ are embedded in spatially distributed spike timings that are input to neurons. Several patterns could be embedded in single input trains. This is called multiplexing. To extract and propagate these patterns, the neurons must demultiplex these inputs using its threshold nonlinearity. Demultiplexing is the ‘point process’ analog of the unmixing of independent inputs in ICA. We have been able to robustly achieve demultiplexing, as we now report. We simulated a feed-forward network with 3 integrate-and-fire neurons and inputs from 3 presynaptic neurons. Learning followed (11) where we replace the inverse by the pseudoinverse computed on the spikes generated during 0.5 s. The pseudo-inverse is necessary because even though on average, the learning matches number of output spikes to number of input spikes, the matrix T is still not usually square and so its actual inverse cannot be taken. In addition, in these simulations, an additional term is introduced in the learning to make sure all the output neurons fire with equal probability. This partially counters the ignoral of the ‘which neuron’ information, which we explained above. Assuming Poisson spike count ni for the ith output neuron with equal firing rate ni it is easy to derive in an approximate ¯ term that will control the spike count, i (¯ i − ni ). The target firing rates ni were set to n ¯ match the “source” spike train in this example. The network learns to demultiplex mixed spike trains, as shown in Figure 2. This demultiplexing is a robust property of learning using (11) with this new spike-controlling term. Finally, what about the spike-timing dependendence of the observed learning? Does it match experimental results? The comparison is made in Figure 3, and the answer is no. There is a timing-dependent transition between depression and potentiation in our result Spike Trains mixing mixed input trains 1 1 0.8 2 0.6 3 0 50 100 150 200 250 300 350 400 450 0.4 500 0.2 output 1 0 2 3 synaptic weights 0 50 100 150 200 250 300 350 400 450 500 original spike train 1 1 0.5 2 0 3 0 50 100 150 200 250 time in ms 300 350 400 450 500 −0.5 Figure 2: Unmixed spike trains. The input (top lef) are 3 spike trains which are a mixture of three independent Poison processes (bottom left). The network unmixes the spike train to approximately recover the original (center left). In this example 19 spikes correspond to the original with 4 deletion and 2 insertions. The two panels at the right show the mixing (top) and synaptic weight matrix after training (bottom). in Figure 3B, but it is not a sharp transition like the experimental result in Figure 3A. In addition, it does not transition at zero (ie: when tk − tl = 0), but at a time offset by the rise time of the EPSPs. In earlier experiments, in which we tranformed the gradient in (11) by an approximate inverse Hessian, to get an approximate Natural Gradient method, a sharp transition did emerge in simulations. However, the approximate inverse Hessian was singular, and we had to de-emphasise this result. It does suggest, however, that if the Natural Gradient transform can be usefully done on some variant of this learning rule, it may well be what accounts for the sharp transition effect of STDP. 3 Discussion Although these derivations started out smoothly, the reader possibly shares the authors’ frustration at the approximations involved here. Why isn’t this simple, like ICA? Why don’t we just have a nice maximum spikelihood model, ie: a density estimation algorithm for multivariate point processes, as ICA was a model in continuous space? We are going to be explicit about the problems now, and will propose a direction where the solution may lie. The over-riding problem is: we are unable to claim that in maximising log |T|, we are maximising the mutual information between inputs and outputs because: 1. The Invertability Problem. Algorithms such as ICA which maximise log Jacobians can only be called Infomax algorithms if the network transformation is both deterministic and invertable. The Spike Response Model is deterministic, but it is not invertable in general. When not invertable, the key formula (considering here vectors of input and output timings, tin and tout )is transformed from simple to complex. ie: p(tout ) = p(tin ) becomes p(tout ) = |T| solns tin p(tin ) d tin |T| (12) Thus when not invertable, we need to know the Jacobians of all the inputs that could have caused an output (called here ‘solns’), something we simply don’t know. 2. The ‘Which Neuron’ Problem. Instead of maximising the mutual information I(tout , tin ), we should be maximising I(tiout , tiin ), where the vector ti is the timing (A) STDP (B) Gradient 100 ∆ w (a.u.) 150 100 ∆ w / w (%) 150 50 0 −50 −100 −100 50 0 −50 −50 0 ∆ t (ms) 50 100 −100 −20 0 20 40 60 ∆ t (ms) 80 100 Figure 3: Dependence of synaptic modification on pre/post inter-spike interval. Left (A): From Froemke & Dan, Nature (2002)]. Dependence of synaptic modification on pre/post inter-spike interval in cat L2/3 visual cortical pyramidal cells in slice. Naturalistic spike trains. Each point represents one experiment. Right (B): According to Equation (11). Each point corresponds to an spike pair between approximately 100 input and 100 output spikes. vector, t, with the vector, i, of corresponding neuron indices, concatenated. Thus, ‘who spiked?’ should be included in the analysis as it is part of the information. 3. The Predictive Information Problem. In ICA, since there was no time involved, we did not have to worry about mutual informations over time between inputs and outputs. But in the spiking model, output spikes may well have (predictive) mutual information with future input spikes, as well as the usual (causal) mutual information with past input spikes. The former has been entirely missing from our analysis so far. These temporal and spatial infomation dependencies missing in our analysis so far, are thrown into a different light by a single empirical observation, which is that Spike TimingDependent Plasticity is not just a feedforward computation like the Spike Response Model. Specifically, there must be at least a statistical, if not a causal, relation between a real synapse’s plasticity and its neuron’s output spike timings, for Figure 3B to look like it does. It seems we have to confront the need for both a ‘memory’ (or reconstruction) model, such as the T we have thus far dealt with, in which output spikes talk about past inputs, and a ‘prediction’ model, in which they talk about future inputs. This is most easily understood from the point of view of Barber & Agakov’s variational Infomax algorithm [3]. They argue for optimising a lower bound on mutual information, which, for our neurons’, would be expressed using an inverse model p, as follows: ˆ I(tiin , tiout ) = H(tiin ) − log p(tiin |tiout ) ˆ p(tiin ,tiout ) ≤ I(tiin , tiout ) (13) In a feedforward model, H(tiin ) may be disregarded in taking gradients, leading us to the optimisation of a ‘memory-prediction’ model p(tiin |tiout ) related to something supposˆ edly happening in dendrites, somas and at synapses. In trying to guess what this might be, it would be nice if the math worked out. We need a square Jacobian matrix, T, so that |T| = p(tiin |tiout ) can be our memory/prediction model. Now let’s rename our feedforˆ → − ward timing Jacobian T (‘up the dendritic trees’), as T, and let’s fantasise that there is ← − some, as yet unspecified, feedback Jacobian T (‘down the dendritic trees’), which covers → − electrotonic influences as they spread from soma to synapse, and which T can be combined with by some operation ‘⊗’ to make things square. Imagine further, that doing this → ← − − yields a memory/prediction model on the inputs. Then the T we are looking for is T ⊗ T, → ← − − and the memory-prediction model is: p(tiin |tiout ) = T ⊗ T ˆ → − → − Ideally, the entries of T should be as before, ie: T kl = ∂tk /∂tl . What should the entries ← − ← − ← − of T be? Becoming just one step more concrete, suppose T had entries T lk = ∂cl /∂tk , where cl is some, as yet unspecified, value, or process, occuring at an input synapse when spike l comes in. What seems clear is that ⊗ should combine the correctly tensorised forms → − ← − → ← − − of T and T (giving them each 4 indices ijkl), so that T = T ⊗ T sums over the spikes k and l to give a I × J matrix, where I is the number of output neurons, and J the number of input neurons. Then our quantity, T, would represent all dependencies of input neuronal activity on output activity, summed over spikes. ← − Further, we imagine that T contains reverse (feedback) electrotonic transforms from soma ← − to synapse R lk that are somehow symmetrically related to the feedforward Spike Re→ − sponses from synapse to soma, which we now rename R kl . Thinking for a moment in terms of somatic k and synaptic l, voltages V , currents I and linear cable theory, the synapse to → − → − soma transform, R kl would be related to an impedance in Vk = Il Z kl , while the soma ← − ← − to synapse transform, R lk would be related to an admittance in Il = Vk Y lk [8]. The → − ← − symmetry in these equations is that Z kl is just the inverse conjugate of Y lk . Finally, then, what is cl ? And what is its relation to the calcium concentration, [Ca2+ ]l , at a synapse, when spike l comes in? These questions naturally follow from considering the experimental data, since it is known that the calcium level at synapses is the critical integrating factor in determining whether potentiation or depression occurs [5]. 4 Appendix: Gradient of log |T| for the full Spike Response Model. Here we give full details of the gradient for Gerstner’s Spike Response Model [7]. This is a general model for which Integrate-and-Fire is a special case. In this model the effect of a presynaptic spike at time tl on the membrane potential at time t is described by a post synaptic potential or spike response, which may also depend on the time that has passed since the last output spike tk−1 , hence the spike response is written as R(t − tk−1 , t − tl ). This response is weighted by the synaptic strength wl . Excitatory or inhibitory synapses are determined by the sign of wl . Refractoriness is incorporated by adding a hyper-polarizing contribution (spike-afterpotential) to the membrane potential in response to the last preceding spike η(t − tk−1 ). The membrane potential as a function of time is therefore given by u(t) = η(t − tk−1 ) + wl R(t − tk−1 , t − tl ) . (14) l We have ignored here potential contributions from external currents which can easily be included without modifying the following derivations. The output firing times t k are defined as the times for which u(t) reaches firing threshold from below. We consider a dynamic threshold, ϑ(t − tk−1 ), which may depend on the time since that last spike tk−1 , together then output spike times are defined implicitly by: t = tk : u(t) = ϑ(t − tk−1 ) and du(t) > 0. dt (15) For this more general model Tkl is given by Tkl = dtk =− dtl ∂u ∂ϑ − ∂tk ∂tk −1 ˙ ∂u wkl R(tk − tk−1 , tk − tl , ) = , ˙ ∂tl u(tk ) − ϑ(tk − tk−1 ) ˙ (16) ˙ ˙ where R(s, t), u(t), and ϑ(t) are derivatives with respect to t. The dependence of Tkl on ˙ tk−1 should be implicitly assumed. It has been omitted to simplify the notation. Now we compute the derivative of log |T| with respect to wkl . For any matrix T we have ∂ log |T|/∂Tab = [T−1 ]ba . Therefore: ∂ log |T| ∂Tab ∂ log |T| ∂Tab = [T−1 ]ba . (17) ∂wkl ∂Tab ∂wkl ∂wkl ab ab Utilising the Kronecker delta δab = (1 if a = b, else 0), the derivative of (16) with respect to wkl gives: ˙ ∂Tab ∂ wab R(ta − ta−1 , ta − tb ) = ˙ ˙ ∂wkl ∂wkl η(ta − ta−1 ) + wac R(ta − ta−1 , ta − tc ) − ϑ(ta − ta−1 ) c ˙ R(ta − ta−1 , ta − tb ) = δak δbl ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ ˙ ˙ wab R(ta − ta−1 , ta − tb )δak R(ta − ta−1 , ta − tl ) − 2 ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ = δak Tab Therefore: ∂ log |T| ∂wkl Tal δbl − wab wal . (18) δbl Tal − wab wal [T−1 ]ba δak Tab = ab = Tkl wkl [T−1 ]lk − [T−1 ]bk Tkl b (19) = Tkl [T−1 ]lk − 1 . wkl (20) Acknowledgments We are grateful for inspirational discussions with Nihat Ay, Michael Eisele, Hong Hui Yu, Jim Crutchfield, Jeff Beck, Surya Ganguli, Sophi` Deneve, David Barber, Fabian Theis, e Tony Zador and Arunava Banerjee. AJB thanks all RNI colleagues for many such discussions. References [1] Amari S-I. 1997. Natural gradient works efficiently in learning, Neural Computation, 10, 251-276 [2] Banerjee A. 2001. On the Phase-Space Dynamics of Systems of Spiking Neurons. Neural Computation, 13, 161-225 [3] Barber D. & Agakov F. 2003. The IM Algorithm: A Variational Approach to Information Maximization. Advances in Neural Information Processing Systems 16, MIT Press. [4] Bell A.J. & Sejnowski T.J. 1995. An information maximization approach to blind separation and blind deconvolution, Neural Computation, 7, 1129-1159 [5] Dan Y. & Poo M-m. 2004. Spike timing-dependent plasticity of neural circuits, Neuron, 44, 23-30 [6] Froemke R.C. & Dan Y. 2002. Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 28, 416: 433-8 [7] Gerstner W. & Kistner W.M. 2002. Spiking neuron models, Camb. Univ. Press [8] Zador A.M., Agmon-Snir H. & Segev I. 1995. The morphoelectrotonic transform: a graphical approach to dendritic function, J. Neurosci., 15(3): 1669-1682

3 0.70842612 57 nips-2004-Economic Properties of Social Networks

Author: Sham M. Kakade, Michael Kearns, Luis E. Ortiz, Robin Pemantle, Siddharth Suri

Abstract: We examine the marriage of recent probabilistic generative models for social networks with classical frameworks from mathematical economics. We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price variation. Our findings are a mixture of formal analysis, simulation, and experiments on an international trade data set from the United Nations. 1

4 0.50206202 176 nips-2004-Sub-Microwatt Analog VLSI Support Vector Machine for Pattern Classification and Sequence Estimation

Author: Shantanu Chakrabartty, Gert Cauwenberghs

Abstract: An analog system-on-chip for kernel-based pattern classification and sequence estimation is presented. State transition probabilities conditioned on input data are generated by an integrated support vector machine. Dot product based kernels and support vector coefficients are implemented in analog programmable floating gate translinear circuits, and probabilities are propagated and normalized using sub-threshold current-mode circuits. A 14-input, 24-state, and 720-support vector forward decoding kernel machine is integrated on a 3mm×3mm chip in 0.5µm CMOS technology. Experiments with the processor trained for speaker verification and phoneme sequence estimation demonstrate real-time recognition accuracy at par with floating-point software, at sub-microwatt power. 1

5 0.4785895 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification

Author: Tong Zhang

Abstract: We consider the problem of deriving class-size independent generalization bounds for some regularized discriminative multi-category classification methods. In particular, we obtain an expected generalization bound for a standard formulation of multi-category support vector machines. Based on the theoretical result, we argue that the formulation over-penalizes misclassification error, which in theory may lead to poor generalization performance. A remedy, based on a generalization of multi-category logistic regression (conditional maximum entropy), is then proposed, and its theoretical properties are examined. 1

6 0.47522646 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

7 0.47380611 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

8 0.47029531 138 nips-2004-Online Bounds for Bayesian Algorithms

9 0.46238419 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

10 0.46158057 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

11 0.45411924 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

12 0.44265059 28 nips-2004-Bayesian inference in spiking neurons

13 0.44043297 131 nips-2004-Non-Local Manifold Tangent Learning

14 0.43431187 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

15 0.43266785 163 nips-2004-Semi-parametric Exponential Family PCA

16 0.43184778 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

17 0.4314777 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

18 0.43082061 116 nips-2004-Message Errors in Belief Propagation

19 0.43065354 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

20 0.43064398 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems