nips nips2002 nips2002-91 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: This paper introduces the Field-Programmable Learning Array, a new paradigm for rapid prototyping of learning primitives and machinelearning algorithms in silicon. The FPLA is a mixed-signal counterpart to the all-digital Field-Programmable Gate Array in that it enables rapid prototyping of algorithms in hardware. Unlike the FPGA, the FPLA is targeted directly for machine learning by providing local, parallel, online analog learning using floating-gate MOS synapse transistors. We present a prototype FPLA chip comprising an array of reconfigurable computational blocks and local interconnect. We demonstrate the viability of this architecture by mapping several learning circuits onto the prototype chip.
Reference: text
sentIndex sentText sentNum sentScore
1 edu ¡ Abstract This paper introduces the Field-Programmable Learning Array, a new paradigm for rapid prototyping of learning primitives and machinelearning algorithms in silicon. [sent-3, score-0.328]
2 The FPLA is a mixed-signal counterpart to the all-digital Field-Programmable Gate Array in that it enables rapid prototyping of algorithms in hardware. [sent-4, score-0.178]
3 Unlike the FPGA, the FPLA is targeted directly for machine learning by providing local, parallel, online analog learning using floating-gate MOS synapse transistors. [sent-5, score-0.143]
4 We present a prototype FPLA chip comprising an array of reconfigurable computational blocks and local interconnect. [sent-6, score-0.381]
5 We demonstrate the viability of this architecture by mapping several learning circuits onto the prototype chip. [sent-7, score-0.24]
6 Such algorithms, when implemented in VLSI, can leverage the inherent parallelism offered by the millions of transistors on a single silicon die. [sent-10, score-0.208]
7 Depending on the design technique, hardware implementations of learning algorithms can realize significant performance increases over standard computers in terms of speed or power consumption. [sent-11, score-0.215]
8 Despite the benefits of implementing machine-learning algorithms in VLSI, several issues have kept hardware implementations from penetrating mainstream machine learning. [sent-12, score-0.161]
9 First, many previous hardware systems were not scalable due to the size of many primary components such as digital multipliers or digital-to-analog converters[2, 3]. [sent-13, score-0.172]
10 Second, many systems such as [4] have inflexible circuit topologies, allowing them to be used for only very specific problems. [sent-14, score-0.263]
11 Third, many hardware learning systems did not comprise a complete solution with on-chip learning [5] and often required external weight updates[3, 6]. [sent-15, score-0.173]
12 In addition to these problems of scalability and inflexibility, perhaps the biggest impediment to implementing learning in VLSI is that designing VLSI chips is a time-consuming and error-prone process. [sent-16, score-0.087]
13 All current VLSI learning implementations required a detailed knowledge of analog and digital circuit design. [sent-17, score-0.479]
14 This prerequisite knowledge impedes hardware development by a hardware novice; indeed, the design process can challenge even the most experienced circuit designer. [sent-18, score-0.51]
15 Because we make extensive use of floating-gate synapse transistors [1] in our learning circuits to enable local adaptation, the design process becomes even more difficult due to slow and inaccurate simulation of these devices. [sent-19, score-0.343]
16 A reconfigurable learning system would solve these problems by allowing rapid prototyping and flexibility in learning system hardware. [sent-20, score-0.194]
17 When combined with a simple user interface, a reconfigurable learning system can enable anyone with a machine-learning background to express his/her ideas in hardware. [sent-24, score-0.104]
18 In this paper, we propose a mixed analog-digital Field-Programmable Learning Array (FPLA), a reconfigurable system for rapid prototyping of machine-learning algorithms in hardware. [sent-25, score-0.183]
19 The FPLA enables the design cycle shown in Figure 1(a) in which the designer expresses a machine-learning problem as an algorithm, compiles that representation into an FPLA configuration, and prototypes the algorithm in an FPLA. [sent-26, score-0.111]
20 The FPLA is similar in concept to all-digital Field-Programmable Gate Arrays (FPGA), in that they both enable reconfigurable computation and prototyping using arrays of simple elements and reconfigurable wiring. [sent-27, score-0.205]
21 Unlike previous reconfigurable hardware learning solutions [3, 4, 6, 7], the FPLA is a general-purpose prototyping tool and does not target one specific architecture. [sent-28, score-0.23]
22 Moreover, our FPLA supports on-chip adaptation and enables rapid prototyping of a large class of learning algorithms. [sent-29, score-0.245]
23 We have implemented a prototype core for an FPLA. [sent-30, score-0.152]
24 Our chip comprises a small (2 2) array of Programmable Learning Blocks (PLBs) as well as a simple interconnect structure to allow the PLBs to communicate in an all-to-all fashion. [sent-31, score-0.465]
25 Our results show that this prototype system achieves its design goal of enabling rapid prototyping of floating-gate learning circuits by implementing learning circuits known in the literature as well as new circuits prototyped for the first time. [sent-32, score-0.719]
26 Section 3 shows results from our test chip of the prototype design. [sent-35, score-0.215]
27 The first two properties are dimensions of the FPLA design space, where tradeoffs between them results in varying levels of flexibility and functionality at the cost of area and power. [sent-39, score-0.089]
28 Likewise, to develop new learning algorithms in silicon, the PLBs should allow lower-level functions such as current mirrors, differential pairs, and current sources. [sent-42, score-0.12]
29 The design compiler transforms this code into an FPLA configuration, which is then downloaded to the chip. [sent-47, score-0.101]
30 The architecture comprises an array of Programmable Learning Blocks (PLBs), a flexible interconnect, and support circuitry on the periphery. [sent-50, score-0.221]
31 Local interconnect enables efficient, low-cost communication between adjacent PLBs. [sent-51, score-0.216]
32 Global interconnect enables distant PLBs to communicate, albeit at a higher cost. [sent-52, score-0.216]
33 The global interconnect must be sparse because of area constraints in VLSI chips, but flexible enough to allow a wide range of PLB connectivity. [sent-54, score-0.192]
34 Local connectivity is critical to enable the creation of complex learning primitives from combinations of PLBs and the implementation of large classes of machine-learning algorithms that exhibit strong local computation. [sent-55, score-0.205]
35 The adaptive properties of floating-gate transistors can overcome these intrinsic accuracy limitations[9], therefore enabling mixed analog-digital computation to obtain the best combination of power, area, scalability, and performance. [sent-58, score-0.123]
36 A user interface for an FPLA comprises two different components: a design compilation and configuration tool, and a chip interface that provides both digital and analog I/O. [sent-59, score-0.547]
37 An FPLA design compiler allows a user to compile an abstract expression of an algorithm (e. [sent-60, score-0.135]
38 The chip interface provides digital I/O to interface with standard computers and surrounding digital circuitry, as well as analog I/O to interface with signals from sensors such as vision chips and implantable devices. [sent-63, score-0.488]
39 2 Prototype Chip As a first step in designing an FPLA, we built a prototype focusing on the PLB design and local interconnect. [sent-65, score-0.162]
40 Our design comprises a 2 2 array of PLBs interconnected in an allto-all fashion. [sent-66, score-0.198]
41 The system I/O comprises digital input for programming and bidirectional analog input/output for system operation. [sent-67, score-0.234]
42 We show the prototype FPLA architecture and chip micrograph in Figure 2. [sent-68, score-0.26]
43 The FPLA included two pFET PLBs and two nFET PLBs, each containing 8 uncommitted lines, 4 I/O blocks, and the computational primitives described below. [sent-71, score-0.154]
44 Our prototype FPLA comprises 4 PLBs that contain simple analog functional primitives. [sent-73, score-0.218]
45 A set of interconnect switches connect the PLBs in an all-toall fashion. [sent-74, score-0.221]
46 The chip photograph shows the four PLBs, inter-PLB blocks, and programming circuitry. [sent-76, score-0.16]
47 Each of the four PLBs comprises computational circuitry and a large switching matrix built of pass-gates controlled by SRAM. [sent-81, score-0.105]
48 The computational primitives that compose the PLBs are two floating-gate transistors, a differential pair, a current mirror, a diode-connected transistor, a bias current source, three transistors with configurable length and width, and two configurable capacitors. [sent-83, score-0.304]
49 These circuit primitives can be wired into arbitrary configurations simply by changing the state of the PLB switch matrix. [sent-84, score-0.376]
50 When deciding what functions to place in the PLBs, our starting point was the decomposition of known primitives [10, 11] for silicon learning as well as standard analog primitives such as those in Mead’s book on silicon neural systems [12]. [sent-85, score-0.439]
51 The circuits included in our PLBs are the most common subcircuits found when decomposing these primitives. [sent-86, score-0.1]
52 However, more useful circuits require resources from multiple PLBs. [sent-88, score-0.1]
53 InterPLB blocks provide local connectivity between PLBs where each inter-PLB block is an array of SRAM pass-gate switches that can connect an uncommitted line in one PLB to an uncommitted line in another PLB. [sent-89, score-0.257]
54 To interface with the external world, there are four I/O connections per PLB, each of which can be configured in one of two ways: as a bare connection to the pad for voltage inputs or current outputs, or as a voltage output through a unity-gain buffer. [sent-91, score-0.095]
55 3 Implementing Machine-Learning Primitives To show the correct functionality of our chip, we implemented various circuits from the literature as well as new circuits developed entirely in the FPLA. [sent-93, score-0.232]
56 75 (c) Figure 3: (a) Schematic of the correlational-learning circuit described by Shon and Hsu in [11]. [sent-98, score-0.263]
57 (b) Schematic of the same circuit as implemented in the FPLA. [sent-99, score-0.295]
58 (c) Experimental results comparing the performance of the custom circuit against the reconfigurable circuit. [sent-100, score-0.314]
59 1 Correlational-Learning Primitive As a first test of our chip, we implemented the correlational-learning circuit described by Shon and Hsu in [11]. [sent-104, score-0.295]
60 This circuit learns the conditional probability of a binary event given another binary event . [sent-105, score-0.286]
61 We show the original circuit in Figure 3(a), and the FPLA implementation Figure 3(b). [sent-106, score-0.263]
62 ¡ We implemented this circuit using primitives from two PLBs. [sent-107, score-0.408]
63 Figure 3(c) compares the results from the custom chip to the results from the FPLA. [sent-109, score-0.191]
64 SPICE simulations confirm that the interconnect switches have a negligible effect on circuit performance. [sent-114, score-0.484]
65 2 Regression-Learning Primitive The regression-learning circuit described in this section is a new hardware learning primitive first implemented in the FPLA. [sent-116, score-0.473]
66 The circuit performs regression learning on a set of 2-D input data. [sent-117, score-0.306]
67 It comprises two correlational learning circuits like the one shown in Figure 4(a) to encode a differential weight . [sent-118, score-0.248]
68 Each circuit learns and respectively, such that: (2) The circuit operates as follows. [sent-119, score-0.549]
69 We apply a zero-mean input signal , encoded as a varying current plus some DC bias current , to the two inputs of the circuit. [sent-120, score-0.079]
70 The differential output current of each circuit represents the product of its stored weight with the input current. [sent-121, score-0.46]
71 (3) (4) The difference in those output currents represents the total product of the current input and the weight stored on the floating gate. [sent-122, score-0.153]
72 This circuit is one-half of the regression learning circuit and learns the positive weight . [sent-128, score-0.591]
73 The other half of the circuit is identical but used to represent the negative differential weight . [sent-129, score-0.329]
74 This data is taken from the FPLA configured as the circuit on the left. [sent-132, score-0.263]
75 The circuit was shown 388 data points with a slope of 0. [sent-133, score-0.288]
76 The computer running the test compares that predicted the circuit predicts an output output with the target and feeds an error signal back to the chip. [sent-141, score-0.307]
77 Based on the error signal, the circuit adapts the weight . [sent-142, score-0.285]
78 Results from in this circuit are shown in Figure 4(b). [sent-145, score-0.263]
79 3 Clustering Primitive We tested a new clustering primitive that is based on the adaptive bump circuit introduced in [10]. [sent-147, score-0.354]
80 The circuit performs two functions: 1) computes the similarity between an input and a stored value, and 2) adapts the stored value to decrease its distance to the input. [sent-148, score-0.402]
81 This adaptive bump circuit exhibits improved adaptation over previous versions [10, 13] due to the inclusion of the autonulling differential pair[14], shown in Figure 5(a) (top). [sent-149, score-0.405]
82 The autonulling differential pair ensures that the adaptation process increases the similarity between the stored mean and the input. [sent-150, score-0.176]
83 The data in Figure 5(b) shows the clustering primitive adapting to an input that is initially distant from the stored value. [sent-151, score-0.148]
84 The result of this adaptation is that over time, the circuit learns to produce a maximal output response at the present input. [sent-152, score-0.355]
85 Instead of waiting several months for chip fabrication, we were able to produce experimental results from a chip in under four hours. [sent-155, score-0.28]
86 Also, the results are a more accurate model of actual circuit behavior than a SPICE simulation. [sent-156, score-0.263]
87 This circuit can: 1) compute the similarity between the stored value and the input, and 2) adapt the stored value to decrease its distance to the input. [sent-159, score-0.379]
88 This plot shows that circuit adaptation moves the circuit’s peak response toward the presented input. [sent-161, score-0.31]
89 4 Future Work The chip that we developed is effective for prototyping single learning primitives, but is too small for solving real machine-learning problems. [sent-163, score-0.279]
90 An FPLA whose target is machinelearning algorithms requires PLBs that comprise higher-level functions, such as the primitives presented in the previous section. [sent-164, score-0.174]
91 First, to reduce the size of the PLBs, we will increase the ratio of computational circuitry to switching circuitry by replacing the low-level functions such as current mirrors and synapse transistors with higher-level primitives such as those mentioned in the previous section. [sent-166, score-0.335]
92 Second, we will increase the number of PLBs in the design, which will require an efficient and scalable global interconnect structure. [sent-167, score-0.225]
93 Finally, we have begun work on the design compiler, a software tool that maps machinelearning algorithms to an FPLA configuration. [sent-170, score-0.106]
94 5 Conclusions Because of the match between the parallelism offered by hardware and the parallelism in machine-learning algorithms, mixed analog-digital VLSI is a promising substrate for machine-learning implementations. [sent-171, score-0.212]
95 To overcome these limitations, we have proposed Field-Programmable Learning Arrays, a viable reconfigurable architecture for prototyping machine-learning algorithms in hardware. [sent-173, score-0.164]
96 FPLAs combine elements of FPGAs, analog VLSI, and on-chip learning to provide a scalable and cost-effective solution for learning in silicon. [sent-174, score-0.154]
97 Our results show that our prototype core and interconnect can effectively implement existing learning primitives and assist in the development of new circuits. [sent-175, score-0.445]
98 Gulak, “A CMOS field programmable analog array,” IEEE Journal of Solid-State Circuits, vol. [sent-217, score-0.129]
99 Paulos, “An analog VLSI neural network with on-chip learning,” IEEE Journal of Solid-State Circuits, vol. [sent-222, score-0.081]
100 Diorio, “A silicon primitive for competitive learning,” in Advances in Neural Information Processing Systems 13 (T. [sent-244, score-0.123]
wordName wordTfidf (topN-words)
[('fpla', 0.52), ('plb', 0.438), ('plbs', 0.342), ('circuit', 0.263), ('interconnect', 0.192), ('vlsi', 0.153), ('recon', 0.141), ('chip', 0.14), ('gurable', 0.134), ('prototyping', 0.119), ('primitives', 0.113), ('nfet', 0.11), ('circuits', 0.1), ('pfet', 0.095), ('hardware', 0.091), ('analog', 0.081), ('hsu', 0.081), ('prototype', 0.075), ('array', 0.071), ('primitive', 0.067), ('diorio', 0.065), ('design', 0.065), ('transistors', 0.064), ('comprises', 0.062), ('stored', 0.058), ('silicon', 0.056), ('shon', 0.055), ('blocks', 0.053), ('custom', 0.051), ('enable', 0.05), ('digital', 0.048), ('figueroa', 0.048), ('programmable', 0.048), ('adaptation', 0.047), ('architecture', 0.045), ('guration', 0.045), ('core', 0.045), ('interface', 0.045), ('differential', 0.044), ('inter', 0.043), ('circuitry', 0.043), ('fpga', 0.041), ('machinelearning', 0.041), ('uncommitted', 0.041), ('con', 0.04), ('implementations', 0.039), ('arrays', 0.036), ('chips', 0.036), ('compiler', 0.036), ('adc', 0.036), ('dac', 0.036), ('parallelism', 0.036), ('rapid', 0.035), ('cmos', 0.035), ('fabricated', 0.035), ('user', 0.034), ('na', 0.033), ('oating', 0.033), ('scalable', 0.033), ('implemented', 0.032), ('implementing', 0.031), ('enabling', 0.03), ('mixed', 0.029), ('exible', 0.029), ('switches', 0.029), ('current', 0.028), ('autonulling', 0.027), ('compilation', 0.027), ('compose', 0.027), ('gured', 0.027), ('spice', 0.027), ('sram', 0.027), ('decoder', 0.026), ('gate', 0.026), ('vb', 0.026), ('slope', 0.025), ('enables', 0.024), ('prototyped', 0.024), ('tradeoffs', 0.024), ('bridges', 0.024), ('bump', 0.024), ('mos', 0.024), ('input', 0.023), ('learns', 0.023), ('synapse', 0.022), ('weight', 0.022), ('local', 0.022), ('tsmc', 0.022), ('designer', 0.022), ('mirrors', 0.022), ('output', 0.022), ('ev', 0.02), ('comprise', 0.02), ('mead', 0.02), ('comprising', 0.02), ('hour', 0.02), ('offered', 0.02), ('gb', 0.02), ('learning', 0.02), ('programming', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 91 nips-2002-Field-Programmable Learning Arrays
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: This paper introduces the Field-Programmable Learning Array, a new paradigm for rapid prototyping of learning primitives and machinelearning algorithms in silicon. The FPLA is a mixed-signal counterpart to the all-digital Field-Programmable Gate Array in that it enables rapid prototyping of algorithms in hardware. Unlike the FPGA, the FPLA is targeted directly for machine learning by providing local, parallel, online analog learning using floating-gate MOS synapse transistors. We present a prototype FPLA chip comprising an array of reconfigurable computational blocks and local interconnect. We demonstrate the viability of this architecture by mapping several learning circuits onto the prototype chip.
2 0.19926825 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,
3 0.1632921 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip
Author: R. Carmona, F. Jiménez-garrido, R. Dominguez-castro, S. Espejo, A. Rodriguez-vázquez
Abstract: A bio-inspired model for an analog programmable array processor (APAP), based on studies on the vertebrate retina, has permitted the realization of complex programmable spatio-temporal dynamics in VLSI. This model mimics the way in which images are processed in the visual pathway, rendering a feasible alternative for the implementation of early vision applications in standard technologies. A prototype chip has been designed and fabricated in a 0.5µm standard CMOS process. Computing power per area and power consumption is amongst the highest reported for a single chip. Design challenges, trade-offs and some experimental results are presented in this paper. 1
4 0.13807893 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity
Author: Giacomo Indiveri
Abstract: We present analog neuromorphic circuits for implementing bistable synapses with spike-timing-dependent plasticity (STDP) properties. In these types of synapses, the short-term dynamics of the synaptic efficacies are governed by the relative timing of the pre- and post-synaptic spikes, while on long time scales the efficacies tend asymptotically to either a potentiated state or to a depressed one. We fabricated a prototype VLSI chip containing a network of integrate and fire neurons interconnected via bistable STDP synapses. Test results from this chip demonstrate the synapse’s STDP learning properties, and its long-term bistable characteristics.
5 0.12581611 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
Author: Shih-Chii Liu, Malte Boegershausen, Pascal Suter
Abstract: We describe a model of short-term synaptic depression that is derived from a silicon circuit implementation. The dynamics of this circuit model are similar to the dynamics of some present theoretical models of shortterm depression except that the recovery dynamics of the variable describing the depression is nonlinear and it also depends on the presynaptic frequency. The equations describing the steady-state and transient responses of this synaptic model fit the experimental results obtained from a fabricated silicon network consisting of leaky integrate-and-fire neurons and different types of synapses. We also show experimental data demonstrating the possible computational roles of depression. One possible role of a depressing synapse is that the input can quickly bring the neuron up to threshold when the membrane potential is close to the resting potential.
6 0.093626708 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain
7 0.093542464 4 nips-2002-A Differential Semantics for Jointree Algorithms
8 0.075400233 200 nips-2002-Topographic Map Formation by Silicon Growth Cones
9 0.065780163 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
10 0.05669013 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
11 0.050752338 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
12 0.044656206 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model
13 0.035593275 18 nips-2002-Adaptation and Unsupervised Learning
14 0.034828544 66 nips-2002-Developing Topography and Ocular Dominance Using Two aVLSI Vision Sensors and a Neurotrophic Model of Plasticity
15 0.032285865 128 nips-2002-Learning a Forward Model of a Reflex
16 0.031461988 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
17 0.030013813 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
18 0.027551023 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
19 0.027492426 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
20 0.027365409 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
topicId topicWeight
[(0, -0.092), (1, 0.072), (2, -0.018), (3, -0.041), (4, 0.037), (5, 0.22), (6, 0.097), (7, -0.024), (8, 0.009), (9, 0.011), (10, 0.085), (11, 0.023), (12, -0.055), (13, 0.021), (14, 0.142), (15, 0.013), (16, 0.092), (17, -0.106), (18, 0.157), (19, -0.169), (20, 0.313), (21, 0.009), (22, -0.086), (23, 0.034), (24, -0.037), (25, 0.007), (26, 0.076), (27, -0.069), (28, 0.052), (29, 0.108), (30, -0.119), (31, -0.035), (32, -0.148), (33, 0.04), (34, 0.008), (35, 0.016), (36, 0.016), (37, -0.117), (38, -0.052), (39, 0.039), (40, 0.036), (41, -0.105), (42, -0.006), (43, -0.036), (44, 0.022), (45, -0.014), (46, 0.018), (47, -0.017), (48, 0.041), (49, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.96793735 91 nips-2002-Field-Programmable Learning Arrays
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: This paper introduces the Field-Programmable Learning Array, a new paradigm for rapid prototyping of learning primitives and machinelearning algorithms in silicon. The FPLA is a mixed-signal counterpart to the all-digital Field-Programmable Gate Array in that it enables rapid prototyping of algorithms in hardware. Unlike the FPGA, the FPLA is targeted directly for machine learning by providing local, parallel, online analog learning using floating-gate MOS synapse transistors. We present a prototype FPLA chip comprising an array of reconfigurable computational blocks and local interconnect. We demonstrate the viability of this architecture by mapping several learning circuits onto the prototype chip.
2 0.85026693 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,
3 0.75481117 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip
Author: R. Carmona, F. Jiménez-garrido, R. Dominguez-castro, S. Espejo, A. Rodriguez-vázquez
Abstract: A bio-inspired model for an analog programmable array processor (APAP), based on studies on the vertebrate retina, has permitted the realization of complex programmable spatio-temporal dynamics in VLSI. This model mimics the way in which images are processed in the visual pathway, rendering a feasible alternative for the implementation of early vision applications in standard technologies. A prototype chip has been designed and fabricated in a 0.5µm standard CMOS process. Computing power per area and power consumption is amongst the highest reported for a single chip. Design challenges, trade-offs and some experimental results are presented in this paper. 1
4 0.59689403 4 nips-2002-A Differential Semantics for Jointree Algorithms
Author: James D. Park, Adnan Darwiche
Abstract: A new approach to inference in belief networks has been recently proposed, which is based on an algebraic representation of belief networks using multi–linear functions. According to this approach, the key computational question is that of representing multi–linear functions compactly, since inference reduces to a simple process of ev aluating and differentiating such functions. W e show here that mainstream inference algorithms based on jointrees are a special case of this approach in a v ery precise sense. W e use this result to prov e new properties of jointree algorithms, and then discuss some of its practical and theoretical implications. 1
5 0.51179188 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity
Author: Giacomo Indiveri
Abstract: We present analog neuromorphic circuits for implementing bistable synapses with spike-timing-dependent plasticity (STDP) properties. In these types of synapses, the short-term dynamics of the synaptic efficacies are governed by the relative timing of the pre- and post-synaptic spikes, while on long time scales the efficacies tend asymptotically to either a potentiated state or to a depressed one. We fabricated a prototype VLSI chip containing a network of integrate and fire neurons interconnected via bistable STDP synapses. Test results from this chip demonstrate the synapse’s STDP learning properties, and its long-term bistable characteristics.
6 0.4444547 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
7 0.37645042 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model
8 0.36348668 200 nips-2002-Topographic Map Formation by Silicon Growth Cones
9 0.26879928 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
10 0.26376325 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain
11 0.26020479 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
12 0.19928363 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution
13 0.18568528 18 nips-2002-Adaptation and Unsupervised Learning
14 0.17813917 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
15 0.16828907 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
16 0.16086547 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
17 0.14713316 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
18 0.13694982 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory
19 0.13626578 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
20 0.13204324 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques
topicId topicWeight
[(6, 0.036), (11, 0.015), (23, 0.025), (42, 0.05), (51, 0.017), (54, 0.061), (55, 0.03), (57, 0.012), (67, 0.019), (68, 0.017), (74, 0.052), (83, 0.498), (87, 0.011), (92, 0.026), (98, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.85966212 91 nips-2002-Field-Programmable Learning Arrays
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: This paper introduces the Field-Programmable Learning Array, a new paradigm for rapid prototyping of learning primitives and machinelearning algorithms in silicon. The FPLA is a mixed-signal counterpart to the all-digital Field-Programmable Gate Array in that it enables rapid prototyping of algorithms in hardware. Unlike the FPGA, the FPLA is targeted directly for machine learning by providing local, parallel, online analog learning using floating-gate MOS synapse transistors. We present a prototype FPLA chip comprising an array of reconfigurable computational blocks and local interconnect. We demonstrate the viability of this architecture by mapping several learning circuits onto the prototype chip.
2 0.85420609 200 nips-2002-Topographic Map Formation by Silicon Growth Cones
Author: Brian Taba, Kwabena A. Boahen
Abstract: We describe a self-configuring neuromorphic chip that uses a model of activity-dependent axon remodeling to automatically wire topographic maps based solely on input correlations. Axons are guided by growth cones, which are modeled in analog VLSI for the first time. Growth cones migrate up neurotropin gradients, which are represented by charge diffusing in transistor channels. Virtual axons move by rerouting address-events. We refined an initially gross topographic projection by simulating retinal wave input. 1 Neuromorphic Systems Neuromorphic engineers are attempting to match the computational efficiency of biological systems by morphing neurocircuitry into silicon circuits [1]. One of the most detailed implementations to date is the silicon retina described in [2] . This chip comprises thirteen different cell types, each of which must be individually and painstakingly wired. While this circuit-level approach has been very successful in sensory systems, it is less helpful when modeling largely unelucidated and exceedingly plastic higher processing centers in cortex. Instead of an explicit blueprint for every cortical area, what is needed is a developmental rule that can wire complex circuits from minimal specifications. One candidate is the famous
3 0.65973049 168 nips-2002-Real-Time Monitoring of Complex Industrial Processes with Particle Filters
Author: Rubén Morales-menéndez, Nando D. Freitas, David Poole
Abstract: This paper discusses the application of particle filtering algorithms to fault diagnosis in complex industrial processes. We consider two ubiquitous processes: an industrial dryer and a level tank. For these applications, we compared three particle filtering variants: standard particle filtering, Rao-Blackwellised particle filtering and a version of RaoBlackwellised particle filtering that does one-step look-ahead to select good sampling regions. We show that the overhead of the extra processing per particle of the more sophisticated methods is more than compensated by the decrease in error and variance.
4 0.58618557 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.
5 0.46395361 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,
6 0.36094239 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip
7 0.3487314 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity
8 0.33299103 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain
9 0.3278887 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
10 0.2630209 4 nips-2002-A Differential Semantics for Jointree Algorithms
11 0.25456712 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
12 0.24778004 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
13 0.24621612 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
14 0.23256195 55 nips-2002-Combining Features for BCI
15 0.22352079 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
16 0.22274382 3 nips-2002-A Convergent Form of Approximate Policy Iteration
17 0.2214967 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model
18 0.22130834 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
19 0.22113568 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
20 0.22010857 169 nips-2002-Real-Time Particle Filters