nips nips2002 nips2002-4 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James D. Park, Adnan Darwiche
Abstract: A new approach to inference in belief networks has been recently proposed, which is based on an algebraic representation of belief networks using multi–linear functions. According to this approach, the key computational question is that of representing multi–linear functions compactly, since inference reduces to a simple process of ev aluating and differentiating such functions. W e show here that mainstream inference algorithms based on jointrees are a special case of this approach in a v ery precise sense. W e use this result to prov e new properties of jointree algorithms, and then discuss some of its practical and theoretical implications. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract A new approach to inference in belief networks has been recently proposed, which is based on an algebraic representation of belief networks using multi–linear functions. [sent-4, score-0.404]
2 W e use this result to prov e new properties of jointree algorithms, and then discuss some of its practical and theoretical implications. [sent-7, score-0.608]
3 1 Introduction It was recently shown that the probability distribution of a belief network can be represented using a multi–linear function, and that most probabilistic queries of interest can be retriev ed directly from the partial deriv ativ es of this function [2]. [sent-8, score-0.506]
4 Although the multi–linear function has an exponential number of terms, it can be represented using a small arithmetic circuit in certain situations [3]. [sent-9, score-0.656]
5 1 Once a belief network is represented as an arithmetic circuit, probabilistic inference is then performed by ev aluating and differentiating the circuit, using a v ery simple procedure which resembles back–propagation in neural networks. [sent-10, score-0.739]
6 Specifically, we show that each jointree is an implicit representation of an arithmetic circuit; that the inward–pass in jointree propagation ev aluates this circuit; and that the outward– pass differentiates the circuit. [sent-12, score-1.725]
7 Using these results, we prov e new useful properties of jointree propagation. [sent-13, score-0.608]
8 W e also suggest a new interpretation of the process of factoring graphical models into jointrees, as a process of factoring exponentially– sized multi–linear functions into arithmetic circuits of smaller size. [sent-14, score-0.634]
9 1 For example, it was shown recently that real–world b elief networks with treewidth up to 60 can b e compiled into arithmetic circuits with few thousand nodes [3]. [sent-15, score-0.595]
10 A true true false false B true false true false θb|a θ¯ b|a θb|¯ a θ¯ a b|¯ = = = = . [sent-17, score-0.328]
11 4 ¯ A true true false false C true false true false θc|a θc|a ¯ θc|¯ a θc|¯ ¯a = = = = . [sent-23, score-0.328]
12 Sections 2 and 3 are dedicated to a review of inference approaches based on arithmetic circuits and jointrees. [sent-29, score-0.455]
13 Section 4 shows that the jointree approach is a special case of the arithmetic–circuit approach, and discusses some practical implications of this finding. [sent-30, score-0.582]
14 2 A belief network is a representational factorization of a probability distribution, not a computational one. [sent-36, score-0.294]
15 Mainstream algorithms for inference in belief networks operate on the network to generate a computational factorization, allowing one to answer queries in time which is linear in the factorization size. [sent-38, score-0.391]
16 A most influential computational factorization of belief networks is the jointree [14, 8, 6]. [sent-39, score-0.799]
17 Standard jointree factorizations are structure–based: their size depend only on the network topology and is invariant to local CPT structure. [sent-40, score-0.772]
18 W e discuss next one of the latest proposals in this direction, which calls for using arithmetic circuits as a computational factorization of belief networks [2]. [sent-42, score-0.64]
19 This proposal is based on viewing each belief network as a multi–linear function, which can be represented compactly using an arithmetic circuit. [sent-43, score-0.635]
20 First, evidence indicators, where for each variable X in the network , we have a variable λx for each value x of X. [sent-45, score-0.286]
21 Second, network parameters, where for each variable X and its parents U in the network, we have a variable θx|u for each value x of X and instantiation u of U. [sent-46, score-0.298]
22 The multi–linear function has a term for each instantiation of the network variables, which is constructed by multiplying all evidence indicators and network parameters that are consistent with that instantiation. [sent-47, score-0.469]
23 + λBλDθD|BC * * A BCD + + B D * * * C E * t A qA qBDA t B q q t qBDA qA t A BDA BDA ABC λAθAθB|AθC|A CE λCλEθE|C B Figure 2: On the left: An arithmetic circuit which computes the function λa λb θa θb|a + λa λ¯θa θ¯ + λa λb θa θb|¯ + λa λ¯θa θ¯ a . [sent-57, score-0.699]
24 The circuit is a D AG, where ¯ ¯ ¯ b ¯ b|¯ a b b|a leaf nodes represent function variables and internal nodes represent arithmetic operations. [sent-58, score-0.867]
25 Given the network polynomial f , we can answer any query with respect to the belief network. [sent-60, score-0.295]
26 For the network in Figure 1, we can compute the probability of evidence e = b¯ by evaluating its polynomial above under c λa = 1,λa = 1,λb = 1, λ¯ = 0 and λc = 0, λc = 1. [sent-63, score-0.285]
27 This algebraic representation of belief networks is attractive as it allows us to obtain answers to a large number of probabilistic queries directly from the derivatives of the network polynomial [2]. [sent-66, score-0.423]
28 The network polynomial has an exponential number of terms, yet one can represent it compactly in certain cases using an arithmetic circuit; see Figure 2. [sent-70, score-0.55]
29 The (first) partial derivatives of an arithmetic circuit can all be computed simultaneously in time linear in the circuit size [2, 12]. [sent-71, score-1.002]
30 The procedure resembles the back–propagation algorithm for neural networks as it evaluates the circuit in a single upward–pass, and then differentiates it through a single downward–pass. [sent-72, score-0.325]
31 The main computational question is then that of generating the smallest arithmetic circuit that computes the network polynomial. [sent-73, score-0.807]
32 A structure–based approach for this has been given in [2], which is guaranteed to generate a circuit whose size is bounded by O(n exp(w)), where n is the number of nodes in the network and w is its treewidth. [sent-74, score-0.491]
33 A more recent approach, however, which exploits local structure has been presented in [3] and was shown experimentally to generate small arithmetic circuits for networks whose treewidth is up to 60. [sent-75, score-0.509]
34 As we show in the rest of this paper, the process of factoring a belief network into a jointree is yet another method for generating an arithmetic circuit for the network. [sent-76, score-1.561]
35 Specifically, we show that the jointree structure is an implicit representation of such a circuit, and that jointree propagation corresponds to circuit evaluation and differentiation. [sent-77, score-1.575]
36 Moreover, the difference between Shenoy–Shafer and Hugin propagation turns out to be a difference in the numeric scheme used for circuit differentiation [11]. [sent-78, score-0.411]
37 3 Join tree Algorithms We now review jointree algorithms, which are quite influential in graphical models. [sent-79, score-0.635]
38 A jointree for B is a pair (T , L), where T is a tree and L is a function that assigns labels to nodes in T . [sent-81, score-0.671]
39 We will refer to the nodes of a jointree (and sometimes their labels) as clusters. [sent-84, score-0.671]
40 We will also refer to the edges of a jointree (and sometimes their labels) as separators. [sent-85, score-0.582]
41 Jointree algorithms start by constructing a jointree for a given belief network [14, 8, 6]. [sent-87, score-0.826]
42 3 The conditional probability table (CPT or CP Table) of each variable X with parents U, denoted θX|U , is assigned to a cluster that contains X and U. [sent-89, score-0.371]
43 In addition, an evidence table over variable X, denoted λX , is assigned to a cluster that contains X. [sent-90, score-0.451]
44 Evidence e is entered into a jointree by initializing evidence tables as follows: we set λX (x) to 1 if x is consistent with evidence e, and we set λX (x) to 0 otherwise. [sent-92, score-0.933]
45 Given some evidence e, a jointree algorithm propagates messages between clusters. [sent-93, score-0.781]
46 After passing two message per edge in the jointree, one can compute the marginals Pr (C, e) for every cluster C. [sent-94, score-0.296]
47 A cluster is then selected as the root and message propagation proceeds in two phases, inward and outward. [sent-98, score-0.547]
48 Cluster i sends a message to cluster j only when it has received messages from all its other neighbors k. [sent-101, score-0.369]
49 A message from cluster i to cluster j is a table Mij defined as follows: Mij = C\S φi k=j Mki , where C are the variables of cluster i, S are the variables of separator ij, and φi is the multiplication of all evidence and CP tables assigned to cluster i. [sent-102, score-1.42]
50 Once message propagation is finished, we have Pr (C, e) = φi k Mki , where C are the variables of cluster i. [sent-103, score-0.446]
51 Hugin propagation proceeds similarly to Shenoy–Shafer by entering evidence; selecting a cluster as root; and propagating messages in two phases, inward and outward [8]. [sent-104, score-0.566]
52 It also maintains a table Φi with each cluster i, initialized to the multiplication of all CPTs and evidence tables assigned to cluster i. [sent-107, score-0.753]
53 Cluster i passes a message to neighboring cluster j only when i has received messages from all its other neighbors k. [sent-108, score-0.369]
54 When cluster i is ready to send a message to cluster j, it does the following. [sent-109, score-0.486]
55 Second, it computes a new separator table Φij = C\S Φi , ij where C are the variables of cluster i and S are the variables of separator ij. [sent-111, score-0.621]
56 Third, Φij it computes a message to cluster j: Mij = Φold . [sent-112, score-0.339]
57 Finally, it multiplies the computed ij message into the table of cluster j: Φj = Φj Mij . [sent-113, score-0.388]
58 After the inward and outward– passes of Hugin propagation are completed, we have: Pr (C, e) = Φi , where C are the variables of cluster i. [sent-114, score-0.418]
59 4 Join trees as arithmetic circuits We now show that every jointree (together with a root cluster and a particular assignment of evidence and CP tables to clusters) corresponds precisely to an arithmetic circuit that computes the network polynomial. [sent-118, score-2.283]
60 Moreover, the structure of the jointree dictates how these nodes are connected into a circuit. [sent-124, score-0.671]
61 The arithmetic circuit in Figure 2 is embedded in the jointree A − AB, with cluster A as the root, and with tables λA , θA assigned to cluster A and tables λB and θB|A assigned to cluster B. [sent-125, score-2.135]
62 Theorem 1 The circuit embedded in a jointree computes the network polynomial. [sent-126, score-1.062]
63 Therefore, by constructing a jointree one is generating a compact representation of the network polynomial in terms of an arithmetic circuit. [sent-127, score-1.103]
64 We are now ready to state our basic results on the differential semantics of jointree propagation, but we need some notational conventions first. [sent-128, score-0.621]
65 In the following three theorems: f denotes the circuit embedded in a jointree or its (unique) output node; s denotes a separator instantiation or the addition node generated by that instantiation; and c denotes a cluster instantiation or the multiplication node generated by that instantiation. [sent-129, score-1.642]
66 Moreover, the value that a circuit node v tak es under evidence e is denoted v(e). [sent-130, score-0.548]
67 Recall that a circuit (or network polynomial) is evaluated under evidence e by setting each input λx to 1 if x is consistent with e; and to 0 otherwise. [sent-131, score-0.528]
68 Finally, recall that ∂ f /∂ v represents the derivative of the circuit output with respect to node v. [sent-132, score-0.405]
69 Theorem 2 The messages produced using Shenoy–Shafer propagation on a jointree under evidence e have the following semantics. [sent-134, score-0.898]
70 4 Given a root cluster, one can direct the jointree b y having arrows point away from the root, which also defines a parent/child relationship b etween clusters and separators. [sent-139, score-0.662]
71 Theorem 3 If evidence table λX is assigned to cluster i with variables C: ∂ f (e) = Mji ψ (x), ∂ λx j C\X (1) ψ=λX where ψ ranges over all evidence and CP tables assigned to cluster i. [sent-140, score-0.897]
72 Moreover, if CPT θX|U is assigned to cluster i with variables C: ∂ f (e) = ψ (xu), Mji (2) ∂ θx|u j C\XU ψ=θX|U where ψ ranges over all evidence and CP tables assigned to cluster i. [sent-141, score-0.732]
73 Therefore, even though Shenoy–Shafer propagation does not fully differentiate the embedded circuit, the differentiation process can be completed through local computations after propagation has finished. [sent-142, score-0.298]
74 5 W e now discuss some applications of the partial derivatives with respect to evidence indicators λx and network parameters θx|u . [sent-143, score-0.321]
75 Suppose jointree propagation has been performed using evidence e, which gives us access directly to the probability of e. [sent-145, score-0.825]
76 Again, Hugin propagation does not compute deriv ativ es with respect to input nodes λx and θx|u . [sent-157, score-0.407]
77 Ev en for addition and multiplication nodes, it only retains deriv ativ es multiplied by v alues. [sent-158, score-0.287]
78 Hence, if we want to recov er the deriv ativ e with respect to, say, multiplication node c, we must know the v alue of this node and it must be different than zero. [sent-159, score-0.392]
79 In such a case, we hav e ∂f (e)/∂c = Φi (c)/c(e), where Φi is the table associated with the cluster i that generates node c. [sent-160, score-0.33]
80 But such quantities will be useful for obtaining deriv ativ es only if the v alues of such input nodes are not zero. [sent-162, score-0.325]
81 Hence, Shenoy–Shafer propagation is more informativ e than Hugin propagation as far as the computation of deriv ativ es is concerned. [sent-163, score-0.435]
82 Hence, our results seem to suggest the first general approach for computing this derivative using standard jointree propagation. [sent-167, score-0.615]
83 Jointree propagation gives exact results only when infinite precision arithmetic is used. [sent-169, score-0.479]
84 In practice, however, finite precision floating– point arithmetic is typically used, in which case the differential semantics of jointree propagation can be used to bound the rounding error in the computed probability of evidence. [sent-170, score-1.14]
85 These results have been useful in unifying the circuit approach presented in [2] with jointree approaches, and in uncovering more properties of jointree propagation. [sent-173, score-1.458]
86 Specifically, the perspective we are promoting here is that probability distributions defined by graphical models should be viewed as multi–linear functions, and the construction of jointrees should be viewed as a process of constructing arithmetic circuits that compute these functions. [sent-175, score-0.529]
87 That is, the fundamental object being factored is a multi–linear function, and the fundamental result of the factorization is an arithmetic circuit. [sent-176, score-0.412]
88 A graphical model is a useful abstraction of the multi–linear function, and a jointree is a useful structure for embedding the arithmetic circuit. [sent-177, score-0.997]
89 This view of factoring is useful since it allows us to cast the factoring problem in more refined terms, which puts us in a better position to exploit the local structure of graphical models in the factorization process. [sent-178, score-0.29]
90 Any jointree for the network will have a cluster of size 2 n, leading to O(exp(n)) complexity. [sent-185, score-0.88]
91 There is, however, a circuit of O(n) size here, n n since the network polynomial can be easily factored as: f = ( 1 ) ¯ i=1 (λxi + λxi ). [sent-186, score-0.453]
92 2 Hence, in the presence of local structure, it appears more promising to factor the graphical model into the more refined arithmetic circuit since not every arithmetic circuit can be embedded in a jointree. [sent-187, score-1.429]
93 First, the multi–linear function of a belief network is “encoded” using a propositional theory, which is expressive enough to capture the form of the multi–linear function in addition to constraints on its variables. [sent-189, score-0.268]
94 An arithmetic circuit is finally extracted from that form. [sent-191, score-0.656]
95 The method was able to generate relatively small arithmetic circuits for a significant suite of real–world belief networks with treewidths up to 60. [sent-192, score-0.59]
96 Moreov er, each of these representations can be expanded in linear time into an arithmetic circuit that satisfies some strong properties [4]. [sent-195, score-0.656]
97 Hence, such representations are special cases of arithmetic circuits as well. [sent-196, score-0.423]
98 W e finally note that the relationship between multi–linear functions (polynomials in general) and arithmetic circuits is a classical subject of algebraic complexity theory [15]. [sent-197, score-0.461]
99 In this field of complexity, computational problems are expressed as polynomials, and a central question is that of determining the size of the smallest arithmetic circuit that computes a giv en polynomial, leading to the notion of circuit complexity. [sent-198, score-0.993]
100 Using this notion, it is then meaningful to talk about the circuit complexity of a graphical model: the size of the smallest arithmetic circuit that computes the multi–linear function induced by the model. [sent-199, score-1.046]
wordName wordTfidf (topN-words)
[('jointree', 0.582), ('arithmetic', 0.362), ('circuit', 0.294), ('multi', 0.194), ('cluster', 0.19), ('shenoy', 0.146), ('belief', 0.136), ('hugin', 0.132), ('shafer', 0.132), ('evidence', 0.126), ('propagation', 0.117), ('separator', 0.115), ('network', 0.108), ('message', 0.106), ('tables', 0.099), ('ativ', 0.093), ('instantiation', 0.092), ('nodes', 0.089), ('mij', 0.084), ('outward', 0.084), ('di', 0.083), ('deriv', 0.081), ('mji', 0.079), ('factoring', 0.079), ('inward', 0.078), ('node', 0.078), ('messages', 0.073), ('pr', 0.062), ('multiplication', 0.062), ('circuits', 0.061), ('false', 0.059), ('root', 0.056), ('erentiation', 0.053), ('factorizations', 0.053), ('jointrees', 0.053), ('graphical', 0.053), ('ij', 0.053), ('cpt', 0.053), ('cp', 0.051), ('polynomial', 0.051), ('factorization', 0.05), ('erential', 0.049), ('ev', 0.049), ('assigned', 0.047), ('mainstream', 0.046), ('parents', 0.046), ('computes', 0.043), ('cpts', 0.042), ('ariable', 0.04), ('erentiates', 0.04), ('rounding', 0.04), ('semantics', 0.039), ('table', 0.039), ('algebraic', 0.038), ('embedded', 0.035), ('alues', 0.035), ('indicators', 0.035), ('uai', 0.034), ('queries', 0.034), ('instantiations', 0.034), ('pass', 0.033), ('variables', 0.033), ('derivative', 0.033), ('inference', 0.032), ('angeles', 0.032), ('networks', 0.031), ('equals', 0.03), ('compactly', 0.029), ('local', 0.029), ('xu', 0.028), ('es', 0.027), ('partial', 0.027), ('aluating', 0.026), ('ariables', 0.026), ('bda', 0.026), ('echnical', 0.026), ('elief', 0.026), ('erentiating', 0.026), ('ery', 0.026), ('esian', 0.026), ('mki', 0.026), ('prov', 0.026), ('qbda', 0.026), ('separators', 0.026), ('treewidth', 0.026), ('variable', 0.026), ('los', 0.025), ('derivatives', 0.025), ('clusters', 0.024), ('children', 0.024), ('propagating', 0.024), ('addition', 0.024), ('denoted', 0.023), ('cally', 0.023), ('hav', 0.023), ('join', 0.023), ('ucla', 0.023), ('true', 0.023), ('letters', 0.023), ('compatible', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 4 nips-2002-A Differential Semantics for Jointree Algorithms
Author: James D. Park, Adnan Darwiche
Abstract: A new approach to inference in belief networks has been recently proposed, which is based on an algebraic representation of belief networks using multi–linear functions. According to this approach, the key computational question is that of representing multi–linear functions compactly, since inference reduces to a simple process of ev aluating and differentiating such functions. W e show here that mainstream inference algorithms based on jointrees are a special case of this approach in a v ery precise sense. W e use this result to prov e new properties of jointree algorithms, and then discuss some of its practical and theoretical implications. 1
2 0.12987286 94 nips-2002-Fractional Belief Propagation
Author: Wim Wiegerinck, Tom Heskes
Abstract: We consider loopy belief propagation for approximate inference in probabilistic graphical models. A limitation of the standard algorithm is that clique marginals are computed as if there were no loops in the graph. To overcome this limitation, we introduce fractional belief propagation. Fractional belief propagation is formulated in terms of a family of approximate free energies, which includes the Bethe free energy and the naive mean-field free as special cases. Using the linear response correction of the clique marginals, the scale parameters can be tuned. Simulation results illustrate the potential merits of the approach.
3 0.11722355 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
Author: Tom Heskes
Abstract: We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable fixed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable fixed points. We illustrate this with an example and discuss implications. 1
4 0.10503758 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,
5 0.10069097 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
Author: Shih-Chii Liu, Malte Boegershausen, Pascal Suter
Abstract: We describe a model of short-term synaptic depression that is derived from a silicon circuit implementation. The dynamics of this circuit model are similar to the dynamics of some present theoretical models of shortterm depression except that the recovery dynamics of the variable describing the depression is nonlinear and it also depends on the presynaptic frequency. The equations describing the steady-state and transient responses of this synaptic model fit the experimental results obtained from a fabricated silicon network consisting of leaky integrate-and-fire neurons and different types of synapses. We also show experimental data demonstrating the possible computational roles of depression. One possible role of a depressing synapse is that the input can quickly bring the neuron up to threshold when the membrane potential is close to the resting potential.
6 0.093542464 91 nips-2002-Field-Programmable Learning Arrays
7 0.090613224 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
8 0.073672555 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
9 0.071772099 152 nips-2002-Nash Propagation for Loopy Graphical Games
10 0.071195349 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity
11 0.070985243 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
12 0.066251419 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
13 0.06574028 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
14 0.063128084 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
15 0.061782654 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip
16 0.058676317 27 nips-2002-An Impossibility Theorem for Clustering
17 0.052517164 124 nips-2002-Learning Graphical Models with Mercer Kernels
18 0.051662136 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
19 0.051652201 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
20 0.050509918 32 nips-2002-Approximate Inference and Protein-Folding
topicId topicWeight
[(0, -0.131), (1, 0.006), (2, -0.075), (3, -0.003), (4, -0.029), (5, 0.198), (6, -0.008), (7, 0.089), (8, -0.149), (9, -0.018), (10, -0.013), (11, 0.083), (12, -0.095), (13, -0.004), (14, 0.085), (15, -0.089), (16, 0.101), (17, -0.011), (18, 0.095), (19, -0.02), (20, 0.159), (21, 0.026), (22, -0.047), (23, 0.038), (24, 0.034), (25, 0.001), (26, 0.045), (27, -0.016), (28, 0.036), (29, 0.105), (30, -0.193), (31, -0.031), (32, -0.021), (33, 0.122), (34, 0.096), (35, -0.023), (36, 0.023), (37, -0.049), (38, -0.034), (39, 0.082), (40, 0.022), (41, -0.072), (42, -0.076), (43, -0.012), (44, -0.006), (45, -0.048), (46, -0.005), (47, -0.014), (48, 0.06), (49, -0.112)]
simIndex simValue paperId paperTitle
same-paper 1 0.96102899 4 nips-2002-A Differential Semantics for Jointree Algorithms
Author: James D. Park, Adnan Darwiche
Abstract: A new approach to inference in belief networks has been recently proposed, which is based on an algebraic representation of belief networks using multi–linear functions. According to this approach, the key computational question is that of representing multi–linear functions compactly, since inference reduces to a simple process of ev aluating and differentiating such functions. W e show here that mainstream inference algorithms based on jointrees are a special case of this approach in a v ery precise sense. W e use this result to prov e new properties of jointree algorithms, and then discuss some of its practical and theoretical implications. 1
2 0.66454655 91 nips-2002-Field-Programmable Learning Arrays
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: This paper introduces the Field-Programmable Learning Array, a new paradigm for rapid prototyping of learning primitives and machinelearning algorithms in silicon. The FPLA is a mixed-signal counterpart to the all-digital Field-Programmable Gate Array in that it enables rapid prototyping of algorithms in hardware. Unlike the FPGA, the FPLA is targeted directly for machine learning by providing local, parallel, online analog learning using floating-gate MOS synapse transistors. We present a prototype FPLA chip comprising an array of reconfigurable computational blocks and local interconnect. We demonstrate the viability of this architecture by mapping several learning circuits onto the prototype chip.
3 0.63762534 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon
Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu
Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,
4 0.49737924 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip
Author: R. Carmona, F. Jiménez-garrido, R. Dominguez-castro, S. Espejo, A. Rodriguez-vázquez
Abstract: A bio-inspired model for an analog programmable array processor (APAP), based on studies on the vertebrate retina, has permitted the realization of complex programmable spatio-temporal dynamics in VLSI. This model mimics the way in which images are processed in the visual pathway, rendering a feasible alternative for the implementation of early vision applications in standard technologies. A prototype chip has been designed and fabricated in a 0.5µm standard CMOS process. Computing power per area and power consumption is amongst the highest reported for a single chip. Design challenges, trade-offs and some experimental results are presented in this paper. 1
5 0.48378167 94 nips-2002-Fractional Belief Propagation
Author: Wim Wiegerinck, Tom Heskes
Abstract: We consider loopy belief propagation for approximate inference in probabilistic graphical models. A limitation of the standard algorithm is that clique marginals are computed as if there were no loops in the graph. To overcome this limitation, we introduce fractional belief propagation. Fractional belief propagation is formulated in terms of a family of approximate free energies, which includes the Bethe free energy and the naive mean-field free as special cases. Using the linear response correction of the clique marginals, the scale parameters can be tuned. Simulation results illustrate the potential merits of the approach.
6 0.45759624 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
7 0.3659963 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
8 0.33488235 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
9 0.32646713 152 nips-2002-Nash Propagation for Loopy Graphical Games
10 0.31559724 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity
11 0.30226043 32 nips-2002-Approximate Inference and Protein-Folding
12 0.30202159 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model
13 0.2927416 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
14 0.28607464 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
15 0.28447753 205 nips-2002-Value-Directed Compression of POMDPs
16 0.27041346 27 nips-2002-An Impossibility Theorem for Clustering
17 0.26944184 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
18 0.26503718 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
19 0.26423532 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
20 0.2568121 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
topicId topicWeight
[(11, 0.02), (14, 0.02), (23, 0.01), (24, 0.37), (42, 0.035), (54, 0.081), (55, 0.022), (57, 0.044), (67, 0.017), (68, 0.07), (74, 0.069), (92, 0.034), (98, 0.102)]
simIndex simValue paperId paperTitle
same-paper 1 0.73902833 4 nips-2002-A Differential Semantics for Jointree Algorithms
Author: James D. Park, Adnan Darwiche
Abstract: A new approach to inference in belief networks has been recently proposed, which is based on an algebraic representation of belief networks using multi–linear functions. According to this approach, the key computational question is that of representing multi–linear functions compactly, since inference reduces to a simple process of ev aluating and differentiating such functions. W e show here that mainstream inference algorithms based on jointrees are a special case of this approach in a v ery precise sense. W e use this result to prov e new properties of jointree algorithms, and then discuss some of its practical and theoretical implications. 1
2 0.61221349 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex
Author: Christian K. Machens, Michael Wehr, Anthony M. Zador
Abstract: How do cortical neurons represent the acoustic environment? This question is often addressed by probing with simple stimuli such as clicks or tone pips. Such stimuli have the advantage of yielding easily interpreted answers, but have the disadvantage that they may fail to uncover complex or higher-order neuronal response properties. Here we adopt an alternative approach, probing neuronal responses with complex acoustic stimuli, including animal vocalizations and music. We have used in vivo whole cell methods in the rat auditory cortex to record subthreshold membrane potential fluctuations elicited by these stimuli. Whole cell recording reveals the total synaptic input to a neuron from all the other neurons in the circuit, instead of just its output—a sparse binary spike train—as in conventional single unit physiological recordings. Whole cell recording thus provides a much richer source of information about the neuron’s response. Many neurons responded robustly and reliably to the complex stimuli in our ensemble. Here we analyze the linear component—the spectrotemporal receptive field (STRF)—of the transformation from the sound (as represented by its time-varying spectrogram) to the neuron’s membrane potential. We find that the STRF has a rich dynamical structure, including excitatory regions positioned in general accord with the prediction of the simple tuning curve. We also find that in many cases, much of the neuron’s response, although deterministically related to the stimulus, cannot be predicted by the linear component, indicating the presence of as-yet-uncharacterized nonlinear response properties.
3 0.5180735 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
4 0.42970967 43 nips-2002-Binary Coding in Auditory Cortex
Author: Michael R. Deweese, Anthony M. Zador
Abstract: Cortical neurons have been reported to use both rate and temporal codes. Here we describe a novel mode in which each neuron generates exactly 0 or 1 action potentials, but not more, in response to a stimulus. We used cell-attached recording, which ensured single-unit isolation, to record responses in rat auditory cortex to brief tone pips. Surprisingly, the majority of neurons exhibited binary behavior with few multi-spike responses; several dramatic examples consisted of exactly one spike on 100% of trials, with no trial-to-trial variability in spike count. Many neurons were tuned to stimulus frequency. Since individual trials yielded at most one spike for most neurons, the information about stimulus frequency was encoded in the population, and would not have been accessible to later stages of processing that only had access to the activity of a single unit. These binary units allow a more efficient population code than is possible with conventional rate coding units, and are consistent with a model of cortical processing in which synchronous packets of spikes propagate stably from one neuronal population to the next. 1 Binary coding in auditory cortex We recorded responses of neurons in the auditory cortex of anesthetized rats to pure-tone pips of different frequencies [1, 2]. Each pip was presented repeatedly, allowing us to assess the variability of the neural response to multiple presentations of each stimulus. We first recorded multi-unit activity with conventional tungsten electrodes (Fig. 1a). The number of spikes in response to each pip fluctuated markedly from one trial to the next (Fig. 1e), as though governed by a random mechanism such as that generating the ticks of a Geiger counter. Highly variable responses such as these, which are at least as variable as a Poisson process, are the norm in the cortex [3-7], and have contributed to the widely held view that cortical spike trains are so noisy that only the average firing rate can be used to encode stimuli. Because we were recording the activity of an unknown number of neurons, we could not be sure whether the strong trial-to-trial fluctuations reflected the underlying variability of the single units. We therefore used an alternative technique, cell- a b Single-unit recording method 5mV Multi-unit 1sec Raw cellattached voltage 10 kHz c Single-unit . . . . .. .. ... . . .... . ... . Identified spikes Threshold e 28 kHz d Single-unit 80 120 160 200 Time (msec) N = 29 tones 3 2 1 Poisson N = 11 tones ry 40 4 na bi 38 kHz 0 Response variance/mean (spikes/trial) High-pass filtered 0 0 1 2 3 Mean response (spikes/trial) Figure 1: Multi-unit spiking activity was highly variable, but single units obeyed binomial statistics. a Multi-unit spike rasters from a conventional tungsten electrode recording showed high trial-to-trial variability in response to ten repetitions of the same 50 msec pure tone stimulus (bottom). Darker hash marks indicate spike times within the response period, which were used in the variability analysis. b Spikes recorded in cell-attached mode were easily identified from the raw voltage trace (top) by applying a high-pass filter (bottom) and thresholding (dark gray line). Spike times (black squares) were assigned to the peaks of suprathreshold segments. c Spike rasters from a cell-attached recording of single-unit responses to 25 repetitions of the same tone consisted of exactly one well-timed spike per trial (latency standard deviation = 1.0 msec), unlike the multi-unit responses (Fig. 1a). Under the Poisson assumption, this would have been highly unlikely (P ~ 10 -11). d The same neuron as in Fig. 1c responds with lower probability to repeated presentations of a different tone, but there are still no multi-spike responses. e We quantified response variability for each tone by dividing the variance in spike count by the mean spike count across all trials for that tone. Response variability for multi-unit tungsten recording (open triangles) was high for each of the 29 tones (out of 32) that elicited at least one spike on one trial. All but one point lie above one (horizontal gray line), which is the value produced by a Poisson process with any constant or time varying event rate. Single unit responses recorded in cell-attached mode were far less variable (filled circles). Ninety one percent (10/11) of the tones that elicited at least one spike from this neuron produced no multi-spike responses in 25 trials; the corresponding points fall on the diagonal line between (0,1) and (1,0), which provides a strict lower bound on the variability for any response set with a mean between 0 and 1. No point lies above one. attached recording with a patch pipette [8, 9], in order to ensure single unit isolation (Fig. 1b). This recording mode minimizes both of the main sources of error in spike detection: failure to detect a spike in the unit under observation (false negatives), and contamination by spikes from nearby neurons (false positives). It also differs from conventional extracellular recording methods in its selection bias: With cell- attached recording neurons are selected solely on the basis of the experimenter’s ability to form a seal, rather than on the basis of neuronal activity and responsiveness to stimuli as in conventional methods. Surprisingly, single unit responses were far more orderly than suggested by the multi-unit recordings; responses typically consisted of either 0 or 1 spikes per trial, and not more (Fig. 1c-e). In the most dramatic examples, each presentation of the same tone pip elicited exactly one spike (Fig. 1c). In most cases, however, some presentations failed to elicit a spike (Fig. 1d). Although low-variability responses have recently been observed in the cortex [10, 11] and elsewhere [12, 13], the binary behavior described here has not previously been reported for cortical neurons. a 1.4 N = 3055 response sets b 1.2 1 Poisson 28 kHz - 100 msec 0.8 0.6 0.4 0.2 0 0 ry na bi Response variance/mean (spikes/trial) The majority of the neurons (59%) in our study for which statistical significance could be assessed (at the p<0.001 significance level; see Fig. 2, caption) showed noisy binary behavior—“binary” because neurons produced either 0 or 1 spikes, and “noisy” because some stimuli elicited both single spikes and failures. In a substantial fraction of neurons, however, the responses showed more variability. We found no correlation between neuronal variability and cortical layer (inferred from the depth of the recording electrode), cortical area (inside vs. outside of area A1) or depth of anesthesia. Moreover, the binary mode of spiking was not due to the brevity (25 msec) of the stimuli; responses that were binary for short tones were comparably binary when longer (100 msec) tones were used (Fig. 2b). Not assessable Not significant Significant (p<0.001) 0.2 0.4 0.6 0.8 1 1.2 Mean response (spikes/trial) 28 kHz - 25 msec 1.4 0 40 80 120 160 Time (msec) 200 Figure 2: Half of the neuronal population exhibited binary firing behavior. a Of the 3055 sets of responses to 25 msec tones, 2588 (gray points) could not be assessed for significance at the p<0.001 level, 225 (open circles) were not significantly binary, and 242 were significantly binary (black points; see Identification methods for group statistics below). All points were jittered slightly so that overlying points could be seen in the figure. 2165 response sets contained no multi-spike responses; the corresponding points fell on the line from [0,1] to [1,0]. b The binary nature of single unit responses was insensitive to tone duration, even for frequencies that elicited the largest responses. Twenty additional spike rasters from the same neuron (and tone frequency) as in Fig. 1c contain no multi-spike responses whether in response to 100 msec tones (above) or 25 msec tones (below). Across the population, binary responses were as prevalent for 100 msec tones as for 25 msec tones (see Identification methods for group statistics). In many neurons, binary responses showed high temporal precision, with latencies sometimes exhibiting standard deviations as low as 1 msec (Fig. 3; see also Fig. 1c), comparable to previous observations in the auditory cortex [14], and only slightly more precise than in monkey visual area MT [5]. High temporal precision was positively correlated with high response probability (Fig. 3). a b N = (44 cells)x(32 tones) 14 N = 32 tones 12 30 Jitter (msec) Jitter (msec) 40 10 8 6 20 10 4 2 0 0 0 0.2 0.4 0.6 0.8 Mean response (spikes/trial) 1 0 0.4 0.8 1.2 1.6 Mean response (spikes/trial) 2 Figure 3: Trial-to-trial variability in latency of response to repeated presentations of the same tone decreased with increasing response probability. a Scatter plot of standard deviation of latency vs. mean response for 25 presentations each of 32 tones for a different neuron as in Figs. 1 and 2 (gray line is best linear fit). Rasters from 25 repeated presentations of a low response tone (upper left inset, which corresponds to left-most data point) display much more variable latencies than rasters from a high response tone (lower right inset; corresponds to right-most data point). b The negative correlation between latency variability and response size was present on average across the population of 44 neurons described in Identification methods for group statistics (linear fit, gray). The low trial-to-trial variability ruled out the possibility that the firing statistics could be accounted for by a simple rate-modulated Poisson process (Fig. 4a1,a2). In other systems, low variability has sometimes been modeled as a Poisson process followed by a post-spike refractory period [10, 12]. In our system, however, the range in latencies of evoked binary responses was often much greater than the refractory period, which could not have been longer than the 2 msec inter-spike intervals observed during epochs of spontaneous spiking, indicating that binary spiking did not result from any intrinsic property of the spike generating mechanism (Fig. 4a3). Moreover, a single stimulus-evoked spike could suppress subsequent spikes for as long as hundreds of milliseconds (e.g. Figs. 1d,4d), supporting the idea that binary spiking arises through a circuit-level, rather than a single-neuron, mechanism. Indeed, the fact that this suppression is observed even in the cortex of awake animals [15] suggests that binary spiking is not a special property of the anesthetized state. It seems surprising that binary spiking in the cortex has not previously been remarked upon. In the auditory cortex the explanation may be in part technical: Because firing rates in the auditory cortex tend to be low, multi-unit recording is often used to maximize the total amount of data collected. Moreover, our use of cell-attached recording minimizes the usual bias toward responsive or active neurons. Such explanations are not, however, likely to account for the failure to observe binary spiking in the visual cortex, where spike count statistics have been scrutinized more closely [3-7]. One possibility is that this reflects a fundamental difference between the auditory and visual systems. An alternative interpretation— a1 b Response probability 100 spikes/s 2 kHz Poisson simulation c 100 200 300 400 Time (msec) 500 20 Ratio of pool sizes a2 0 16 12 8 4 0 a3 Poisson with refractory period 0 40 80 120 160 200 Time (msec) d Response probability PSTH 0.2 0.4 0.6 0.8 1 Mean spike count per neuron 1 0.8 N = 32 tones 0.6 0.4 0.2 0 2.0 3.8 7.1 13.2 24.9 46.7 Tone frequency (kHz) Figure 4: a The lack of multi-spike responses elicited by the neuron shown in Fig. 3a were not due to an absolute refractory period since the range of latencies for many tones, like that shown here, was much greater than any reasonable estimate for the neuron’s refractory period. (a1) Experimentally recorded responses. (a2) Using the smoothed post stimulus time histogram (PSTH; bottom) from the set of responses in Fig. 4a, we generated rasters under the assumption of Poisson firing. In this representative example, four double-spike responses (arrows at left) were produced in 25 trials. (a3) We then generated rasters assuming that the neuron fired according to a Poisson process subject to a hard refractory period of 2 msec. Even with a refractory period, this representative example includes one triple- and three double-spike responses. The minimum interspike-interval during spontaneous firing events was less than two msec for five of our neurons, so 2 msec is a conservative upper bound for the refractory period. b. Spontaneous activity is reduced following high-probability responses. The PSTH (top; 0.25 msec bins) of the combined responses from the 25% (8/32) of tones that elicited the largest responses from the same neuron as in Figs. 3a and 4a illustrates a preclusion of spontaneous and evoked activity for over 200 msec following stimulation. The PSTHs from progressively less responsive groups of tones show progressively less preclusion following stimulation. c Fewer noisy binary neurons need to be pooled to achieve the same “signal-to-noise ratio” (SNR; see ref. [24]) as a collection of Poisson neurons. The ratio of the number of Poisson to binary neurons required to achieve the same SNR is plotted against the mean number of spikes elicited per neuron following stimulation; here we have defined the SNR to be the ratio of the mean spike count to the standard deviation of the spike count. d Spike probability tuning curve for the same neuron as in Figs. 1c-e and 2b fit to a Gaussian in tone frequency. and one that we favor—is that the difference rests not in the sensory modality, but instead in the difference between the stimuli used. In this view, the binary responses may not be limited to the auditory cortex; neurons in visual and other sensory cortices might exhibit similar responses to the appropriate stimuli. For example, the tone pips we used might be the auditory analog of a brief flash of light, rather than the oriented moving edges or gratings usually used to probe the primary visual cortex. Conversely, auditory stimuli analogous to edges or gratings [16, 17] may be more likely to elicit conventional, rate-modulated Poisson responses in the auditory cortex. Indeed, there may be a continuum between binary and Poisson modes. Thus, even in conventional rate-modulated responses, the first spike is often privileged in that it carries most of the information in the spike train [5, 14, 18]. The first spike may be particularly important as a means of rapidly signaling stimulus transients. Binary responses suggest a mode that complements conventional rate coding. In the simplest rate-coding model, a stimulus parameter (such as the frequency of a tone) governs only the rate at which a neuron generates spikes, but not the detailed positions of the spikes; the actual spike train itself is an instantiation of a random process (such as a Poisson process). By contrast, in the binomial model, the stimulus parameter (frequency) is encoded as the probability of firing (Fig. 4d). Binary coding has implications for cortical computation. In the rate coding model, stimulus encoding is “ergodic”: a stimulus parameter can be read out either by observing the activity of one neuron for a long time, or a population for a short time. By contrast, in the binary model the stimulus value can be decoded only by observing a neuronal population, so that there is no benefit to integrating over long time periods (cf. ref. [19]). One advantage of binary encoding is that it allows the population to signal quickly; the most compact message a neuron can send is one spike [20]. Binary coding is also more efficient in the context of population coding, as quantified by the signal-to-noise ratio (Fig. 4c). The precise organization of both spike number and time we have observed suggests that cortical activity consists, at least under some conditions, of packets of spikes synchronized across populations of neurons. Theoretical work [21-23] has shown how such packets can propagate stably from one population to the next, but only if neurons within each population fire at most one spike per packet; otherwise, the number of spikes per packet—and hence the width of each packet—grows at each propagation step. Interestingly, one prediction of stable propagation models is that spike probability should be related to timing precision, a prediction born out by our observations (Fig. 3). The role of these packets in computation remains an open question. 2 Identification methods for group statistics We recorded responses to 32 different 25 msec tones from each of 175 neurons from the auditory cortices of 16 Sprague-Dawley rats; each tone was repeated between 5 and 75 times (mean = 19). Thus our ensemble consisted of 32x175=5600 response sets, with between 5 and 75 samples in each set. Of these, 3055 response sets contained at least one spike on at least on trial. For each response set, we tested the hypothesis that the observed variability was significantly lower than expected from the null hypothesis of a Poisson process. The ability to assess significance depended on two parameters: the sample size (5-75) and the firing probability. Intuitively, the dependence on firing probability arises because at low firing rates most responses produce only trials with 0 or 1 spikes under both the Poisson and binary models; only at high firing rates do the two models make different predictions, since in that case the Poisson model includes many trials with 2 or even 3 spikes while the binary model generates only solitary spikes (see Fig. 4a1,a2). Using a stringent significance criterion of p<0.001, 467 response sets had a sufficient number of repeats to assess significance, given the observed firing probability. Of these, half (242/467=52%) were significantly less variable than expected by chance, five hundred-fold higher than the 467/1000=0.467 response sets expected, based on the 0.001 significance criterion, to yield a binary response set. Seventy-two neurons had at least one response set for which significance could be assessed, and of these, 49 neurons (49/72=68%) had at least one significantly sub-Poisson response set. Of this population of 49 neurons, five achieved low variability through repeatable bursty behavior (e.g., every spike count was either 0 or 3, but not 1 or 2) and were excluded from further analysis. The remaining 44 neurons formed the basis for the group statistics analyses shown in Figs. 2a and 3b. Nine of these neurons were subjected to an additional protocol consisting of at least 10 presentations each of 100 msec tones and 25 msec tones of all 32 frequencies. Of the 100 msec stimulation response sets, 44 were found to be significantly sub-Poisson at the p<0.05 level, in good agreement with the 43 found to be significant among the responses to 25 msec tones. 3 Bibliography 1. Kilgard, M.P. and M.M. Merzenich, Cortical map reorganization enabled by nucleus basalis activity. Science, 1998. 279(5357): p. 1714-8. 2. Sally, S.L. and J.B. Kelly, Organization of auditory cortex in the albino rat: sound frequency. J Neurophysiol, 1988. 59(5): p. 1627-38. 3. Softky, W.R. and C. Koch, The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J Neurosci, 1993. 13(1): p. 334-50. 4. Stevens, C.F. and A.M. Zador, Input synchrony and the irregular firing of cortical neurons. Nat Neurosci, 1998. 1(3): p. 210-7. 5. Buracas, G.T., A.M. Zador, M.R. DeWeese, and T.D. Albright, Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 1998. 20(5): p. 959-69. 6. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 7. Tolhurst, D.J., J.A. Movshon, and A.F. Dean, The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res, 1983. 23(8): p. 775-85. 8. Otmakhov, N., A.M. Shirke, and R. Malinow, Measuring the impact of probabilistic transmission on neuronal output. Neuron, 1993. 10(6): p. 1101-11. 9. Friedrich, R.W. and G. Laurent, Dynamic optimization of odor representations by slow temporal patterning of mitral cell activity. Science, 2001. 291(5505): p. 889-94. 10. Kara, P., P. Reinagel, and R.C. Reid, Low response variability in simultaneously recorded retinal, thalamic, and cortical neurons. Neuron, 2000. 27(3): p. 635-46. 11. Gur, M., A. Beylin, and D.M. Snodderly, Response variability of neurons in primary visual cortex (V1) of alert monkeys. J Neurosci, 1997. 17(8): p. 2914-20. 12. Berry, M.J., D.K. Warland, and M. Meister, The structure and precision of retinal spike trains. Proc Natl Acad Sci U S A, 1997. 94(10): p. 5411-6. 13. de Ruyter van Steveninck, R.R., G.D. Lewen, S.P. Strong, R. Koberle, and W. Bialek, Reproducibility and variability in neural spike trains. Science, 1997. 275(5307): p. 1805-8. 14. Heil, P., Auditory cortical onset responses revisited. I. First-spike timing. J Neurophysiol, 1997. 77(5): p. 2616-41. 15. Lu, T., L. Liang, and X. Wang, Temporal and rate representations of timevarying signals in the auditory cortex of awake primates. Nat Neurosci, 2001. 4(11): p. 1131-8. 16. Kowalski, N., D.A. Depireux, and S.A. Shamma, Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J Neurophysiol, 1996. 76(5): p. 350323. 17. deCharms, R.C., D.T. Blake, and M.M. Merzenich, Optimizing sound features for cortical neurons. Science, 1998. 280(5368): p. 1439-43. 18. Panzeri, S., R.S. Petersen, S.R. Schultz, M. Lebedev, and M.E. Diamond, The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 2001. 29(3): p. 769-77. 19. Britten, K.H., M.N. Shadlen, W.T. Newsome, and J.A. Movshon, The analysis of visual motion: a comparison of neuronal and psychophysical performance. J Neurosci, 1992. 12(12): p. 4745-65. 20. Delorme, A. and S.J. Thorpe, Face identification using one spike per neuron: resistance to image degradations. Neural Netw, 2001. 14(6-7): p. 795-803. 21. Diesmann, M., M.O. Gewaltig, and A. Aertsen, Stable propagation of synchronous spiking in cortical neural networks. Nature, 1999. 402(6761): p. 529-33. 22. Marsalek, P., C. Koch, and J. Maunsell, On the relationship between synaptic input and spike output jitter in individual neurons. Proc Natl Acad Sci U S A, 1997. 94(2): p. 735-40. 23. Kistler, W.M. and W. Gerstner, Stable propagation of activity pulses in populations of spiking neurons. Neural Comp., 2002. 14: p. 987-997. 24. Zohary, E., M.N. Shadlen, and W.T. Newsome, Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 1994. 370(6485): p. 140-3. 25. Abbott, L.F. and P. Dayan, The effect of correlated variability on the accuracy of a population code. Neural Comput, 1999. 11(1): p. 91-101.
5 0.40785614 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer
Abstract: We introduce a family of classifiers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classifiers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classification function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-definite kernels, regularization techniques that allow for the construction of an optimal classifier in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classification tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classifier outperformed standard SVMs. 1
6 0.40089187 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
7 0.40088117 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
8 0.40048316 135 nips-2002-Learning with Multiple Labels
9 0.39809647 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables
10 0.39801127 76 nips-2002-Dynamical Constraints on Computing with Spike Timing in the Cortex
11 0.39556745 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
12 0.39403722 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
13 0.39092642 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution
14 0.39014459 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
15 0.3893199 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule
16 0.38841918 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals
17 0.38636297 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
18 0.38626584 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
19 0.38556686 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
20 0.38473129 148 nips-2002-Morton-Style Factorial Coding of Color in Primary Visual Cortex