nips nips2002 nips2002-154 knowledge-graph by maker-knowledge-mining

154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity

Source: pdf

Author: Giacomo Indiveri

Abstract: We present analog neuromorphic circuits for implementing bistable synapses with spike-timing-dependent plasticity (STDP) properties. In these types of synapses, the short-term dynamics of the synaptic efﬁcacies are governed by the relative timing of the pre- and post-synaptic spikes, while on long time scales the efﬁcacies tend asymptotically to either a potentiated state or to a depressed one. We fabricated a prototype VLSI chip containing a network of integrate and ﬁre neurons interconnected via bistable STDP synapses. Test results from this chip demonstrate the synapse’s STDP learning properties, and its long-term bistable characteristics.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ch Abstract We present analog neuromorphic circuits for implementing bistable synapses with spike-timing-dependent plasticity (STDP) properties. [sent-4, score-0.823]

2 In these types of synapses, the short-term dynamics of the synaptic efﬁcacies are governed by the relative timing of the pre- and post-synaptic spikes, while on long time scales the efﬁcacies tend asymptotically to either a potentiated state or to a depressed one. [sent-5, score-0.561]

3 We fabricated a prototype VLSI chip containing a network of integrate and ﬁre neurons interconnected via bistable STDP synapses. [sent-6, score-0.543]

4 Test results from this chip demonstrate the synapse’s STDP learning properties, and its long-term bistable characteristics. [sent-7, score-0.269]

5 1 Introduction Most artiﬁcial neural network algorithms based on Hebbian learning use correlations of mean rate signals to increase the synaptic efﬁcacies between connected neurons. [sent-8, score-0.361]

6 To prevent uncontrolled growth of synaptic efﬁcacies, these algorithms usually incorporate also weight normalization constraints, that are often not biophysically realistic. [sent-9, score-0.289]

7 Recently an alternative class of competitive Hebbian learning algorithms has been proposed based on a spike-timing-dependent plasticity (STDP) mechanism [1]. [sent-10, score-0.05]

8 It has been argued that the STDP mechanism can automatically, and in a biologically plausible way, balance the strengths of synaptic efﬁcacies, thus preserving the beneﬁts of both weight normalization and correlation based learning rules [16]. [sent-11, score-0.289]

9 In STDP the precise timing of spikes generated by the neurons play an important role. [sent-12, score-0.321]

10 If a pre-synaptic spike arrives at the synaptic terminal before a post-synaptic spike is emitted, within a critical time window, the synaptic efﬁcacy is increased. [sent-13, score-0.803]

11 Conversely if the post-synaptic spike is emitted soon before the pre-synaptic one arrives, the synaptic efﬁcacy is decreased. [sent-14, score-0.425]

12 While mean rate Hebbian learning algorithms are difﬁcult to implement using analog circuits, spike-based learning rules map directly onto VLSI [4, 6, 7]. [sent-15, score-0.16]

13 In this paper we present compact analog circuits that, combined with neuromorphic integrate and ﬁre (I&F;) neurons and synaptic circuits with realistic dynamics [8, 12, 11] implement STDP learning for short time scales and asymptotically tend to one of two possible states on long time scales. [sent-16, score-1.301]

14 The circuits required to implement STDP, are described in Section 2. [sent-17, score-0.245]

15 The circuits that implement bistability are described in Section 3. [sent-18, score-0.439]

16 The network of I&F; neurons used to measure the properties of the bistable STDP synapse is described in Section 4. [sent-19, score-0.471]

17 Long term storage of synaptic efﬁcacies The circuits that drive the synaptic efﬁcacy to one of two possible states on long time scales, were implemented in order to cope with the problem of long term storage of analog values in CMOS technology. [sent-20, score-1.138]

18 Conventional VLSI capacitors, the devices typically used as memory elements, are not ideal, in that they slowly loose the charge they are supposed to store, due to leakage currents. [sent-21, score-0.116]

19 Several solutions have been proposed for long term storage of synaptic efﬁcacies in analog VLSI neural networks. [sent-22, score-0.487]

20 One of the ﬁrst suggestions was to use the same method used for dynamic RAM: to periodically refresh the stored value. [sent-23, score-0.059]

21 This involves though discretization of the analog value to N discrete levels, a method for comparing the measured voltage to the N levels, and a clocked circuit to periodically refresh the value on the capacitor. [sent-24, score-0.444]

22 An alternative solution is to use analog-to-digital (ADC) converters, an off chip RAM and digital-to-analog converters (DAC), but this approach requires, next to a discretization of the value to N states, bulky ADC and DAC circuits. [sent-25, score-0.092]

23 A more recent suggestion is the one of using ﬂoating gate devices [5]. [sent-26, score-0.047]

24 These devices can store very precise analog values for an indeﬁnite amount of time using standard CMOS technology [13], but for spike-based learning rules they would require a control circuit (and thus large area) per synapse. [sent-27, score-0.388]

25 To implement dense arrays of neurons with large numbers of dendritic inputs the synaptic circuits should be as compact as possible. [sent-28, score-0.725]

26 Bistable synapses An alternative approach that uses a very small amount of area per synapse is to use bistable synapses. [sent-29, score-0.39]

27 The assumption that on long time scales the synaptic efﬁcacy can only assume two values is not too severe, for networks of neurons with large numbers of synapses. [sent-31, score-0.564]

28 It has been argued that also biological synapses can be indeed discrete on long time-scales. [sent-32, score-0.177]

29 Also from a theoretical perspective it has been shown that the performance of associative networks is not necessarily degraded if the dynamic range of the synaptic efﬁcacy is reduced even to the extreme (two stable states), provided that the transitions between stable states are stochastic [2]. [sent-34, score-0.315]

30 More recently Boﬁll and Murray proposed circuits for implementing STDP within a framework of pulsebased neural network circuits [4]. [sent-37, score-0.482]

31 But, next to missing the long-term bistability properties, their synaptic circuits require digital control signals that cannot be easily generated within the framework of neuromorphic networks of I&F; neurons [8, 12]. [sent-38, score-0.995]

32 Vdd Vdd M3 M4 Vtp M2 Vdd M10 M5 /post Vpot Ipot Vw0 Vd M6 M7 Vp Cw Idep Vdep pre M11 M8 M1 M12 M9 Vtd Figure 1: Synaptic efﬁcacy STDP circuit. [sent-39, score-0.106]

33 2 The STDP circuits The circuit required to implement STDP in a network of I&F; neurons is shown in Fig. [sent-40, score-0.61]

34 This circuit increases or decreases the analog voltage Vw0 , depending on the relative timing of the pulses pre and /post. [sent-42, score-0.676]

35 The voltage Vw0 is then used to set the strength of synaptic circuits with realistic dynamics, of the type described in [11]. [sent-43, score-0.6]

36 The pre- and post-synaptic pulses pre and /post are generated by compact, low power I&F; neurons, of the type described in [9]. [sent-44, score-0.197]

37 1 is fully symmetric: upon the arrival of a pre-synaptic pulse pre a waveform Vpot (t) (for potentiating Vw0 ) is generated. [sent-46, score-0.133]

38 The pre- and post-synaptic pulses are also used to switch on two gates (M 8 and M 5), that allow the currents Idep and Ipot to ﬂow, as long as the pulses are high, either increasing or decreasing the weight. [sent-49, score-0.278]

39 The bias voltages V p on transistor M 6 and Vd on M 7 set an upper bound for the maximum amount of current that can be injected into or removed from the capacitor Cw . [sent-50, score-0.07]

40 The change in synaptic efﬁcacy is then: ∆Vw0 = ∆Vw0 = Ipot (tpost ) ∆tspk Cp Idep (tpre ) − Cd ∆tspk if tpre < tpost if tpost < tpre (3) where ∆tspk is the pre- and post-synaptic spike width, Cp is the parasitic capacitance of node Vpot and Cd the one of node Vdep (not shown in Fig. [sent-52, score-0.882]

41 2(a) we plot experimental data showing how ∆Vw0 changes as a function of ∆t = tpre − tpost for different values of Vtd and Vtp . [sent-55, score-0.282]

42 5 10 −10 −5 (a) 0 ∆ t (ms) 5 10 (b) Figure 2: Changes in synaptic efﬁcacy, as a function of the difference between pre- and post-synaptic spike emission times ∆t = tpre −tpost . [sent-61, score-0.506]

43 5 0 0 5 2 3 4 5 0 0 5 1 2 3 4 5 0 0 1 2 3 Time (ms) 4 5 pre (V) V dep (V) 1 Figure 3: Changes in Vw0 , in response to a sequence of pre-synaptic spikes (top trace). [sent-67, score-0.162]

44 The middle trace shows how the signal Vdep , triggered by the post-synaptic neuron, decreases linearly with time. [sent-68, score-0.034]

45 The bottom trace shows the series of digital pulses pre, generated with every pre-synaptic spike. [sent-69, score-0.125]

46 As there are four independent control biases, it is possible to set the maximum amplitude and temporal window of inﬂuence independently for positive and negative changes in V w0 . [sent-71, score-0.035]

47 Unlike the biological experiments, in our VLSI setup it is possible to evaluate the effect of multiple pulses on the synaptic efﬁcacy, for very long successive stimulation sessions, monitoring all the internal state variables and signals involved in the process. [sent-74, score-0.463]

48 3 we show the effect of multiple pre-synaptic spikes, succeeding a post-synaptic one, plotting a trace of the voltage V w0 , together with the Vhigh M3 Vw0 M5 − Vthr M4 + Vw0 M6 Vleak M1 M2 Vlow Figure 4: Bistability circuit. [sent-76, score-0.141]

49 Depending on Vw0 − Vthr , the comparator drives Vw0 to either Vhigh or Vlow . [sent-77, score-0.127]

50 The rate at which the circuit drives Vw0 toward the asymptote is controlled by Vleak and imposed by transistors M 2 and M 4. [sent-78, score-0.4]

51 “internal” signal Vdep , generated by the post-synaptic spike, and the pulses pre, generated by the per-synaptic neuron. [sent-79, score-0.091]

52 Note how the change in Vw0 is a positive one, when the postsynaptic spike follows a pre-synaptic one, at t = 0. [sent-80, score-0.132]

53 5ms, and is negative when a series of pre-synaptic spikes follows the post-synaptic one. [sent-81, score-0.056]

54 The effect of subsequent pre pulses following the ﬁrst post-/pre-synaptic pair is additive, and decreases with time as in Fig. [sent-82, score-0.197]

55 As expected, the anti-causal relationship between pre- and post-synaptic neurons has the net effect of decreasing the synaptic efﬁcacy. [sent-84, score-0.454]

56 3 The bistability circuit The bistability circuit, shown in Fig. [sent-85, score-0.547]

57 4, drives the voltage Vw0 toward one of two possible states: Vhigh (if Vw0 > Vthr ), or Vlow (if Vw0 < Vthr ). [sent-86, score-0.217]

58 The signal Vthr is a threshold voltage that can be set externally. [sent-87, score-0.151]

59 The circuit comprises a comparator, and a mixed-mode analog-digital leakage circuit. [sent-88, score-0.228]

60 The comparator is a ﬁve transistor transconductance ampliﬁer [13] that can be designed using minimum feature-size transistors. [sent-89, score-0.065]

61 The leakage circuit contains two gates that act as digital switches (M 5, M 6) and four transistors that set the two stable state asymptotes Vhigh and Vlow and that, together with the bias voltage Vleak , determine the rate at which Vw0 approaches the asymptotes. [sent-90, score-0.426]

62 The bistability circuit drives Vw0 in two different ways, depending on how large is the distance between the value of V w0 itself and the asymptote. [sent-91, score-0.443]

63 If |Vw0 −Vas | > 4UT the bistability circuit drives Vw0 toward Vas linearly, where Vas represents either Vlow or Vhigh , depending on the sign of (Vw0 − Vthr ): Vw0 (t) = Vw0 (0) + Vw0 (t) = Vw0 (0) − Ileak Cw t Ileak Cw t if Vw0 > Vthr if Vw0 < Vthr (4) where Cw is the capacitor of Fig. [sent-92, score-0.532]

64 1 and Ileak = I0 e κVleak −Vlow UT As Vw0 gets close to the asymptote and |Vw0 −Vas | < 4UT , transistors M 2 or M 4 of Fig. [sent-93, score-0.131]

65 Transition of Vw0 from below threshold to above threshold (Vthr = 1. [sent-97, score-0.088]

66 25V and pre- and postsynaptic neurons stimulated in a way to increase Vw0 . [sent-99, score-0.239]

67 I1 I2 M1 O1 M2 O2 Figure 6: Network of leaky I&F; neurons with bistable STDP excitatory synapses and inhibitory synapses. [sent-100, score-0.549]

68 The large circles symbolize I&F; neurons, the small empty ones bistable STDP excitatory synapses, and the small bars non-plastic inhibitory synapses. [sent-101, score-0.233]

69 If the STDP short-term dynamics drive Vw0 above threshold we say that long-term potentiation (LTP) had been induced. [sent-104, score-0.223]

70 And if the short-term dynamics drive Vw0 below threshold, we say that long-term depression (LTD) has been induced. [sent-105, score-0.135]

71 5 we show how the synaptic efﬁcacy Vw0 changes upon induction of LTP, while stimulating the pre- and post-synaptic neurons with uniformly distributed spike trains. [sent-107, score-0.588]

72 The asymptote Vlow was set to zero, and Vhigh to 2. [sent-108, score-0.069]

73 The pre- and post-synaptic neurons were injected with constant DC currents in a way to increase Vw0 , on average. [sent-110, score-0.238]

74 As shown, the two asymptotes Vlow and Vhigh act as two attractors, or stable equilibrium points, whereas the threshold voltage Vthr acts as an unstable equilibrium point. [sent-111, score-0.18]

75 If the synaptic efﬁcacy is below threshold the short-term dynamics have to ﬁght against the long-term bistability effect, to increase Vw0 . [sent-112, score-0.583]

76 But as soon as Vw0 crosses the threshold, the bistability circuit switches, the effects of the short-term dynamics are reinforced by the asymptotic drive, and Vw0 is quickly driven toward Vhigh . [sent-113, score-0.49]

77 4 A network of integrate and ﬁre neurons The prototype chip that we used to test the bistable STDP circuits presented in this paper, contains a symmetric network of leaky I&F; neurons [9] (see Fig. [sent-114, score-0.979]

78 (a) Changes in V w0 for low synaptic efﬁcacy values (Vhigh = 2. [sent-117, score-0.289]

79 (b) Changes in Vw0 for high synaptic efﬁcacy values (Vwh = 3. [sent-119, score-0.289]

80 6V ) and with bistability asymptotic drive (Vleak = 0. [sent-120, score-0.306]

81 2, 3, and 5 was obtained by injecting currents in the neurons labeled I1 and O1 and by measuring the signals from the excitatory synapse on O1. [sent-123, score-0.326]

82 7 we show the membrane potential of I1, O1, and the synaptic efﬁcacy Vw0 of the corresponding synapse, in two different conditions. [sent-125, score-0.289]

83 Figure 7(a) shows the changes in Vw0 when both neurons are stimulated but no asymptotic drive is used. [sent-126, score-0.353]

84 As shown Vw0 strongly depends on the spike patterns of the pre- and post-synaptic neurons. [sent-127, score-0.099]

85 Figure 7(b) shows a scenario in which only neuron I1 is stimulated, but in which the weight Vw0 is close to its high asymptote (Vhigh = 3. [sent-128, score-0.069]

86 6V) and in which there is a long-term asymptotic drive (Vleak = 0. [sent-129, score-0.112]

87 Even though the synaptic weight stays always in its potentiated state, the ﬁring rate of O1 is not as regular as the one of its efferent neuron. [sent-131, score-0.321]

88 5 Discussion and future work The STDP circuits presented here introduce a source of variability in the spike timing of the I&F; neurons that could be exploited for creating VLSI networks of neurons with stochastic dynamics and for implementing spike-based stochastic learning mechanisms [2]. [sent-133, score-0.814]

89 of Poisson distributed spike trains) and on their precise spike-timing in order to induce LTP or LTD only to a small speciﬁc sub-set of the synapses stimulated. [sent-136, score-0.258]

90 In future experiments we will characterize the properties of the bistable STDP synapse in response to Poisson distributed spike trains, and measure transition probabilities as functions of input statistics and circuit parameters. [sent-137, score-0.523]

91 We presented compact neuromorphic circuits for implementing bistable STDP synapses in VLSI networks of I&F; neurons, and showed data from a prototype chip. [sent-138, score-0.747]

92 We demonstrated how these types of synapses can either store their LTP or LTD state for long-term, or switch state depending on the precise timing of the pre- and post-synaptic spikes. [sent-139, score-0.282]

93 In the near future, we plan to use the simple network of I&F; neurons of Fig. [sent-140, score-0.206]

94 6, present on the prototype chip, to analyze the effect of bistable STDP plasticity at a network level. [sent-141, score-0.338]

95 On the long term, we plan to design a larger chip with these circuits to implement a re-conﬁgurable network of I&F; neurons of O(100) neurons and O(1000) synapses, and use it as a real-time tool for investigating the computational properties of competitive networks and selective attention models. [sent-142, score-0.757]

96 Some of the ideas that led to the design and implementation of the circuits presented were inspired by the Telluride Workshop on Neuromorphic Engineering (http://www. [sent-144, score-0.204]

97 Asymmetric hebbian learning, spike liming and neural response variability. [sent-152, score-0.165]

98 A synaptic model of memory: Long term potentiation in the hippocampus. [sent-166, score-0.333]

99 Modeling selective attention using a neuromorphic analog VLSI device. [sent-211, score-0.205]

100 Regulation of synaptic efﬁcacy by u coincidence of postsynaptic APs and EPSPs. [sent-257, score-0.322]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('stdp', 0.354), ('synaptic', 0.289), ('cacy', 0.272), ('vthr', 0.223), ('bistable', 0.206), ('vhigh', 0.204), ('circuits', 0.204), ('bistability', 0.194), ('vlow', 0.167), ('neurons', 0.165), ('circuit', 0.159), ('vleak', 0.149), ('vlsi', 0.142), ('vdep', 0.13), ('tpost', 0.129), ('synapses', 0.125), ('analog', 0.119), ('tpre', 0.118), ('cacies', 0.113), ('voltage', 0.107), ('pre', 0.106), ('spike', 0.099), ('cw', 0.096), ('idep', 0.093), ('ipot', 0.093), ('vpot', 0.093), ('pulses', 0.091), ('ef', 0.088), ('neuromorphic', 0.086), ('drive', 0.079), ('asymptote', 0.069), ('leakage', 0.069), ('timing', 0.066), ('hebbian', 0.066), ('comparator', 0.065), ('vas', 0.065), ('chip', 0.063), ('transistors', 0.062), ('drives', 0.062), ('synapse', 0.059), ('dynamics', 0.056), ('ileak', 0.056), ('tspk', 0.056), ('vtd', 0.056), ('spikes', 0.056), ('ltp', 0.055), ('vd', 0.052), ('long', 0.052), ('plasticity', 0.05), ('vdd', 0.048), ('vtp', 0.048), ('toward', 0.048), ('devices', 0.047), ('vp', 0.047), ('threshold', 0.044), ('indiveri', 0.044), ('potentiation', 0.044), ('ms', 0.044), ('currents', 0.044), ('liu', 0.044), ('capacitor', 0.041), ('stimulated', 0.041), ('ltd', 0.041), ('prototype', 0.041), ('network', 0.041), ('implement', 0.041), ('fusi', 0.037), ('giacomo', 0.037), ('emitted', 0.037), ('changes', 0.035), ('trace', 0.034), ('governed', 0.034), ('precise', 0.034), ('implementing', 0.033), ('postsynaptic', 0.033), ('asymptotic', 0.033), ('delbruck', 0.032), ('refresh', 0.032), ('zurich', 0.032), ('adc', 0.032), ('dac', 0.032), ('potentiated', 0.032), ('scales', 0.032), ('signals', 0.031), ('quadrant', 0.029), ('injected', 0.029), ('asymptotes', 0.029), ('converters', 0.029), ('store', 0.029), ('depending', 0.028), ('periodically', 0.027), ('leak', 0.027), ('pulse', 0.027), ('arrives', 0.027), ('storage', 0.027), ('excitatory', 0.027), ('integrate', 0.027), ('compact', 0.026), ('networks', 0.026), ('leaky', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity

Author: Giacomo Indiveri

2 0.3222813 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain

Author: R. J. Vogelstein, Francesco Tenore, Ralf Philipp, Miriam S. Adlerstein, David H. Goldberg, Gert Cauwenberghs

Abstract: Address-event representation (AER), originally proposed as a means to communicate sparse neural events between neuromorphic chips, has proven efﬁcient in implementing large-scale networks with arbitrary, conﬁgurable synaptic connectivity. In this work, we further extend the functionality of AER to implement arbitrary, conﬁgurable synaptic plasticity in the address domain. As proof of concept, we implement a biologically inspired form of spike timing-dependent plasticity (STDP) based on relative timing of events in an AER framework. Experimental results from an analog VLSI integrate-and-ﬁre network demonstrate address domain learning in a task that requires neurons to group correlated inputs.

3 0.27036667 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics

Author: Shih-Chii Liu, Malte Boegershausen, Pascal Suter

Abstract: We describe a model of short-term synaptic depression that is derived from a silicon circuit implementation. The dynamics of this circuit model are similar to the dynamics of some present theoretical models of shortterm depression except that the recovery dynamics of the variable describing the depression is nonlinear and it also depends on the presynaptic frequency. The equations describing the steady-state and transient responses of this synaptic model ﬁt the experimental results obtained from a fabricated silicon network consisting of leaky integrate-and-ﬁre neurons and different types of synapses. We also show experimental data demonstrating the possible computational roles of depression. One possible role of a depressing synapse is that the input can quickly bring the neuron up to threshold when the membrane potential is close to the resting potential.

4 0.21911564 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model

Author: Luk Chong Yeung, Brian S. Blais, Leon N. Cooper, Harel Z. Shouval

Abstract: A uniﬁed, biophysically motivated Calcium-Dependent Learning model has been shown to account for various rate-based and spike time-dependent paradigms for inducing synaptic plasticity. Here, we investigate the properties of this model for a multi-synapse neuron that receives inputs with diﬀerent spike-train statistics. In addition, we present a physiological form of metaplasticity, an activity-driven regulation mechanism, that is essential for the robustness of the model. A neuron thus implemented develops stable and selective receptive ﬁelds, given various input statistics 1

5 0.18144783 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule

Author: Michael Eisele, Kenneth D. Miller

Abstract: Cortical synaptic plasticity depends on the relative timing of pre- and postsynaptic spikes and also on the temporal pattern of presynaptic spikes and of postsynaptic spikes. We study the hypothesis that cortical synaptic plasticity does not associate individual spikes, but rather whole ﬁring episodes, and depends only on when these episodes start and how long they last, but as little as possible on the timing of individual spikes. Here we present the mathematical background for such a study. Standard methods from hidden Markov models are used to deﬁne what “ﬁring episodes” are. Estimating the probability of being in such an episode requires not only the knowledge of past spikes, but also of future spikes. We show how to construct a causal learning rule, which depends only on past spikes, but associates pre- and postsynaptic ﬁring episodes as if it also knew future spikes. We also show that this learning rule agrees with some features of synaptic plasticity in superﬁcial layers of rat visual cortex (Froemke and Dan, Nature 416:433, 2002).

6 0.16690192 76 nips-2002-Dynamical Constraints on Computing with Spike Timing in the Cortex

7 0.1658064 129 nips-2002-Learning in Spiking Neural Assemblies

8 0.16224514 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits

9 0.1575131 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip

10 0.14877781 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon

11 0.13807893 91 nips-2002-Field-Programmable Learning Arrays

12 0.12828897 171 nips-2002-Reconstructing Stimulus-Driven Neural Networks from Spike Times

13 0.11921906 71 nips-2002-Dopamine Induced Bistability Enhances Signal Processing in Spiny Neurons

14 0.11262815 200 nips-2002-Topographic Map Formation by Silicon Growth Cones

15 0.099106841 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model

16 0.097648889 43 nips-2002-Binary Coding in Auditory Cortex

17 0.090043411 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

18 0.083200425 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex

19 0.082378007 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution

20 0.075663336 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.157), (1, 0.308), (2, 0.014), (3, -0.166), (4, 0.074), (5, 0.342), (6, 0.198), (7, -0.015), (8, -0.015), (9, -0.057), (10, 0.097), (11, -0.036), (12, -0.047), (13, 0.023), (14, 0.15), (15, 0.0), (16, -0.045), (17, -0.088), (18, 0.078), (19, -0.032), (20, 0.098), (21, -0.038), (22, -0.04), (23, 0.056), (24, -0.015), (25, 0.025), (26, -0.001), (27, -0.038), (28, 0.045), (29, -0.05), (30, 0.033), (31, -0.007), (32, 0.003), (33, -0.016), (34, -0.048), (35, -0.04), (36, 0.063), (37, -0.018), (38, 0.047), (39, -0.047), (40, -0.002), (41, 0.043), (42, 0.0), (43, -0.028), (44, 0.036), (45, -0.102), (46, 0.008), (47, 0.007), (48, -0.092), (49, -0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97451663 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity

Author: Giacomo Indiveri

2 0.86244524 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics

Author: Shih-Chii Liu, Malte Boegershausen, Pascal Suter

3 0.8355155 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain

Author: R. J. Vogelstein, Francesco Tenore, Ralf Philipp, Miriam S. Adlerstein, David H. Goldberg, Gert Cauwenberghs

4 0.76155978 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model

Author: Luk Chong Yeung, Brian S. Blais, Leon N. Cooper, Harel Z. Shouval

5 0.6506331 200 nips-2002-Topographic Map Formation by Silicon Growth Cones

Author: Brian Taba, Kwabena A. Boahen

Abstract: We describe a self-configuring neuromorphic chip that uses a model of activity-dependent axon remodeling to automatically wire topographic maps based solely on input correlations. Axons are guided by growth cones, which are modeled in analog VLSI for the first time. Growth cones migrate up neurotropin gradients, which are represented by charge diffusing in transistor channels. Virtual axons move by rerouting address-events. We refined an initially gross topographic projection by simulating retinal wave input. 1 Neuromorphic Systems Neuromorphic engineers are attempting to match the computational efficiency of biological systems by morphing neurocircuitry into silicon circuits [1]. One of the most detailed implementations to date is the silicon retina described in [2] . This chip comprises thirteen different cell types, each of which must be individually and painstakingly wired. While this circuit-level approach has been very successful in sensory systems, it is less helpful when modeling largely unelucidated and exceedingly plastic higher processing centers in cortex. Instead of an explicit blueprint for every cortical area, what is needed is a developmental rule that can wire complex circuits from minimal specifications. One candidate is the famous

6 0.58607405 91 nips-2002-Field-Programmable Learning Arrays

7 0.57949656 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip

8 0.55250889 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule

9 0.52539337 71 nips-2002-Dopamine Induced Bistability Enhances Signal Processing in Spiny Neurons

10 0.51494503 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits

11 0.50276494 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon

12 0.48770878 129 nips-2002-Learning in Spiking Neural Assemblies

13 0.43651512 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model

14 0.38547045 76 nips-2002-Dynamical Constraints on Computing with Spike Timing in the Cortex

15 0.32015854 171 nips-2002-Reconstructing Stimulus-Driven Neural Networks from Spike Times

16 0.30356708 4 nips-2002-A Differential Semantics for Jointree Algorithms

17 0.29748493 66 nips-2002-Developing Topography and Ocular Dominance Using Two aVLSI Vision Sensors and a Neurotrophic Model of Plasticity

18 0.2455357 43 nips-2002-Binary Coding in Auditory Cortex

19 0.23394163 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach

20 0.23001797 128 nips-2002-Learning a Forward Model of a Reflex

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.015), (3, 0.37), (23, 0.017), (42, 0.037), (54, 0.071), (55, 0.022), (67, 0.027), (68, 0.056), (74, 0.054), (83, 0.084), (92, 0.017), (95, 0.023), (98, 0.117)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84550321 154 nips-2002-Neuromorphic Bisable VLSI Synapses with Spike-Timing-Dependent Plasticity

Author: Giacomo Indiveri

2 0.76845956 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

3 0.73314947 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes

Author: Anat Levin, Assaf Zomet, Yair Weiss

Abstract: Certain simple images are known to trigger a percept of transparency: the input image I is perceived as the sum of two images I(x, y) = I1 (x, y) + I2 (x, y). This percept is puzzling. First, why do we choose the “more complicated” description with two images rather than the “simpler” explanation I(x, y) = I1 (x, y) + 0 ? Second, given the inﬁnite number of ways to express I as a sum of two images, how do we compute the “best” decomposition ? Here we suggest that transparency is the rational percept of a system that is adapted to the statistics of natural scenes. We present a probabilistic model of images based on the qualitative statistics of derivative ﬁlters and “corner detectors” in natural scenes and use this model to ﬁnd the most probable decomposition of a novel image. The optimization is performed using loopy belief propagation. We show that our model computes perceptually “correct” decompositions on synthetic images and discuss its application to real images. 1

4 0.51079398 120 nips-2002-Kernel Design Using Boosting

Author: Koby Crammer, Joseph Keshet, Yoram Singer

Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efﬁcient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a ﬁxed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classiﬁcation, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classiﬁcation and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predeﬁned kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predeﬁned kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for speciﬁc tasks such as text categorization [3] and protein classiﬁcation problems [4]. Our work attempts to give a computational alternative to predeﬁned kernels by learning kernel operators from data. We start with a few deﬁnitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a ﬁnite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The speciﬁc setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to ﬁnd a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be deﬁned shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this deﬁnition, they deﬁned the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a ﬁnite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “ﬂattened” into a vector of dimension m2 . Therefore, this deﬁnition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself sufﬁces to construct an excellent classiﬁer f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to ﬁnd a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classiﬁer whose empirical loss is zero. Furthermore, the task of ﬁnding a good kernel when it is not always possible to ﬁnd a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difﬁculties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore ﬁrst discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classiﬁcation error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] uniﬁes the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classiﬁcation error. Given the prediction of a classiﬁer f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are deﬁned as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classiﬁer f (x). This classiﬁer is a weighted combination of (possibly very simple) base classiﬁers. (In the boosting framework, the base classiﬁers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to ﬁnd a good of weights αt . Our starting point in this paper is the ﬁrst sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-ﬂy. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reﬂects how difﬁcult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classiﬁcation algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspeciﬁed. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we deﬁne the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this deﬁnition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a speciﬁc base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We deﬁne the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw ﬁts the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efﬁciently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now deﬁne a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to ﬁnding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semideﬁnite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for ﬁnding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efﬁciently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the ﬁrst two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the ﬁrst two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (ﬁrst and third from left) and after learning (second and fourth). For each of the two settings we show the ﬁrst two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we ﬁrst learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classiﬁer on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classiﬁer with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the ﬁgure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the ﬁrst two features. In all plots, each bar indicates a 95% conﬁdence level. It is clear from the ﬁgure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reﬂects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the beneﬁts of using a boutique kernel. The ﬁrst and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the ﬁrst eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the ﬁrst and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the ﬁrst group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classiﬁcation problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classiﬁers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classiﬁcation problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the ﬁrst ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as signiﬁcant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classiﬁers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classiﬁcation using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classiﬁcation. In Proceedings of the Paciﬁc Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-deﬁnite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.

5 0.4449392 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics

Author: Shih-Chii Liu, Malte Boegershausen, Pascal Suter

6 0.43659887 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain

7 0.41541505 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip

8 0.40935391 168 nips-2002-Real-Time Monitoring of Complex Industrial Processes with Particle Filters

9 0.40127498 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model

10 0.39946848 200 nips-2002-Topographic Map Formation by Silicon Growth Cones

11 0.39260072 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule

12 0.3921681 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon

13 0.38966897 91 nips-2002-Field-Programmable Learning Arrays

14 0.38850191 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

15 0.38825962 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits

16 0.38361073 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

17 0.3834374 140 nips-2002-Margin Analysis of the LVQ Algorithm

18 0.38319337 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

19 0.37993056 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach

20 0.3781192 45 nips-2002-Boosted Dyadic Kernel Discriminants