nips nips2005 nips2005-165 knowledge-graph by maker-knowledge-mining

165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Source: pdf

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

Abstract: In this paper, we aim at analyzing the characteristic of neuronal population responses to instantaneous or time-dependent inputs and the role of synapses in neural information processing. We have derived an evolution equation of the membrane potential density function with synaptic depression, and obtain the formulas for analytic computing the response of instantaneous re rate. Through a technical analysis, we arrive at several signi cant conclusions: The background inputs play an important role in information processing and act as a switch betwee temporal integration and coincidence detection. the role of synapses can be regarded as a spatio-temporal lter; it is important in neural information processing for the spatial distribution of synapses and the spatial and temporal relation of inputs. The instantaneous input frequency can affect the response amplitude and phase delay. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract In this paper, we aim at analyzing the characteristic of neuronal population responses to instantaneous or time-dependent inputs and the role of synapses in neural information processing. [sent-13, score-0.316]

2 We have derived an evolution equation of the membrane potential density function with synaptic depression, and obtain the formulas for analytic computing the response of instantaneous re rate. [sent-14, score-0.437]

3 Through a technical analysis, we arrive at several signi cant conclusions: The background inputs play an important role in information processing and act as a switch betwee temporal integration and coincidence detection. [sent-15, score-0.249]

4 the role of synapses can be regarded as a spatio-temporal lter; it is important in neural information processing for the spatial distribution of synapses and the spatial and temporal relation of inputs. [sent-16, score-0.216]

5 The instantaneous input frequency can affect the response amplitude and phase delay. [sent-17, score-0.265]

6 1 Introduction Noise has an important impact on information processing of the nervous system in vivo. [sent-18, score-0.013]

7 It is signi cance for us to study the stimulus-and-response behavior of neuronal populations, especially to transients or time-dependent inputs in noisy environment, viz. [sent-19, score-0.143]

8 given this stochastic environment, the neuronal output is typically characterized by the instantaneous ring rate. [sent-20, score-0.122]

9 Moreover, it is revealed recently that synapses have a more active role in information processing[5-7]. [sent-22, score-0.093]

10 The synapses are highly dynamic and show use-dependent plasticity over a wide range of time scales. [sent-23, score-0.085]

11 Synaptic short-term depression is one of the most common expressions of plasticity. [sent-24, score-0.086]

12 At synapses with this type of modulation, pre-synaptic activity produces a decrease in synaptic. [sent-25, score-0.085]

13 The present work is concerned with the processes underlying investigating the collectivity dynamics of neuronal population with synaptic depression and the instantaneous response to time-dependence inputs. [sent-26, score-0.444]

14 First, we deduce a one-dimension Fokker-Planck (FP) equation via reducing the high-dimension FP equations. [sent-27, score-0.052]

15 Then, we derive the stationary solution and the response of instantaneous re rate from it. [sent-28, score-0.226]

16 The population density based on the integrate-and- re neuronal model is low-dimensional and thus can be computed ef ciently, although the approach could be generalized to other neuron models. [sent-32, score-0.161]

17 It is completely characterized by its membrane potential below threshold. [sent-33, score-0.054]

18 Details of the generation of an action potential above the threshold are ignored. [sent-34, score-0.018]

19 Synaptic and external inputs are summed until it reaches a threshold where a spike is emitted. [sent-35, score-0.083]

20 Jk (t) = ADk (t), where A is a constant representing the absolute synaptic ef cacy corresponding to the maximal postsynaptic response obtained if all the synaptic resources are released at once, and Dk (t) act in accordance with complex dynamics rule. [sent-39, score-0.321]

21 We use the phenomenological model by Tsodyks & Markram [7] to simulate short-term synaptic depression: (1 dDk (t) = dt Dk (t)) Uk Dk (t) (t d tsp ); k (2) where Dk is a `depression' variable, Dk 2 [0; 1], d is the recovery time constant, Uk is a constant determining the step decrease in Dk . [sent-40, score-0.211]

22 The boundary conditions of equation (9) are Z 1 pv (t; 1) = 0; pv (t; v)dv = 1; r(t) = Jv (t; 0): (14) 0 2. [sent-43, score-0.668]

23 2 Stationary Solution and Response Analysis When the system is in the stationary states, @pv =@t = 0; dmk =dt = 0; d k =dt = 0; pv (t; v) = p0 (v); r(t) = r0 ; mk (t) = m0 ; k (t) = 0 and k (t) = 0 . [sent-44, score-0.455]

24 Then mk and k 0 k (1 = + "k 1 k (t)); (16) have the forms, i. [sent-48, score-0.092]

25 t , the output has the same v frequency and takes the forms p1 (t; v) = p! [sent-54, score-0.035]

26 For inputs that vary on a slow enough time scale, satisfy l = p1 = r1 = v ! [sent-59, score-0.062]

27 1 pn = 1 (25) 0 In general, Q1 (t) v ~1 Kv (t), then we have F0 = f0 (t; v) ~1 Kv (t)p0 : v From (23), (25) and (26), we can get Z 1 0 ~0 Z 1 ~0 2r0 ~ 1 (v Kv )2 (v Kv )2 0 r1 Kv (t) exp[ ] p0 exp[ ]dv dv + j! [sent-62, score-0.19]

28 v 0 0 0 Qv Qv Qv 0 v 0 Z 0 ~0 Z 1 Z v ~0 00 00 (v Kv )2 (v Kv )2 0 2r0 1 exp[ ] [ p0 (v )dv ] exp[ ]dv dv: 1 0 0 0 Qv 0 Qv Qv v 0 In the limit of high frequency inputs, i. [sent-63, score-0.035]

29 Qv ~0 Kv )(1 Q1 (t) v ); ~1 Kv (t)Q0 v (30) Discussion ~0 In equation (15), Kv re ects the average intensity of background inputs and Q0 re ects the v 0 intensity of background noise. [sent-68, score-0.342]

30 k In the low input frequency regime, from (27), we can know that the input frequency ! [sent-70, score-0.123]

31 increasing will result in the response amplitude and the phase delay increasing. [sent-71, score-0.173]

32 However, in the high input frequency limit regime, from (30), we can know the input frequency ! [sent-72, score-0.123]

33 increasing will result in the response amplitude and the phase delay decreasing. [sent-73, score-0.173]

34 Moreover, from (27) and (30), we know the stationary background re rate r0 play an important part in response to changes in uctuation outputs. [sent-74, score-0.302]

35 The instantaneous response r1 increases monotonically with background re rate r0 :But the background re rate r0 is a function ~1 of the background noise Q0 : In equation (27), r1 =Kv re ects the response amplitude, v and in equation (30), r0 =Q0 re ects the response amplitude. [sent-75, score-0.771]

36 As Figure 1 (A) and (B) show v ~1 ~0 that r1 =Kv and r0 =Q0 changes with variables Q0 and Kv respectively. [sent-76, score-0.02]

37 We can know, v v ~0 ~0 for the subthreshold regime (Kv < 1), they increase monotonically with Q0 when Kv is a v 0 ~ constant. [sent-77, score-0.064]

38 However, for the suprathreshold regime (Kv > 1), they decrease monotonically 0 0 ~ v is a constant. [sent-78, score-0.07]

39 When inputs remain, if the instantaneous response ampliwith Qv when K tude increases, then we can take for the role of neurons are more like coincidence detection than temporal integration. [sent-79, score-0.304]

40 And from this viewpoint, it suggests that the background inputs play an important role in information processing and act as a switch between temporal integration and coincidence detection. [sent-80, score-0.249]

41 In equation (16), if the inputs take the oscillatory form, 1 k (t) = ej! [sent-81, score-0.135]

42 t ; according to (19), ~1 e0 Figure 1: Response amplitude versus Q0 and Kv . [sent-82, score-0.041]

43 (A) r1 =Kv (for equation (27)) v ~ ~ changes with Q0 and K 0 . [sent-83, score-0.072]

44 (B) r0 =Q0 (for equation (30)) changes with Q0 and K 0 . [sent-84, score-0.072]

45 where m =arctg( 1+ ddUk 0 ) is the phase delay, d Uk 0 = ( d ! [sent-88, score-0.025]

46 The minus shows it is a `depression' response amplitude. [sent-90, score-0.077]

47 The phase delay increases with the input frequency ! [sent-91, score-0.106]

48 The k `depression' response amplitude decrease with the input frequency ! [sent-93, score-0.187]

49 The equations (15) (18), (12), (19), (27), (30) and (32) show us a k point of view that the synapses can be regarded as a time-dependent external eld which impacts on the neuronal population through the time-dependent mean and variance. [sent-95, score-0.205]

50 We 1 assume the inputs are composed of two parts, viz. [sent-96, score-0.062]

51 However, in general mk 6= mk1 + mk2 , this suggest for us that the spatial distribution of synapses and inputs is important on neural information processing. [sent-99, score-0.221]

52 In conclusion, the role of synapses can be regarded as a spatio-temporal lter. [sent-100, score-0.12]

53 Figure 2 is the results of simulation of a network of 2000 neurons and the analytic solution for equation (15) and equation (27) in different conditions. [sent-101, score-0.152]

54 4 Summary In this paper, we deal with the model of the integrate-and- re neurons with synaptic current dynamics and synaptic depression. [sent-102, score-0.26]

55 In Section 2, rst, using the membrane potential equation (1) and combining the synaptic depression equation (2), we derive the evolution equation (4) of the joint distribution density function. [sent-103, score-0.421]

56 Then, we give an approach to cut the evolution equation of the high dimensional function down to one dimension, and get equation (9). [sent-104, score-0.164]

57 Finally, we give the stationary solution and the response of instantaneous re rate to time-dependence random uctuation inputs. [sent-105, score-0.265]

58 This paper can only investigate the IF neuronal model without internal connection. [sent-107, score-0.051]

59 We can also extend to other models, such as the non-linear IF neuronal models of sparsely connected networks of excitatory and inhibitory neurons. [sent-108, score-0.051]

60 Figure 2: Simulation of a network of 2000 neurons (thin solid line) and the analytic solution (thick solid line) for equation (15) and equation (27), with v = 15(ms), d = 1(s), A = 0:5, Uk = 0:5, N = 30, ! [sent-109, score-0.152]

61 The horizontal axis is time (0-2s), and the longitudinal axis is the re rate. [sent-112, score-0.058]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kv', 0.668), ('qv', 0.407), ('dk', 0.312), ('pv', 0.301), ('dv', 0.156), ('uk', 0.131), ('mk', 0.092), ('jv', 0.092), ('pd', 0.088), ('depression', 0.086), ('synaptic', 0.086), ('se', 0.083), ('pk', 0.083), ('va', 0.079), ('response', 0.077), ('instantaneous', 0.071), ('synapses', 0.067), ('inputs', 0.062), ('dt', 0.062), ('ddk', 0.06), ('xidian', 0.06), ('equation', 0.052), ('neuronal', 0.051), ('background', 0.049), ('adk', 0.045), ('djv', 0.045), ('kdk', 0.045), ('tsp', 0.045), ('ej', 0.044), ('amplitude', 0.041), ('china', 0.04), ('uctuation', 0.039), ('population', 0.039), ('membrane', 0.036), ('frequency', 0.035), ('dynamics', 0.034), ('get', 0.034), ('tsodyks', 0.033), ('re', 0.032), ('stationary', 0.032), ('delay', 0.03), ('regime', 0.03), ('auk', 0.03), ('dmk', 0.03), ('fourcaud', 0.03), ('transients', 0.03), ('markram', 0.03), ('coincidence', 0.03), ('intelligent', 0.03), ('hz', 0.029), ('exp', 0.029), ('regarded', 0.027), ('neuron', 0.026), ('perturbative', 0.026), ('role', 0.026), ('evolution', 0.026), ('analytic', 0.026), ('phase', 0.025), ('jk', 0.024), ('dd', 0.024), ('erf', 0.024), ('brunel', 0.024), ('neurons', 0.022), ('monotonically', 0.022), ('know', 0.021), ('oscillatory', 0.021), ('powers', 0.021), ('external', 0.021), ('ects', 0.02), ('act', 0.02), ('firing', 0.02), ('changes', 0.02), ('neocortical', 0.019), ('fp', 0.019), ('potential', 0.018), ('play', 0.018), ('decrease', 0.018), ('cacy', 0.018), ('plasticity', 0.018), ('dn', 0.017), ('temporal', 0.016), ('input', 0.016), ('switch', 0.015), ('substituting', 0.015), ('fn', 0.015), ('expansion', 0.014), ('rate', 0.014), ('boundary', 0.014), ('tan', 0.013), ('processing', 0.013), ('institute', 0.013), ('intensity', 0.013), ('axis', 0.013), ('density', 0.013), ('nitions', 0.013), ('environment', 0.012), ('subthreshold', 0.012), ('pawelzik', 0.012), ('destexhe', 0.012), ('gong', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

2 0.10373981 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

Author: Frank Wood, Stefan Roth, Michael J. Black

Abstract: Probabilistic modeling of correlated neural population ﬁring activity is central to understanding the neural code and building practical decoding algorithms. No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. The parameters of the model are learned using contrastive divergence and an optimization procedure for ﬁnding appropriate marginal directions. We evaluate the method using real data recorded from a population of motor cortical neurons. In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is signiﬁcantly higher than under other models. These results suggest that our model captures correlations in the ﬁring activity. Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. 1

3 0.079150319 161 nips-2005-Radial Basis Function Network for Multi-task Learning

Author: Xuejun Liao, Lawrence Carin

Abstract: We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network’s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1

4 0.060623378 106 nips-2005-Large-scale biophysical parameter estimation in single neurons via constrained linear regression

Author: Misha Ahrens, Liam Paninski, Quentin J. Huys

Abstract: Our understanding of the input-output function of single cells has been substantially advanced by biophysically accurate multi-compartmental models. The large number of parameters needing hand tuning in these models has, however, somewhat hampered their applicability and interpretability. Here we propose a simple and well-founded method for automatic estimation of many of these key parameters: 1) the spatial distribution of channel densities on the cell’s membrane; 2) the spatiotemporal pattern of synaptic input; 3) the channels’ reversal potentials; 4) the intercompartmental conductances; and 5) the noise level in each compartment. We assume experimental access to: a) the spatiotemporal voltage signal in the dendrite (or some contiguous subpart thereof, e.g. via voltage sensitive imaging techniques), b) an approximate kinetic description of the channels and synapses present in each compartment, and c) the morphology of the part of the neuron under investigation. The key observation is that, given data a)-c), all of the parameters 1)-4) may be simultaneously inferred by a version of constrained linear regression; this regression, in turn, is efﬁciently solved using standard algorithms, without any “local minima” problems despite the large number of parameters and complex dynamics. The noise level 5) may also be estimated by standard techniques. We demonstrate the method’s accuracy on several model datasets, and describe techniques for quantifying the uncertainty in our estimates. 1

5 0.060175985 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

Author: Yael Niv, Nathaniel D. Daw, Peter Dayan

Abstract: Reinforcement learning models have long promised to unify computational, psychological and neural accounts of appetitively conditioned behavior. However, the bulk of data on animal conditioning comes from free-operant experiments measuring how fast animals will work for reinforcement. Existing reinforcement learning (RL) models are silent about these tasks, because they lack any notion of vigor. They thus fail to address the simple observation that hungrier animals will work harder for food, as well as stranger facts such as their sometimes greater productivity even when working for irrelevant outcomes such as water. Here, we develop an RL framework for free-operant behavior, suggesting that subjects choose how vigorously to perform selected actions by optimally balancing the costs and beneﬁts of quick responding. Motivational states such as hunger shift these factors, skewing the tradeoff. This accounts normatively for the effects of motivation on response rates, as well as many other classic ﬁndings. Finally, we suggest that tonic levels of dopamine may be involved in the computation linking motivational state to optimal responding, thereby explaining the complex vigor-related effects of pharmacological manipulation of dopamine. 1

6 0.059640102 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

7 0.058481853 188 nips-2005-Temporally changing synaptic plasticity

8 0.058366388 181 nips-2005-Spiking Inputs to a Winner-take-all Network

9 0.058036726 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches

10 0.047829047 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects

11 0.047122359 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

12 0.041761894 118 nips-2005-Learning in Silicon: Timing is Everything

13 0.039242294 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models

14 0.034291573 22 nips-2005-An Analog Visual Pre-Processing Processor Employing Cyclic Line Access in Only-Nearest-Neighbor-Interconnects Architecture

15 0.033833787 134 nips-2005-Neural mechanisms of contrast dependent receptive field size in V1

16 0.033329625 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

17 0.032858919 98 nips-2005-Infinite latent feature models and the Indian buffet process

18 0.032130644 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains

19 0.03210124 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

20 0.031632185 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.076), (1, -0.115), (2, -0.024), (3, -0.033), (4, 0.005), (5, -0.012), (6, -0.005), (7, -0.015), (8, 0.013), (9, -0.004), (10, 0.009), (11, -0.031), (12, 0.015), (13, -0.039), (14, -0.02), (15, 0.038), (16, 0.002), (17, -0.02), (18, 0.026), (19, -0.044), (20, 0.028), (21, 0.036), (22, 0.018), (23, 0.076), (24, -0.039), (25, 0.067), (26, -0.01), (27, 0.054), (28, 0.077), (29, -0.103), (30, -0.033), (31, -0.069), (32, 0.048), (33, -0.21), (34, 0.067), (35, 0.014), (36, 0.069), (37, 0.008), (38, -0.095), (39, 0.07), (40, -0.051), (41, -0.007), (42, 0.011), (43, -0.037), (44, -0.114), (45, -0.105), (46, 0.066), (47, -0.0), (48, 0.149), (49, -0.203)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97055995 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

2 0.54782909 161 nips-2005-Radial Basis Function Network for Multi-task Learning

Author: Xuejun Liao, Lawrence Carin

3 0.4580856 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches

Author: Anna Levina, Michael Herrmann

Abstract: There is experimental evidence that cortical neurons show avalanche activity with the intensity of ﬁring events being distributed as a power-law. We present a biologically plausible extension of a neural network which exhibits a power-law avalanche distribution for a wide range of connectivity parameters. 1

4 0.37284017 106 nips-2005-Large-scale biophysical parameter estimation in single neurons via constrained linear regression

Author: Misha Ahrens, Liam Paninski, Quentin J. Huys

5 0.35904875 40 nips-2005-CMOL CrossNets: Possible Neuromorphic Nanoelectronic Circuits

Author: Jung Hoon Lee, Xiaolong Ma, Konstantin K. Likharev

Abstract: Hybrid “CMOL” integrated circuits, combining CMOS subsystem with nanowire crossbars and simple two-terminal nanodevices, promise to extend the exponential Moore-Law development of microelectronics into the sub-10-nm range. We are developing neuromorphic network (“CrossNet”) architectures for this future technology, in which neural cell bodies are implemented in CMOS, nanowires are used as axons and dendrites, while nanodevices (bistable latching switches) are used as elementary synapses. We have shown how CrossNets may be trained to perform pattern recovery and classification despite the limitations imposed by the CMOL hardware. Preliminary estimates have shown that CMOL CrossNets may be extremely dense (~10 7 cells per cm2) and operate approximately a million times faster than biological neural networks, at manageable power consumption. In Conclusion, we discuss in brief possible short-term and long-term applications of the emerging technology. 1 Introduction: CMOL Circuits Recent results [1, 2] indicate that the current VLSI paradigm based on CMOS technology can be hardly extended beyond the 10-nm frontier: in this range the sensitivity of parameters (most importantly, the gate voltage threshold) of silicon field-effect transistors to inevitable fabrication spreads grows exponentially. This sensitivity will probably send the fabrication facilities costs skyrocketing, and may lead to the end of Moore’s Law some time during the next decade. There is a growing consensus that the impending Moore’s Law crisis may be preempted by a radical paradigm shift from the purely CMOS technology to hybrid CMOS/nanodevice circuits, e.g., those of “CMOL” variety (Fig. 1). Such circuits (see, e.g., Ref. 3 for their recent review) would combine a level of advanced CMOS devices fabricated by the lithographic patterning, and two-layer nanowire crossbar formed, e.g., by nanoimprint, with nanowires connected by simple, similar, two-terminal nanodevices at each crosspoint. For such devices, molecular single-electron latching switches [4] are presently the leading candidates, in particular because they may be fabricated using the self-assembled monolayer (SAM) technique which already gave reproducible results for simpler molecular devices [5]. (a) nanodevices nanowiring and nanodevices interface pins upper wiring level of CMOS stack (b) βFCMOS Fnano α Fig. 1. CMOL circuit: (a) schematic side view, and (b) top-view zoom-in on several adjacent interface pins. (For clarity, only two adjacent nanodevices are shown.) In order to overcome the CMOS/nanodevice interface problems pertinent to earlier proposals of hybrid circuits [6], in CMOL the interface is provided by pins that are distributed all over the circuit area, on the top of the CMOS stack. This allows to use advanced techniques of nanowire patterning (like nanoimprint) which do not have nanoscale accuracy of layer alignment [3]. The vital feature of this interface is the tilt, by angle α = arcsin(Fnano/βFCMOS), of the nanowire crossbar relative to the square arrays of interface pins (Fig. 1b). Here Fnano is the nanowiring half-pitch, FCMOS is the half-pitch of the CMOS subsystem, and β is a dimensionless factor larger than 1 that depends on the CMOS cell complexity. Figure 1b shows that this tilt allows the CMOS subsystem to address each nanodevice even if Fnano << βFCMOS. By now, it has been shown that CMOL circuits can combine high performance with high defect tolerance (which is necessary for any circuit using nanodevices) for several digital applications. In particular, CMOL circuits with defect rates below a few percent would enable terabit-scale memories [7], while the performance of FPGA-like CMOL circuits may be several hundred times above that of overcome purely CMOL FPGA (implemented with the same FCMOS), at acceptable power dissipation and defect tolerance above 20% [8]. In addition, the very structure of CMOL circuits makes them uniquely suitable for the implementation of more complex, mixed-signal information processing systems, including ultradense and ultrafast neuromorphic networks. The objective of this paper is to describe in brief the current status of our work on the development of so-called Distributed Crossbar Networks (“CrossNets”) that could provide high performance despite the limitations imposed by CMOL hardware. A more detailed description of our earlier results may be found in Ref. 9. 2 Synapses The central device of CrossNet is a two-terminal latching switch [3, 4] (Fig. 2a) which is a combination of two single-electron devices, a transistor and a trap [3]. The device may be naturally implemented as a single organic molecule (Fig. 2b). Qualitatively, the device operates as follows: if voltage V = Vj – Vk applied between the external electrodes (in CMOL, nanowires) is low, the trap island has no net electric charge, and the single-electron transistor is closed. If voltage V approaches certain threshold value V+ > 0, an additional electron is inserted into the trap island, and its field lifts the Coulomb blockade of the single-electron transistor, thus connecting the nanowires. The switch state may be reset (e.g., wires disconnected) by applying a lower voltage V < V- < V+. Due to the random character of single-electron tunneling [2], the quantitative description of the switch is by necessity probabilistic: actually, V determines only the rates Γ↑↓ of device switching between its ON and OFF states. The rates, in turn, determine the dynamics of probability p to have the transistor opened (i.e. wires connected): dp/dt = Γ↑(1 - p) - Γ↓p. (1) The theory of single-electron tunneling [2] shows that, in a good approximation, the rates may be presented as Γ↑↓ = Γ0 exp{±e(V - S)/kBT} , (2) (a) single-electron trap tunnel junction Vj Vk single-electron transistor (b) O clipping group O N C R diimide acceptor groups O O C N R R O OPE wires O N R R N O O R O N R R = hexyl N O O R R O N C R R R Fig. 2. (a) Schematics and (b) possible molecular implementation of the two-terminal single-electron latching switch where Γ0 and S are constants depending on physical parameters of the latching switches. Note that despite the random character of switching, the strong nonlinearity of Eq. (2) allows to limit the degree of the device “fuzziness”. 3 CrossNets Figure 3a shows the generic structure of a CrossNet. CMOS-implemented somatic cells (within the Fire Rate model, just nonlinear differential amplifiers, see Fig. 3b,c) apply their output voltages to “axonic” nanowires. If the latching switch, working as an elementary synapse, on the crosspoint of an axonic wire with the perpendicular “dendritic” wire is open, some current flows into the latter wire, charging it. Since such currents are injected into each dendritic wire through several (many) open synapses, their addition provides a natural passive analog summation of signals from the corresponding somas, typical for all neural networks. Examining Fig. 3a, please note the open-circuit terminations of axonic and dendritic lines at the borders of the somatic cells; due to these terminations the somas do not communicate directly (but only via synapses). The network shown on Fig. 3 is evidently feedforward; recurrent networks are achieved in the evident way by doubling the number of synapses and nanowires per somatic cell (Fig. 3c). Moreover, using dual-rail (bipolar) representation of the signal, and hence doubling the number of nanowires and elementary synapses once again, one gets a CrossNet with somas coupled by compact 4-switch groups [9]. Using Eqs. (1) and (2), it is straightforward to show that that the average synaptic weight wjk of the group obeys the “quasi-Hebbian” rule: d w jk = −4Γ0 sinh (γ S ) sinh (γ V j ) sinh (γ Vk ) . dt (3) (a) - +soma j (b) RL + -- jk+ RL (c) jk- RL + -- -+soma k RL Fig. 3. (a) Generic structure of the simplest, (feedforward, non-Hebbian) CrossNet. Red lines show “axonic”, and blue lines “dendritic” nanowires. Gray squares are interfaces between nanowires and CMOS-based somas (b, c). Signs show the dendrite input polarities. Green circles denote molecular latching switches forming elementary synapses. Bold red and blue points are open-circuit terminations of the nanowires, that do not allow somas to interact in bypass of synapses In the simplest cases (e.g., quasi-Hopfield networks with finite connectivity), the tri-level synaptic weights of the generic CrossNets are quite satisfactory, leading to just a very modest (~30%) network capacity loss. However, some applications (in particular, pattern classification) may require a larger number of weight quantization levels L (e.g., L ≈ 30 for a 1% fidelity [9]). This may be achieved by using compact square arrays (e.g., 4×4) of latching switches (Fig. 4). Various species of CrossNets [9] differ also by the way the somatic cells are distributed around the synaptic field. Figure 5 shows feedforward versions of two CrossNet types most explored so far: the so-called FlossBar and InBar. The former network is more natural for the implementation of multilayered perceptrons (MLP), while the latter system is preferable for recurrent network implementations and also allows a simpler CMOS design of somatic cells. The most important advantage of CrossNets over the hardware neural networks suggested earlier is that these networks allow to achieve enormous density combined with large cell connectivity M >> 1 in quasi-2D electronic circuits. 4 CrossNet training CrossNet training faces several hardware-imposed challenges: (i) The synaptic weight contribution provided by the elementary latching switch is binary, so that for most applications the multi-switch synapses (Fig. 4) are necessary. (ii) The only way to adjust any particular synaptic weight is to turn ON or OFF the corresponding latching switch(es). This is only possible to do by applying certain voltage V = Vj – Vk between the two corresponding nanowires. At this procedure, other nanodevices attached to the same wires should not be disturbed. (iii) As stated above, synapse state switching is a statistical progress, so that the degree of its “fuzziness” should be carefully controlled. (a) Vj (b) V w – A/2 i=1 i=1 2 2 … … n n Vj V w+ A/2 i' = 1 RL 2 … i' = 1 n RS ±(V t –A/2) 2 … RS n ±(V t +A/2) Fig. 4. Composite synapse for providing L = 2n2+1 discrete levels of the weight in (a) operation and (b) weight adjustment modes. The dark-gray rectangles are resistive metallic strips at soma/nanowire interfaces (a) (b) Fig. 5. Two main CrossNet species: (a) FlossBar and (b) InBar, in the generic (feedforward, non-Hebbian, ternary-weight) case for the connectivity parameter M = 9. Only the nanowires and nanodevices coupling one cell (indicated with red dashed lines) to M post-synaptic cells (blue dashed lines) are shown; actually all the cells are similarly coupled We have shown that these challenges may be met using (at least) the following training methods [9]: (i) Synaptic weight import. This procedure is started with training of a homomorphic “precursor” artificial neural network with continuous synaptic weighs wjk, implemented in software, using one of established methods (e.g., error backpropagation). Then the synaptic weights wjk are transferred to the CrossNet, with some “clipping” (rounding) due to the binary nature of elementary synaptic weights. To accomplish the transfer, pairs of somatic cells are sequentially selected via CMOS-level wiring. Using the flexibility of CMOS circuitry, these cells are reconfigured to apply external voltages ±VW to the axonic and dendritic nanowires leading to a particular synapse, while all other nanowires are grounded. The voltage level V W is selected so that it does not switch the synapses attached to only one of the selected nanowires, while voltage 2VW applied to the synapse at the crosspoint of the selected wires is sufficient for its reliable switching. (In the composite synapses with quasi-continuous weights (Fig. 4), only a part of the corresponding switches is turned ON or OFF.) (ii) Error backpropagation. The synaptic weight import procedure is straightforward when wjk may be simply calculated, e.g., for the Hopfield-type networks. However, for very large CrossNets used, e.g., as pattern classifiers the precursor network training may take an impracticably long time. In this case the direct training of a CrossNet may become necessary. We have developed two methods of such training, both based on “Hebbian” synapses consisting of 4 elementary synapses (latching switches) whose average weight dynamics obeys Eq. (3). This quasi-Hebbian rule may be used to implement the backpropagation algorithm either using a periodic time-multiplexing [9] or in a continuous fashion, using the simultaneous propagation of signals and errors along the same dual-rail channels. As a result, presently we may state that CrossNets may be taught to perform virtually all major functions demonstrated earlier with the usual neural networks, including the corrupted pattern restoration in the recurrent quasi-Hopfield mode and pattern classification in the feedforward MLP mode [11]. 5 C r o s s N e t p e r f o r m an c e e s t i m a t e s The significance of this result may be only appreciated in the context of unparalleled physical parameters of CMOL CrossNets. The only fundamental limitation on the half-pitch Fnano (Fig. 1) comes from quantum-mechanical tunneling between nanowires. If the wires are separated by vacuum, the corresponding specific leakage conductance becomes uncomfortably large (~10-12 Ω-1m-1) only at Fnano = 1.5 nm; however, since realistic insulation materials (SiO2, etc.) provide somewhat lower tunnel barriers, let us use a more conservative value Fnano= 3 nm. Note that this value corresponds to 1012 elementary synapses per cm2, so that for 4M = 104 and n = 4 the areal density of neural cells is close to 2×107 cm-2. Both numbers are higher than those for the human cerebral cortex, despite the fact that the quasi-2D CMOL circuits have to compete with quasi-3D cerebral cortex. With the typical specific capacitance of 3×10-10 F/m = 0.3 aF/nm, this gives nanowire capacitance C0 ≈ 1 aF per working elementary synapse, because the corresponding segment has length 4Fnano. The CrossNet operation speed is determined mostly by the time constant τ0 of dendrite nanowire capacitance recharging through resistances of open nanodevices. Since both the relevant conductance and capacitance increase similarly with M and n, τ0 ≈ R0C0. The possibilities of reduction of R0, and hence τ0, are limited mostly by acceptable power dissipation per unit area, that is close to Vs2/(2Fnano)2R0. For room-temperature operation, the voltage scale V0 ≈ Vt should be of the order of at least 30 kBT/e ≈ 1 V to avoid thermally-induced errors [9]. With our number for Fnano, and a relatively high but acceptable power consumption of 100 W/cm2, we get R0 ≈ 1010Ω (which is a very realistic value for single-molecule single-electron devices like one shown in Fig. 3). With this number, τ0 is as small as ~10 ns. This means that the CrossNet speed may be approximately six orders of magnitude (!) higher than that of the biological neural networks. Even scaling R0 up by a factor of 100 to bring power consumption to a more comfortable level of 1 W/cm2, would still leave us at least a four-orders-of-magnitude speed advantage. 6 D i s c u s s i on: P o s s i bl e a p p l i c at i o n s These estimates make us believe that that CMOL CrossNet chips may revolutionize the neuromorphic network applications. Let us start with the example of relatively small (1-cm2-scale) chips used for recognition of a face in a crowd [11]. The most difficult feature of such recognition is the search for face location, i.e. optimal placement of a face on the image relative to the panel providing input for the processing network. The enormous density and speed of CMOL hardware gives a possibility to time-and-space multiplex this task (Fig. 6). In this approach, the full image (say, formed by CMOS photodetectors on the same chip) is divided into P rectangular panels of h×w pixels, corresponding to the expected size and approximate shape of a single face. A CMOS-implemented communication channel passes input data from each panel to the corresponding CMOL neural network, providing its shift in time, say using the TV scanning pattern (red line in Fig. 6). The standard methods of image classification require the network to have just a few hidden layers, so that the time interval Δt necessary for each mapping position may be so short that the total pattern recognition time T = hwΔt may be acceptable even for online face recognition. w h image network input Fig. 6. Scan mapping of the input image on CMOL CrossNet inputs. Red lines show the possible time sequence of image pixels sent to a certain input of the network processing image from the upper-left panel of the pattern Indeed, let us consider a 4-Megapixel image partitioned into 4K 32×32-pixel panels (h = w = 32). This panel will require an MLP net with several (say, four) layers with 1K cells each in order to compare the panel image with ~10 3 stored faces. With the feasible 4-nm nanowire half-pitch, and 65-level synapses (sufficient for better than 99% fidelity [9]), each interlayer crossbar would require chip area about (4K×64 nm)2 = 64×64 μm2, fitting 4×4K of them on a ~0.6 cm2 chip. (The CMOS somatic-layer and communication-system overheads are negligible.) With the acceptable power consumption of the order of 10 W/cm2, the input-to-output signal propagation in such a network will take only about 50 ns, so that Δt may be of the order of 100 ns and the total time T = hwΔt of processing one frame of the order of 100 microseconds, much shorter than the typical TV frame time of ~10 milliseconds. The remaining two-orders-of-magnitude time gap may be used, for example, for double-checking the results via stopping the scan mapping (Fig. 6) at the most promising position. (For this, a simple feedback from the recognition output to the mapping communication system is necessary.) It is instructive to compare the estimated CMOL chip speed with that of the implementation of a similar parallel network ensemble on a CMOS signal processor (say, also combined on the same chip with an array of CMOS photodetectors). Even assuming an extremely high performance of 30 billion additions/multiplications per second, we would need ~4×4K×1K×(4K)2/(30×109) ≈ 104 seconds ~ 3 hours per frame, evidently incompatible with the online image stream processing. Let us finish with a brief (and much more speculative) discussion of possible long-term prospects of CMOL CrossNets. Eventually, large-scale (~30×30 cm2) CMOL circuits may become available. According to the estimates given in the previous section, the integration scale of such a system (in terms of both neural cells and synapses) will be comparable with that of the human cerebral cortex. Equipped with a set of broadband sensor/actuator interfaces, such (necessarily, hierarchical) system may be capable, after a period of initial supervised training, of further self-training in the process of interaction with environment, with the speed several orders of magnitude higher than that of its biological prototypes. Needless to say, the successful development of such self-developing systems would have a major impact not only on all information technologies, but also on the society as a whole. Acknowledgments This work has been supported in part by the AFOSR, MARCO (via FENA Center), and NSF. Valuable contributions made by Simon Fölling, Özgür Türel and Ibrahim Muckra, as well as useful discussions with P. Adams, J. Barhen, D. Hammerstrom, V. Protopopescu, T. Sejnowski, and D. Strukov are gratefully acknowledged. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Frank, D. J. et al. (2001) Device scaling limits of Si MOSFETs and their application dependencies. Proc. IEEE 89(3): 259-288. Likharev, K. K. (2003) Electronics below 10 nm, in J. Greer et al. (eds.), Nano and Giga Challenges in Microelectronics, pp. 27-68. Amsterdam: Elsevier. Likharev, K. K. and Strukov, D. B. (2005) CMOL: Devices, circuits, and architectures, in G. Cuniberti et al. (eds.), Introducing Molecular Electronics, Ch. 16. Springer, Berlin. Fölling, S., Türel, Ö. & Likharev, K. K. (2001) Single-electron latching switches as nanoscale synapses, in Proc. of the 2001 Int. Joint Conf. on Neural Networks, pp. 216-221. Mount Royal, NJ: Int. Neural Network Society. Wang, W. et al. (2003) Mechanism of electron conduction in self-assembled alkanethiol monolayer devices. Phys. Rev. B 68(3): 035416 1-8. Stan M. et al. (2003) Molecular electronics: From devices and interconnect to circuits and architecture, Proc. IEEE 91(11): 1940-1957. Strukov, D. B. & Likharev, K. K. (2005) Prospects for terabit-scale nanoelectronic memories. Nanotechnology 16(1): 137-148. Strukov, D. B. & Likharev, K. K. (2005) CMOL FPGA: A reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices. Nanotechnology 16(6): 888-900. Türel, Ö. et al. (2004) Neuromorphic architectures for nanoelectronic circuits”, Int. J. of Circuit Theory and Appl. 32(5): 277-302. See, e.g., Hertz J. et al. (1991) Introduction to the Theory of Neural Computation. Cambridge, MA: Perseus. Lee, J. H. & Likharev, K. K. (2005) CrossNets as pattern classifiers. Lecture Notes in Computer Sciences 3575: 434-441.

6 0.34968212 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

7 0.34245908 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

8 0.33831695 188 nips-2005-Temporally changing synaptic plasticity

9 0.33103815 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

10 0.31180018 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

11 0.29783905 184 nips-2005-Structured Prediction via the Extragradient Method

12 0.24465926 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models

13 0.24118514 203 nips-2005-Visual Encoding with Jittering Eyes

14 0.23521173 118 nips-2005-Learning in Silicon: Timing is Everything

15 0.23468046 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions

16 0.22909921 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

17 0.19905749 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

18 0.19408609 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

19 0.19077471 25 nips-2005-An aVLSI Cricket Ear Model

20 0.18937972 121 nips-2005-Location-based activity recognition

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.017), (10, 0.561), (27, 0.019), (31, 0.023), (34, 0.029), (55, 0.013), (57, 0.056), (60, 0.017), (69, 0.022), (71, 0.012), (77, 0.011), (88, 0.038), (91, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9736048 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

2 0.87954545 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

Abstract: We discuss two intrinsic weaknesses of the spectral graph partitioning method, both of which have practical consequences. The ﬁrst is that spectral embeddings tend to hide the best cuts from the commonly used hyperplane rounding method. Rather than cleaning up the resulting suboptimal cuts with local search, we recommend the adoption of ﬂow-based rounding. The second weakness is that for many “power law” graphs, the spectral method produces cuts that are highly unbalanced, thus decreasing the usefulness of the method for visualization (see ﬁgure 4(b)) or as a basis for divide-and-conquer algorithms. These balance problems, which occur even though the spectral method’s quotient-style objective function does encourage balance, can be ﬁxed with a stricter balance constraint that turns the spectral mathematical program into an SDP that can be solved for million-node graphs by a method of Burer and Monteiro. 1 Background Graph partitioning is the NP-hard problem of ﬁnding a small graph cut subject to the constraint that neither side of the resulting partitioning of the nodes is “too small”. We will be dealing with several versions: the graph bisection problem, which requires perfect 1 : 1 2 2 balance; the β-balanced cut problem (with β a fraction such as 1 ), which requires at least 3 β : (1 − β) balance; and the quotient cut problem, which requires the small side to be large enough to “pay for” the edges in the cut. The quotient cut metric is c/ min(a, b), where c is the cutsize and a and b are the sizes of the two sides of the cut. All of the well-known variants of the quotient cut metric (e.g. normalized cut [15]) have similar behavior with respect to the issues discussed in this paper. The spectral method for graph partitioning was introduced in 1973 by Fiedler and Donath & Hoffman [6]. In the mid-1980’s Alon & Milman [1] proved that spectral cuts can be at worst quadratically bad; in the mid 1990’s Guattery & Miller [10] proved that this analysis is tight by exhibiting a family of n-node graphs whose spectral bisections cut O(n 2/3 ) edges versus the optimal O(n1/3 ) edges. On the other hand, Spielman & Teng [16] have proved stronger performance guarantees for the special case of spacelike graphs. The spectral method can be derived by relaxing a quadratic integer program which encodes the graph bisection problem (see section 3.1). The solution to this relaxation is the “Fiedler vector”, or second smallest eigenvector of the graph’s discrete Laplacian matrix, whose elements xi can be interpreted as an embedding of the graph on the line. To obtain a (A) Graph with nearly balanced 8-cut (B) Spectral Embedding (C) Notional Flow-based Embedding Figure 1: The spectral embedding hides the best solution from hyperplane rounding. speciﬁc cut, one must apply a “rounding method” to this embedding. The hyperplane rounding method chooses one of the n − 1 cuts which separate the nodes whose x i values lie above and below some split value x. ˆ 2 Using ﬂow to ﬁnd cuts that are hidden from hyperplane rounding Theorists have long known that the spectral method cannot distinguish between deep cuts and long paths, and that this confusion can cause it to cut a graph in the wrong direction thereby producing the spectral method’s worst-case behavior [10]. In this section we will show by example that even when the spectral method is not fooled into cutting in the wrong direction, the resulting embedding can hide the best cuts from the hyperplane rounding method. This is a possible explanation for the frequently made empirical observation (see e.g. [12]) that hyperplane roundings of spectral embeddings are noisy and therefore beneﬁt from cleanup with a local search method such as Fiduccia-Matheyses [8]. Consider the graph in ﬁgure 1(a), which has a near-bisection cutting 8 edges. For this graph the spectral method produces the embedding shown in ﬁgure 1(b), and recommends that we make a vertical cut (across the horizontal dimension which is based on the Fiedler vector). This is correct in a generalized sense, but it is obvious that no hyperplane (or vertical line in this picture) can possibly extract the optimal 8-edge cut. Some insight into why spectral embeddings tend to have this problem can be obtained from the spectral method’s electrical interpretation. In this view the graph is represented by a resistor network [7]. Current ﬂowing in this network causes voltage drops across the resistors, thus determining the nodes’ voltages and hence their positions. When current ﬂows through a long series of resistors, it induces a progressive voltage drop. This is what causes the excessive length of the embeddings of the horizontal girder-like structures which are blocking all vertical hyperplane cuts in ﬁgure 1(b). If the embedding method were somehow not based on current, but rather on ﬂow, which does not distinguish between a pipe and a series of pipes, then the long girders could retract into the two sides of the embedding, as suggested by ﬁgure 1(c), and the best cut would be revealed. Because theoretical ﬂow-like embedding methods such as [14] are currently not practical, we point out that in cases like ﬁgure 1(b), where the spectral method has not chosen an incorrect direction for the cut, one can use an S-T max ﬂow problem with the ﬂow running in the recommended direction (horizontally for this embedding) to extract the good cut even though it is hidden from all hyperplanes. We currently use two different ﬂow-based rounding methods. A method called MQI looks for quotient cuts, and is already described in [13]. Another method, that we shall call Midﬂow, looks for β-balanced cuts. The input to Midﬂow is a graph and an ordering of its nodes (obtained e.g. from a spectral embedding or from the projection of any embedding onto a line). We divide the graph’s nodes into 3 sets F, L, and U. The sets F and L respectively contain the ﬁrst βn and last βn nodes in the ordering, and U contains the remaining 50-50 balance ng s ro un di Hy pe r pl an e neg-pos split quotient cut score (cutsize / size of small side) 0.01 ctor r ve iedle of F 0.004 0.003 0.00268 0.00232 Best hyperplane rounding of Fiedler Vector Best improvement with local search 0.002 0.00138 0.001 60000 80000 Midflow rounding beta = 1/4 100000 120000 0.00145 140000 Midflow rounding of Fiedler Vector beta = 1/3 160000 180000 200000 220000 240000 number of nodes on ’left’ side of cut (out of 324800) Figure 2: A typical example (see section 2.1) where ﬂow-based rounding beats hyperplane rounding, even when the hyperplane cuts are improved with Fiduccia-Matheyses search. Note that for this spacelike graph, the best quotient cuts have reasonably good balance. U = n − 2βn nodes, which are “up for grabs”. We set up an S-T max ﬂow problem with one node for every graph node plus 2 new nodes for the source and sink. For each graph edge there are two arcs, one in each direction, with unit capacity. Finally, the nodes in F are pinned to the source and the nodes in L are pinned to sink by inﬁnite capacity arcs. This max-ﬂow problem can be solved by a good implementation of the push-relabel algorithm (such as Goldberg and Cherkassky’s hi pr [4]) in time that empirically is nearly linear with a very good constant factor. Figure 6 shows that solving a MidFlow problem with hi pr can be 1000 times cheaper than ﬁnding a spectral embedding with ARPACK. When the goal is ﬁnding good β-balanced cuts, MidFlow rounding is strictly more powerful than hyperplane rounding; from a given node ordering hyperplane rounding chooses the best of U + 1 candidate cuts, while MidFlow rounding chooses the best of 2U candidates, including all of those considered by hyperplane rounding. [Similarly, MQI rounding is strictly more powerful than hyperplane rounding for the task of ﬁnding good quotient cuts.] 2.1 A concrete example The plot in ﬁgure 2 shows a number of cuts in a 324,800 node nearly planar graph derived from a 700x464 pixel downward-looking view of some clouds over some mountains.1 The y-axis of the plot is quotient cut score; smaller values are better. We note in passing that the commonly used split point x = 0 does not yield the best hyperplane cut. Our main ˆ point is that the two cuts generated by MidFlow rounding of the Fiedler vector (with β = 1 3 and β = 1 ) are nearly twice as good as the best hyperplane cut. Even after the best 4 hyperplane cut has been improved by taking the best result of 100 runs of a version of Fiduccia-Matheyses local search, it is still much worse than the cuts obtained by ﬂowbased rounding. 1 The graph’s edges are unweighted but are chosen by a randomized rule which is more likely to include an edge between two neighboring pixels if they have a similar grey value. Good cuts in the graph tend to run along discontinuities in the image, as one would expect. quotient cut score 1 SDP-LB (smaller is better) 0.1 Scatter plot showing cuts in a

3 0.86001074 156 nips-2005-Prediction and Change Detection

Author: Mark Steyvers, Scott Brown

Abstract: We measure the ability of human observers to predict the next datum in a sequence that is generated by a simple statistical process undergoing change at random points in time. Accurate performance in this task requires the identification of changepoints. We assess individual differences between observers both empirically, and using two kinds of models: a Bayesian approach for change detection and a family of cognitively plausible fast and frugal models. Some individuals detect too many changes and hence perform sub-optimally due to excess variability. Other individuals do not detect enough changes, and perform sub-optimally because they fail to notice short-term temporal trends. 1 Intr oduction Decision-making often requires a rapid response to change. For example, stock analysts need to quickly detect changes in the market in order to adjust investment strategies. Coaches need to track changes in a player’s performance in order to adjust strategy. When tracking changes, there are costs involved when either more or less changes are observed than actually occurred. For example, when using an overly conservative change detection criterion, a stock analyst might miss important short-term trends and interpret them as random fluctuations instead. On the other hand, a change may also be detected too readily. For example, in basketball, a player who makes a series of consecutive baskets is often identified as a “hot hand” player whose underlying ability is perceived to have suddenly increased [1,2]. This might lead to sub-optimal passing strategies, based on random fluctuations. We are interested in explaining individual differences in a sequential prediction task. Observers are shown stimuli generated from a simple statistical process with the task of predicting the next datum in the sequence. The latent parameters of the statistical process change discretely at random points in time. Performance in this task depends on the accurate detection of those changepoints, as well as inference about future outcomes based on the outcomes that followed the most recent inferred changepoint. There is much prior research in statistics on the problem of identifying changepoints [3,4,5]. In this paper, we adopt a Bayesian approach to the changepoint identification problem and develop a simple inference procedure to predict the next datum in a sequence. The Bayesian model serves as an ideal observer model and is useful to characterize the ways in which individuals deviate from optimality. The plan of the paper is as follows. We first introduce the sequential prediction task and discuss a Bayesian analysis of this prediction problem. We then discuss the results from a few individuals in this prediction task and show how the Bayesian approach can capture individual differences with a single “twitchiness” parameter that describes how readily changes are perceived in random sequences. We will show that some individuals are too twitchy: their performance is too variable because they base their predictions on too little of the recent data. Other individuals are not twitchy enough, and they fail to capture fast changes in the data. We also show how behavior can be explained with a set of fast and frugal models [6]. These are cognitively realistic models that operate under plausible computational constraints. 2 A pr ediction task wit h m ult iple c hange points In the prediction task, stimuli are presented sequentially and the task is to predict the next stimulus in the sequence. After t trials, the observer has been presented with stimuli y1, y2, …, yt and the task is to make a prediction about yt+1. After the prediction is made, the actual outcome yt+1 is revealed and the next trial proceeds to the prediction of yt+2. This procedure starts with y1 and is repeated for T trials. The observations yt are D-dimensional vectors with elements sampled from binomial distributions. The parameters of those distributions change discretely at random points in time such that the mean increases or decreases after a change point. This generates a sequence of observation vectors, y1, y2, …, yT, where each yt = {yt,1 … yt,D}. Each of the yt,d is sampled from a binomial distribution Bin(θt,d,K), so 0 ≤ yt,d ≤ K. The parameter vector θt ={θt,1 … θt,D} changes depending on the locations of the changepoints. At each time step, xt is a binary indicator for the occurrence of a changepoint occurring at time t+1. The parameter α determines the probability of a change occurring in the sequence. The generative model is specified by the following algorithm: 1. For d=1..D sample θ1,d from a Uniform(0,1) distribution 2. For t=2..T, (a) Sample xt-1 from a Bernoulli(α) distribution (b) If xt-1=0, then θt=θt-1, else for d=1..D sample θt,d from a Uniform(0,1) distribution (c) for d=1..D, sample yt from a Bin(θt,d,K) distribution Table 1 shows some data generated from the changepoint model with T=20, α=.1,and D=1. In the prediction task, y will be observed, but x and θ are not. Table 1: Example data t x θ y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 .68 .68 .68 .68 .48 .48 .48 .74 .74 .74 .74 .74 .74 .19 .19 .87 .87 .87 .87 .87 9 7 8 7 4 4 4 9 8 3 6 7 8 2 1 8 9 9 8 8 3 A Bayesian pr ediction m ode l In both our Bayesian and fast-and-frugal analyses, the prediction task is decomposed into two inference procedures. First, the changepoint locations are identified. This is followed by predictive inference for the next outcome based on the most recent changepoint locations. Several Bayesian approaches have been developed for changepoint problems involving single or multiple changepoints [3,5]. We apply a Markov Chain Monte Carlo (MCMC) analysis to approximate the joint posterior distribution over changepoint assignments x while integrating out θ. Gibbs sampling will be used to sample from this posterior marginal distribution. The samples can then be used to predict the next outcome in the sequence. 3.1 I n f e r e nc e f o r c h a n g e p o i n t a s s i g n m e n t s . To apply Gibbs sampling, we evaluate the conditional probability of assigning a changepoint at time i, given all other changepoint assignments and the current α value. By integrating out θ, the conditional probability is P ( xi | x−i , y, α ) = ∫ P ( xi ,θ , α | x− i , y ) (1) θ where x− i represents all switch point assignments except xi. This can be simplified by considering the location of the most recent changepoint preceding and following time i and the outcomes occurring between these locations. Let niL be the number of time steps from the last changepoint up to and including the current time step i such that xi − nL =1 and xi − nL + j =0 for 0 < niL . Similarly, let niR be the number of time steps that i i follow time step i up to the next changepoint such that xi + n R =1 and xi + nR − j =0 for i R i 0 < n . Let y = L i ∑ i − niL < k ≤ i i yk and y = ∑ k < k ≤i + n R yk . The update equation for the R i i changepoint assignment can then be simplified to P ( xi = m | x−i ) ∝ ( ) ( ( ) D Γ 1 + y L + y R Γ 1 + Kn L + Kn R − y L − y R ⎧ i, j i, j i i i, j i, j ⎪ (1 − α ) ∏ L R Γ 2 + Kni + Kni ⎪ j =1 ⎪ ⎨ L L L R R R ⎪ D Γ 1 + yi, j Γ 1 + Kni − yi, j Γ 1 + yi, j Γ 1 + Kni − yi, j α∏ ⎪ Γ 2 + KniL Γ 2 + KniR ⎪ j =1 ⎩ ( ) ( ( ) ( ) ( ) ) ( ) m=0 ) (2) m =1 We initialize the Gibbs sampler by sampling each xt from a Bernoulli(α) distribution. All changepoint assignments are then updated sequentially by the Gibbs sampling equation above. The sampler is run for M iterations after which one set of changepoint assignments is saved. The Gibbs sampler is then restarted multiple times until S samples have been collected. Although we could have included an update equation for α, in this analysis we treat α as a known constant. This will be useful when characterizing the differences between human observers in terms of differences in α. 3.2 P r e d i c ti v e i n f er e n ce The next latent parameter value θt+1 and outcome yt+1 can be predicted on the basis of observed outcomes that occurred after the last inferred changepoint: θ t+1, j = t ∑ i =t* +1 yt+1, j = round (θt +1, j K ) yi, j / K , (3) where t* is the location of the most recent change point. By considering multiple Gibbs samples, we get a distribution over outcomes yt+1. We base the model predictions on the mean of this distribution. 3.3 I l l u s t r a t i o n o f m o d e l p er f o r m a n c e Figure 1 illustrates the performance of the model on a one dimensional sequence (D=1) generated from the changepoint model with T=160, α=0.05, and K=10. The Gibbs sampler was run for M=30 iterations and S=200 samples were collected. The top panel shows the actual changepoints (triangles) and the distribution of changepoint assignments averaged over samples. The bottom panel shows the observed data y (thin lines) as well as the θ values in the generative model (rescaled between 0 and 10). At locations with large changes between observations, the marginal changepoint probability is quite high. At other locations, the true change in the mean is very small, and the model is less likely to put in a changepoint. The lower right panel shows the distribution over predicted θt+1 values. xt 1 0.5 0 yt 10 1 5 θt+1 0.5 0 20 40 60 80 100 120 140 160 0 Figure 1. Results of model simulation. 4 Prediction experiment We tested performance of 9 human observers in the prediction task. The observers included the authors, a visitor, and one student who were aware of the statistical nature of the task as well as naïve students. The observers were seated in front of an LCD touch screen displaying a two-dimensional grid of 11 x 11 buttons. The changepoint model was used to generate a sequence of T=1500 stimuli for two binomial variables y1 and y2 (D=2, K=10). The change probability α was set to 0.1. The two variables y1 and y2 specified the two-dimensional button location. The same sequence was used for all observers. On each trial, the observer touched a button on the grid displayed on the touch screen. Following each button press, the button corresponding to the next {y1,y2} outcome in the sequence was highlighted. Observers were instructed to press the button that best predicted the next location of the highlighted button. The 1500 trials were divided into three blocks of 500 trials. Breaks were allowed between blocks. The whole experiment lasted between 15 and 30 minutes. Figure 2 shows the first 50 trials from the third block of the experiment. The top and bottom panels show the actual outcomes for the y1 and y2 button grid coordinates as well as the predictions for two observers (SB and MY). The figure shows that at trial 15, the y1 and y2 coordinates show a large shift followed by an immediate shift in observer’s MY predictions (on trial 16). Observer SB waits until trial 17 to make a shift. 10 5 0 outcomes SB predictions MY predictions 10 5 0 0 5 10 15 20 25 Trial 30 35 40 45 50 Figure 2. Trial by trial predictions from two observers. 4.1 T a s k er r o r We assessed prediction performance by comparing the prediction with the actual outcome in the sequence. Task error was measured by normalized city-block distance T 1 (4) task error= ∑ yt ,1 − ytO,1 + yt ,2 − ytO,2 (T − 1) t =2 where yO represents the observer’s prediction. Note that the very first trial is excluded from this calculation. Even though more suitable probabilistic measures for prediction error could have been adopted, we wanted to allow comparison of observer’s performance with both probabilistic and non-probabilistic models. Task error ranged from 2.8 (for participant MY) to 3.3 (for ML). We also assessed the performance of five models – their task errors ranged from 2.78 to 3.20. The Bayesian models (Section 3) had the lowest task errors, just below 2.8. This fits with our definition of the Bayesian models as “ideal observer” models – their task error is lower than any other model’s and any human observer’s task error. The fast and frugal models (Section 5) had task errors ranging from 2.85 to 3.20. 5 Modeling R esults We will refer to the models with the following letter codes: B=Bayesian Model, LB=limited Bayesian model, FF1..3=fast and frugal models 1..3. We assessed model fit by comparing the model’s prediction against the human observers’ predictions, again using a normalized city-block distance model error= T 1 ∑ ytM − ytO,1 + ytM − ytO,2 ,1 ,2 (T − 1) t=2 (5) where yM represents the model’s prediction. The model error for each individual observer is shown in Figure 3. It is important to note that because each model is associated with a set of free parameters, the parameters optimized for task error and model error are different. For Figure 3, the parameters were optimized to minimize Equation (5) for each individual observer, showing the extent to which these models can capture the performance of individual observers, not necessarily providing the best task performance. B LB FF1 FF2 MY MS MM EJ FF3 Model Error 2 1.5 1 0.5 0 PH NP DN SB ML 1 Figure 3. Model error for each individual observer. 5.1 B ay e s i a n p re d i ct i o n m o d e l s At each trial t, the model was provided with the sequence of all previous outcomes. The Gibbs sampling and inference procedures from Eq. (2) and (3) were applied with M=30 iterations and S=200 samples. The change probability α was a free parameter. In the full Bayesian model, the whole sequence of observations up to the current trial is available for prediction, leading to a memory requirement of up to T=1500 trials – a psychologically unreasonable assumption. We therefore also simulated a limited Bayesian model (LB) where the observed sequence was truncated to the last 10 outcomes. The LB model showed almost no decrement in task performance compared to the full Bayesian model. Figure 3 also shows that it fit human data quite well. 5.2 I n d i v i d u a l D i f f er e nc e s The right-hand panel of Figure 4 plots each observer’s task error as a function of the mean city-block distance between their subsequent button presses. This shows a clear U-shaped function. Observers with very variable predictions (e.g., ML and DN) had large average changes between successive button pushes, and also had large task error: These observers were too “twitchy”. Observers with very small average button changes (e.g., SB and NP) were not twitchy enough, and also had large task error. Observers in the middle had the lowest task error (e.g., MS and MY). The left-hand panel of Figure 4 shows the same data, but with the x-axis based on the Bayesian model fits. Instead of using mean button change distance to index twitchiness (as in 1 Error bars indicate bootstrapped 95% confidence intervals. the right-hand panel), the left-hand panel uses the estimated α parameters from the Bayesian model. A similar U-shaped pattern is observed: individuals with too large or too small α estimates have large task errors. 3.3 DN 3.2 Task Error ML SB 3.2 NP 3.1 Task Error 3.3 PH EJ 3 MM MS MY 2.9 2.8 10 -4 10 -3 10 -2 DN NP 3.1 3 PH EJ MM MS 2.9 B ML SB MY 2.8 10 -1 10 0 0.5 1 α 1.5 2 Mean Button Change 2.5 3 Figure 4. Task error vs. “twitchiness”. Left-hand panel indexes twitchiness using estimated α parameters from Bayesian model fits. Right-hand panel uses mean distance between successive predictions. 5.3 F a s t - a n d - F r u g a l ( F F ) p r e d ic t i o n m o d e l s These models perform the prediction task using simple heuristics that are cognitively plausible. The FF models keep a short memory of previous stimulus values and make predictions using the same two-step process as the Bayesian model. First, a decision is made as to whether the latent parameter θ has changed. Second, remembered stimulus values that occurred after the most recently detected changepoint are used to generate the next prediction. A simple heuristic is used to detect changepoints: If the distance between the most recent observation and prediction is greater than some threshold amount, a change is inferred. We defined the distance between a prediction (p) and an observation (y) as the difference between the log-likelihoods of y assuming θ=p and θ=y. Thus, if fB(.|θ, K) is the binomial density with parameters θ and K, the distance between observation y and prediction p is defined as d(y,p)=log(fB(y|y,K))-log(fB(y|p,K)). A changepoint on time step t+1 is inferred whenever d(yt,pt)>C. The parameter C governs the twitchiness of the model predictions. If C is large, only very dramatic changepoints will be detected, and the model will be too conservative. If C is small, the model will be too twitchy, and will detect changepoints on the basis of small random fluctuations. Predictions are based on the most recent M observations, which are kept in memory, unless a changepoint has been detected in which case only those observations occurring after the changepoint are used for prediction. The prediction for time step t+1 is simply the mean of these observations, say p. Human observers were reticent to make predictions very close to the boundaries. This was modeled by allowing the FF model to change its prediction for the next time step, yt+1, towards the mean prediction (0.5). This change reflects a two-way bet. If the probability of a change occurring is α, the best guess will be 0.5 if that change occurs, or the mean p if the change does not occur. Thus, the prediction made is actually yt+1=1/2 α+(1-α)p. Note that we do not allow perfect knowledge of the probability of a changepoint, α. Instead, an estimated value of α is used based on the number of changepoints detected in the data series up to time t. The FF model nests two simpler FF models that are psychologically interesting. If the twitchiness threshold parameter C becomes arbitrarily large, the model never detects a change and instead becomes a continuous running average model. Predictions from this model are simply a boxcar smooth of the data. Alternatively, if we assume no memory the model must based each prediction on only the previous stimulus (i.e., M=1). Above, in Figure 3, we labeled the complete FF model as FF1, the boxcar model as FF2 and the memoryless model FF3. Figure 3 showed that the complete FF model (FF1) fit the data from all observers significantly better than either the boxcar model (FF2) or the memoryless model (FF3). Exceptions were observers PH, DN and ML, for whom all three FF model fit equally well. This result suggests that our observers were (mostly) doing more than just keeping a running average of the data, or using only the most recent observation. The FF1 model fit the data about as well as the Bayesian models for all observers except MY and MS. Note that, in general, the FF1 and Bayesian model fits are very good: the average city block distance between the human data and the model prediction is around 0.75 (out of 10) buttons on both the x- and y-axes. 6 C onclusion We used an online prediction task to study changepoint detection. Human observers had to predict the next observation in stochastic sequences containing random changepoints. We showed that some observers are too “twitchy”: They perform poorly on the prediction task because they see changes where only random fluctuation exists. Other observers are not twitchy enough, and they perform poorly because they fail to see small changes. We developed a Bayesian changepoint detection model that performed the task optimally, and also provided a good fit to human data when sub-optimal parameter settings were used. Finally, we developed a fast-and-frugal model that showed how participants may be able to perform well at the task using minimal information and simple decision heuristics. Acknowledgments We thank Eric-Jan Wagenmakers and Mike Yi for useful discussions related to this work. This work was supported in part by a grant from the US Air Force Office of Scientific Research (AFOSR grant number FA9550-04-1-0317). R e f er e n ce s [1] Gilovich, T., Vallone, R. and Tversky, A. (1985). The hot hand in basketball: on the misperception of random sequences. Cognitive Psychology17, 295-314. [2] Albright, S.C. (1993a). A statistical analysis of hitting streaks in baseball. Journal of the American Statistical Association, 88, 1175-1183. [3] Stephens, D.A. (1994). Bayesian retrospective multiple changepoint identification. Applied Statistics 43(1), 159-178. [4] Carlin, B.P., Gelfand, A.E., & Smith, A.F.M. (1992). Hierarchical Bayesian analysis of changepoint problems. Applied Statistics 41(2), 389-405. [5] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711-732. [6] Gigerenzer, G., & Goldstein, D.G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650-669.

4 0.85983753 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with ﬁxed points. These bounds identify beneﬁts to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same ﬁxed points as TD(0). Our analysis also leads to the ﬁrst performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a ﬁnite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a ﬁnite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is deﬁned by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , deﬁned by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is deﬁned by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, deﬁned by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we deﬁne a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justiﬁed by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an afﬁne space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique ﬁxed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The ﬁrst inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefﬁcients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for ﬁxed MDP parameters (S, U, g, p) and ﬁxed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to ﬁnite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefﬁcient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a ﬁnite state space. However, as the following theorem establishes, the coefﬁcient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else ﬁxed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular deﬁnition: the weights ˜ ˜ are used in deﬁning r and now we are deﬁning the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisﬁes Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated beneﬁts. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two signiﬁcant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is deﬁned by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we deﬁne a modiﬁed dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modiﬁed dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efﬁciently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisﬁes Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the ﬁrst time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a ﬁxed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a ﬁxed policy, suggest that approximation error is ampliﬁed by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisﬁes this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes inﬁnite as α approaches 1. On the other hand, generalization of Theorem 6 yields the ﬁrst performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-deﬁned because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Ofﬁce of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic beneﬁted from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has beneﬁted from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of ﬁxed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psﬁles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

5 0.77124876 110 nips-2005-Learning Depth from Single Monocular Images

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

Abstract: We consider the task of depth estimation from a single monocular image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufﬁcient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a discriminatively-trained Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models both depths at individual points as well as the relation between depths at different points. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. 1

6 0.45603466 53 nips-2005-Cyclic Equilibria in Markov Games

7 0.44961268 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches

8 0.43941751 177 nips-2005-Size Regularized Cut for Data Clustering

9 0.42058465 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

10 0.42001522 46 nips-2005-Consensus Propagation

11 0.39846775 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

12 0.39367041 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects

13 0.39074349 153 nips-2005-Policy-Gradient Methods for Planning

14 0.3866362 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

15 0.38345596 34 nips-2005-Bayesian Surprise Attracts Human Attention

16 0.38186991 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

17 0.37748617 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

18 0.37217602 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

19 0.36635265 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

20 0.36310059 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity