nips nips2005 nips2005-128 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Naoki Masuda, Shun-ichi Amari
Abstract: There is a long-standing controversy on the site of the cerebellar motor learning. Different theories and experimental results suggest that either the cerebellar flocculus or the brainstem learns the task and stores the memory. With a dynamical system approach, we clarify the mechanism of transferring the memory generated in the flocculus to the brainstem and that of so-called savings phenomena. The brainstem learning must comply with a sort of Hebbian rule depending on Purkinje-cell activities. In contrast to earlier numerical models, our model is simple but it accommodates explanations and predictions of experimental situations as qualitative features of trajectories in the phase space of synaptic weights, without fine parameter tuning. 1
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract There is a long-standing controversy on the site of the cerebellar motor learning. [sent-5, score-0.502]
2 Different theories and experimental results suggest that either the cerebellar flocculus or the brainstem learns the task and stores the memory. [sent-6, score-0.555]
3 With a dynamical system approach, we clarify the mechanism of transferring the memory generated in the flocculus to the brainstem and that of so-called savings phenomena. [sent-7, score-0.543]
4 The brainstem learning must comply with a sort of Hebbian rule depending on Purkinje-cell activities. [sent-8, score-0.26]
5 In contrast to earlier numerical models, our model is simple but it accommodates explanations and predictions of experimental situations as qualitative features of trajectories in the phase space of synaptic weights, without fine parameter tuning. [sent-9, score-0.154]
6 1 Introduction The cerebellum is involved in various types of motor learning. [sent-10, score-0.194]
7 1, the cerebellum is composed of the cerebellar cortex and the cerebellar nuclei (we depict the vestibular nucleus V N in Fig. [sent-12, score-1.152]
8 There are two main pathways linking external input from the mossy fibers (mf ) to motor outputs, which originate from the cerebellar nuclei. [sent-14, score-0.605]
9 The pathway that relays the mossy fibers directly to the cerebellar nuclei is called the direct pathway. [sent-15, score-0.938]
10 Each nucleus cell receives about 104 mossy fiber synapses. [sent-16, score-0.185]
11 The pathway involving the mossy fibers, the granule cells (gr), the parallel fibers (pl), and the Purkinje cells (P r) in the flocculo-nodular lobes of the cerebellar cortex, is called the indirect pathway. [sent-17, score-0.983]
12 Because the Purkinje cells, which are the sole source of output from the cerebellar cortex, are GABAergic, firing rates of the nuclei are suppressed when this pathway is active. [sent-18, score-0.79]
13 The indirect pathway also includes recurrent collaterals terminating on various types of inhibitory cells. [sent-19, score-0.331]
14 Another anatomical feature of the indirect pathway is that climbing fibers (Cm in Fig. [sent-20, score-0.376]
15 Taking into account the huge mass of intermediate computational units in the indirect pathway, or the granule cells, Marr conjectured that the cerebellum operates as a perceptron with high computational power [8]. [sent-22, score-0.24]
16 The climbing fibers were thought to induce long-term potentiation (LTP) of pl-P r synapses to reinforce the signal transduction. [sent-23, score-0.124]
17 Albus claimed that long-term depression (LTD) rather than LTP should occur so that the Purkinje cells inhibit the nuclei [2]. [sent-24, score-0.288]
18 The climbing fibers were thought to serve as teaching lines that convey error-correcting signals. [sent-25, score-0.116]
19 u e=ru-z gr x=Au Pr IO pl *w Cm y=wx=wAu mf *v z=vu-y VN Figure 1: Architecture of the VOR model. [sent-26, score-0.129]
20 The vestibulo-ocular reflex (VOR) is a standard benchmark for exploring synaptic substrates of cerebellar motor learning. [sent-27, score-0.543]
21 When a subject wears a prism, adaptation of the VOR gain occurs for image stabilization. [sent-30, score-0.14]
22 However, the cerebellum is not the only site of convergence of visual and vestibular signals. [sent-32, score-0.189]
23 The learning scheme depending only on the indirect pathway is called the flocculus hypothesis. [sent-33, score-0.336]
24 An alternative is the brainstem hypothesis in which synaptic plasticity is assumed to occur in the direct pathway (mf → V N ) [12]. [sent-34, score-0.688]
25 This idea is supported by experimental evidence that flocculus shutdown after 3 days of VOR adaptation does not impair the motor memory [7]. [sent-35, score-0.313]
26 Moreover, in other experiments, plasticity of the Purkinje cells in response to vestibular inputs, as required in the flocculus hypothesis, really occurs but in the direction opposite to that predicted by the flocculus hypothesis [5, 12]. [sent-36, score-0.407]
27 Also, LTP of the mf -V N synapses, which is necessary to implement the brainstem hypothesis [3], has been suggested in experiments [14]. [sent-37, score-0.321]
28 Relative contributions of the flocculus mechanism and the brainstem mechanism to motor learning remain illusive [3, 5, 9]. [sent-38, score-0.429]
29 The same controversy exists regarding the mechanism of associative eyelid conditioning [9, 10, 11]. [sent-39, score-0.102]
30 Many of the experiments in favor of the flocculus hypothesis are concerned with short-term learning, whereas plasticity involving the vestibular nuclei is suggested to be functional in the long term. [sent-41, score-0.474]
31 Short-term motor memory in the flocculus may eventually be transferred to the brainstem. [sent-42, score-0.338]
32 Medina and Mauk proposed a numerical model and examined what types of brainstem learning rules are compatible with memory transfer [10]. [sent-44, score-0.513]
33 They concluded that the brainstem plasticity should be driven by coincident activities of the Purkinje cells and the mossy fibers. [sent-45, score-0.496]
34 The necessity of Hebbian type of learning in the direct pathway is also supported by another numerical model [13]. [sent-46, score-0.341]
35 We propose a much simpler model to understand the essential mechanism of memory transfer without fine parameter manipulations. [sent-47, score-0.275]
36 Another goal of this work is to explain savings of learning. [sent-48, score-0.177]
37 The flocculus may be responsible for daily rapid learning and forgetting, and the brainstem may underlie gradual memory consolidation [11]. [sent-56, score-0.413]
38 1, let us denote by u ∈ Rm the external input to the mossy fibers. [sent-59, score-0.137]
39 It is propagated to the granule cells via synaptic connectivity represented by an n by m matrix A, where presumably n m. [sent-60, score-0.25]
40 The output of the granule cells, or x ≡ Au ∈ Rn , is received by the Purkinje-cell layer. [sent-61, score-0.086]
41 The direct pathway (mf → V N ) is defined by a plastic connection matrix v ∈ R × Rm . [sent-64, score-0.282]
42 The output to the VOR actuator is given by z = vu − y = vu − wAu, which is the output of the sole neuron of the cerebellar nuclei. [sent-65, score-0.574]
43 This form of z takes into account that the contribution of the indirect pathway is inhibitory and that of the direct pathway is excitatory. [sent-66, score-0.553]
44 The animal learns to adapt z as close as possible to the desirable motor output ru. [sent-67, score-0.146]
45 small) desirable gain r, the correct direction of synaptic changes is the decrease (resp. [sent-69, score-0.195]
46 The learning error e ≡ ru − z is carried by the climbing fibers and projects onto the Purkinje cell, which enables supervised learning [6]. [sent-72, score-0.255]
47 (1) is modified to ˙ w = −η1 (ru − vu + wAu)Au + η2 Au − η3 w, where η2 and η3 are rates of memory decay satisfying η2 , η3 (2) η1 . [sent-86, score-0.24]
48 With the synaptic strengths in this null condition denoted by (w, v) = (w0 , v0 ), we obtain r0 u = v0 u − ˙ w0 Au. [sent-89, score-0.097]
49 (2) results in ˙ w = −η1 (ru − vu + wAu) Au − η3 (w − w0 ) . [sent-94, score-0.103]
50 Numerical models suggest that LTP in the nuclei should be driven by y [10, 11]. [sent-98, score-0.227]
51 However, the mechanism and the specificity underlying plasticity of v are not well understood [9]. [sent-99, score-0.131]
52 In parallel to the learning rule of w, we assume a subtractive normalization term −η5 u [10]. [sent-101, score-0.155]
53 3 Analysis of Memory Transfer Let us examine a couple of learning rules in the direct pathway to identify robust learning mechanisms. [sent-107, score-0.322]
54 1 Supervised learning Although the climbing fibers carrying e send excitatory collaterals to the cerebellar nuclei, supervised learning there has very little experimental support [5]. [sent-109, score-0.553]
55 Here we show that supervised learning in the direct pathway is theoretically unlikely. [sent-110, score-0.322]
56 (5) becomes ˙ v = η4 (ru − vu + wAu)u − η5 u − η6 v. [sent-114, score-0.103]
57 (6) yields ˙ v = η4 (ru − vu + wAu) u − η6 (v − v0 ). [sent-119, score-0.103]
58 (11) η1 η6 A2 u2 + η 3 η4 u2 + η 3 η6 LTD of w and LTP of v are expected for adaptation to a larger gain (r > r0 ), and LTP of w and LTD of v are expected for r < r0 . [sent-124, score-0.109]
59 The results are consistent with both the flocculus hypothesis and the brainstem hypothesis as far as the direction of learning is concerned [5]. [sent-125, score-0.338]
60 LTP) of w first occurs to decrease the learning error. [sent-128, score-0.092]
61 Then, the motor memory stored in w is gradually transferred by LTP (resp. [sent-129, score-0.378]
62 In the long run, the memory is stored mainly in v, not in w. [sent-132, score-0.137]
63 However, the memory transfer based on supervised learning has fundamental deficiencies. [sent-133, score-0.305]
64 This may underlie the fact that LTD of w was not followed by partial LTP in the numerical simulations in [10]. [sent-139, score-0.11]
65 Even if the position of (w ∗ , v ∗ ) happens to support LTD of w and LTP of v, memory transfer takes a long time. [sent-140, score-0.26]
66 Similar calculations show that this rule also realizes memory transfer only in an unreliable manner. [sent-144, score-0.271]
67 2 Purkinje cell-dependent learning Results of numerical studies support that v should be subject to a type of Hebbian learning depending on two afferents to the vestibular nuclei, namely, u and y [10, 11, 13]. [sent-152, score-0.244]
68 Since LTP should logically occur when y is small and u is large, we set F = (ymax − y)u, where ymax is the maximum firing rate of the Purkinje cell. [sent-154, score-0.093]
69 In contrast to the supervised and Hebbian learning rules, this learning is robust against parameter changes since the positions and the slopes of the two nullclines are apart from each other. [sent-174, score-0.227]
70 Owing to this property, in the long term, the memory is transferred more rapidly along the w-nullcline than for the other two learning rules. [sent-175, score-0.248]
71 (15) is that |v ∗ − v0 | |w∗ − w0 | holds, which means efficient memory transfer from w to v. [sent-177, score-0.235]
72 To show this, we mimic a situation of savings by periodically alternating the task period and the rest period. [sent-185, score-0.245]
73 During the rest of the day (20 hours), the dark condition is simulated by giving no teaching signal to the model. [sent-188, score-0.2]
74 The numerical results are consistent with the savings found in other reported experiments [7] and models [11]; the animal forgets much of the acquired gain in the dark, while a small fraction is transferred each day to the cerebellar nuclei. [sent-192, score-0.783]
75 5) and suggest that v is really responsible for savings and that its plasticity needs guidance under the short-term learning of w. [sent-195, score-0.329]
76 The memory transfer occurs even in the dark condition, as indicated by the increase (resp. [sent-196, score-0.328]
77 This happens because ruin of the short-term memory of w drives the learning of v for some time even after the daily training has finished. [sent-200, score-0.245]
78 For the indirect pathway, a dark condition defines an off-task period during which w gradually loses its associations. [sent-201, score-0.256]
79 ˙ (18) The VOR adaptation with this rule is shown in Fig. [sent-205, score-0.087]
80 Longterm retention of the acquired gain is now impossible, whereas the short-term learning, or the adaptation within a day, deteriorates little. [sent-208, score-0.109]
81 Since savings do not occur, the ultimate learning error is larger than when v is plastic. [sent-209, score-0.212]
82 However, if w is fixed and v is plastic, the VOR gain is not adaptive, since y does not carry teaching signals any longer. [sent-210, score-0.099]
83 In this case, we must implement supervised learning of v for learning to occur. [sent-211, score-0.105]
84 5 Discussion Our model explains how the flocculus and the brainstem cooperate in motor learning. [sent-213, score-0.314]
85 Presumably, the indirect pathway involving the flocculus is computationally powerful because of a huge number of intermediate granule cells, but its memory is of short-term nature. [sent-214, score-0.524]
86 The direct pathway bypassing the mossy fibers to the cerebellar nuclei is likely to have less computational power but stores motor memory for a long period. [sent-215, score-1.223]
87 A part of the motor memory is expected to be passed from the flocculus to the nuclei. [sent-216, score-0.262]
88 This happens in a robust manner if the direct pathway is equipped with the learning rule dependent on correlation between the Purkinje-cell firing and the mossy-fiber firing. [sent-217, score-0.348]
89 To explore whether associative LTP/LTD in the cerebellar nuclei really exists will be a subject of future experimental work. [sent-218, score-0.603]
90 5 3 w Figure 3: Numerical simulations of savings with the Purkinje cell-dependent learning rule. [sent-229, score-0.243]
91 (B) and (D) show trajectories in the w-v space (thin solid lines) together with the nullclines (thick solid lines) and e = 0 (thick dotted lines). [sent-238, score-0.111]
92 5 0 50 100 time [hr] 150 0 50 100 time [hr] 150 Figure 4: Numerical simulations of savings with fixed v. [sent-244, score-0.208]
93 In the earlier models [10, 11], quantitative meanings were given to the equilibrium synaptic weights. [sent-249, score-0.115]
94 For example, the earlier arguments negating the flocculus hypothesis are based on the fact that the plasticity of the flocculus (w) responding to vestibular inputs occurs but in the direction opposite to the expectation of the flocculus hypothesis [5, 12]. [sent-253, score-0.38]
95 Then, partial LTP ensues as the motor memory is transferred to the nuclei. [sent-257, score-0.338]
96 Figure 2(B) predicts that, in this case, LTP in the indirect pathway is gradually transferred to LTD in the direct pathway. [sent-259, score-0.453]
97 Rapid, synaptically driven increases in the intrinsic excitability of cerebellar deep nuclear neurons. [sent-270, score-0.414]
98 Roles of cerebellar cortex and nuclei in motor learning: contradictions or clues? [sent-336, score-0.697]
99 Simulations of cerebellar motor learning: computational analysis of plasticity at the mossy fiber to deep nucleus synapse. [sent-343, score-0.769]
100 Long-term potentiation in the interpositus and vestibular nuclei in the rat. [sent-390, score-0.35]
wordName wordTfidf (topN-words)
[('cerebellar', 0.343), ('occulus', 0.343), ('ltp', 0.287), ('ltd', 0.286), ('vor', 0.24), ('pathway', 0.216), ('nuclei', 0.206), ('purkinje', 0.206), ('brainstem', 0.189), ('savings', 0.177), ('au', 0.143), ('mossy', 0.137), ('memory', 0.137), ('bers', 0.136), ('motor', 0.125), ('vestibular', 0.12), ('wau', 0.12), ('vu', 0.103), ('transfer', 0.098), ('plasticity', 0.091), ('granule', 0.086), ('nullclines', 0.086), ('indirect', 0.085), ('transferred', 0.076), ('synaptic', 0.075), ('mf', 0.075), ('climbing', 0.075), ('ru', 0.075), ('cerebellum', 0.069), ('ymax', 0.069), ('hr', 0.068), ('ber', 0.063), ('dark', 0.062), ('gain', 0.058), ('cells', 0.058), ('hypothesis', 0.057), ('numerical', 0.054), ('day', 0.054), ('medina', 0.051), ('subtractive', 0.051), ('adaptation', 0.051), ('ex', 0.05), ('hebbian', 0.049), ('nucleus', 0.048), ('period', 0.047), ('teaching', 0.041), ('gradually', 0.04), ('equilibrium', 0.04), ('mechanism', 0.04), ('rule', 0.036), ('direct', 0.036), ('changes', 0.036), ('supervised', 0.035), ('learning', 0.035), ('controversy', 0.034), ('equilibrum', 0.034), ('masuda', 0.034), ('nagao', 0.034), ('saitama', 0.034), ('vestibuloocular', 0.034), ('wako', 0.034), ('normalization', 0.033), ('presumably', 0.031), ('simulations', 0.031), ('occurs', 0.031), ('io', 0.03), ('gr', 0.03), ('plastic', 0.03), ('amari', 0.03), ('collaterals', 0.03), ('riken', 0.03), ('associative', 0.028), ('daily', 0.027), ('experiences', 0.027), ('gains', 0.027), ('slope', 0.026), ('really', 0.026), ('decrease', 0.026), ('sole', 0.025), ('underlie', 0.025), ('excitability', 0.025), ('deep', 0.025), ('synapses', 0.025), ('trajectories', 0.025), ('happens', 0.025), ('occur', 0.024), ('opposite', 0.024), ('wx', 0.024), ('potentiation', 0.024), ('pl', 0.024), ('cortex', 0.023), ('brain', 0.023), ('stores', 0.023), ('manifold', 0.022), ('condition', 0.022), ('driven', 0.021), ('rest', 0.021), ('animal', 0.021), ('thick', 0.021), ('drives', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning
Author: Naoki Masuda, Shun-ichi Amari
Abstract: There is a long-standing controversy on the site of the cerebellar motor learning. Different theories and experimental results suggest that either the cerebellar flocculus or the brainstem learns the task and stores the memory. With a dynamical system approach, we clarify the mechanism of transferring the memory generated in the flocculus to the brainstem and that of so-called savings phenomena. The brainstem learning must comply with a sort of Hebbian rule depending on Purkinje-cell activities. In contrast to earlier numerical models, our model is simple but it accommodates explanations and predictions of experimental situations as qualitative features of trajectories in the phase space of synaptic weights, without fine parameter tuning. 1
2 0.098547764 188 nips-2005-Temporally changing synaptic plasticity
Author: Minija Tamosiunaite, Bernd Porr, Florentin Wörgötter
Abstract: Recent experimental results suggest that dendritic and back-propagating spikes can influence synaptic plasticity in different ways [1]. In this study we investigate how these signals could temporally interact at dendrites leading to changing plasticity properties at local synapse clusters. Similar to a previous study [2], we employ a differential Hebbian plasticity rule to emulate spike-timing dependent plasticity. We use dendritic (D-) and back-propagating (BP-) spikes as post-synaptic signals in the learning rule and investigate how their interaction will influence plasticity. We will analyze a situation where synapse plasticity characteristics change in the course of time, depending on the type of post-synaptic activity momentarily elicited. Starting with weak synapses, which only elicit local D-spikes, a slow, unspecific growth process is induced. As soon as the soma begins to spike this process is replaced by fast synaptic changes as the consequence of the much stronger and sharper BP-spike, which now dominates the plasticity rule. This way a winner-take-all-mechanism emerges in a two-stage process, enhancing the best-correlated inputs. These results suggest that synaptic plasticity is a temporal changing process by which the computational properties of dendrites or complete neurons can be substantially augmented. 1
3 0.085870788 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. The inferred motor programs can be used directly for digit classification, but they can also be used in other ways. By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. We can also use the motor programs as additional, highly informative outputs which reduce overfitting when training a feed-forward classifier. 1 Overview The idea that patterns can be recognized by figuring out how they were generated has been around for at least half a century [1, 2] and one of the first proposed applications was the recognition of handwriting using a generative model that involved pairs of opposing springs [3, 4]. The “analysis-by-synthesis” approach is attractive because the true generative model should provide the most natural way to characterize a class of patterns. The handwritten 2’s in figure 1, for example, are very variable when viewed as pixels but they have very similar motor programs. Despite its obvious merits, analysis-by-synthesis has had few successes, partly because it is computationally expensive to invert non-linear generative models and partly because the underlying parameters of the generative model are unknown for most large data sets. For example, the only source of information about how the MNIST digits were drawn is the images themselves. We describe a simple generative model in which a pen is controlled by two pairs of opposing springs whose stiffnesses are specified by a motor program. If the sequence of stiffnesses is specified correctly, the model can produce images which look very like the MNIST digits. Using a separate network for each digit class, we show that backpropagation can be used to learn a “recognition” network that maps images to the motor programs required to produce them. An interesting aspect of this learning is that the network creates its own training data, so it does not require the training images to be labelled with motor programs. Each recognition network starts with a single example of a motor program and grows an “island of competence” around this example, progressively extending the region over which it can map small changes in the image to the corresponding small changes in the motor program (see figure 2). Figure 1: An MNIST image of a 2 and the additional images that can be generated by inferring the motor program and then adding random noise to it. The pixels are very different, but they are all clearly twos. Fairly good digit recognition can be achieved by using the 10 recognition networks to find 10 motor programs for a test image and then scoring each motor program by its squared error in reconstructing the image. The 10 scores are then fed into a softmax classifier. Recognition can be improved by using PCA to model the distribution of motor trajectories for each class and using the distance of a motor trajectory from the relevant PCA hyperplane as an additional score. Each recognition network is solving a difficult global search problem in which the correct motor program must be found by a single, “open-loop” pass through the network. More accurate recognition can be achieved by using this open-loop global search to initialize an iterative, closed-loop local search which uses the error in the reconstructed image to revise the motor program. This requires reconstruction errors in pixel space to be mapped to corrections in the space of spring stiffnesses. We cannot backpropagate errors through the generative model because it is just a hand-coded computer program. So we learn “generative” networks, one per digit class, that emulate the generator. After learning, backpropagation through these generative networks is used to convert pixel reconstruction errors into stiffness corrections. Our final system gives 1.82% error on the MNIST test set which is similar to the 1.7% achieved by a very different generative approach [5] but worse than the 1.53% produced by the best backpropagation networks or the 1.4% produced by support vector machines [6]. It is much worse than the 0.4% produced by convolutional neural networks that use cleverly enhanced training sets [7]. Recognition of test images is quite slow because it uses ten different recognition networks followed by iterative local search. There is, however, a much more efficient way to make use of our ability to extract motor programs. They can be treated as additional output labels when using backpropagation to train a single, multilayer, discriminative neural network. These additional labels act as a very informative regularizer that reduces the error rate from 1.53% to 1.27% in a network with two hidden layers of 500 units each. This is a new method of improving performance that can be used in conjunction with other tricks such as preprocessing the images, enhancing the training set or using convolutional neural nets [8, 7]. 2 A simple generative model for drawing digits The generative model uses two pairs of opposing springs at right angles. One end of each spring is attached to a frictionless horizontal or vertical rail that is 39 pixels from the center of the image. The other end is attached to a “pen” that has significant mass. The springs themselves are weightless and have zero rest length. The pen starts at the equilibrium position defined by the initial stiffnesses of the four springs. It then follows a trajectory that is determined by the stiffness of each spring at each of the 16 subsequent time steps in the motor program. The mass is large compared with the rate at which the stiffnesses change, so the system is typically far from equilibrium as it follows the smooth trajectory. On each time step, the momentum is multiplied by 0.9 to simulate viscosity. A coarse-grain trajectory is computed by using one step of forward integration for each time step in the motor program, so it contains 17 points. The code is at www.cs.toronto.edu/∼ hinton/code. Figure 2: The training data for each class-specific recognition network is produced by adding noise to motor programs that are inferred from MNIST images using the current parameters of the recognition network. To initiate this process, the biases of the output units are set by hand so that they represent a prototypical motor program for the class. Given a coarse-grain trajectory, we need a way of assigning an intensity to each pixel. We tried various methods until we hand-evolved one that was able to reproduce the MNIST images fairly accurately, but we suspect that many other methods would be just as good. For each point on the coarse trajectory, we share two units of ink between the the four closest pixels using bilinear interpolation. We also use linear interpolation to add three fine-grain trajectory points between every pair of coarse-grain points. These fine-grain points also contribute ink to the pixels using bilinear interpolation, but the amount of ink they contribute is zero if they are less than one pixel apart and rises linearly to the same amount as the coarse-grain points if they are more than two pixels apart. This generates a thin skeleton with a fairly uniform ink density. To flesh-out the skeleton, we use two “ink parameters”, a a a a a, b, to specify a 3 × 3 kernel of the form b(1 + a)[ 12 , a , 12 ; a , 1 − a, a ; 12 , a , 12 ] which 6 6 6 6 is convolved with the image four times. Finally, the pixel intensities are clipped to lie in the interval [0,1]. The matlab code is at www.cs.toronto.edu/∼ hinton/code. The values of 2a and b/1.5 are additional, logistic outputs of the recognition networks1 . 3 Training the recognition networks The obvious way to learn a recognition network is to use a training set in which the inputs are images and the target outputs are the motor programs that were used to generate those images. If we knew the distribution over motor programs for a given digit class, we could easily produce such a set by running the generator. Unfortunately, the distribution over motor programs is exactly what we want to learn from the data, so we need a way to train 1 We can add all sorts of parameters to the hand-coded generative model and then get the recognition networks to learn to extract the appropriate values for each image. The global mass and viscosity as well as the spacing of the rails that hold the springs can be learned. We can even implement affinelike transformations by attaching the four springs to endpoints whose eight coordinates are given by the recognition networks. These extra parameters make the learning slower and, for the normalized digits, they do not improve discrimination, probably because they help the wrong digit models as much as the right one. the recognition network without knowing this distribution in advance. Generating scribbles from random motor programs will not work because the capacity of the network will be wasted on irrelevant images that are far from the real data. Figure 2 shows how a single, prototype motor program can be used to initialize a learning process that creates its own training data. The prototype consists of a sequence of 4 × 17 spring stiffnesses that are used to set the biases on 68 of the 70 logistic output units of the recognition net. If the weights coming from the 400 hidden units are initially very small, the recognition net will then output a motor program that is a close approximation to the prototype, whatever the input image. Some random noise is then added to this motor program and it is used to generate a training image. So initially, all of the generated training images are very similar to the one produced by the prototype. The recognition net will therefore devote its capacity to modeling the way in which small changes in these images map to small changes in the motor program. Images in the MNIST training set that are close to the prototype will then be given their correct motor programs. This will tend to stretch the distribution of motor programs produced by the network along the directions that correspond to the manifold on which the digits lie. As time goes by, the generated training set will expand along the manifold for that digit class until all of the MNIST training images of that class are well modelled by the recognition network. It takes about 10 hours in matlab on a 3 GHz Xeon to train each recognition network. We use minibatches of size 100, momentum of 0.9, and adaptive learning rates on each connection that increase additively when the sign of the gradient agrees with the sign of the previous weight change and decrease multiplicatively when the signs disagree [9]. The net is generating its own training data, so the objective function is always changing which makes it inadvisable to use optimization methods that go as far as they can in a carefully chosen direction. Figures 3 and 4 show some examples of how well the recognition nets perform after training. Nearly all models achieve an average squared pixel error of less than 15 per image on their validation set (pixel intensities are between 0 and 1 with a preponderance of extreme values). The inferred motor programs are clearly good enough to capture the diverse handwriting styles in the data. They are not good enough, however, to give classification performance comparable to the state-of-the-art on the MNIST database. So we added a series of enhancements to the basic system to improve the classification accuracy. 4 Enhancements to the basic system Extra strokes in ones and sevens. One limitation of the basic system is that it draws digits using only a single stroke (i.e. the trajectory is a single, unbroken curve). But when people draw digits, they often add extra strokes to them. Two of the most common examples are the dash at the bottom of ones, and the dash through the middle of sevens (see examples in figure 5). About 2.2% of ones and 13% of sevens in the MNIST training set are dashed and not modelling the dashes reduces classification accuracy significantly. We model dashed ones and sevens by augmenting their basic motor programs with another motor program to draw the dash. For example, a dashed seven is generated by first drawing an ordinary seven using the motor program computed by the seven model, and then drawing the dash with a motor program computed by a separate neural network that models only dashes. Dashes in ones and sevens are modeled with two different networks. Their training proceeds the same way as with the other models, except now there are only 50 hidden units and the training set contains only the dashed cases of the digit. (Separating these cases from the rest of the MNIST training set is easy because they can be quickly spotted by looking at the difference between the images and their reconstructions by the dashless digit model.) The net takes the entire image of a digit as input, and computes the motor program for just the dash. When reconstructing an unlabelled image as say, a seven, we compute both Figure 3: Examples of validation set images reconstructed by their corresponding model. In each case the original image is on the left and the reconstruction is on the right. Superimposed on the original image is the pen trajectory. the dashed and dashless versions of seven and pick the one with the lower squared pixel error to be that image’s reconstruction as a seven. Figure 5 shows examples of images reconstructed using the extra stroke. Local search. When reconstructing an image in its own class, a digit model often produces a sensible, overall approximation of the image. However, some of the finer details of the reconstruction may be slightly wrong and need to be fixed up by an iterative local search that adjusts the motor program to reduce the reconstruction error. We first approximate the graphics model with a neural network that contains a single hidden layer of 500 logistic units. We train one such generative network for each of the ten digits and for the dashed version of ones and sevens (for a total of 12 nets). The motor programs used for training are obtained by adding noise to the motor programs inferred from the training data by the relevant, fully trained recognition network. The images produced from these motor programs by the graphics model are used as the targets for the supervised learning of each generative network. Given these targets, the weight updates are computed in the same way as for the recognition networks. Figure 4: To model 4’s we use a single smooth trajectory, but turn off the ink for timesteps 9 and 10. For images in which the pen does not need to leave the paper, the recognition net finds a trajectory in which points 8 and 11 are close together so that points 9 and 10 are not needed. For 5’s we leave the top until last and turn off the ink for timesteps 13 and 14. Figure 5: Examples of dashed ones and sevens reconstructed using a second stroke. The pen trajectory for the dash is shown in blue, superimposed on the original image. Initial squared pixel error = 33.8 10 iterations, error = 15.2 20 iterations, error = 10.5 30 iterations, error = 9.3 Figure 6: An example of how local search improves the detailed registration of the trajectory found by the correct model. After 30 iterations, the squared pixel error is less than a third of its initial value. Once the generative network is trained, we can use it to iteratively improve the initial motor program computed by the recognition network for an image. The main steps in one iteration are: 1) compute the error between the image and the reconstruction generated from the current motor program by the graphics model; 2) backpropagate the reconstruction error through the generative network to calculate its gradient with respect to the motor program; 3) compute a new motor program by taking a step along the direction of steepest descent plus 0.5 times the previous step. Figure 6 shows an example of how local search improves the reconstruction by the correct model. Local search is usually less effective at improving the fits of the wrong models, so it eliminates about 20% of the classification errors on the validation set. PCA model of the image residuals. The sum of squared pixel errors is not the best way of comparing an image with its reconstruction, because it treats the residual pixel errors as independent and zero-mean Gaussian distributed, which they are not. By modelling the structure in the residual vectors, we can get a better estimate of the conditional probability of the image given the motor program. For each digit class, we construct a PCA model of the image residual vectors for the training images. Then, given a test image, we project the image residual vector produced by each inferred motor program onto the relevant PCA hyperplane and compute the squared distance between the residual and its projection. This gives ten scores for the image that measure the quality of its reconstructions by the digit models. We don’t discard the old sum of squared pixel errors as they are still useful for classifying most images correctly. Instead, all twenty scores are used as inputs to the classifier, which decides how to combine both types of scores to achieve high classification accuracy. PCA model of trajectories. Classifying an image by comparing its reconstruction errors for the different digit models tacitly relies on the assumption that the incorrect models will reconstruct the image poorly. Since the models have only been trained on images in their Squared error = 24.9, Shape prior score = 31.5 Squared error = 15.0, Shape prior score = 104.2 Figure 7: Reconstruction of a two image by the two model (left box) and by the three model (right box), with the pen trajectory superimposed on the original image. The three model sharply bends the bottom of its trajectory to better explain the ink, but the trajectory prior for three penalizes it with a high score. The two model has a higher squared error, but a much lower prior score, which allows the classifier to correctly label the image. own class, they often do reconstruct images from other classes poorly, but occasionally they fit an image from another class well. For example, figure 7 shows how the three model reconstructs a two image better than the two model by generating a highly contorted three. This problem becomes even more pronounced with local search which sometimes contorts the wrong model to fit the image really well. The solution is to learn a PCA model of the trajectories that a digit model infers from images in its own class. Given a test image, the trajectory computed by each digit model is scored by its squared distance from the relevant PCA hyperplane. These 10 “prior” scores are then given to the classifier along with the 20 “likelihood” scores described above. The prior scores eliminate many classification mistakes such as the one in figure 7. 5 Classification results To classify a test image, we apply multinomial logistic regression to the 30 scores – i.e. we use a neural network with no hidden units, 10 softmax output units and a cross-entropy error. The net is trained by gradient descent using the scores for the validation set images. To illustrate the gain in classification accuracy achieved by the enhancements explained above, table 1 gives the percent error on the validation set as each enhancement is added to the system. Together, the enhancements almost halve the number of mistakes. Enhancements None 1 1, 2 1, 2, 3 1, 2, 3, 4 Validation set % error 4.43 3.84 3.01 2.67 2.28 Test set % error 1.82 Table 1: The gain in classification accuracy on the validation set as the following enhancements are added: 1) extra stroke for dashed ones and sevens, 2) local search, 3) PCA model of image residual, and 4) PCA trajectory prior. To avoid using the test set for model selection, the performance on the official test set was only measured for the final system. 6 Discussion After training a single neural network to output both the class label and the motor program for all classes (as described in section 1) we tried ignoring the label output and classifying the test images by using the cost, under 10 different PCA models, of the trajectory defined by the inferred motor program. Each PCA model was fitted to the trajectories extracted from the training images for a given class. This gave 1.80% errors which is as good as the 1.82% we got using the 10 separate recognition networks and local search. This is quite surprising because the motor programs produced by the single network were simplified to make them all have the same dimensionality and they produced significantly poorer reconstructions. By only using the 10 digit-specific recognition nets to create the motor programs for the training data, we get much faster recognition of test data because at test time we can use a single recognition network for all classes. It also means we do not need to trade-off prior scores against image residual scores because there is only one image residual. The ability to extract motor programs could also be used to enhance the training set. [7] shows that error rates can be halved by using smooth vector distortion fields to create extra training data. They argue that these fields simulate “uncontrolled oscillations of the hand muscles dampened by inertia”. Motor noise may be better modelled by adding noise to an actual motor program as shown in figure 1. Notice that this produces a wide variety of non-blurry images and it can also change the topology. The techniques we have used for extracting motor programs from digit images may be applicable to speech. There are excellent generative models that can produce almost perfect speech if they are given the right formant parameters [10]. Using one of these generative models we may be able to train a large number of specialized recognition networks to extract formant parameters from speech without requiring labeled training data. Once this has been done, labeled data would be available for training a single feed-forward network that could recover accurate formant parameters which could be used for real-time recognition. Acknowledgements We thank Steve Isard, David MacKay and Allan Jepson for helpful discussions. This research was funded by NSERC, CFI and OIT. GEH is a fellow of the Canadian Institute for Advanced Research and holds a Canada Research Chair in machine learning. References [1] D. M. MacKay. Mindlike behaviour in artefacts. British Journal for Philosophy of Science, 2:105–121, 1951. [2] M. Halle and K. Stevens. Speech recognition: A model and a program for research. IRE Transactions on Information Theory, IT-8 (2):155–159, 1962. [3] Murray Eden. Handwriting and pattern recognition. IRE Transactions on Information Theory, IT-8 (2):160–166, 1962. [4] J.M. Hollerbach. An oscillation theory of handwriting. Biological Cybernetics, 39:139–156, 1981. [5] G. Mayraz and G. E. Hinton. Recognizing hand-written digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:189–197, 2001. [6] D. Decoste and B. Schoelkopf. Training invariant support vector machines. Machine Learning, 46:161–190, 2002. [7] Patrice Y. Simard, Dave Steinkraus, and John Platt. Best practice for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, Los Alamitos, pages 958–962, 2003. [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. [9] A. Jacobs R. Increased Rates of Convergence Through Learning Rate Adaptation. Technical Report: UM-CS-1987-117. University of Massachusetts, Amherst, MA, 1987. [10] W. Holmes, J. Holmes, and M. Judd. Extension of the bandwith of the jsru parallel-formant synthesizer for high quality synthesis of male and female speech. In Proceedings of ICASSP 90 (1), pages 313–316, 1990.
4 0.062717497 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy
Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reflecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target specification and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1
5 0.057466175 136 nips-2005-Noise and the two-thirds power Law
Author: Uri Maoz, Elon Portugaly, Tamar Flash, Yair Weiss
Abstract: The two-thirds power law, an empirical law stating an inverse non-linear relationship between the tangential hand speed and the curvature of its trajectory during curved motion, is widely acknowledged to be an invariant of upper-limb movement. It has also been shown to exist in eyemotion, locomotion and was even demonstrated in motion perception and prediction. This ubiquity has fostered various attempts to uncover the origins of this empirical relationship. In these it was generally attributed either to smoothness in hand- or joint-space or to the result of mechanisms that damp noise inherent in the motor system to produce the smooth trajectories evident in healthy human motion. We show here that white Gaussian noise also obeys this power-law. Analysis of signal and noise combinations shows that trajectories that were synthetically created not to comply with the power-law are transformed to power-law compliant ones after combination with low levels of noise. Furthermore, there exist colored noise types that drive non-power-law trajectories to power-law compliance and are not affected by smoothing. These results suggest caution when running experiments aimed at verifying the power-law or assuming its underlying existence without proper analysis of the noise. Our results could also suggest that the power-law might be derived not from smoothness or smoothness-inducing mechanisms operating on the noise inherent in our motor system but rather from the correlated noise which is inherent in this motor system. 1
6 0.05642328 118 nips-2005-Learning in Silicon: Timing is Everything
7 0.045730684 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception
8 0.045611877 135 nips-2005-Neuronal Fiber Delineation in Area of Edema from Diffusion Weighted MRI
9 0.0444437 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects
10 0.044435538 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity
11 0.044257876 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface
12 0.041092593 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
13 0.03949786 106 nips-2005-Large-scale biophysical parameter estimation in single neurons via constrained linear regression
14 0.039412174 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects
15 0.035022948 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
16 0.034296039 99 nips-2005-Integrate-and-Fire models with adaptation are good enough
17 0.033921186 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity
18 0.033495016 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account
19 0.033291563 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions
20 0.032641917 181 nips-2005-Spiking Inputs to a Winner-take-all Network
topicId topicWeight
[(0, 0.098), (1, -0.104), (2, -0.015), (3, 0.004), (4, 0.007), (5, 0.022), (6, -0.003), (7, -0.049), (8, 0.012), (9, 0.017), (10, -0.008), (11, -0.005), (12, 0.017), (13, -0.071), (14, 0.067), (15, -0.056), (16, -0.082), (17, 0.026), (18, 0.018), (19, 0.018), (20, 0.013), (21, -0.041), (22, -0.041), (23, 0.017), (24, -0.0), (25, 0.066), (26, -0.052), (27, 0.056), (28, 0.041), (29, -0.006), (30, -0.153), (31, -0.005), (32, -0.044), (33, -0.084), (34, 0.032), (35, 0.041), (36, -0.125), (37, 0.022), (38, 0.057), (39, 0.151), (40, -0.146), (41, -0.1), (42, 0.123), (43, -0.034), (44, 0.024), (45, 0.166), (46, 0.049), (47, 0.005), (48, 0.035), (49, 0.107)]
simIndex simValue paperId paperTitle
same-paper 1 0.94534707 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning
Author: Naoki Masuda, Shun-ichi Amari
Abstract: There is a long-standing controversy on the site of the cerebellar motor learning. Different theories and experimental results suggest that either the cerebellar flocculus or the brainstem learns the task and stores the memory. With a dynamical system approach, we clarify the mechanism of transferring the memory generated in the flocculus to the brainstem and that of so-called savings phenomena. The brainstem learning must comply with a sort of Hebbian rule depending on Purkinje-cell activities. In contrast to earlier numerical models, our model is simple but it accommodates explanations and predictions of experimental situations as qualitative features of trajectories in the phase space of synaptic weights, without fine parameter tuning. 1
2 0.51566088 40 nips-2005-CMOL CrossNets: Possible Neuromorphic Nanoelectronic Circuits
Author: Jung Hoon Lee, Xiaolong Ma, Konstantin K. Likharev
Abstract: Hybrid “CMOL” integrated circuits, combining CMOS subsystem with nanowire crossbars and simple two-terminal nanodevices, promise to extend the exponential Moore-Law development of microelectronics into the sub-10-nm range. We are developing neuromorphic network (“CrossNet”) architectures for this future technology, in which neural cell bodies are implemented in CMOS, nanowires are used as axons and dendrites, while nanodevices (bistable latching switches) are used as elementary synapses. We have shown how CrossNets may be trained to perform pattern recovery and classification despite the limitations imposed by the CMOL hardware. Preliminary estimates have shown that CMOL CrossNets may be extremely dense (~10 7 cells per cm2) and operate approximately a million times faster than biological neural networks, at manageable power consumption. In Conclusion, we discuss in brief possible short-term and long-term applications of the emerging technology. 1 Introduction: CMOL Circuits Recent results [1, 2] indicate that the current VLSI paradigm based on CMOS technology can be hardly extended beyond the 10-nm frontier: in this range the sensitivity of parameters (most importantly, the gate voltage threshold) of silicon field-effect transistors to inevitable fabrication spreads grows exponentially. This sensitivity will probably send the fabrication facilities costs skyrocketing, and may lead to the end of Moore’s Law some time during the next decade. There is a growing consensus that the impending Moore’s Law crisis may be preempted by a radical paradigm shift from the purely CMOS technology to hybrid CMOS/nanodevice circuits, e.g., those of “CMOL” variety (Fig. 1). Such circuits (see, e.g., Ref. 3 for their recent review) would combine a level of advanced CMOS devices fabricated by the lithographic patterning, and two-layer nanowire crossbar formed, e.g., by nanoimprint, with nanowires connected by simple, similar, two-terminal nanodevices at each crosspoint. For such devices, molecular single-electron latching switches [4] are presently the leading candidates, in particular because they may be fabricated using the self-assembled monolayer (SAM) technique which already gave reproducible results for simpler molecular devices [5]. (a) nanodevices nanowiring and nanodevices interface pins upper wiring level of CMOS stack (b) βFCMOS Fnano α Fig. 1. CMOL circuit: (a) schematic side view, and (b) top-view zoom-in on several adjacent interface pins. (For clarity, only two adjacent nanodevices are shown.) In order to overcome the CMOS/nanodevice interface problems pertinent to earlier proposals of hybrid circuits [6], in CMOL the interface is provided by pins that are distributed all over the circuit area, on the top of the CMOS stack. This allows to use advanced techniques of nanowire patterning (like nanoimprint) which do not have nanoscale accuracy of layer alignment [3]. The vital feature of this interface is the tilt, by angle α = arcsin(Fnano/βFCMOS), of the nanowire crossbar relative to the square arrays of interface pins (Fig. 1b). Here Fnano is the nanowiring half-pitch, FCMOS is the half-pitch of the CMOS subsystem, and β is a dimensionless factor larger than 1 that depends on the CMOS cell complexity. Figure 1b shows that this tilt allows the CMOS subsystem to address each nanodevice even if Fnano << βFCMOS. By now, it has been shown that CMOL circuits can combine high performance with high defect tolerance (which is necessary for any circuit using nanodevices) for several digital applications. In particular, CMOL circuits with defect rates below a few percent would enable terabit-scale memories [7], while the performance of FPGA-like CMOL circuits may be several hundred times above that of overcome purely CMOL FPGA (implemented with the same FCMOS), at acceptable power dissipation and defect tolerance above 20% [8]. In addition, the very structure of CMOL circuits makes them uniquely suitable for the implementation of more complex, mixed-signal information processing systems, including ultradense and ultrafast neuromorphic networks. The objective of this paper is to describe in brief the current status of our work on the development of so-called Distributed Crossbar Networks (“CrossNets”) that could provide high performance despite the limitations imposed by CMOL hardware. A more detailed description of our earlier results may be found in Ref. 9. 2 Synapses The central device of CrossNet is a two-terminal latching switch [3, 4] (Fig. 2a) which is a combination of two single-electron devices, a transistor and a trap [3]. The device may be naturally implemented as a single organic molecule (Fig. 2b). Qualitatively, the device operates as follows: if voltage V = Vj – Vk applied between the external electrodes (in CMOL, nanowires) is low, the trap island has no net electric charge, and the single-electron transistor is closed. If voltage V approaches certain threshold value V+ > 0, an additional electron is inserted into the trap island, and its field lifts the Coulomb blockade of the single-electron transistor, thus connecting the nanowires. The switch state may be reset (e.g., wires disconnected) by applying a lower voltage V < V- < V+. Due to the random character of single-electron tunneling [2], the quantitative description of the switch is by necessity probabilistic: actually, V determines only the rates Γ↑↓ of device switching between its ON and OFF states. The rates, in turn, determine the dynamics of probability p to have the transistor opened (i.e. wires connected): dp/dt = Γ↑(1 - p) - Γ↓p. (1) The theory of single-electron tunneling [2] shows that, in a good approximation, the rates may be presented as Γ↑↓ = Γ0 exp{±e(V - S)/kBT} , (2) (a) single-electron trap tunnel junction Vj Vk single-electron transistor (b) O clipping group O N C R diimide acceptor groups O O C N R R O OPE wires O N R R N O O R O N R R = hexyl N O O R R O N C R R R Fig. 2. (a) Schematics and (b) possible molecular implementation of the two-terminal single-electron latching switch where Γ0 and S are constants depending on physical parameters of the latching switches. Note that despite the random character of switching, the strong nonlinearity of Eq. (2) allows to limit the degree of the device “fuzziness”. 3 CrossNets Figure 3a shows the generic structure of a CrossNet. CMOS-implemented somatic cells (within the Fire Rate model, just nonlinear differential amplifiers, see Fig. 3b,c) apply their output voltages to “axonic” nanowires. If the latching switch, working as an elementary synapse, on the crosspoint of an axonic wire with the perpendicular “dendritic” wire is open, some current flows into the latter wire, charging it. Since such currents are injected into each dendritic wire through several (many) open synapses, their addition provides a natural passive analog summation of signals from the corresponding somas, typical for all neural networks. Examining Fig. 3a, please note the open-circuit terminations of axonic and dendritic lines at the borders of the somatic cells; due to these terminations the somas do not communicate directly (but only via synapses). The network shown on Fig. 3 is evidently feedforward; recurrent networks are achieved in the evident way by doubling the number of synapses and nanowires per somatic cell (Fig. 3c). Moreover, using dual-rail (bipolar) representation of the signal, and hence doubling the number of nanowires and elementary synapses once again, one gets a CrossNet with somas coupled by compact 4-switch groups [9]. Using Eqs. (1) and (2), it is straightforward to show that that the average synaptic weight wjk of the group obeys the “quasi-Hebbian” rule: d w jk = −4Γ0 sinh (γ S ) sinh (γ V j ) sinh (γ Vk ) . dt (3) (a) - +soma j (b) RL + -- jk+ RL (c) jk- RL + -- -+soma k RL Fig. 3. (a) Generic structure of the simplest, (feedforward, non-Hebbian) CrossNet. Red lines show “axonic”, and blue lines “dendritic” nanowires. Gray squares are interfaces between nanowires and CMOS-based somas (b, c). Signs show the dendrite input polarities. Green circles denote molecular latching switches forming elementary synapses. Bold red and blue points are open-circuit terminations of the nanowires, that do not allow somas to interact in bypass of synapses In the simplest cases (e.g., quasi-Hopfield networks with finite connectivity), the tri-level synaptic weights of the generic CrossNets are quite satisfactory, leading to just a very modest (~30%) network capacity loss. However, some applications (in particular, pattern classification) may require a larger number of weight quantization levels L (e.g., L ≈ 30 for a 1% fidelity [9]). This may be achieved by using compact square arrays (e.g., 4×4) of latching switches (Fig. 4). Various species of CrossNets [9] differ also by the way the somatic cells are distributed around the synaptic field. Figure 5 shows feedforward versions of two CrossNet types most explored so far: the so-called FlossBar and InBar. The former network is more natural for the implementation of multilayered perceptrons (MLP), while the latter system is preferable for recurrent network implementations and also allows a simpler CMOS design of somatic cells. The most important advantage of CrossNets over the hardware neural networks suggested earlier is that these networks allow to achieve enormous density combined with large cell connectivity M >> 1 in quasi-2D electronic circuits. 4 CrossNet training CrossNet training faces several hardware-imposed challenges: (i) The synaptic weight contribution provided by the elementary latching switch is binary, so that for most applications the multi-switch synapses (Fig. 4) are necessary. (ii) The only way to adjust any particular synaptic weight is to turn ON or OFF the corresponding latching switch(es). This is only possible to do by applying certain voltage V = Vj – Vk between the two corresponding nanowires. At this procedure, other nanodevices attached to the same wires should not be disturbed. (iii) As stated above, synapse state switching is a statistical progress, so that the degree of its “fuzziness” should be carefully controlled. (a) Vj (b) V w – A/2 i=1 i=1 2 2 … … n n Vj V w+ A/2 i' = 1 RL 2 … i' = 1 n RS ±(V t –A/2) 2 … RS n ±(V t +A/2) Fig. 4. Composite synapse for providing L = 2n2+1 discrete levels of the weight in (a) operation and (b) weight adjustment modes. The dark-gray rectangles are resistive metallic strips at soma/nanowire interfaces (a) (b) Fig. 5. Two main CrossNet species: (a) FlossBar and (b) InBar, in the generic (feedforward, non-Hebbian, ternary-weight) case for the connectivity parameter M = 9. Only the nanowires and nanodevices coupling one cell (indicated with red dashed lines) to M post-synaptic cells (blue dashed lines) are shown; actually all the cells are similarly coupled We have shown that these challenges may be met using (at least) the following training methods [9]: (i) Synaptic weight import. This procedure is started with training of a homomorphic “precursor” artificial neural network with continuous synaptic weighs wjk, implemented in software, using one of established methods (e.g., error backpropagation). Then the synaptic weights wjk are transferred to the CrossNet, with some “clipping” (rounding) due to the binary nature of elementary synaptic weights. To accomplish the transfer, pairs of somatic cells are sequentially selected via CMOS-level wiring. Using the flexibility of CMOS circuitry, these cells are reconfigured to apply external voltages ±VW to the axonic and dendritic nanowires leading to a particular synapse, while all other nanowires are grounded. The voltage level V W is selected so that it does not switch the synapses attached to only one of the selected nanowires, while voltage 2VW applied to the synapse at the crosspoint of the selected wires is sufficient for its reliable switching. (In the composite synapses with quasi-continuous weights (Fig. 4), only a part of the corresponding switches is turned ON or OFF.) (ii) Error backpropagation. The synaptic weight import procedure is straightforward when wjk may be simply calculated, e.g., for the Hopfield-type networks. However, for very large CrossNets used, e.g., as pattern classifiers the precursor network training may take an impracticably long time. In this case the direct training of a CrossNet may become necessary. We have developed two methods of such training, both based on “Hebbian” synapses consisting of 4 elementary synapses (latching switches) whose average weight dynamics obeys Eq. (3). This quasi-Hebbian rule may be used to implement the backpropagation algorithm either using a periodic time-multiplexing [9] or in a continuous fashion, using the simultaneous propagation of signals and errors along the same dual-rail channels. As a result, presently we may state that CrossNets may be taught to perform virtually all major functions demonstrated earlier with the usual neural networks, including the corrupted pattern restoration in the recurrent quasi-Hopfield mode and pattern classification in the feedforward MLP mode [11]. 5 C r o s s N e t p e r f o r m an c e e s t i m a t e s The significance of this result may be only appreciated in the context of unparalleled physical parameters of CMOL CrossNets. The only fundamental limitation on the half-pitch Fnano (Fig. 1) comes from quantum-mechanical tunneling between nanowires. If the wires are separated by vacuum, the corresponding specific leakage conductance becomes uncomfortably large (~10-12 Ω-1m-1) only at Fnano = 1.5 nm; however, since realistic insulation materials (SiO2, etc.) provide somewhat lower tunnel barriers, let us use a more conservative value Fnano= 3 nm. Note that this value corresponds to 1012 elementary synapses per cm2, so that for 4M = 104 and n = 4 the areal density of neural cells is close to 2×107 cm-2. Both numbers are higher than those for the human cerebral cortex, despite the fact that the quasi-2D CMOL circuits have to compete with quasi-3D cerebral cortex. With the typical specific capacitance of 3×10-10 F/m = 0.3 aF/nm, this gives nanowire capacitance C0 ≈ 1 aF per working elementary synapse, because the corresponding segment has length 4Fnano. The CrossNet operation speed is determined mostly by the time constant τ0 of dendrite nanowire capacitance recharging through resistances of open nanodevices. Since both the relevant conductance and capacitance increase similarly with M and n, τ0 ≈ R0C0. The possibilities of reduction of R0, and hence τ0, are limited mostly by acceptable power dissipation per unit area, that is close to Vs2/(2Fnano)2R0. For room-temperature operation, the voltage scale V0 ≈ Vt should be of the order of at least 30 kBT/e ≈ 1 V to avoid thermally-induced errors [9]. With our number for Fnano, and a relatively high but acceptable power consumption of 100 W/cm2, we get R0 ≈ 1010Ω (which is a very realistic value for single-molecule single-electron devices like one shown in Fig. 3). With this number, τ0 is as small as ~10 ns. This means that the CrossNet speed may be approximately six orders of magnitude (!) higher than that of the biological neural networks. Even scaling R0 up by a factor of 100 to bring power consumption to a more comfortable level of 1 W/cm2, would still leave us at least a four-orders-of-magnitude speed advantage. 6 D i s c u s s i on: P o s s i bl e a p p l i c at i o n s These estimates make us believe that that CMOL CrossNet chips may revolutionize the neuromorphic network applications. Let us start with the example of relatively small (1-cm2-scale) chips used for recognition of a face in a crowd [11]. The most difficult feature of such recognition is the search for face location, i.e. optimal placement of a face on the image relative to the panel providing input for the processing network. The enormous density and speed of CMOL hardware gives a possibility to time-and-space multiplex this task (Fig. 6). In this approach, the full image (say, formed by CMOS photodetectors on the same chip) is divided into P rectangular panels of h×w pixels, corresponding to the expected size and approximate shape of a single face. A CMOS-implemented communication channel passes input data from each panel to the corresponding CMOL neural network, providing its shift in time, say using the TV scanning pattern (red line in Fig. 6). The standard methods of image classification require the network to have just a few hidden layers, so that the time interval Δt necessary for each mapping position may be so short that the total pattern recognition time T = hwΔt may be acceptable even for online face recognition. w h image network input Fig. 6. Scan mapping of the input image on CMOL CrossNet inputs. Red lines show the possible time sequence of image pixels sent to a certain input of the network processing image from the upper-left panel of the pattern Indeed, let us consider a 4-Megapixel image partitioned into 4K 32×32-pixel panels (h = w = 32). This panel will require an MLP net with several (say, four) layers with 1K cells each in order to compare the panel image with ~10 3 stored faces. With the feasible 4-nm nanowire half-pitch, and 65-level synapses (sufficient for better than 99% fidelity [9]), each interlayer crossbar would require chip area about (4K×64 nm)2 = 64×64 μm2, fitting 4×4K of them on a ~0.6 cm2 chip. (The CMOS somatic-layer and communication-system overheads are negligible.) With the acceptable power consumption of the order of 10 W/cm2, the input-to-output signal propagation in such a network will take only about 50 ns, so that Δt may be of the order of 100 ns and the total time T = hwΔt of processing one frame of the order of 100 microseconds, much shorter than the typical TV frame time of ~10 milliseconds. The remaining two-orders-of-magnitude time gap may be used, for example, for double-checking the results via stopping the scan mapping (Fig. 6) at the most promising position. (For this, a simple feedback from the recognition output to the mapping communication system is necessary.) It is instructive to compare the estimated CMOL chip speed with that of the implementation of a similar parallel network ensemble on a CMOS signal processor (say, also combined on the same chip with an array of CMOS photodetectors). Even assuming an extremely high performance of 30 billion additions/multiplications per second, we would need ~4×4K×1K×(4K)2/(30×109) ≈ 104 seconds ~ 3 hours per frame, evidently incompatible with the online image stream processing. Let us finish with a brief (and much more speculative) discussion of possible long-term prospects of CMOL CrossNets. Eventually, large-scale (~30×30 cm2) CMOL circuits may become available. According to the estimates given in the previous section, the integration scale of such a system (in terms of both neural cells and synapses) will be comparable with that of the human cerebral cortex. Equipped with a set of broadband sensor/actuator interfaces, such (necessarily, hierarchical) system may be capable, after a period of initial supervised training, of further self-training in the process of interaction with environment, with the speed several orders of magnitude higher than that of its biological prototypes. Needless to say, the successful development of such self-developing systems would have a major impact not only on all information technologies, but also on the society as a whole. Acknowledgments This work has been supported in part by the AFOSR, MARCO (via FENA Center), and NSF. Valuable contributions made by Simon Fölling, Özgür Türel and Ibrahim Muckra, as well as useful discussions with P. Adams, J. Barhen, D. Hammerstrom, V. Protopopescu, T. Sejnowski, and D. Strukov are gratefully acknowledged. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Frank, D. J. et al. (2001) Device scaling limits of Si MOSFETs and their application dependencies. Proc. IEEE 89(3): 259-288. Likharev, K. K. (2003) Electronics below 10 nm, in J. Greer et al. (eds.), Nano and Giga Challenges in Microelectronics, pp. 27-68. Amsterdam: Elsevier. Likharev, K. K. and Strukov, D. B. (2005) CMOL: Devices, circuits, and architectures, in G. Cuniberti et al. (eds.), Introducing Molecular Electronics, Ch. 16. Springer, Berlin. Fölling, S., Türel, Ö. & Likharev, K. K. (2001) Single-electron latching switches as nanoscale synapses, in Proc. of the 2001 Int. Joint Conf. on Neural Networks, pp. 216-221. Mount Royal, NJ: Int. Neural Network Society. Wang, W. et al. (2003) Mechanism of electron conduction in self-assembled alkanethiol monolayer devices. Phys. Rev. B 68(3): 035416 1-8. Stan M. et al. (2003) Molecular electronics: From devices and interconnect to circuits and architecture, Proc. IEEE 91(11): 1940-1957. Strukov, D. B. & Likharev, K. K. (2005) Prospects for terabit-scale nanoelectronic memories. Nanotechnology 16(1): 137-148. Strukov, D. B. & Likharev, K. K. (2005) CMOL FPGA: A reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices. Nanotechnology 16(6): 888-900. Türel, Ö. et al. (2004) Neuromorphic architectures for nanoelectronic circuits”, Int. J. of Circuit Theory and Appl. 32(5): 277-302. See, e.g., Hertz J. et al. (1991) Introduction to the Theory of Neural Computation. Cambridge, MA: Perseus. Lee, J. H. & Likharev, K. K. (2005) CrossNets as pattern classifiers. Lecture Notes in Computer Sciences 3575: 434-441.
3 0.50496274 136 nips-2005-Noise and the two-thirds power Law
Author: Uri Maoz, Elon Portugaly, Tamar Flash, Yair Weiss
Abstract: The two-thirds power law, an empirical law stating an inverse non-linear relationship between the tangential hand speed and the curvature of its trajectory during curved motion, is widely acknowledged to be an invariant of upper-limb movement. It has also been shown to exist in eyemotion, locomotion and was even demonstrated in motion perception and prediction. This ubiquity has fostered various attempts to uncover the origins of this empirical relationship. In these it was generally attributed either to smoothness in hand- or joint-space or to the result of mechanisms that damp noise inherent in the motor system to produce the smooth trajectories evident in healthy human motion. We show here that white Gaussian noise also obeys this power-law. Analysis of signal and noise combinations shows that trajectories that were synthetically created not to comply with the power-law are transformed to power-law compliant ones after combination with low levels of noise. Furthermore, there exist colored noise types that drive non-power-law trajectories to power-law compliance and are not affected by smoothing. These results suggest caution when running experiments aimed at verifying the power-law or assuming its underlying existence without proper analysis of the noise. Our results could also suggest that the power-law might be derived not from smoothness or smoothness-inducing mechanisms operating on the noise inherent in our motor system but rather from the correlated noise which is inherent in this motor system. 1
4 0.49954379 188 nips-2005-Temporally changing synaptic plasticity
Author: Minija Tamosiunaite, Bernd Porr, Florentin Wörgötter
Abstract: Recent experimental results suggest that dendritic and back-propagating spikes can influence synaptic plasticity in different ways [1]. In this study we investigate how these signals could temporally interact at dendrites leading to changing plasticity properties at local synapse clusters. Similar to a previous study [2], we employ a differential Hebbian plasticity rule to emulate spike-timing dependent plasticity. We use dendritic (D-) and back-propagating (BP-) spikes as post-synaptic signals in the learning rule and investigate how their interaction will influence plasticity. We will analyze a situation where synapse plasticity characteristics change in the course of time, depending on the type of post-synaptic activity momentarily elicited. Starting with weak synapses, which only elicit local D-spikes, a slow, unspecific growth process is induced. As soon as the soma begins to spike this process is replaced by fast synaptic changes as the consequence of the much stronger and sharper BP-spike, which now dominates the plasticity rule. This way a winner-take-all-mechanism emerges in a two-stage process, enhancing the best-correlated inputs. These results suggest that synaptic plasticity is a temporal changing process by which the computational properties of dendrites or complete neurons can be substantially augmented. 1
5 0.49677107 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. The inferred motor programs can be used directly for digit classification, but they can also be used in other ways. By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. We can also use the motor programs as additional, highly informative outputs which reduce overfitting when training a feed-forward classifier. 1 Overview The idea that patterns can be recognized by figuring out how they were generated has been around for at least half a century [1, 2] and one of the first proposed applications was the recognition of handwriting using a generative model that involved pairs of opposing springs [3, 4]. The “analysis-by-synthesis” approach is attractive because the true generative model should provide the most natural way to characterize a class of patterns. The handwritten 2’s in figure 1, for example, are very variable when viewed as pixels but they have very similar motor programs. Despite its obvious merits, analysis-by-synthesis has had few successes, partly because it is computationally expensive to invert non-linear generative models and partly because the underlying parameters of the generative model are unknown for most large data sets. For example, the only source of information about how the MNIST digits were drawn is the images themselves. We describe a simple generative model in which a pen is controlled by two pairs of opposing springs whose stiffnesses are specified by a motor program. If the sequence of stiffnesses is specified correctly, the model can produce images which look very like the MNIST digits. Using a separate network for each digit class, we show that backpropagation can be used to learn a “recognition” network that maps images to the motor programs required to produce them. An interesting aspect of this learning is that the network creates its own training data, so it does not require the training images to be labelled with motor programs. Each recognition network starts with a single example of a motor program and grows an “island of competence” around this example, progressively extending the region over which it can map small changes in the image to the corresponding small changes in the motor program (see figure 2). Figure 1: An MNIST image of a 2 and the additional images that can be generated by inferring the motor program and then adding random noise to it. The pixels are very different, but they are all clearly twos. Fairly good digit recognition can be achieved by using the 10 recognition networks to find 10 motor programs for a test image and then scoring each motor program by its squared error in reconstructing the image. The 10 scores are then fed into a softmax classifier. Recognition can be improved by using PCA to model the distribution of motor trajectories for each class and using the distance of a motor trajectory from the relevant PCA hyperplane as an additional score. Each recognition network is solving a difficult global search problem in which the correct motor program must be found by a single, “open-loop” pass through the network. More accurate recognition can be achieved by using this open-loop global search to initialize an iterative, closed-loop local search which uses the error in the reconstructed image to revise the motor program. This requires reconstruction errors in pixel space to be mapped to corrections in the space of spring stiffnesses. We cannot backpropagate errors through the generative model because it is just a hand-coded computer program. So we learn “generative” networks, one per digit class, that emulate the generator. After learning, backpropagation through these generative networks is used to convert pixel reconstruction errors into stiffness corrections. Our final system gives 1.82% error on the MNIST test set which is similar to the 1.7% achieved by a very different generative approach [5] but worse than the 1.53% produced by the best backpropagation networks or the 1.4% produced by support vector machines [6]. It is much worse than the 0.4% produced by convolutional neural networks that use cleverly enhanced training sets [7]. Recognition of test images is quite slow because it uses ten different recognition networks followed by iterative local search. There is, however, a much more efficient way to make use of our ability to extract motor programs. They can be treated as additional output labels when using backpropagation to train a single, multilayer, discriminative neural network. These additional labels act as a very informative regularizer that reduces the error rate from 1.53% to 1.27% in a network with two hidden layers of 500 units each. This is a new method of improving performance that can be used in conjunction with other tricks such as preprocessing the images, enhancing the training set or using convolutional neural nets [8, 7]. 2 A simple generative model for drawing digits The generative model uses two pairs of opposing springs at right angles. One end of each spring is attached to a frictionless horizontal or vertical rail that is 39 pixels from the center of the image. The other end is attached to a “pen” that has significant mass. The springs themselves are weightless and have zero rest length. The pen starts at the equilibrium position defined by the initial stiffnesses of the four springs. It then follows a trajectory that is determined by the stiffness of each spring at each of the 16 subsequent time steps in the motor program. The mass is large compared with the rate at which the stiffnesses change, so the system is typically far from equilibrium as it follows the smooth trajectory. On each time step, the momentum is multiplied by 0.9 to simulate viscosity. A coarse-grain trajectory is computed by using one step of forward integration for each time step in the motor program, so it contains 17 points. The code is at www.cs.toronto.edu/∼ hinton/code. Figure 2: The training data for each class-specific recognition network is produced by adding noise to motor programs that are inferred from MNIST images using the current parameters of the recognition network. To initiate this process, the biases of the output units are set by hand so that they represent a prototypical motor program for the class. Given a coarse-grain trajectory, we need a way of assigning an intensity to each pixel. We tried various methods until we hand-evolved one that was able to reproduce the MNIST images fairly accurately, but we suspect that many other methods would be just as good. For each point on the coarse trajectory, we share two units of ink between the the four closest pixels using bilinear interpolation. We also use linear interpolation to add three fine-grain trajectory points between every pair of coarse-grain points. These fine-grain points also contribute ink to the pixels using bilinear interpolation, but the amount of ink they contribute is zero if they are less than one pixel apart and rises linearly to the same amount as the coarse-grain points if they are more than two pixels apart. This generates a thin skeleton with a fairly uniform ink density. To flesh-out the skeleton, we use two “ink parameters”, a a a a a, b, to specify a 3 × 3 kernel of the form b(1 + a)[ 12 , a , 12 ; a , 1 − a, a ; 12 , a , 12 ] which 6 6 6 6 is convolved with the image four times. Finally, the pixel intensities are clipped to lie in the interval [0,1]. The matlab code is at www.cs.toronto.edu/∼ hinton/code. The values of 2a and b/1.5 are additional, logistic outputs of the recognition networks1 . 3 Training the recognition networks The obvious way to learn a recognition network is to use a training set in which the inputs are images and the target outputs are the motor programs that were used to generate those images. If we knew the distribution over motor programs for a given digit class, we could easily produce such a set by running the generator. Unfortunately, the distribution over motor programs is exactly what we want to learn from the data, so we need a way to train 1 We can add all sorts of parameters to the hand-coded generative model and then get the recognition networks to learn to extract the appropriate values for each image. The global mass and viscosity as well as the spacing of the rails that hold the springs can be learned. We can even implement affinelike transformations by attaching the four springs to endpoints whose eight coordinates are given by the recognition networks. These extra parameters make the learning slower and, for the normalized digits, they do not improve discrimination, probably because they help the wrong digit models as much as the right one. the recognition network without knowing this distribution in advance. Generating scribbles from random motor programs will not work because the capacity of the network will be wasted on irrelevant images that are far from the real data. Figure 2 shows how a single, prototype motor program can be used to initialize a learning process that creates its own training data. The prototype consists of a sequence of 4 × 17 spring stiffnesses that are used to set the biases on 68 of the 70 logistic output units of the recognition net. If the weights coming from the 400 hidden units are initially very small, the recognition net will then output a motor program that is a close approximation to the prototype, whatever the input image. Some random noise is then added to this motor program and it is used to generate a training image. So initially, all of the generated training images are very similar to the one produced by the prototype. The recognition net will therefore devote its capacity to modeling the way in which small changes in these images map to small changes in the motor program. Images in the MNIST training set that are close to the prototype will then be given their correct motor programs. This will tend to stretch the distribution of motor programs produced by the network along the directions that correspond to the manifold on which the digits lie. As time goes by, the generated training set will expand along the manifold for that digit class until all of the MNIST training images of that class are well modelled by the recognition network. It takes about 10 hours in matlab on a 3 GHz Xeon to train each recognition network. We use minibatches of size 100, momentum of 0.9, and adaptive learning rates on each connection that increase additively when the sign of the gradient agrees with the sign of the previous weight change and decrease multiplicatively when the signs disagree [9]. The net is generating its own training data, so the objective function is always changing which makes it inadvisable to use optimization methods that go as far as they can in a carefully chosen direction. Figures 3 and 4 show some examples of how well the recognition nets perform after training. Nearly all models achieve an average squared pixel error of less than 15 per image on their validation set (pixel intensities are between 0 and 1 with a preponderance of extreme values). The inferred motor programs are clearly good enough to capture the diverse handwriting styles in the data. They are not good enough, however, to give classification performance comparable to the state-of-the-art on the MNIST database. So we added a series of enhancements to the basic system to improve the classification accuracy. 4 Enhancements to the basic system Extra strokes in ones and sevens. One limitation of the basic system is that it draws digits using only a single stroke (i.e. the trajectory is a single, unbroken curve). But when people draw digits, they often add extra strokes to them. Two of the most common examples are the dash at the bottom of ones, and the dash through the middle of sevens (see examples in figure 5). About 2.2% of ones and 13% of sevens in the MNIST training set are dashed and not modelling the dashes reduces classification accuracy significantly. We model dashed ones and sevens by augmenting their basic motor programs with another motor program to draw the dash. For example, a dashed seven is generated by first drawing an ordinary seven using the motor program computed by the seven model, and then drawing the dash with a motor program computed by a separate neural network that models only dashes. Dashes in ones and sevens are modeled with two different networks. Their training proceeds the same way as with the other models, except now there are only 50 hidden units and the training set contains only the dashed cases of the digit. (Separating these cases from the rest of the MNIST training set is easy because they can be quickly spotted by looking at the difference between the images and their reconstructions by the dashless digit model.) The net takes the entire image of a digit as input, and computes the motor program for just the dash. When reconstructing an unlabelled image as say, a seven, we compute both Figure 3: Examples of validation set images reconstructed by their corresponding model. In each case the original image is on the left and the reconstruction is on the right. Superimposed on the original image is the pen trajectory. the dashed and dashless versions of seven and pick the one with the lower squared pixel error to be that image’s reconstruction as a seven. Figure 5 shows examples of images reconstructed using the extra stroke. Local search. When reconstructing an image in its own class, a digit model often produces a sensible, overall approximation of the image. However, some of the finer details of the reconstruction may be slightly wrong and need to be fixed up by an iterative local search that adjusts the motor program to reduce the reconstruction error. We first approximate the graphics model with a neural network that contains a single hidden layer of 500 logistic units. We train one such generative network for each of the ten digits and for the dashed version of ones and sevens (for a total of 12 nets). The motor programs used for training are obtained by adding noise to the motor programs inferred from the training data by the relevant, fully trained recognition network. The images produced from these motor programs by the graphics model are used as the targets for the supervised learning of each generative network. Given these targets, the weight updates are computed in the same way as for the recognition networks. Figure 4: To model 4’s we use a single smooth trajectory, but turn off the ink for timesteps 9 and 10. For images in which the pen does not need to leave the paper, the recognition net finds a trajectory in which points 8 and 11 are close together so that points 9 and 10 are not needed. For 5’s we leave the top until last and turn off the ink for timesteps 13 and 14. Figure 5: Examples of dashed ones and sevens reconstructed using a second stroke. The pen trajectory for the dash is shown in blue, superimposed on the original image. Initial squared pixel error = 33.8 10 iterations, error = 15.2 20 iterations, error = 10.5 30 iterations, error = 9.3 Figure 6: An example of how local search improves the detailed registration of the trajectory found by the correct model. After 30 iterations, the squared pixel error is less than a third of its initial value. Once the generative network is trained, we can use it to iteratively improve the initial motor program computed by the recognition network for an image. The main steps in one iteration are: 1) compute the error between the image and the reconstruction generated from the current motor program by the graphics model; 2) backpropagate the reconstruction error through the generative network to calculate its gradient with respect to the motor program; 3) compute a new motor program by taking a step along the direction of steepest descent plus 0.5 times the previous step. Figure 6 shows an example of how local search improves the reconstruction by the correct model. Local search is usually less effective at improving the fits of the wrong models, so it eliminates about 20% of the classification errors on the validation set. PCA model of the image residuals. The sum of squared pixel errors is not the best way of comparing an image with its reconstruction, because it treats the residual pixel errors as independent and zero-mean Gaussian distributed, which they are not. By modelling the structure in the residual vectors, we can get a better estimate of the conditional probability of the image given the motor program. For each digit class, we construct a PCA model of the image residual vectors for the training images. Then, given a test image, we project the image residual vector produced by each inferred motor program onto the relevant PCA hyperplane and compute the squared distance between the residual and its projection. This gives ten scores for the image that measure the quality of its reconstructions by the digit models. We don’t discard the old sum of squared pixel errors as they are still useful for classifying most images correctly. Instead, all twenty scores are used as inputs to the classifier, which decides how to combine both types of scores to achieve high classification accuracy. PCA model of trajectories. Classifying an image by comparing its reconstruction errors for the different digit models tacitly relies on the assumption that the incorrect models will reconstruct the image poorly. Since the models have only been trained on images in their Squared error = 24.9, Shape prior score = 31.5 Squared error = 15.0, Shape prior score = 104.2 Figure 7: Reconstruction of a two image by the two model (left box) and by the three model (right box), with the pen trajectory superimposed on the original image. The three model sharply bends the bottom of its trajectory to better explain the ink, but the trajectory prior for three penalizes it with a high score. The two model has a higher squared error, but a much lower prior score, which allows the classifier to correctly label the image. own class, they often do reconstruct images from other classes poorly, but occasionally they fit an image from another class well. For example, figure 7 shows how the three model reconstructs a two image better than the two model by generating a highly contorted three. This problem becomes even more pronounced with local search which sometimes contorts the wrong model to fit the image really well. The solution is to learn a PCA model of the trajectories that a digit model infers from images in its own class. Given a test image, the trajectory computed by each digit model is scored by its squared distance from the relevant PCA hyperplane. These 10 “prior” scores are then given to the classifier along with the 20 “likelihood” scores described above. The prior scores eliminate many classification mistakes such as the one in figure 7. 5 Classification results To classify a test image, we apply multinomial logistic regression to the 30 scores – i.e. we use a neural network with no hidden units, 10 softmax output units and a cross-entropy error. The net is trained by gradient descent using the scores for the validation set images. To illustrate the gain in classification accuracy achieved by the enhancements explained above, table 1 gives the percent error on the validation set as each enhancement is added to the system. Together, the enhancements almost halve the number of mistakes. Enhancements None 1 1, 2 1, 2, 3 1, 2, 3, 4 Validation set % error 4.43 3.84 3.01 2.67 2.28 Test set % error 1.82 Table 1: The gain in classification accuracy on the validation set as the following enhancements are added: 1) extra stroke for dashed ones and sevens, 2) local search, 3) PCA model of image residual, and 4) PCA trajectory prior. To avoid using the test set for model selection, the performance on the official test set was only measured for the final system. 6 Discussion After training a single neural network to output both the class label and the motor program for all classes (as described in section 1) we tried ignoring the label output and classifying the test images by using the cost, under 10 different PCA models, of the trajectory defined by the inferred motor program. Each PCA model was fitted to the trajectories extracted from the training images for a given class. This gave 1.80% errors which is as good as the 1.82% we got using the 10 separate recognition networks and local search. This is quite surprising because the motor programs produced by the single network were simplified to make them all have the same dimensionality and they produced significantly poorer reconstructions. By only using the 10 digit-specific recognition nets to create the motor programs for the training data, we get much faster recognition of test data because at test time we can use a single recognition network for all classes. It also means we do not need to trade-off prior scores against image residual scores because there is only one image residual. The ability to extract motor programs could also be used to enhance the training set. [7] shows that error rates can be halved by using smooth vector distortion fields to create extra training data. They argue that these fields simulate “uncontrolled oscillations of the hand muscles dampened by inertia”. Motor noise may be better modelled by adding noise to an actual motor program as shown in figure 1. Notice that this produces a wide variety of non-blurry images and it can also change the topology. The techniques we have used for extracting motor programs from digit images may be applicable to speech. There are excellent generative models that can produce almost perfect speech if they are given the right formant parameters [10]. Using one of these generative models we may be able to train a large number of specialized recognition networks to extract formant parameters from speech without requiring labeled training data. Once this has been done, labeled data would be available for training a single feed-forward network that could recover accurate formant parameters which could be used for real-time recognition. Acknowledgements We thank Steve Isard, David MacKay and Allan Jepson for helpful discussions. This research was funded by NSERC, CFI and OIT. GEH is a fellow of the Canadian Institute for Advanced Research and holds a Canada Research Chair in machine learning. References [1] D. M. MacKay. Mindlike behaviour in artefacts. British Journal for Philosophy of Science, 2:105–121, 1951. [2] M. Halle and K. Stevens. Speech recognition: A model and a program for research. IRE Transactions on Information Theory, IT-8 (2):155–159, 1962. [3] Murray Eden. Handwriting and pattern recognition. IRE Transactions on Information Theory, IT-8 (2):160–166, 1962. [4] J.M. Hollerbach. An oscillation theory of handwriting. Biological Cybernetics, 39:139–156, 1981. [5] G. Mayraz and G. E. Hinton. Recognizing hand-written digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:189–197, 2001. [6] D. Decoste and B. Schoelkopf. Training invariant support vector machines. Machine Learning, 46:161–190, 2002. [7] Patrice Y. Simard, Dave Steinkraus, and John Platt. Best practice for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, Los Alamitos, pages 958–962, 2003. [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. [9] A. Jacobs R. Increased Rates of Convergence Through Learning Rate Adaptation. Technical Report: UM-CS-1987-117. University of Massachusetts, Amherst, MA, 1987. [10] W. Holmes, J. Holmes, and M. Judd. Extension of the bandwith of the jsru parallel-formant synthesizer for high quality synthesis of male and female speech. In Proceedings of ICASSP 90 (1), pages 313–316, 1990.
6 0.29878864 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
7 0.29082173 106 nips-2005-Large-scale biophysical parameter estimation in single neurons via constrained linear regression
8 0.27983981 176 nips-2005-Silicon growth cones map silicon retina
9 0.27609503 135 nips-2005-Neuronal Fiber Delineation in Area of Edema from Diffusion Weighted MRI
10 0.27039599 54 nips-2005-Data-Driven Online to Batch Conversions
11 0.2560854 167 nips-2005-Robust design of biological experiments
12 0.25494632 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects
13 0.25153214 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
14 0.24210145 118 nips-2005-Learning in Silicon: Timing is Everything
16 0.23258321 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
17 0.23219164 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
18 0.23052271 76 nips-2005-From Batch to Transductive Online Learning
19 0.22511804 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
20 0.22472119 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
topicId topicWeight
[(0, 0.454), (3, 0.02), (10, 0.028), (27, 0.018), (31, 0.037), (34, 0.03), (39, 0.017), (41, 0.011), (55, 0.038), (57, 0.045), (60, 0.023), (69, 0.048), (73, 0.022), (88, 0.066), (91, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.83895183 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning
Author: Naoki Masuda, Shun-ichi Amari
Abstract: There is a long-standing controversy on the site of the cerebellar motor learning. Different theories and experimental results suggest that either the cerebellar flocculus or the brainstem learns the task and stores the memory. With a dynamical system approach, we clarify the mechanism of transferring the memory generated in the flocculus to the brainstem and that of so-called savings phenomena. The brainstem learning must comply with a sort of Hebbian rule depending on Purkinje-cell activities. In contrast to earlier numerical models, our model is simple but it accommodates explanations and predictions of experimental situations as qualitative features of trajectories in the phase space of synaptic weights, without fine parameter tuning. 1
2 0.48150951 85 nips-2005-Generalization to Unseen Cases
Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri
Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1
3 0.24477689 181 nips-2005-Spiking Inputs to a Winner-take-all Network
Author: Matthias Oster, Shih-Chii Liu
Abstract: Recurrent networks that perform a winner-take-all computation have been studied extensively. Although some of these studies include spiking networks, they consider only analog input rates. We present results of this winner-take-all computation on a network of integrate-and-fire neurons which receives spike trains as inputs. We show how we can configure the connectivity in the network so that the winner is selected after a pre-determined number of input spikes. We discuss spiking inputs with both regular frequencies and Poisson-distributed rates. The robustness of the computation was tested by implementing the winner-take-all network on an analog VLSI array of 64 integrate-and-fire neurons which have an innate variance in their operating parameters. 1
4 0.24009821 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection
Author: Long Zhu, Alan L. Yuille
Abstract: We describe a hierarchical compositional system for detecting deformable objects in images. Objects are represented by graphical models. The algorithm uses a hierarchical tree where the root of the tree corresponds to the full object and lower-level elements of the tree correspond to simpler features. The algorithm proceeds by passing simple messages up and down the tree. The method works rapidly, in under a second, on 320 × 240 images. We demonstrate the approach on detecting cats, horses, and hands. The method works in the presence of background clutter and occlusions. Our approach is contrasted with more traditional methods such as dynamic programming and belief propagation. 1
5 0.23781817 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy
Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reflecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target specification and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1
7 0.2281044 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
8 0.2261087 30 nips-2005-Assessing Approximations for Gaussian Process Classification
9 0.2254889 136 nips-2005-Noise and the two-thirds power Law
10 0.22542652 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
11 0.22539529 99 nips-2005-Integrate-and-Fire models with adaptation are good enough
12 0.22468592 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
13 0.22459796 188 nips-2005-Temporally changing synaptic plasticity
14 0.22431827 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions
15 0.22400357 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
16 0.22288871 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
17 0.22254758 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
18 0.22243157 48 nips-2005-Context as Filtering
19 0.22241944 109 nips-2005-Learning Cue-Invariant Visual Responses
20 0.22144875 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts