nips nips2005 nips2005-67 knowledge-graph by maker-knowledge-mining

67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Source: pdf

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reﬂecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target speciﬁcation and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reﬂecting the extensive recurrence of neural circuitry. [sent-8, score-0.364]

2 Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. [sent-9, score-0.175]

3 For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target speciﬁcation and a movementinitiation cue is believed to be involved in motor planning. [sent-10, score-1.002]

4 We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. [sent-11, score-0.726]

5 We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. [sent-12, score-0.477]

6 These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. [sent-13, score-0.175]

7 1 Introduction At present, the best view of the activity of a neural circuit is provided by multiple-electrode extracellular recording technologies, which allow us to simultaneously measure spike trains from up to a few hundred neurons in one or more brain areas during each trial. [sent-14, score-0.576]

8 While the resulting data provide an extensive picture of neural spiking, their use in characterizing the ﬁne timescale dynamics of a neural circuit is complicated by at least two factors. [sent-15, score-0.277]

9 First, extracellularly captured action potentials provide only an occasional view of the process from which they are generated, forcing us to interpolate the evolution of the circuit between the spikes. [sent-16, score-0.178]

10 Second, the circuit activity may evolve quite differently on different trials that are otherwise experimentally identical. [sent-17, score-0.345]

11 There is little alternative to this approach when recordings are made one neuron at a time, even when the dynamics of the system are the subject of study. [sent-19, score-0.218]

12 This is especially important during cognitive processing such as decision making or motor planning. [sent-22, score-0.214]

13 An alternative approach is to adopt latent variable methods and to identify a hidden dynamical system that can summarize and explain the simultaneously-recorded spike trains. [sent-25, score-0.683]

14 The central idea is that the responses of different neurons reﬂect different views of a common dynamical process in the network, whose effective dimensionality is much smaller than the total number of neurons in the network. [sent-26, score-0.369]

15 While the underlying state trajectory may be slightly different on each trial, the commonalities among these trajectories can be captured by the network’s parameters, which are shared across trials. [sent-27, score-0.443]

16 These parameters deﬁne how the network evolves over time, as well as how the observed spike trains relate to the network’s state at each time point. [sent-28, score-0.332]

17 Dimensionality reduction in a latent dynamical model is crucial and yields beneﬁts beyond simple noise elimination. [sent-29, score-0.429]

18 The trajectory of the ball may not be identical in each sequence, and so simply averaging the sequences together would provide little information about the dynamics. [sent-32, score-0.143]

19 Independently smoothing the dynamics of each pixel might identify a dynamical process; however, correctly rejecting noise might be difﬁcult, and in any case this would yield an inefﬁcient and opaque representation of the underlying physical process. [sent-33, score-0.468]

20 By contrast, a hidden dynamical system account could capture the video sequence data using a low-dimensional latent variable that represented only the ball’s position and momentum over time, with dynamical rules that captured the physics of ballistics and elastic collision. [sent-34, score-0.857]

21 The ﬁrst is to obtain a low dimensional summary of the dynamical trajectory in any one trial. [sent-37, score-0.298]

22 In the video sequence example, predicting the loudness of the sound on impact might be easy given the estimate of the ball’s trajectory (and thus its speed), but would be difﬁcult from the raw pixel trajectories, even if denoised. [sent-39, score-0.175]

23 In the neural case, behavioral variables such as reaction time might similarly be most easily predicted from the reconstructed trajectory. [sent-40, score-0.141]

24 In the neural case this would involve identifying the structure of dynamics available to the circuit: the number and relationship of attractors, appearance of oscillatory limit cycles and so on. [sent-43, score-0.242]

25 The use of latent variable models with hidden dynamics for neural data has, thus far, been limited. [sent-44, score-0.465]

26 In [1], [2], small groups of neurons in the frontal cortex were modeled using hidden Markov models, in which the latent dynamical system is assumed to transition between a set of discrete states. [sent-45, score-0.609]

27 In [3], a state space model with linear hidden dynamics and pointprocess outputs was applied to simulated data. [sent-46, score-0.272]

28 However, these restricted latent models cannot capture the richness of dynamics that recurrent networks exhibit. [sent-47, score-0.423]

29 If such systems are relevant to real neural data, we must seek to identify hidden models capable of reﬂecting this range of behaviors. [sent-49, score-0.153]

30 In this work, we consider a latent variable model having (1) hidden underlying recurrent structure with continuous-valued states, and (2) Poisson-distributed output spike counts (conditioned on the state), as described in Section 2. [sent-50, score-0.604]

31 Evidence of motor preparation in PMd is given in Section 5. [sent-53, score-0.404]

32 In Section 6, we characterize the neural dynamics of motor preparation on a trial-by-trial basis. [sent-54, score-0.579]

33 Models of this class have long been used, albeit generally without stochastic pertubation, to describe the dynamics of neuronal responses (e. [sent-60, score-0.17]

34 The RNN is chosen for the range of dynamics it can exhibit, including convergence to point or surface attractors, oscillatory limit cycles, or chaotic evolution; but each node is simply an abstract dimension of latent space which may couple to many or all of the observed neurons. [sent-65, score-0.466]

35 The output distribution is given by a generalized linear model that describes the relationship i between all nodes in the state xt and the spike count yt ∈ IR of neuron i ∈ {1, . [sent-66, score-0.429]

36 , q} in the tth time bin i yt | xt ∼ Poisson h ci · xt + di Δ , (4) where ci ∈ IRp×1 and di ∈ IR are constants, h is a link function mapping IR → IR+ , and Δ ∈ IR is the time bin width. [sent-69, score-0.636]

37 We collect the spike counts from all q simultaneouslyi recorded physical neurons into a vector yt ∈ IRq×1 , whose ith element is yt . [sent-70, score-0.417]

38 3 Inference and Learning The Expectation-Maximization (EM) algorithm [5] was used to iteratively (1) infer the underlying hidden state trajectories (i. [sent-72, score-0.306]

39 The difference from EKS arises in the measurement update step of the forward pass P xt | {y}t ∝ P (yt | xt ) P xt | {y}t−1 . [sent-80, score-0.384]

40 1 1 (5) Because P (yt | xt ) is a product of Poissons rather than a Gaussian, the ﬁltered state posterior P (xt | {y}t ) cannot be easily computed. [sent-81, score-0.236]

41 Instead, as in [3], we approximated this 1 posterior with a Gaussian centered at the mode of log P (xt | {y}t ) and whose covariance 1 is given by the negative inverse Hessian of the log posterior at that mode. [sent-82, score-0.134]

42 Certain choices of h, including ez and log (1 + ez ), lead to a log posterior that is strictly concave in xt . [sent-83, score-0.284]

43 Because the posterior state distributions are approximated as Gaussians in the E-step, the above expectation is a Gaussian integral that involves non-linear functions g and h and cannot be computed analytically in general. [sent-87, score-0.197]

44 Unfortunately, this exponential mapping would distort the relationship between perturbations in the latent state (whose size is set by the covariance matrix Q) and the resulting ﬂuctuations in ﬁring rates. [sent-93, score-0.32]

45 (7) The corresponding Gaussian integrals must then be approximated by quadrature methods. [sent-97, score-0.165]

46 As long as the variances of the posterior state distribution did not diverge, the output distributions described by the learned model closely approximated those of the actual model that generated the simulated data. [sent-101, score-0.162]

47 After a pseudo-randomly chosen delay period of 200, 750, or 1000 ms, the target increased in size as the go cue and the monkey reached to the target. [sent-104, score-0.766]

48 100 0 200 ms Delay Peri-Movement Activity Activity Figure 1: Delayed reach task and average action potential (spike) emission rate from one representative unit. [sent-108, score-0.208]

49 Vertical dashed lines indicate peripheral reach target onset (left) and movement onset (right). [sent-110, score-0.483]

50 5 Motor preparation in PMd Motor preparation is often studied using the “instructed delay” behavioral paradigm, as described in Section 4, where a variable-length “planning” period temporally separates an instruction stimulus from a go cue [8]–[13]. [sent-111, score-0.851]

51 Longer delay periods typically lead to shorter reaction times (RT, deﬁned as time between go cue and movement onset), and this has been interpreted as evidence for a motor preparation process that takes time [11], [12], [14], [15]. [sent-112, score-1.095]

52 In this view, the delay period allows for motor preparation to complete prior to the go cue, thus shortening the RT. [sent-113, score-0.926]

53 Evidence for motor preparation at the neural level is taken from PMd (and, to a lesser degree, M1), where neurons show sustained activity during the delay period (Figure 1, delay activity) [8]–[10]. [sent-114, score-1.293]

54 A number of ﬁndings support the hypothesis that such activity is related to motor preparation. [sent-115, score-0.371]

55 First, delay period activity typically shows tuning for the instruction (i. [sent-116, score-0.578]

56 , location of reach target; note that the PMd neuron in Figure 1 has greater delay activity before leftward than before rightward reaches), consistent with the idea that something speciﬁc is being prepared [8], [9], [11], [13]. [sent-118, score-0.503]

57 Second, in the absence of a delay period, a brief burst of similarly-tuned activity is observed during the RT interval, consistent with the idea that motor preparation is taking place at that time [12]. [sent-119, score-0.841]

58 Third, we have recently reported that ﬁring rates across trials to the same reach target become more consistent as the delay period progresses [16]. [sent-120, score-0.761]

59 Averaged across 14 single- and 33 multi-neuron units, we found that this Normalized Variance (NV) declined 24% (t-test, p <10−10 ) from 200 ms before target onset to the median time of the go cue. [sent-122, score-0.622]

60 This decline spanned ∼119 ms just after target onset and appears to, at least roughly, track the time-course of motor preparation. [sent-123, score-0.658]

61 The NV may be interpreted as a signature of the approximate degree of motor preparation yet to be accomplished. [sent-124, score-0.404]

62 Shortly after target onset, ﬁring rates are frequently far from their mean. [sent-125, score-0.162]

63 If the go cue arrives then, it will take time to correct these “errors” and RTs will therefore be longer. [sent-126, score-0.369]

64 By the time the NV has completed its decline, ﬁring rates are consistently near their mean (which we presume is near an “optimal” conﬁguration for the impending reach), and RTs will be shorter if the go cue arrives then. [sent-127, score-0.492]

65 This interpretation assumes that there is a limit on how quickly ﬁring rates can converge to their ideal values (a limit on how quickly the NV can drop) such that a decline during the delay period saves time later. [sent-128, score-0.61]

66 The NV was found to be lower at the time of the go cue for trials with shorter RTs than those with longer RTs [16]. [sent-129, score-0.469]

67 The above data strongly suggest that the network underlying motor preparation exhibits rich dynamics. [sent-130, score-0.503]

68 Activity is initially variable across trials, but appears to settle during the delay period. [sent-131, score-0.316]

69 Because the RNN (1)-(2) is capable of exhibiting such dynamics and may underly motor preparation, we sought to identify such a dynamical system in delay activity. [sent-132, score-0.896]

70 However, it provides little insight into the course of motor planning on a single trial. [sent-134, score-0.256]

71 A gradual fall in trial-to-trial variance might reﬂect a gradual convergence on each trial, or might reﬂect rapid transitions that occur at different times on different trials. [sent-135, score-0.162]

72 We ﬁt the dynamical system model (1)–(4) with three latent dimensions (p = 3) to training data, consisting of delay activity preceding 70 reaches to the same target (30°). [sent-140, score-1.004]

73 Spike counts were taken in non-overlapping Δ = 20 ms bins at 20 ms time steps from 50 ms after target onset to 50 ms after the go cue. [sent-141, score-1.052]

74 Then, the ﬁtted model parameters were used to infer the latent space trajectories for 146 test trials, which are plotted in Figure 2. [sent-142, score-0.335]

75 Despite the trial-to-trial variability in the delay period neural responses, the state evolves along a characteristic path on each trial. [sent-143, score-0.521]

76 It could have been that the neural variability across trials would cause the state trajectory to evolve in markedly different ways on different trials. [sent-144, score-0.424]

77 Even with the characteristic structure, the state trajectories are not all identical, however. [sent-145, score-0.183]

78 This presumably reﬂects the fact that the motor planning process is internallyregulated, and its timecourse may differ from trial to trial, even when the presented stimulus (in this case, the reach target) is identical. [sent-146, score-0.438]

79 How these timecourses differ from trial to trial would have been obscured had we combined the neural data across trials, as with the NV in Section 5. [sent-147, score-0.286]

80 Is this low-dimensional description of the system dynamics adequate to describe the ﬁring of all 131 recorded units? [sent-148, score-0.223]

81 Dots indicate 50 ms after target onset (blue) and 50 ms after the go cue (green). [sent-150, score-0.828]

82 The radius of the green dots is logarithmically-related to delay period length (200, 750, or 1000ms). [sent-151, score-0.38]

83 inhomogeneous ﬁring rates using the output relationship from (4) λi = h ci · xt + di , t (8) where λi is the imputed ﬁring rate of the ith unit at the tth time bin. [sent-152, score-0.564]

84 Figure 3 shows the t imputed ﬁring rates for 15 representative units overlaid with empirical ﬁring rates obtained by directly averaging raw spike counts across the same test trials. [sent-153, score-0.646]

85 If the imputed ﬁring rates truly reﬂect the rate functions underlying the observed spikes, then the mean behavior of the imputed ﬁring rates should track the empirical ﬁring rates. [sent-154, score-0.567]

86 On the other hand, if the latent system were inadequate to describe the activity, we should expect to see dynamical features in the empirical ﬁring that could not be captured by the imputed ﬁring rates. [sent-155, score-0.71]

87 The strong agreement observed in Figure 3 and across all 131 units suggests that this simple dynamical system is indeed capable of capturing signiﬁcant components of the dynamics of this neural circuit. [sent-156, score-0.596]

88 We can view the dyamical system approach adopted in this work as a form of non-linear dynamical embedding of point-process data. [sent-157, score-0.29]

89 Figure 2 effectively represents a three-dimensional manifold in the space of ﬁring rates along which the dynamics unfold. [sent-159, score-0.209]

90 Beyond the agreement of imputed means demonstrated by Figure 3, we would like to directly test the ﬁt of the model to the neural spike data. [sent-160, score-0.359]

91 Unfortunately, current goodnessof-ﬁt methods for spike trains, such as those based on time-rescaling [17], cannot be applied directly to latent variable models. [sent-161, score-0.356]

92 The difﬁculty arises because the average trajectory obtained from marginalizing over the latent variables in the system (by which we might hope to rescale the inter-spike intervals) is not designed to provide an accurate estimate of the trial-by-trial ﬁring rate functions. [sent-162, score-0.393]

93 Instead, each trial must be described by a distinct trajectory in latent space, which can only be inferred after observing the spike trains themselves. [sent-163, score-0.625]

94 We are currently exploring extensions to the standard methods which infer latent trajectories using a subset of recorded neurons, and then test the quality of ﬁring-rate predictions for the remaining neurons. [sent-165, score-0.376]

95 In addition, we plan to compare models of different latent dimensionalities; here, the latent space was arbitrarily chosen to be three-dimensional. [sent-166, score-0.44]

96 To validate the learned latent space and inferred trajectories, we would also like to relate these results to trial-by-trial behavior. [sent-167, score-0.26]

97 In particular, given the evidence from Section 5, how “settled” the activity is at the time of the go cue should be predictive of RT. [sent-168, score-0.487]

98 80 40 0 Firing rate (spikes/s) 80 40 0 80 40 0 0 500 1000 0 500 1000 0 500 1000 0 500 1000 0 500 1000 Time relative to target onset (ms) Figure 3: Imputed trial-by-trial ﬁring rates (blue) and empirical ﬁring rates (red). [sent-169, score-0.385]

99 Gray vertical line indicates the time of the go cue. [sent-170, score-0.173]

100 For clarity, only test trials with delay periods of 1000 ms (44 trials) are plotted for each unit. [sent-172, score-0.487]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('delay', 0.249), ('ring', 0.248), ('latent', 0.22), ('motor', 0.214), ('dynamical', 0.209), ('nv', 0.204), ('preparation', 0.19), ('imputed', 0.182), ('pmd', 0.182), ('activity', 0.157), ('cue', 0.157), ('irp', 0.156), ('onset', 0.148), ('ms', 0.147), ('go', 0.142), ('spike', 0.136), ('dynamics', 0.134), ('period', 0.131), ('xt', 0.128), ('ir', 0.116), ('trajectories', 0.115), ('trials', 0.091), ('rts', 0.091), ('trajectory', 0.089), ('trial', 0.089), ('target', 0.087), ('eks', 0.078), ('rnn', 0.078), ('neurophysiol', 0.077), ('attractors', 0.077), ('rates', 0.075), ('hidden', 0.07), ('recurrent', 0.069), ('state', 0.068), ('across', 0.067), ('integrals', 0.066), ('neurons', 0.062), ('decline', 0.062), ('circuit', 0.061), ('reach', 0.061), ('yt', 0.061), ('ci', 0.06), ('ez', 0.058), ('counts', 0.056), ('units', 0.055), ('ball', 0.054), ('approximated', 0.054), ('underlying', 0.053), ('behav', 0.052), ('churchland', 0.052), ('riehle', 0.052), ('captured', 0.051), ('trains', 0.051), ('video', 0.05), ('di', 0.049), ('system', 0.048), ('shorter', 0.048), ('ect', 0.046), ('network', 0.046), ('comput', 0.045), ('maneesh', 0.045), ('chaotic', 0.045), ('gradual', 0.045), ('quadrature', 0.045), ('settling', 0.045), ('stanford', 0.045), ('capable', 0.042), ('planning', 0.042), ('psth', 0.041), ('gat', 0.041), ('instruction', 0.041), ('recorded', 0.041), ('neural', 0.041), ('posterior', 0.04), ('inferred', 0.04), ('peripheral', 0.039), ('tth', 0.039), ('arrives', 0.039), ('neuron', 0.036), ('might', 0.036), ('oscillatory', 0.036), ('evolve', 0.036), ('responses', 0.036), ('ts', 0.035), ('analytically', 0.035), ('brain', 0.035), ('reaches', 0.034), ('evolution', 0.033), ('view', 0.033), ('newton', 0.033), ('reaction', 0.033), ('variability', 0.032), ('la', 0.032), ('instructed', 0.032), ('delayed', 0.032), ('perturbations', 0.032), ('presumably', 0.032), ('tishby', 0.032), ('time', 0.031), ('limit', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

2 0.21138854 80 nips-2005-Gaussian Process Dynamical Models

Author: Jack Wang, Aaron Hertzmann, David M. Blei

Abstract: This paper introduces Gaussian Process Dynamical Models (GPDM) for nonlinear time series analysis. A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space. We marginalize out the model parameters in closed-form, using Gaussian Process (GP) priors for both the dynamics and the observation mappings. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We demonstrate the approach on human motion capture data in which each pose is 62-dimensional. Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces. Webpage: http://www.dgp.toronto.edu/∼ jmwang/gpdm/ 1

3 0.1766357 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains

Author: Marton G. Danoczy, Richard H. R. Hahnloser

Abstract: Neurons can have rapidly changing spike train statistics dictated by the underlying network excitability or behavioural state of an animal. To estimate the time course of such state dynamics from single- or multiple neuron recordings, we have developed an algorithm that maximizes the likelihood of observed spike trains by optimizing the state lifetimes and the state-conditional interspike-interval (ISI) distributions. Our nonparametric algorithm is free of time-binning and spike-counting problems and has the computational complexity of a Mixed-state Markov Model operating on a state sequence of length equal to the total number of recorded spikes. As an example, we ﬁt a two-state model to paired recordings of premotor neurons in the sleeping songbird. We ﬁnd that the two state-conditional ISI functions are highly similar to the ones measured during waking and singing, respectively. 1

4 0.17432828 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

Author: Vinod Nair, Geoffrey E. Hinton

Abstract: We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. The inferred motor programs can be used directly for digit classiﬁcation, but they can also be used in other ways. By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. We can also use the motor programs as additional, highly informative outputs which reduce overﬁtting when training a feed-forward classiﬁer. 1 Overview The idea that patterns can be recognized by ﬁguring out how they were generated has been around for at least half a century [1, 2] and one of the ﬁrst proposed applications was the recognition of handwriting using a generative model that involved pairs of opposing springs [3, 4]. The “analysis-by-synthesis” approach is attractive because the true generative model should provide the most natural way to characterize a class of patterns. The handwritten 2’s in ﬁgure 1, for example, are very variable when viewed as pixels but they have very similar motor programs. Despite its obvious merits, analysis-by-synthesis has had few successes, partly because it is computationally expensive to invert non-linear generative models and partly because the underlying parameters of the generative model are unknown for most large data sets. For example, the only source of information about how the MNIST digits were drawn is the images themselves. We describe a simple generative model in which a pen is controlled by two pairs of opposing springs whose stiffnesses are speciﬁed by a motor program. If the sequence of stiffnesses is speciﬁed correctly, the model can produce images which look very like the MNIST digits. Using a separate network for each digit class, we show that backpropagation can be used to learn a “recognition” network that maps images to the motor programs required to produce them. An interesting aspect of this learning is that the network creates its own training data, so it does not require the training images to be labelled with motor programs. Each recognition network starts with a single example of a motor program and grows an “island of competence” around this example, progressively extending the region over which it can map small changes in the image to the corresponding small changes in the motor program (see ﬁgure 2). Figure 1: An MNIST image of a 2 and the additional images that can be generated by inferring the motor program and then adding random noise to it. The pixels are very different, but they are all clearly twos. Fairly good digit recognition can be achieved by using the 10 recognition networks to ﬁnd 10 motor programs for a test image and then scoring each motor program by its squared error in reconstructing the image. The 10 scores are then fed into a softmax classiﬁer. Recognition can be improved by using PCA to model the distribution of motor trajectories for each class and using the distance of a motor trajectory from the relevant PCA hyperplane as an additional score. Each recognition network is solving a difﬁcult global search problem in which the correct motor program must be found by a single, “open-loop” pass through the network. More accurate recognition can be achieved by using this open-loop global search to initialize an iterative, closed-loop local search which uses the error in the reconstructed image to revise the motor program. This requires reconstruction errors in pixel space to be mapped to corrections in the space of spring stiffnesses. We cannot backpropagate errors through the generative model because it is just a hand-coded computer program. So we learn “generative” networks, one per digit class, that emulate the generator. After learning, backpropagation through these generative networks is used to convert pixel reconstruction errors into stiffness corrections. Our ﬁnal system gives 1.82% error on the MNIST test set which is similar to the 1.7% achieved by a very different generative approach [5] but worse than the 1.53% produced by the best backpropagation networks or the 1.4% produced by support vector machines [6]. It is much worse than the 0.4% produced by convolutional neural networks that use cleverly enhanced training sets [7]. Recognition of test images is quite slow because it uses ten different recognition networks followed by iterative local search. There is, however, a much more efﬁcient way to make use of our ability to extract motor programs. They can be treated as additional output labels when using backpropagation to train a single, multilayer, discriminative neural network. These additional labels act as a very informative regularizer that reduces the error rate from 1.53% to 1.27% in a network with two hidden layers of 500 units each. This is a new method of improving performance that can be used in conjunction with other tricks such as preprocessing the images, enhancing the training set or using convolutional neural nets [8, 7]. 2 A simple generative model for drawing digits The generative model uses two pairs of opposing springs at right angles. One end of each spring is attached to a frictionless horizontal or vertical rail that is 39 pixels from the center of the image. The other end is attached to a “pen” that has signiﬁcant mass. The springs themselves are weightless and have zero rest length. The pen starts at the equilibrium position deﬁned by the initial stiffnesses of the four springs. It then follows a trajectory that is determined by the stiffness of each spring at each of the 16 subsequent time steps in the motor program. The mass is large compared with the rate at which the stiffnesses change, so the system is typically far from equilibrium as it follows the smooth trajectory. On each time step, the momentum is multiplied by 0.9 to simulate viscosity. A coarse-grain trajectory is computed by using one step of forward integration for each time step in the motor program, so it contains 17 points. The code is at www.cs.toronto.edu/∼ hinton/code. Figure 2: The training data for each class-speciﬁc recognition network is produced by adding noise to motor programs that are inferred from MNIST images using the current parameters of the recognition network. To initiate this process, the biases of the output units are set by hand so that they represent a prototypical motor program for the class. Given a coarse-grain trajectory, we need a way of assigning an intensity to each pixel. We tried various methods until we hand-evolved one that was able to reproduce the MNIST images fairly accurately, but we suspect that many other methods would be just as good. For each point on the coarse trajectory, we share two units of ink between the the four closest pixels using bilinear interpolation. We also use linear interpolation to add three ﬁne-grain trajectory points between every pair of coarse-grain points. These ﬁne-grain points also contribute ink to the pixels using bilinear interpolation, but the amount of ink they contribute is zero if they are less than one pixel apart and rises linearly to the same amount as the coarse-grain points if they are more than two pixels apart. This generates a thin skeleton with a fairly uniform ink density. To ﬂesh-out the skeleton, we use two “ink parameters”, a a a a a, b, to specify a 3 × 3 kernel of the form b(1 + a)[ 12 , a , 12 ; a , 1 − a, a ; 12 , a , 12 ] which 6 6 6 6 is convolved with the image four times. Finally, the pixel intensities are clipped to lie in the interval [0,1]. The matlab code is at www.cs.toronto.edu/∼ hinton/code. The values of 2a and b/1.5 are additional, logistic outputs of the recognition networks1 . 3 Training the recognition networks The obvious way to learn a recognition network is to use a training set in which the inputs are images and the target outputs are the motor programs that were used to generate those images. If we knew the distribution over motor programs for a given digit class, we could easily produce such a set by running the generator. Unfortunately, the distribution over motor programs is exactly what we want to learn from the data, so we need a way to train 1 We can add all sorts of parameters to the hand-coded generative model and then get the recognition networks to learn to extract the appropriate values for each image. The global mass and viscosity as well as the spacing of the rails that hold the springs can be learned. We can even implement afﬁnelike transformations by attaching the four springs to endpoints whose eight coordinates are given by the recognition networks. These extra parameters make the learning slower and, for the normalized digits, they do not improve discrimination, probably because they help the wrong digit models as much as the right one. the recognition network without knowing this distribution in advance. Generating scribbles from random motor programs will not work because the capacity of the network will be wasted on irrelevant images that are far from the real data. Figure 2 shows how a single, prototype motor program can be used to initialize a learning process that creates its own training data. The prototype consists of a sequence of 4 × 17 spring stiffnesses that are used to set the biases on 68 of the 70 logistic output units of the recognition net. If the weights coming from the 400 hidden units are initially very small, the recognition net will then output a motor program that is a close approximation to the prototype, whatever the input image. Some random noise is then added to this motor program and it is used to generate a training image. So initially, all of the generated training images are very similar to the one produced by the prototype. The recognition net will therefore devote its capacity to modeling the way in which small changes in these images map to small changes in the motor program. Images in the MNIST training set that are close to the prototype will then be given their correct motor programs. This will tend to stretch the distribution of motor programs produced by the network along the directions that correspond to the manifold on which the digits lie. As time goes by, the generated training set will expand along the manifold for that digit class until all of the MNIST training images of that class are well modelled by the recognition network. It takes about 10 hours in matlab on a 3 GHz Xeon to train each recognition network. We use minibatches of size 100, momentum of 0.9, and adaptive learning rates on each connection that increase additively when the sign of the gradient agrees with the sign of the previous weight change and decrease multiplicatively when the signs disagree [9]. The net is generating its own training data, so the objective function is always changing which makes it inadvisable to use optimization methods that go as far as they can in a carefully chosen direction. Figures 3 and 4 show some examples of how well the recognition nets perform after training. Nearly all models achieve an average squared pixel error of less than 15 per image on their validation set (pixel intensities are between 0 and 1 with a preponderance of extreme values). The inferred motor programs are clearly good enough to capture the diverse handwriting styles in the data. They are not good enough, however, to give classiﬁcation performance comparable to the state-of-the-art on the MNIST database. So we added a series of enhancements to the basic system to improve the classiﬁcation accuracy. 4 Enhancements to the basic system Extra strokes in ones and sevens. One limitation of the basic system is that it draws digits using only a single stroke (i.e. the trajectory is a single, unbroken curve). But when people draw digits, they often add extra strokes to them. Two of the most common examples are the dash at the bottom of ones, and the dash through the middle of sevens (see examples in ﬁgure 5). About 2.2% of ones and 13% of sevens in the MNIST training set are dashed and not modelling the dashes reduces classiﬁcation accuracy signiﬁcantly. We model dashed ones and sevens by augmenting their basic motor programs with another motor program to draw the dash. For example, a dashed seven is generated by ﬁrst drawing an ordinary seven using the motor program computed by the seven model, and then drawing the dash with a motor program computed by a separate neural network that models only dashes. Dashes in ones and sevens are modeled with two different networks. Their training proceeds the same way as with the other models, except now there are only 50 hidden units and the training set contains only the dashed cases of the digit. (Separating these cases from the rest of the MNIST training set is easy because they can be quickly spotted by looking at the difference between the images and their reconstructions by the dashless digit model.) The net takes the entire image of a digit as input, and computes the motor program for just the dash. When reconstructing an unlabelled image as say, a seven, we compute both Figure 3: Examples of validation set images reconstructed by their corresponding model. In each case the original image is on the left and the reconstruction is on the right. Superimposed on the original image is the pen trajectory. the dashed and dashless versions of seven and pick the one with the lower squared pixel error to be that image’s reconstruction as a seven. Figure 5 shows examples of images reconstructed using the extra stroke. Local search. When reconstructing an image in its own class, a digit model often produces a sensible, overall approximation of the image. However, some of the ﬁner details of the reconstruction may be slightly wrong and need to be ﬁxed up by an iterative local search that adjusts the motor program to reduce the reconstruction error. We ﬁrst approximate the graphics model with a neural network that contains a single hidden layer of 500 logistic units. We train one such generative network for each of the ten digits and for the dashed version of ones and sevens (for a total of 12 nets). The motor programs used for training are obtained by adding noise to the motor programs inferred from the training data by the relevant, fully trained recognition network. The images produced from these motor programs by the graphics model are used as the targets for the supervised learning of each generative network. Given these targets, the weight updates are computed in the same way as for the recognition networks. Figure 4: To model 4’s we use a single smooth trajectory, but turn off the ink for timesteps 9 and 10. For images in which the pen does not need to leave the paper, the recognition net ﬁnds a trajectory in which points 8 and 11 are close together so that points 9 and 10 are not needed. For 5’s we leave the top until last and turn off the ink for timesteps 13 and 14. Figure 5: Examples of dashed ones and sevens reconstructed using a second stroke. The pen trajectory for the dash is shown in blue, superimposed on the original image. Initial squared pixel error = 33.8 10 iterations, error = 15.2 20 iterations, error = 10.5 30 iterations, error = 9.3 Figure 6: An example of how local search improves the detailed registration of the trajectory found by the correct model. After 30 iterations, the squared pixel error is less than a third of its initial value. Once the generative network is trained, we can use it to iteratively improve the initial motor program computed by the recognition network for an image. The main steps in one iteration are: 1) compute the error between the image and the reconstruction generated from the current motor program by the graphics model; 2) backpropagate the reconstruction error through the generative network to calculate its gradient with respect to the motor program; 3) compute a new motor program by taking a step along the direction of steepest descent plus 0.5 times the previous step. Figure 6 shows an example of how local search improves the reconstruction by the correct model. Local search is usually less effective at improving the ﬁts of the wrong models, so it eliminates about 20% of the classiﬁcation errors on the validation set. PCA model of the image residuals. The sum of squared pixel errors is not the best way of comparing an image with its reconstruction, because it treats the residual pixel errors as independent and zero-mean Gaussian distributed, which they are not. By modelling the structure in the residual vectors, we can get a better estimate of the conditional probability of the image given the motor program. For each digit class, we construct a PCA model of the image residual vectors for the training images. Then, given a test image, we project the image residual vector produced by each inferred motor program onto the relevant PCA hyperplane and compute the squared distance between the residual and its projection. This gives ten scores for the image that measure the quality of its reconstructions by the digit models. We don’t discard the old sum of squared pixel errors as they are still useful for classifying most images correctly. Instead, all twenty scores are used as inputs to the classiﬁer, which decides how to combine both types of scores to achieve high classiﬁcation accuracy. PCA model of trajectories. Classifying an image by comparing its reconstruction errors for the different digit models tacitly relies on the assumption that the incorrect models will reconstruct the image poorly. Since the models have only been trained on images in their Squared error = 24.9, Shape prior score = 31.5 Squared error = 15.0, Shape prior score = 104.2 Figure 7: Reconstruction of a two image by the two model (left box) and by the three model (right box), with the pen trajectory superimposed on the original image. The three model sharply bends the bottom of its trajectory to better explain the ink, but the trajectory prior for three penalizes it with a high score. The two model has a higher squared error, but a much lower prior score, which allows the classiﬁer to correctly label the image. own class, they often do reconstruct images from other classes poorly, but occasionally they ﬁt an image from another class well. For example, ﬁgure 7 shows how the three model reconstructs a two image better than the two model by generating a highly contorted three. This problem becomes even more pronounced with local search which sometimes contorts the wrong model to ﬁt the image really well. The solution is to learn a PCA model of the trajectories that a digit model infers from images in its own class. Given a test image, the trajectory computed by each digit model is scored by its squared distance from the relevant PCA hyperplane. These 10 “prior” scores are then given to the classiﬁer along with the 20 “likelihood” scores described above. The prior scores eliminate many classiﬁcation mistakes such as the one in ﬁgure 7. 5 Classiﬁcation results To classify a test image, we apply multinomial logistic regression to the 30 scores – i.e. we use a neural network with no hidden units, 10 softmax output units and a cross-entropy error. The net is trained by gradient descent using the scores for the validation set images. To illustrate the gain in classiﬁcation accuracy achieved by the enhancements explained above, table 1 gives the percent error on the validation set as each enhancement is added to the system. Together, the enhancements almost halve the number of mistakes. Enhancements None 1 1, 2 1, 2, 3 1, 2, 3, 4 Validation set % error 4.43 3.84 3.01 2.67 2.28 Test set % error 1.82 Table 1: The gain in classiﬁcation accuracy on the validation set as the following enhancements are added: 1) extra stroke for dashed ones and sevens, 2) local search, 3) PCA model of image residual, and 4) PCA trajectory prior. To avoid using the test set for model selection, the performance on the ofﬁcial test set was only measured for the ﬁnal system. 6 Discussion After training a single neural network to output both the class label and the motor program for all classes (as described in section 1) we tried ignoring the label output and classifying the test images by using the cost, under 10 different PCA models, of the trajectory deﬁned by the inferred motor program. Each PCA model was ﬁtted to the trajectories extracted from the training images for a given class. This gave 1.80% errors which is as good as the 1.82% we got using the 10 separate recognition networks and local search. This is quite surprising because the motor programs produced by the single network were simpliﬁed to make them all have the same dimensionality and they produced signiﬁcantly poorer reconstructions. By only using the 10 digit-speciﬁc recognition nets to create the motor programs for the training data, we get much faster recognition of test data because at test time we can use a single recognition network for all classes. It also means we do not need to trade-off prior scores against image residual scores because there is only one image residual. The ability to extract motor programs could also be used to enhance the training set. [7] shows that error rates can be halved by using smooth vector distortion ﬁelds to create extra training data. They argue that these ﬁelds simulate “uncontrolled oscillations of the hand muscles dampened by inertia”. Motor noise may be better modelled by adding noise to an actual motor program as shown in ﬁgure 1. Notice that this produces a wide variety of non-blurry images and it can also change the topology. The techniques we have used for extracting motor programs from digit images may be applicable to speech. There are excellent generative models that can produce almost perfect speech if they are given the right formant parameters [10]. Using one of these generative models we may be able to train a large number of specialized recognition networks to extract formant parameters from speech without requiring labeled training data. Once this has been done, labeled data would be available for training a single feed-forward network that could recover accurate formant parameters which could be used for real-time recognition. Acknowledgements We thank Steve Isard, David MacKay and Allan Jepson for helpful discussions. This research was funded by NSERC, CFI and OIT. GEH is a fellow of the Canadian Institute for Advanced Research and holds a Canada Research Chair in machine learning. References [1] D. M. MacKay. Mindlike behaviour in artefacts. British Journal for Philosophy of Science, 2:105–121, 1951. [2] M. Halle and K. Stevens. Speech recognition: A model and a program for research. IRE Transactions on Information Theory, IT-8 (2):155–159, 1962. [3] Murray Eden. Handwriting and pattern recognition. IRE Transactions on Information Theory, IT-8 (2):160–166, 1962. [4] J.M. Hollerbach. An oscillation theory of handwriting. Biological Cybernetics, 39:139–156, 1981. [5] G. Mayraz and G. E. Hinton. Recognizing hand-written digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:189–197, 2001. [6] D. Decoste and B. Schoelkopf. Training invariant support vector machines. Machine Learning, 46:161–190, 2002. [7] Patrice Y. Simard, Dave Steinkraus, and John Platt. Best practice for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, Los Alamitos, pages 958–962, 2003. [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. [9] A. Jacobs R. Increased Rates of Convergence Through Learning Rate Adaptation. Technical Report: UM-CS-1987-117. University of Massachusetts, Amherst, MA, 1987. [10] W. Holmes, J. Holmes, and M. Judd. Extension of the bandwith of the jsru parallel-formant synthesizer for high quality synthesis of male and female speech. In Proceedings of ICASSP 90 (1), pages 313–316, 1990.

5 0.16807225 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models

Author: Wolfgang Maass, Prashant Joshi, Eduardo D. Sontag

Abstract: The network topology of neurons in the brain exhibits an abundance of feedback connections, but the computational function of these feedback connections is largely unknown. We present a computational theory that characterizes the gain in computational power achieved through feedback in dynamical systems with fading memory. It implies that many such systems acquire through feedback universal computational capabilities for analog computing with a non-fading memory. In particular, we show that feedback enables such systems to process time-varying input streams in diverse ways according to rules that are implemented through internal states of the dynamical system. In contrast to previous attractor-based computational models for neural networks, these ﬂexible internal states are high-dimensional attractors of the circuit dynamics, that still allow the circuit state to absorb new information from online input streams. In this way one arrives at novel models for working memory, integration of evidence, and reward expectation in cortical circuits. We show that they are applicable to circuits of conductance-based Hodgkin-Huxley (HH) neurons with high levels of noise that reﬂect experimental data on invivo conditions. 1

6 0.16495249 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

7 0.14702007 181 nips-2005-Spiking Inputs to a Winner-take-all Network

8 0.14525203 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

9 0.14467013 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

10 0.1426394 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

11 0.13686177 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects

12 0.13681872 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

13 0.13458265 136 nips-2005-Noise and the two-thirds power Law

14 0.13265112 141 nips-2005-Norepinephrine and Neural Interrupts

15 0.12733486 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

16 0.110456 118 nips-2005-Learning in Silicon: Timing is Everything

17 0.10683209 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments

18 0.10436795 188 nips-2005-Temporally changing synaptic plasticity

19 0.10320608 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

20 0.098726138 50 nips-2005-Convex Neural Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.284), (1, -0.288), (2, 0.017), (3, 0.039), (4, 0.082), (5, -0.143), (6, 0.146), (7, -0.104), (8, -0.003), (9, 0.102), (10, 0.055), (11, 0.076), (12, -0.174), (13, -0.137), (14, 0.164), (15, -0.039), (16, 0.041), (17, -0.011), (18, 0.123), (19, 0.019), (20, 0.164), (21, 0.016), (22, -0.017), (23, 0.083), (24, 0.068), (25, 0.063), (26, 0.042), (27, 0.058), (28, 0.004), (29, -0.02), (30, -0.024), (31, -0.066), (32, 0.149), (33, 0.114), (34, -0.06), (35, -0.048), (36, -0.062), (37, -0.014), (38, 0.033), (39, 0.004), (40, -0.083), (41, 0.033), (42, 0.072), (43, 0.019), (44, 0.094), (45, 0.01), (46, -0.001), (47, 0.037), (48, -0.075), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9737345 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

2 0.58593994 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains

Author: Marton G. Danoczy, Richard H. R. Hahnloser

3 0.58380193 80 nips-2005-Gaussian Process Dynamical Models

Author: Jack Wang, Aaron Hertzmann, David M. Blei

4 0.5264346 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models

Author: Wolfgang Maass, Prashant Joshi, Eduardo D. Sontag

5 0.51502776 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

Author: Frank Wood, Stefan Roth, Michael J. Black

Abstract: Probabilistic modeling of correlated neural population ﬁring activity is central to understanding the neural code and building practical decoding algorithms. No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. The parameters of the model are learned using contrastive divergence and an optimization procedure for ﬁnding appropriate marginal directions. We evaluate the method using real data recorded from a population of motor cortical neurons. In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is signiﬁcantly higher than under other models. These results suggest that our model captures correlations in the ﬁring activity. Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. 1

6 0.51113194 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

7 0.50836486 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

8 0.50673014 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

9 0.49049217 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

10 0.48117182 136 nips-2005-Noise and the two-thirds power Law

11 0.48106775 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments

12 0.46328741 141 nips-2005-Norepinephrine and Neural Interrupts

13 0.46316046 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

14 0.44841599 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects

15 0.44678843 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

16 0.43855667 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning

17 0.41622156 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

18 0.39427945 146 nips-2005-On the Accuracy of Bounded Rationality: How Far from Optimal Is Fast and Frugal?

19 0.39324468 181 nips-2005-Spiking Inputs to a Winner-take-all Network

20 0.37982684 118 nips-2005-Learning in Silicon: Timing is Everything

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.021), (10, 0.053), (11, 0.012), (27, 0.041), (31, 0.059), (34, 0.06), (39, 0.033), (41, 0.011), (44, 0.012), (50, 0.016), (55, 0.031), (57, 0.037), (60, 0.021), (63, 0.224), (65, 0.016), (69, 0.071), (73, 0.024), (88, 0.126), (91, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92352796 59 nips-2005-Dual-Tree Fast Gauss Transforms

Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray

Abstract: In previous work we presented an efﬁcient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with ﬁnitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efﬁcient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efﬁciency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the ﬁrst truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed uniﬁed perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efﬁcient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . ﬁnite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The ﬁnite-difference-based dual-tree method tends to be efﬁcient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We deﬁne the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are deﬁned by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a ﬂat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a ﬂat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the ﬁrst time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The ﬁrst operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efﬁcient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-ﬁrst traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the ﬁrst pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisﬁes ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Inﬁnite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefﬁcients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisﬁes the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We deﬁne upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefﬁcient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefﬁcient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we ﬁrst set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efﬁcient Program for Many-Body Simulations. SIAM Journal on Scientiﬁc and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientiﬁc Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential ﬁelds. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientiﬁc Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artiﬁcial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientiﬁc and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientiﬁc and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efﬁcient kernel density estimation. International Conference on Computer Vision, 2003.

same-paper 2 0.85591769 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

3 0.62223423 136 nips-2005-Noise and the two-thirds power Law

Author: Uri Maoz, Elon Portugaly, Tamar Flash, Yair Weiss

Abstract: The two-thirds power law, an empirical law stating an inverse non-linear relationship between the tangential hand speed and the curvature of its trajectory during curved motion, is widely acknowledged to be an invariant of upper-limb movement. It has also been shown to exist in eyemotion, locomotion and was even demonstrated in motion perception and prediction. This ubiquity has fostered various attempts to uncover the origins of this empirical relationship. In these it was generally attributed either to smoothness in hand- or joint-space or to the result of mechanisms that damp noise inherent in the motor system to produce the smooth trajectories evident in healthy human motion. We show here that white Gaussian noise also obeys this power-law. Analysis of signal and noise combinations shows that trajectories that were synthetically created not to comply with the power-law are transformed to power-law compliant ones after combination with low levels of noise. Furthermore, there exist colored noise types that drive non-power-law trajectories to power-law compliance and are not affected by smoothing. These results suggest caution when running experiments aimed at verifying the power-law or assuming its underlying existence without proper analysis of the noise. Our results could also suggest that the power-law might be derived not from smoothness or smoothness-inducing mechanisms operating on the noise inherent in our motor system but rather from the correlated noise which is inherent in this motor system. 1

4 0.62012875 30 nips-2005-Assessing Approximations for Gaussian Process Classification

Author: Malte Kuss, Carl E. Rasmussen

Abstract: Gaussian processes are attractive models for probabilistic classiﬁcation but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classiﬁcation using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classiﬁers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classiﬁcation remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classiﬁcation (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classiﬁcation error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classiﬁer to produce class probabilities or conﬁdences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classiﬁcation Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classiﬁcation (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to ﬁnding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the ﬁrst two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difﬁcult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efﬁciently requires some ﬁne-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ signiﬁcantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classiﬁcation tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the ﬁrst set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is deﬁned by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a ﬁne regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justiﬁes the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and signiﬁcantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we ﬁnd the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classiﬁcation. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are reﬁned until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classiﬁcation error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The ﬁrst three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reﬂects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classiﬁcation. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classiﬁcation with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classiﬁcation using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classiﬁers, pages 61–73. The MIT Press, 2000.

5 0.61366498 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

Author: Frank Wood, Stefan Roth, Michael J. Black

6 0.61022621 181 nips-2005-Spiking Inputs to a Winner-take-all Network

7 0.60646093 48 nips-2005-Context as Filtering

8 0.60631198 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

9 0.6033802 45 nips-2005-Conditional Visual Tracking in Kernel Space

10 0.60293388 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models

11 0.6028899 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

12 0.59904909 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

13 0.59872293 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

14 0.59795749 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

15 0.59421706 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

16 0.59111488 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

17 0.59092981 109 nips-2005-Learning Cue-Invariant Visual Responses

18 0.58976281 23 nips-2005-An Application of Markov Random Fields to Range Sensing

19 0.58779776 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

20 0.58749413 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex