nips nips2002 nips2002-187 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lavi Shpigelman, Yoram Singer, Rony Paz, Eilon Vaadia
Abstract: Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this paper is the construction of biologically-motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach using the Spikernel and various standard kernels for the task of predicting hand movement velocities from cortical recordings. In all of our experiments all the kernels we tested outperform the standard scalar product used in regression with the Spikernel consistently achieving the best performance. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il ¡ £ Abstract Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. [sent-8, score-0.209]
2 The focus of this paper is the construction of biologically-motivated kernels for cortical activities. [sent-9, score-0.401]
3 The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. [sent-10, score-0.337]
4 We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. [sent-11, score-0.287]
5 We demonstrate the merits of our modeling approach using the Spikernel and various standard kernels for the task of predicting hand movement velocities from cortical recordings. [sent-12, score-0.786]
6 In all of our experiments all the kernels we tested outperform the standard scalar product used in regression with the Spikernel consistently achieving the best performance. [sent-13, score-0.307]
7 1 Introduction Neuronal activity in primary motor cortex (MI) during multi-joint arm reaching movements in 2D and 3-D [1, 2] and drawing movements [3] has been used extensively as a test bed for gaining understanding of neural computations in the brain. [sent-14, score-0.66]
8 The tuning curve approach models the average firing rate of a cortical unit as a function of some external variable, like the frequency of an auditory stimulus or the direction of a planned movement. [sent-16, score-0.465]
9 Many studies of motor cortical areas [4, 2, 5, 3, 6] showed that while single units are broadly tuned to movement direction, a relatively small population of cells (tens to hundreds) carries enough information to allow for accurate prediction. [sent-17, score-0.951]
10 Such broad tuning can be found in many parts of the nervous system, suggesting that computation by distributed populations of cells is a general cortical feature. [sent-18, score-0.37]
11 The population-vector method [4, 2] describes each cell’s firing rate as the dot product between that cell’s preferred direction and the direction of hand movement. [sent-19, score-0.161]
12 The vector sum of preferred directions, weighted by the measured firing rates is used both as a way of understanding what the cortical units encode and as a means for estimating the velocity vector. [sent-20, score-0.474]
13 Some studies [10, 11, 12] support the notion that neurons can associate and dissociate rapidly to functional groups in the process of performing a computational task. [sent-22, score-0.125]
14 The concepts of simultaneous encoding of multiple parameters and dynamic representation in neuronal populations, could together explain some of the conundrums in motor system physiology. [sent-23, score-0.292]
15 Advances in computing power and recent developments of physiological recording methods allow recording of ever growing numbers of cortical units that can be used for real-time analysis and modeling. [sent-25, score-0.531]
16 Current attempts at predicting movement from cortical activity rely on modeling techniques such as cosine-tuning estimation (pop. [sent-27, score-0.665]
17 vector) [18], linear regression [15, 19] and artificial neural nets [15] (though this study reports getting better results by linear regression). [sent-28, score-0.12]
18 A major deficiency of standard approaches is poor ability to extract the relevant information from monitored brain activity in an efficient manner that will allow reducing the number of recorded channels and recording time. [sent-29, score-0.249]
19 3 we introduce and explain the main mathematical tool that we use, namely, the kernel operator. [sent-34, score-0.22]
20 4 we discuss the design and implementation of a biologically-motivated kernel for neural activities. [sent-36, score-0.22]
21 2 Problem setting Consider the case where we monitor instantaneous spike rates from cortical units during physical motor behavior of a subject. [sent-40, score-0.844]
22 Our goal is to learn a predictive model of some behavior parameter with the cortical activity as the input. [sent-41, score-0.359]
23 Formally speaking, let be a sequence of instantaneous firing rates from cortical units consisting of samples altogether. [sent-42, score-0.518]
24 We use to denote sequences of firing rates and denote by the length of a sequence . [sent-43, score-0.212]
25 Let denote some parameter of the movement that we would like to predict (e. [sent-56, score-0.267]
26 Our goal is to learn an approximation of the form from neural firing rates to movement parameter. [sent-59, score-0.291]
27 In general, information about movement can be found in neural activity both before and after the time of movement itself. [sent-60, score-0.555]
28 like to make @V @V @ 7 ¡ 9 Y `X 1 ¤ g)e¦ ¤ f ¨ 9 ¡# b & @ p7 FihB @ aV 3 Kernel methods for regression A major mathematical notion employed in this paper is kernel operators. [sent-64, score-0.34]
29 Kernel operators allow algorithms whose interface to the data is limited to scalar products to employ complicated premappings of the data into feature spaces by use of kernels. [sent-65, score-0.211]
30 Formally, a kernel is an innerproduct operator where is some arbitrary vector space. [sent-66, score-0.257]
31 Given a kernel operator we can use it to perform various statistical learning tasks. [sent-68, score-0.257]
32 One such task is support vector regression (SVR) [20] which attempts to find a regression function for target values that is linear if observed in the (typically very large) feature space mapped by the kernel. [sent-69, score-0.316]
33 To estimate a linear (linear in feature space) regression tor implemented by kernel with precision , one minimizes ¢ ¡ & ¥tb £V ¡ ¨ ¤# ¢ © & ¥5x ¤# & ¤ x# ¡¦ & ( "# %ib ¢ ( V ¡ 12 7 43( & ¤ q # E ( 0 ( ! [sent-75, score-0.416]
34 2PI&$eV H2GEFt¥# " ¥# 7 §3( subject to & 9DC"B6# 8( 6 # ( minimize By switching to the dual problem of this optimization problem, it is possible to incorporate the kernel function, achieving a mapping that may not be feasible by calculating (possibly infinite) feature vectors . [sent-81, score-0.296]
35 4 Spikernels The quality of SVM learning is highly dependent on how the data is embedded in the feature space via the kernel operator. [sent-83, score-0.296]
36 For this reason, several studies have been devoted lately to developing new kernels [22, 23, 24]. [sent-84, score-0.167]
37 In fact, for classification problems, a good kernel would render the work of the classification algorithm trivial. [sent-85, score-0.22]
38 With this in mind, we develop a kernel for neural spiking activity. [sent-86, score-0.22]
39 1 Motivation Our goal in developing a kernel for spike trains is to map similar patterns to nearby areas of the feature space. [sent-88, score-0.585]
40 Current methods for predicting response variables from neural activities use standard linear regression techniques (see for instance [15]) or or even replace the time pattern with mean firing rates. [sent-89, score-0.194]
41 In the description of our kernel we attempt to capture some well accepted notions on similarities between spike trains. [sent-92, score-0.389]
42 We make the following assumptions regarding similarities between spike patterns: Pattern A PatternA PatternA Pattern B PatternB PatternB Timeof Interest Rate Rate Rate Time Time Time Figure 1: Illustrative examples of pattern similarities. [sent-93, score-0.169]
43 Middle: patterns with large bin-by-bin differences that can be eliminated with some time warping. [sent-95, score-0.12]
44 Right: patterns whose suffix (time of interest) is similar and prefix is different. [sent-96, score-0.12]
45 The most commonly made assumption is that similar firing patterns may have small differences in a bin-by-bin comparison. [sent-97, score-0.12]
46 This type of variation is due to inherent noise of any physical system but also responses to external factors that were not recorded and are not directly related the to the task performed. [sent-98, score-0.149]
47 1 we show an example of two patterns that are bin-wise similar though clearly not identical. [sent-100, score-0.12]
48 A cortical population may display highly specific patterns to represent specific information. [sent-101, score-0.471]
49 It is conceivable that some features of external stimuli are represented by population dynamics that would be best described as ’temporal’ coding. [sent-102, score-0.166]
50 Two patterns may be quite different in a simple bin-wise comparison but if they are aligned by some non-linear time distortion or shifting, the similarity becomes apparent. [sent-103, score-0.12]
51 An illustration of such patterns is given in the middle plots of Fig. [sent-104, score-0.12]
52 In comparing patterns we would like to induce a higher score when the time-shifts are small. [sent-106, score-0.12]
53 Patterns that are associated with identical values of an external stimulus at time may be similar at that time but very different at when values of the external stimulus for these patterns are no longer similar (as illustrated on the right-hand-side of Fig. [sent-107, score-0.286]
54 2 Kernel definition We describe the kernel by specifying the features that make up the feature space. [sent-110, score-0.296]
55 ¡ Patterns that are piece-wise similar to contribute to the feature coordinate with a weight that decays as the sample-by-sample comparison between the sequences grows large. [sent-130, score-0.19]
56 3 Efficient kernel calculation he definition of given by Eq. [sent-134, score-0.259]
57 Based on ideas from [24] we developed an indirect method for evaluating the kernel through a recursion which can be performed efficiently using dynamic programing. [sent-137, score-0.287]
58 Since a linear combination of kernels is also a kernel, we can define our kernel to be ! [sent-176, score-0.353]
59 G U £& ¡# ( q ( ¢ 6 I 7 §3( 2 B 6 # 5& ¡q The kernel summation can be interpreted as a concatenation of the feature vectors that these kernels represent. [sent-177, score-0.483]
60 Weighted summation is concatenation of the feature vectors after first multiplying them by the square root of the weights. [sent-178, score-0.13]
61 & "# Y & "C Y# result in kernels that differ in the way two rate values are compared. [sent-179, score-0.168]
62 Different choices of Say we assign to be the squared norm: , the integral in the kernel recursion Eq. [sent-180, score-0.287]
63 This gain results in a kernel whose computation is numerically unstable. [sent-183, score-0.22]
64 q We show results for the Experimental results Data collection: The data used in this work was recorded from the primary motor cortex of a rhesus (Macaca mulatta) monkey (~4. [sent-188, score-0.449]
65 The monkey sat in a dark chamber, and 8 electrodes were introduced into each hemisphere. [sent-192, score-0.126]
66 The data used in this report includes 31 single units and 16 multi-unit channels (MUA) that were recorded in one session by 16 microelectrodes. [sent-194, score-0.192]
67 The monkey used two planarmovement manipulanda to control 2 cursors (X and + shapes) on the screen to perform center-out task. [sent-195, score-0.177]
68 Each trial begun when the monkey centered both cursors on a central circle for 1. [sent-196, score-0.177]
69 5s) one of eight targets appeared at a distance of 4 cm from the origin and monkey had to move and reach the target in less than 2s to receive liquid reward. [sent-202, score-0.126]
70 At the end of each session, we examined the activity of neurons evoked by passive manipulation of the limbs and applied intracortical microstimulation (ICMS) to evoke movements. [sent-203, score-0.221]
71 The data presented here was recorded in penetration sites where ICMS evoked shoulder and elbow movements. [sent-204, score-0.164]
72 Data preprocessing and modeling: The movements and spike data were preprocessed to create a labeled corpus. [sent-207, score-0.286]
73 We used only the data from trials on which the monkey succeeded in the movement task and examined only the right hand movements. [sent-208, score-0.398]
74 We partitioned the movement and spike trains into -long bins to get the spike counts and average hand movement velocities hP @ in each segment. [sent-209, score-0.949]
75 We then normalized the spike counts to achieve a zero mean and a unit variance for time consisted of the or velocity as for each cortical unit. [sent-210, score-0.57]
76 10 segments) of spike counts from all ( ) cortical units as the input sequence . [sent-213, score-0.635]
77 In our experiments the number of cortical units was hence . [sent-214, score-0.35]
78 the matrix of spike counts is of size ¢ £¡ s 6 & ©pe%# @X@¡ P G G 2G225" t# P u @¡ ¢ ¤¡ ) and the SVM regression setup requires setting Each kernel employs a few parameters ( of two more parameters, ( and ). [sent-215, score-0.577]
79 Overall we had minutes of clean cortical recordings of which we used the first minutes as our validation set for tuning the parameters. [sent-219, score-0.304]
80 The kernels that we tested are the exponential kernel ( , the homogeneous polynomial kernel ( , ), the ) which boils down to a linear regression, and the standard scalar product kernel ( Spikernel. [sent-221, score-0.847]
81 We found out that predicting the signal was more difficult than predicting the signal. [sent-230, score-0.148]
82 This may be the result of sampling a population of cortical units that are tuned more to the left-right directions. [sent-231, score-0.433]
83 The linear regression method (scalar-product kernel) came in last. [sent-234, score-0.12]
84 It seems that both re-mapping the data by standard kernels and by the Spikernel allow for better prediction models. [sent-235, score-0.18]
85 The ordering of the kernels by their mean score is consistent when looking at per-test results, except for the exponential kernel which is out-performed by linear regression in some of the tests. [sent-236, score-0.473]
86 B – Mean correlation coefficient values for each kernel type The Spikernel out-performs in all folds. [sent-270, score-0.264]
87 On the data we collected, all the kernels we devised outperform the standard scalar product that is used in linear regression. [sent-272, score-0.187]
88 Furthermore, the Spikernel, a biologically motivated kernel operator for spike counts outperforms all the other kernels. [sent-273, score-0.494]
89 Primate motor cortex and free arm movements to visual targets in three-dimensional space. [sent-286, score-0.452]
90 Spatial coding of movements: A hypothesis concerning the coding of movement of movement direction by motor cortical populations. [sent-298, score-1.066]
91 Motor cortical representation of speed and direction during reaching. [sent-303, score-0.311]
92 Cortical ensemble activity increasingly predicts behavior outcomes during learning of a motor task. [sent-308, score-0.296]
93 Relationship of cerebellar purkinje cell simple spike discharge to movement kinematics in the monkey. [sent-311, score-0.442]
94 On the relationship between joint angular velocity and motor cortical discharge during reaching. [sent-320, score-0.579]
95 Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. [sent-330, score-0.265]
96 Spike synchronization and rate modulation differentially involved in motor cortical function. [sent-343, score-0.508]
97 Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. [sent-346, score-0.405]
98 Real-time predictionof hand trajectory by ensembles of cortical neurons in primates. [sent-361, score-0.364]
99 Work toward real-time control of a cortical neural prothesis. [sent-367, score-0.319]
100 Classes of kernels for machine learning: A statistical perspective. [sent-397, score-0.133]
wordName wordTfidf (topN-words)
[('spikernel', 0.384), ('cortical', 0.268), ('movement', 0.232), ('kernel', 0.22), ('motor', 0.205), ('spike', 0.169), ('qp', 0.16), ('kernels', 0.133), ('ring', 0.133), ('monkey', 0.126), ('patterns', 0.12), ('regression', 0.12), ('spikernels', 0.118), ('movements', 0.117), ('fec', 0.094), ('tb', 0.094), ('activity', 0.091), ('neuronal', 0.087), ('population', 0.083), ('external', 0.083), ('bx', 0.082), ('units', 0.082), ('arm', 0.078), ('laubach', 0.077), ('nicolelis', 0.077), ('feature', 0.076), ('predicting', 0.074), ('vaadia', 0.07), ('counts', 0.068), ('recursive', 0.068), ('recursion', 0.067), ('recorded', 0.066), ('miguel', 0.066), ('populations', 0.066), ('velocity', 0.065), ('andrew', 0.065), ('instantaneous', 0.061), ('rates', 0.059), ('georgopoulus', 0.059), ('ggg', 0.059), ('icms', 0.059), ('johan', 0.059), ('patterna', 0.059), ('patternb', 0.059), ('penetration', 0.059), ('schwartz', 0.059), ('neurons', 0.056), ('scalar', 0.054), ('concatenation', 0.054), ('cortex', 0.052), ('bergman', 0.051), ('svr', 0.051), ('chapin', 0.051), ('cursors', 0.051), ('kettner', 0.051), ('toward', 0.051), ('neurophysiology', 0.05), ('care', 0.049), ('sequence', 0.048), ('db', 0.048), ('da', 0.047), ('vx', 0.047), ('moran', 0.047), ('allow', 0.047), ('nature', 0.045), ('recording', 0.045), ('ev', 0.044), ('session', 0.044), ('ia', 0.044), ('coefficient', 0.044), ('yd', 0.044), ('plug', 0.044), ('developments', 0.044), ('direction', 0.043), ('coding', 0.043), ('contribute', 0.042), ('discharge', 0.041), ('hand', 0.04), ('evoked', 0.039), ('lodhi', 0.039), ('velocities', 0.039), ('calculation', 0.039), ('scatter', 0.037), ('manipulations', 0.037), ('av', 0.037), ('operator', 0.037), ('coordinate', 0.037), ('daniel', 0.036), ('tuning', 0.036), ('denote', 0.035), ('rate', 0.035), ('sequences', 0.035), ('groups', 0.035), ('coded', 0.035), ('manipulation', 0.035), ('studies', 0.034), ('operators', 0.034), ('algebraic', 0.034), ('mark', 0.034), ('pre', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
Author: Lavi Shpigelman, Yoram Singer, Rony Paz, Eilon Vaadia
Abstract: Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this paper is the construction of biologically-motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach using the Spikernel and various standard kernels for the task of predicting hand movement velocities from cortical recordings. In all of our experiments all the kernels we tested outperform the standard scalar product used in regression with the Spikernel consistently achieving the best performance. 1
2 0.23964022 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter
Author: W Wu, M. J. Black, Y. Gao, M. Serruya, A. Shaikhouni, J. P. Donoghue, Elie Bienenstock
Abstract: The direct neural control of external devices such as computer displays or prosthetic limbs requires the accurate decoding of neural activity representing continuous movement. We develop a real-time control system using the spiking activity of approximately 40 neurons recorded with an electrode array implanted in the arm area of primary motor cortex. In contrast to previous work, we develop a control-theoretic approach that explicitly models the motion of the hand and the probabilistic relationship between this motion and the mean firing rates of the cells in 70 bins. We focus on a realistic cursor control task in which the subject must move a cursor to “hit” randomly placed targets on a computer monitor. Encoding and decoding of the neural data is achieved with a Kalman filter which has a number of advantages over previous linear filtering techniques. In particular, the Kalman filter reconstructions of hand trajectories in off-line experiments are more accurate than previously reported results and the model provides insights into the nature of the neural coding of movement. ¨ ©§
3 0.1814855 120 nips-2002-Kernel Design Using Boosting
Author: Koby Crammer, Joseph Keshet, Yoram Singer
Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.
4 0.17964296 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
Author: Emanuel Todorov, Michael I. Jordan
Abstract: Behavioral goals are achieved reliably and repeatedly with movements rarely reproducible in their detail. Here we offer an explanation: we show that not only are variability and goal achievement compatible, but indeed that allowing variability in redundant dimensions is the optimal control strategy in the face of uncertainty. The optimal feedback control laws for typical motor tasks obey a “minimal intervention” principle: deviations from the average trajectory are only corrected when they interfere with the task goals. The resulting behavior exhibits task-constrained variability, as well as synergetic coupling among actuators—which is another unexplained empirical phenomenon.
5 0.1762421 43 nips-2002-Binary Coding in Auditory Cortex
Author: Michael R. Deweese, Anthony M. Zador
Abstract: Cortical neurons have been reported to use both rate and temporal codes. Here we describe a novel mode in which each neuron generates exactly 0 or 1 action potentials, but not more, in response to a stimulus. We used cell-attached recording, which ensured single-unit isolation, to record responses in rat auditory cortex to brief tone pips. Surprisingly, the majority of neurons exhibited binary behavior with few multi-spike responses; several dramatic examples consisted of exactly one spike on 100% of trials, with no trial-to-trial variability in spike count. Many neurons were tuned to stimulus frequency. Since individual trials yielded at most one spike for most neurons, the information about stimulus frequency was encoded in the population, and would not have been accessible to later stages of processing that only had access to the activity of a single unit. These binary units allow a more efficient population code than is possible with conventional rate coding units, and are consistent with a model of cortical processing in which synchronous packets of spikes propagate stably from one neuronal population to the next. 1 Binary coding in auditory cortex We recorded responses of neurons in the auditory cortex of anesthetized rats to pure-tone pips of different frequencies [1, 2]. Each pip was presented repeatedly, allowing us to assess the variability of the neural response to multiple presentations of each stimulus. We first recorded multi-unit activity with conventional tungsten electrodes (Fig. 1a). The number of spikes in response to each pip fluctuated markedly from one trial to the next (Fig. 1e), as though governed by a random mechanism such as that generating the ticks of a Geiger counter. Highly variable responses such as these, which are at least as variable as a Poisson process, are the norm in the cortex [3-7], and have contributed to the widely held view that cortical spike trains are so noisy that only the average firing rate can be used to encode stimuli. Because we were recording the activity of an unknown number of neurons, we could not be sure whether the strong trial-to-trial fluctuations reflected the underlying variability of the single units. We therefore used an alternative technique, cell- a b Single-unit recording method 5mV Multi-unit 1sec Raw cellattached voltage 10 kHz c Single-unit . . . . .. .. ... . . .... . ... . Identified spikes Threshold e 28 kHz d Single-unit 80 120 160 200 Time (msec) N = 29 tones 3 2 1 Poisson N = 11 tones ry 40 4 na bi 38 kHz 0 Response variance/mean (spikes/trial) High-pass filtered 0 0 1 2 3 Mean response (spikes/trial) Figure 1: Multi-unit spiking activity was highly variable, but single units obeyed binomial statistics. a Multi-unit spike rasters from a conventional tungsten electrode recording showed high trial-to-trial variability in response to ten repetitions of the same 50 msec pure tone stimulus (bottom). Darker hash marks indicate spike times within the response period, which were used in the variability analysis. b Spikes recorded in cell-attached mode were easily identified from the raw voltage trace (top) by applying a high-pass filter (bottom) and thresholding (dark gray line). Spike times (black squares) were assigned to the peaks of suprathreshold segments. c Spike rasters from a cell-attached recording of single-unit responses to 25 repetitions of the same tone consisted of exactly one well-timed spike per trial (latency standard deviation = 1.0 msec), unlike the multi-unit responses (Fig. 1a). Under the Poisson assumption, this would have been highly unlikely (P ~ 10 -11). d The same neuron as in Fig. 1c responds with lower probability to repeated presentations of a different tone, but there are still no multi-spike responses. e We quantified response variability for each tone by dividing the variance in spike count by the mean spike count across all trials for that tone. Response variability for multi-unit tungsten recording (open triangles) was high for each of the 29 tones (out of 32) that elicited at least one spike on one trial. All but one point lie above one (horizontal gray line), which is the value produced by a Poisson process with any constant or time varying event rate. Single unit responses recorded in cell-attached mode were far less variable (filled circles). Ninety one percent (10/11) of the tones that elicited at least one spike from this neuron produced no multi-spike responses in 25 trials; the corresponding points fall on the diagonal line between (0,1) and (1,0), which provides a strict lower bound on the variability for any response set with a mean between 0 and 1. No point lies above one. attached recording with a patch pipette [8, 9], in order to ensure single unit isolation (Fig. 1b). This recording mode minimizes both of the main sources of error in spike detection: failure to detect a spike in the unit under observation (false negatives), and contamination by spikes from nearby neurons (false positives). It also differs from conventional extracellular recording methods in its selection bias: With cell- attached recording neurons are selected solely on the basis of the experimenter’s ability to form a seal, rather than on the basis of neuronal activity and responsiveness to stimuli as in conventional methods. Surprisingly, single unit responses were far more orderly than suggested by the multi-unit recordings; responses typically consisted of either 0 or 1 spikes per trial, and not more (Fig. 1c-e). In the most dramatic examples, each presentation of the same tone pip elicited exactly one spike (Fig. 1c). In most cases, however, some presentations failed to elicit a spike (Fig. 1d). Although low-variability responses have recently been observed in the cortex [10, 11] and elsewhere [12, 13], the binary behavior described here has not previously been reported for cortical neurons. a 1.4 N = 3055 response sets b 1.2 1 Poisson 28 kHz - 100 msec 0.8 0.6 0.4 0.2 0 0 ry na bi Response variance/mean (spikes/trial) The majority of the neurons (59%) in our study for which statistical significance could be assessed (at the p<0.001 significance level; see Fig. 2, caption) showed noisy binary behavior—“binary” because neurons produced either 0 or 1 spikes, and “noisy” because some stimuli elicited both single spikes and failures. In a substantial fraction of neurons, however, the responses showed more variability. We found no correlation between neuronal variability and cortical layer (inferred from the depth of the recording electrode), cortical area (inside vs. outside of area A1) or depth of anesthesia. Moreover, the binary mode of spiking was not due to the brevity (25 msec) of the stimuli; responses that were binary for short tones were comparably binary when longer (100 msec) tones were used (Fig. 2b). Not assessable Not significant Significant (p<0.001) 0.2 0.4 0.6 0.8 1 1.2 Mean response (spikes/trial) 28 kHz - 25 msec 1.4 0 40 80 120 160 Time (msec) 200 Figure 2: Half of the neuronal population exhibited binary firing behavior. a Of the 3055 sets of responses to 25 msec tones, 2588 (gray points) could not be assessed for significance at the p<0.001 level, 225 (open circles) were not significantly binary, and 242 were significantly binary (black points; see Identification methods for group statistics below). All points were jittered slightly so that overlying points could be seen in the figure. 2165 response sets contained no multi-spike responses; the corresponding points fell on the line from [0,1] to [1,0]. b The binary nature of single unit responses was insensitive to tone duration, even for frequencies that elicited the largest responses. Twenty additional spike rasters from the same neuron (and tone frequency) as in Fig. 1c contain no multi-spike responses whether in response to 100 msec tones (above) or 25 msec tones (below). Across the population, binary responses were as prevalent for 100 msec tones as for 25 msec tones (see Identification methods for group statistics). In many neurons, binary responses showed high temporal precision, with latencies sometimes exhibiting standard deviations as low as 1 msec (Fig. 3; see also Fig. 1c), comparable to previous observations in the auditory cortex [14], and only slightly more precise than in monkey visual area MT [5]. High temporal precision was positively correlated with high response probability (Fig. 3). a b N = (44 cells)x(32 tones) 14 N = 32 tones 12 30 Jitter (msec) Jitter (msec) 40 10 8 6 20 10 4 2 0 0 0 0.2 0.4 0.6 0.8 Mean response (spikes/trial) 1 0 0.4 0.8 1.2 1.6 Mean response (spikes/trial) 2 Figure 3: Trial-to-trial variability in latency of response to repeated presentations of the same tone decreased with increasing response probability. a Scatter plot of standard deviation of latency vs. mean response for 25 presentations each of 32 tones for a different neuron as in Figs. 1 and 2 (gray line is best linear fit). Rasters from 25 repeated presentations of a low response tone (upper left inset, which corresponds to left-most data point) display much more variable latencies than rasters from a high response tone (lower right inset; corresponds to right-most data point). b The negative correlation between latency variability and response size was present on average across the population of 44 neurons described in Identification methods for group statistics (linear fit, gray). The low trial-to-trial variability ruled out the possibility that the firing statistics could be accounted for by a simple rate-modulated Poisson process (Fig. 4a1,a2). In other systems, low variability has sometimes been modeled as a Poisson process followed by a post-spike refractory period [10, 12]. In our system, however, the range in latencies of evoked binary responses was often much greater than the refractory period, which could not have been longer than the 2 msec inter-spike intervals observed during epochs of spontaneous spiking, indicating that binary spiking did not result from any intrinsic property of the spike generating mechanism (Fig. 4a3). Moreover, a single stimulus-evoked spike could suppress subsequent spikes for as long as hundreds of milliseconds (e.g. Figs. 1d,4d), supporting the idea that binary spiking arises through a circuit-level, rather than a single-neuron, mechanism. Indeed, the fact that this suppression is observed even in the cortex of awake animals [15] suggests that binary spiking is not a special property of the anesthetized state. It seems surprising that binary spiking in the cortex has not previously been remarked upon. In the auditory cortex the explanation may be in part technical: Because firing rates in the auditory cortex tend to be low, multi-unit recording is often used to maximize the total amount of data collected. Moreover, our use of cell-attached recording minimizes the usual bias toward responsive or active neurons. Such explanations are not, however, likely to account for the failure to observe binary spiking in the visual cortex, where spike count statistics have been scrutinized more closely [3-7]. One possibility is that this reflects a fundamental difference between the auditory and visual systems. An alternative interpretation— a1 b Response probability 100 spikes/s 2 kHz Poisson simulation c 100 200 300 400 Time (msec) 500 20 Ratio of pool sizes a2 0 16 12 8 4 0 a3 Poisson with refractory period 0 40 80 120 160 200 Time (msec) d Response probability PSTH 0.2 0.4 0.6 0.8 1 Mean spike count per neuron 1 0.8 N = 32 tones 0.6 0.4 0.2 0 2.0 3.8 7.1 13.2 24.9 46.7 Tone frequency (kHz) Figure 4: a The lack of multi-spike responses elicited by the neuron shown in Fig. 3a were not due to an absolute refractory period since the range of latencies for many tones, like that shown here, was much greater than any reasonable estimate for the neuron’s refractory period. (a1) Experimentally recorded responses. (a2) Using the smoothed post stimulus time histogram (PSTH; bottom) from the set of responses in Fig. 4a, we generated rasters under the assumption of Poisson firing. In this representative example, four double-spike responses (arrows at left) were produced in 25 trials. (a3) We then generated rasters assuming that the neuron fired according to a Poisson process subject to a hard refractory period of 2 msec. Even with a refractory period, this representative example includes one triple- and three double-spike responses. The minimum interspike-interval during spontaneous firing events was less than two msec for five of our neurons, so 2 msec is a conservative upper bound for the refractory period. b. Spontaneous activity is reduced following high-probability responses. The PSTH (top; 0.25 msec bins) of the combined responses from the 25% (8/32) of tones that elicited the largest responses from the same neuron as in Figs. 3a and 4a illustrates a preclusion of spontaneous and evoked activity for over 200 msec following stimulation. The PSTHs from progressively less responsive groups of tones show progressively less preclusion following stimulation. c Fewer noisy binary neurons need to be pooled to achieve the same “signal-to-noise ratio” (SNR; see ref. [24]) as a collection of Poisson neurons. The ratio of the number of Poisson to binary neurons required to achieve the same SNR is plotted against the mean number of spikes elicited per neuron following stimulation; here we have defined the SNR to be the ratio of the mean spike count to the standard deviation of the spike count. d Spike probability tuning curve for the same neuron as in Figs. 1c-e and 2b fit to a Gaussian in tone frequency. and one that we favor—is that the difference rests not in the sensory modality, but instead in the difference between the stimuli used. In this view, the binary responses may not be limited to the auditory cortex; neurons in visual and other sensory cortices might exhibit similar responses to the appropriate stimuli. For example, the tone pips we used might be the auditory analog of a brief flash of light, rather than the oriented moving edges or gratings usually used to probe the primary visual cortex. Conversely, auditory stimuli analogous to edges or gratings [16, 17] may be more likely to elicit conventional, rate-modulated Poisson responses in the auditory cortex. Indeed, there may be a continuum between binary and Poisson modes. Thus, even in conventional rate-modulated responses, the first spike is often privileged in that it carries most of the information in the spike train [5, 14, 18]. The first spike may be particularly important as a means of rapidly signaling stimulus transients. Binary responses suggest a mode that complements conventional rate coding. In the simplest rate-coding model, a stimulus parameter (such as the frequency of a tone) governs only the rate at which a neuron generates spikes, but not the detailed positions of the spikes; the actual spike train itself is an instantiation of a random process (such as a Poisson process). By contrast, in the binomial model, the stimulus parameter (frequency) is encoded as the probability of firing (Fig. 4d). Binary coding has implications for cortical computation. In the rate coding model, stimulus encoding is “ergodic”: a stimulus parameter can be read out either by observing the activity of one neuron for a long time, or a population for a short time. By contrast, in the binary model the stimulus value can be decoded only by observing a neuronal population, so that there is no benefit to integrating over long time periods (cf. ref. [19]). One advantage of binary encoding is that it allows the population to signal quickly; the most compact message a neuron can send is one spike [20]. Binary coding is also more efficient in the context of population coding, as quantified by the signal-to-noise ratio (Fig. 4c). The precise organization of both spike number and time we have observed suggests that cortical activity consists, at least under some conditions, of packets of spikes synchronized across populations of neurons. Theoretical work [21-23] has shown how such packets can propagate stably from one population to the next, but only if neurons within each population fire at most one spike per packet; otherwise, the number of spikes per packet—and hence the width of each packet—grows at each propagation step. Interestingly, one prediction of stable propagation models is that spike probability should be related to timing precision, a prediction born out by our observations (Fig. 3). The role of these packets in computation remains an open question. 2 Identification methods for group statistics We recorded responses to 32 different 25 msec tones from each of 175 neurons from the auditory cortices of 16 Sprague-Dawley rats; each tone was repeated between 5 and 75 times (mean = 19). Thus our ensemble consisted of 32x175=5600 response sets, with between 5 and 75 samples in each set. Of these, 3055 response sets contained at least one spike on at least on trial. For each response set, we tested the hypothesis that the observed variability was significantly lower than expected from the null hypothesis of a Poisson process. The ability to assess significance depended on two parameters: the sample size (5-75) and the firing probability. Intuitively, the dependence on firing probability arises because at low firing rates most responses produce only trials with 0 or 1 spikes under both the Poisson and binary models; only at high firing rates do the two models make different predictions, since in that case the Poisson model includes many trials with 2 or even 3 spikes while the binary model generates only solitary spikes (see Fig. 4a1,a2). Using a stringent significance criterion of p<0.001, 467 response sets had a sufficient number of repeats to assess significance, given the observed firing probability. Of these, half (242/467=52%) were significantly less variable than expected by chance, five hundred-fold higher than the 467/1000=0.467 response sets expected, based on the 0.001 significance criterion, to yield a binary response set. Seventy-two neurons had at least one response set for which significance could be assessed, and of these, 49 neurons (49/72=68%) had at least one significantly sub-Poisson response set. Of this population of 49 neurons, five achieved low variability through repeatable bursty behavior (e.g., every spike count was either 0 or 3, but not 1 or 2) and were excluded from further analysis. The remaining 44 neurons formed the basis for the group statistics analyses shown in Figs. 2a and 3b. Nine of these neurons were subjected to an additional protocol consisting of at least 10 presentations each of 100 msec tones and 25 msec tones of all 32 frequencies. Of the 100 msec stimulation response sets, 44 were found to be significantly sub-Poisson at the p<0.05 level, in good agreement with the 43 found to be significant among the responses to 25 msec tones. 3 Bibliography 1. Kilgard, M.P. and M.M. Merzenich, Cortical map reorganization enabled by nucleus basalis activity. Science, 1998. 279(5357): p. 1714-8. 2. Sally, S.L. and J.B. Kelly, Organization of auditory cortex in the albino rat: sound frequency. J Neurophysiol, 1988. 59(5): p. 1627-38. 3. Softky, W.R. and C. Koch, The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J Neurosci, 1993. 13(1): p. 334-50. 4. Stevens, C.F. and A.M. Zador, Input synchrony and the irregular firing of cortical neurons. Nat Neurosci, 1998. 1(3): p. 210-7. 5. Buracas, G.T., A.M. Zador, M.R. DeWeese, and T.D. Albright, Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 1998. 20(5): p. 959-69. 6. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 7. Tolhurst, D.J., J.A. Movshon, and A.F. Dean, The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res, 1983. 23(8): p. 775-85. 8. Otmakhov, N., A.M. Shirke, and R. Malinow, Measuring the impact of probabilistic transmission on neuronal output. Neuron, 1993. 10(6): p. 1101-11. 9. Friedrich, R.W. and G. Laurent, Dynamic optimization of odor representations by slow temporal patterning of mitral cell activity. Science, 2001. 291(5505): p. 889-94. 10. Kara, P., P. Reinagel, and R.C. Reid, Low response variability in simultaneously recorded retinal, thalamic, and cortical neurons. Neuron, 2000. 27(3): p. 635-46. 11. Gur, M., A. Beylin, and D.M. Snodderly, Response variability of neurons in primary visual cortex (V1) of alert monkeys. J Neurosci, 1997. 17(8): p. 2914-20. 12. Berry, M.J., D.K. Warland, and M. Meister, The structure and precision of retinal spike trains. Proc Natl Acad Sci U S A, 1997. 94(10): p. 5411-6. 13. de Ruyter van Steveninck, R.R., G.D. Lewen, S.P. Strong, R. Koberle, and W. Bialek, Reproducibility and variability in neural spike trains. Science, 1997. 275(5307): p. 1805-8. 14. Heil, P., Auditory cortical onset responses revisited. I. First-spike timing. J Neurophysiol, 1997. 77(5): p. 2616-41. 15. Lu, T., L. Liang, and X. Wang, Temporal and rate representations of timevarying signals in the auditory cortex of awake primates. Nat Neurosci, 2001. 4(11): p. 1131-8. 16. Kowalski, N., D.A. Depireux, and S.A. Shamma, Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J Neurophysiol, 1996. 76(5): p. 350323. 17. deCharms, R.C., D.T. Blake, and M.M. Merzenich, Optimizing sound features for cortical neurons. Science, 1998. 280(5368): p. 1439-43. 18. Panzeri, S., R.S. Petersen, S.R. Schultz, M. Lebedev, and M.E. Diamond, The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 2001. 29(3): p. 769-77. 19. Britten, K.H., M.N. Shadlen, W.T. Newsome, and J.A. Movshon, The analysis of visual motion: a comparison of neuronal and psychophysical performance. J Neurosci, 1992. 12(12): p. 4745-65. 20. Delorme, A. and S.J. Thorpe, Face identification using one spike per neuron: resistance to image degradations. Neural Netw, 2001. 14(6-7): p. 795-803. 21. Diesmann, M., M.O. Gewaltig, and A. Aertsen, Stable propagation of synchronous spiking in cortical neural networks. Nature, 1999. 402(6761): p. 529-33. 22. Marsalek, P., C. Koch, and J. Maunsell, On the relationship between synaptic input and spike output jitter in individual neurons. Proc Natl Acad Sci U S A, 1997. 94(2): p. 735-40. 23. Kistler, W.M. and W. Gerstner, Stable propagation of activity pulses in populations of spiking neurons. Neural Comp., 2002. 14: p. 987-997. 24. Zohary, E., M.N. Shadlen, and W.T. Newsome, Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 1994. 370(6485): p. 140-3. 25. Abbott, L.F. and P. Dayan, The effect of correlated variability on the accuracy of a population code. Neural Comput, 1999. 11(1): p. 91-101.
6 0.16474199 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
7 0.16108668 76 nips-2002-Dynamical Constraints on Computing with Spike Timing in the Cortex
8 0.15417539 103 nips-2002-How Linear are Auditory Cortical Responses?
9 0.15166822 156 nips-2002-On the Complexity of Learning the Kernel Matrix
10 0.13957152 55 nips-2002-Combining Features for BCI
11 0.13642691 171 nips-2002-Reconstructing Stimulus-Driven Neural Networks from Spike Times
12 0.13530904 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
13 0.13210399 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex
14 0.13172452 119 nips-2002-Kernel Dependency Estimation
15 0.13112201 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule
16 0.13002202 106 nips-2002-Hyperkernels
17 0.12960166 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
18 0.12789284 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
19 0.1270064 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
20 0.12083242 26 nips-2002-An Estimation-Theoretic Framework for the Presentation of Multiple Stimuli
topicId topicWeight
[(0, -0.331), (1, 0.135), (2, 0.126), (3, -0.175), (4, -0.076), (5, -0.127), (6, 0.094), (7, 0.099), (8, 0.099), (9, 0.102), (10, -0.079), (11, 0.089), (12, 0.099), (13, -0.031), (14, -0.095), (15, -0.035), (16, -0.023), (17, -0.015), (18, -0.128), (19, -0.006), (20, -0.054), (21, -0.059), (22, -0.044), (23, -0.008), (24, 0.024), (25, -0.044), (26, 0.169), (27, -0.148), (28, 0.053), (29, 0.032), (30, -0.145), (31, -0.008), (32, 0.058), (33, 0.025), (34, 0.01), (35, -0.018), (36, -0.062), (37, -0.118), (38, -0.017), (39, 0.106), (40, 0.021), (41, 0.094), (42, 0.009), (43, -0.079), (44, -0.074), (45, -0.007), (46, -0.032), (47, -0.03), (48, -0.036), (49, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.95577049 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
Author: Lavi Shpigelman, Yoram Singer, Rony Paz, Eilon Vaadia
Abstract: Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this paper is the construction of biologically-motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach using the Spikernel and various standard kernels for the task of predicting hand movement velocities from cortical recordings. In all of our experiments all the kernels we tested outperform the standard scalar product used in regression with the Spikernel consistently achieving the best performance. 1
2 0.72453147 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter
Author: W Wu, M. J. Black, Y. Gao, M. Serruya, A. Shaikhouni, J. P. Donoghue, Elie Bienenstock
Abstract: The direct neural control of external devices such as computer displays or prosthetic limbs requires the accurate decoding of neural activity representing continuous movement. We develop a real-time control system using the spiking activity of approximately 40 neurons recorded with an electrode array implanted in the arm area of primary motor cortex. In contrast to previous work, we develop a control-theoretic approach that explicitly models the motion of the hand and the probabilistic relationship between this motion and the mean firing rates of the cells in 70 bins. We focus on a realistic cursor control task in which the subject must move a cursor to “hit” randomly placed targets on a computer monitor. Encoding and decoding of the neural data is achieved with a Kalman filter which has a number of advantages over previous linear filtering techniques. In particular, the Kalman filter reconstructions of hand trajectories in off-line experiments are more accurate than previously reported results and the model provides insights into the nature of the neural coding of movement. ¨ ©§
3 0.67096543 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
Author: Emanuel Todorov, Michael I. Jordan
Abstract: Behavioral goals are achieved reliably and repeatedly with movements rarely reproducible in their detail. Here we offer an explanation: we show that not only are variability and goal achievement compatible, but indeed that allowing variability in redundant dimensions is the optimal control strategy in the face of uncertainty. The optimal feedback control laws for typical motor tasks obey a “minimal intervention” principle: deviations from the average trajectory are only corrected when they interfere with the task goals. The resulting behavior exhibits task-constrained variability, as well as synergetic coupling among actuators—which is another unexplained empirical phenomenon.
4 0.53792542 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
Author: Auke J. Ijspeert, Jun Nakanishi, Stefan Schaal
Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-offreedom humanoid robot exemplify the feasibility and robustness of our approach. 1
5 0.52864653 167 nips-2002-Rational Kernels
Author: Corinna Cortes, Patrick Haffner, Mehryar Mohri
Abstract: We introduce a general family of kernels based on weighted transducers or rational relations, rational kernels, that can be used for analysis of variable-length sequences or more generally weighted automata, in applications such as computational biology or speech recognition. We show that rational kernels can be computed efficiently using a general algorithm of composition of weighted transducers and a general single-source shortest-distance algorithm. We also describe several general families of positive definite symmetric rational kernels. These general kernels can be combined with Support Vector Machines to form efficient and powerful techniques for spoken-dialog classification: highly complex kernels become easy to design and implement and lead to substantial improvements in the classification accuracy. We also show that the string kernels considered in applications to computational biology are all specific instances of rational kernels.
6 0.51689178 106 nips-2002-Hyperkernels
7 0.49852738 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
8 0.4868792 120 nips-2002-Kernel Design Using Boosting
9 0.4826059 156 nips-2002-On the Complexity of Learning the Kernel Matrix
10 0.48101404 43 nips-2002-Binary Coding in Auditory Cortex
11 0.47763619 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques
12 0.47239229 55 nips-2002-Combining Features for BCI
13 0.47033733 119 nips-2002-Kernel Dependency Estimation
14 0.46275771 113 nips-2002-Information Diffusion Kernels
15 0.45957562 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
16 0.4558571 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
17 0.44936588 171 nips-2002-Reconstructing Stimulus-Driven Neural Networks from Spike Times
18 0.44215578 76 nips-2002-Dynamical Constraints on Computing with Spike Timing in the Cortex
19 0.44191498 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
20 0.43981275 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
topicId topicWeight
[(3, 0.011), (11, 0.027), (23, 0.069), (42, 0.046), (54, 0.114), (55, 0.056), (57, 0.278), (64, 0.018), (67, 0.017), (68, 0.043), (74, 0.068), (92, 0.03), (98, 0.148)]
simIndex simValue paperId paperTitle
1 0.9221468 94 nips-2002-Fractional Belief Propagation
Author: Wim Wiegerinck, Tom Heskes
Abstract: We consider loopy belief propagation for approximate inference in probabilistic graphical models. A limitation of the standard algorithm is that clique marginals are computed as if there were no loops in the graph. To overcome this limitation, we introduce fractional belief propagation. Fractional belief propagation is formulated in terms of a family of approximate free energies, which includes the Bethe free energy and the naive mean-field free as special cases. Using the linear response correction of the clique marginals, the scale parameters can be tuned. Simulation results illustrate the potential merits of the approach.
same-paper 2 0.85408664 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
Author: Lavi Shpigelman, Yoram Singer, Rony Paz, Eilon Vaadia
Abstract: Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this paper is the construction of biologically-motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach using the Spikernel and various standard kernels for the task of predicting hand movement velocities from cortical recordings. In all of our experiments all the kernels we tested outperform the standard scalar product used in regression with the Spikernel consistently achieving the best performance. 1
3 0.76851773 135 nips-2002-Learning with Multiple Labels
Author: Rong Jin, Zoubin Ghahramani
Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1
4 0.71518409 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
Author: Tom Heskes
Abstract: We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable fixed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable fixed points. We illustrate this with an example and discuss implications. 1
5 0.63842177 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
Author: Auke J. Ijspeert, Jun Nakanishi, Stefan Schaal
Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-offreedom humanoid robot exemplify the feasibility and robustness of our approach. 1
6 0.63442516 148 nips-2002-Morton-Style Factorial Coding of Color in Primary Visual Cortex
7 0.63388336 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution
8 0.63249344 4 nips-2002-A Differential Semantics for Jointree Algorithms
9 0.63047802 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
10 0.62869763 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
11 0.62772679 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories
12 0.61953902 43 nips-2002-Binary Coding in Auditory Cortex
13 0.61910594 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter
14 0.6181128 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
15 0.61688775 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
16 0.61041826 55 nips-2002-Combining Features for BCI
17 0.60930318 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
18 0.60651451 169 nips-2002-Real-Time Particle Filters
19 0.60612154 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
20 0.60563958 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits