nips nips2005 nips2005-129 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Frank Wood, Stefan Roth, Michael J. Black
Abstract: Probabilistic modeling of correlated neural population firing activity is central to understanding the neural code and building practical decoding algorithms. No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. The parameters of the model are learned using contrastive divergence and an optimization procedure for finding appropriate marginal directions. We evaluate the method using real data recorded from a population of motor cortical neurons. In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is significantly higher than under other models. These results suggest that our model captures correlations in the firing activity. Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Probabilistic modeling of correlated neural population firing activity is central to understanding the neural code and building practical decoding algorithms. [sent-4, score-0.784]
2 No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. [sent-5, score-0.367]
3 To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. [sent-6, score-0.605]
4 The parameters of the model are learned using contrastive divergence and an optimization procedure for finding appropriate marginal directions. [sent-7, score-0.64]
5 We evaluate the method using real data recorded from a population of motor cortical neurons. [sent-8, score-0.38]
6 In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is significantly higher than under other models. [sent-9, score-0.675]
7 Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. [sent-11, score-0.924]
8 1 Introduction Modeling population activity is central to many problems in the analysis of neural data. [sent-12, score-0.409]
9 Current multi-electrode technology, however, allows the activity of tens or hundreds of cells to be recorded simultaneously along with with complex natural stimuli or behavior. [sent-14, score-0.357]
10 Probabilistic modeling of this data is challenging due to its high-dimensional nature and the correlated firing activity of neural populations. [sent-15, score-0.438]
11 One can view the problem as one of learning the joint probability P (s, r) of a stimulus or behavior s and the firing activity of a neural population r. [sent-16, score-0.512]
12 The neural activity may be in the form of firing rates or spike times. [sent-17, score-0.349]
13 Here we focus the latter more challenging problem of representing a multivariate probability distribution over spike times. [sent-18, score-0.174]
14 On the other hand no parametric models currently exist that capture the one-sided, skewed nature of typical correlated neural data. [sent-21, score-0.308]
15 We do, however, have sufficient data to model the marginal statistics of the data. [sent-22, score-0.179]
16 With that observation we draw on the FRAME model developed by Zhu and Mumford for image texture synthesis [1] to represent neural population activity. [sent-23, score-0.301]
17 The FRAME model represents P (s, r) in terms of its marginal histograms. [sent-24, score-0.179]
18 The joint is represented by a Gibbs model that combines functions of these marginals and we exploit the method of [2] to automatically choose the optimal marginal directions. [sent-26, score-0.347]
19 To learn the parameters of the model we exploit the technique of contrastive divergence [3, 4] which has been used previously to learn the parameters of Product-of-Experts (PoE) models [5]. [sent-27, score-0.582]
20 We observe that the FRAME model can be viewed as a Product of Experts where the experts are functions of the marginal histograms. [sent-28, score-0.396]
21 The resulting model is more flexible than the standard PoE formulation and allows us to model more complex, skewed distributions observed in neural data. [sent-29, score-0.23]
22 We train and test the model on real data recorded from a monkey performing a motor control task; details of the task and the neural data are described in the following section. [sent-30, score-0.344]
23 We learn a variety of probabilistic models including full Gaussian, independent Gaussian, product of t-distributions [4], independent non-parametric, and the FRAME model. [sent-31, score-0.144]
24 We evaluate the log likelihood of test data under the different models and show that the complete FRAME model outperforms the other methods (note that “complete” here means the model uses the same number of marginal directions as there are dimensions in the data). [sent-32, score-0.539]
25 The use of energy-based models such as FRAME for modeling neural data appears novel and promising, and the results reported here are easily extended to other cortical areas. [sent-33, score-0.266]
26 There is a need in the community for such probabilistic models of multi-variate spiking processes. [sent-34, score-0.347]
27 For example Bell and Para [6] formulate a simple model of correlated spiking but acknowledge that what they would really like, and do not have, is what they call a “maximum spikelihood” model. [sent-35, score-0.34]
28 This neural modeling problem represents a new application of energy-based models and consequently suggests extensions of the basic methods. [sent-36, score-0.22]
29 Finally, there is a need for rich probabilistic models of this type in the Bayesian decoding of neural activity [7]. [sent-37, score-0.518]
30 2 Methods The data used in this study consists of simultaneously recorded spike times from a population of M1 motor neurons recorded in monkeys trained to perform a manual tracking task [8, 9]. [sent-38, score-0.508]
31 The task involved moving a 2D manipulandum so that a cursor controlled by the manipulandum came into contact with a target. [sent-40, score-0.222]
32 Several papers [9, 11, 10] have reported successfully decoding the cursor kinematics from this data using firing rates estimated from binned spike counts. [sent-42, score-0.384]
33 The activity of a population of cells was recorded at a rate of 30kHz then sorted using an automated spike sorting method; from this we randomly selected five cells with which to demonstrate our method. [sent-43, score-0.712]
34 , ti,k ] is a vector of time intervals ti,k that represents the spiking activity of a single cell i at timestep k. [sent-48, score-0.579]
35 These intervals are the elapsed time between the time at timestep k and the time at each of j past spikes. [sent-49, score-0.121]
36 , rN,k ] be a vector concatenation of N such spiking activity representations. [sent-53, score-0.43]
37 Let sk = [xk , yk ] be the position of the manipulandum at each timestep. [sent-54, score-0.203]
38 Hand position at time k, sk = [xk , yk ], is regularly sampled every 50ms. [sent-56, score-0.129]
39 Spiking activity (shown as vertical bars) is retained at full data acquisition precision (30khz). [sent-57, score-0.188]
40 Sections of spike trains from four cells are shown. [sent-58, score-0.195]
41 training data consists of 4000 points Rk , sk sampled at 50ms intervals with a history of 3 past spikes (J = 3) per neuron. [sent-60, score-0.255]
42 Various empirical marginals of the data (shown in Fig 2) illustrate that the data are not well fit by canonical symmetric parametric distributions because the data is asymmetric and skewed. [sent-62, score-0.152]
43 For such data traditional parametric models may not work well so instead we apply the FRAME model of [1] to this modeling problem. [sent-63, score-0.263]
44 FRAME is a semi-parametric energy based model of the following form: Let dk = [sk , Rk ], where sk and Rk are defined as above. [sent-64, score-0.454]
45 In this e T view λT φ(ωe dk ) is an energy function computed over a projection of the data. [sent-71, score-0.477]
46 Models of e this form are constrained maximum entropy models, and in this case by adjusting λ e the model marginal projection onto ωe is constrained to be identical (ideally) to the empirical marginal over the same projection. [sent-72, score-0.566]
47 Distributions of this form are called Gibbs or energy-based distributions as T T e λe φ(ωe dk ) is analogous to the energy in a Boltzmann distribution. [sent-76, score-0.357]
48 Minimizing the Figure 2: Histograms of various projections of single cell data. [sent-77, score-0.144]
49 d) 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ¸ * 2987432100000000 Figure 3: (left) Illustration of the projection and weighting of a single point d: Here, the data point d is projected onto the projection direction ω. [sent-87, score-0.398]
50 (right) Illustration of the projection and binning of d: The upper plot shows the empirical marginal (in dotted gray) as obtained from the projection illustrated in the left figure. [sent-89, score-0.454]
51 The function φ(·) takes a real valued projection and produces a vector of fixed length with a single 1 in the bin that is mapped to that range of the projection. [sent-90, score-0.211]
52 This discretization of the projection is indicated by the spacing of the downward pointing arrows. [sent-91, score-0.16]
53 This process is repeated for each of the projection directions in the model. [sent-93, score-0.24]
54 this energy is equivalent to maximizing the log likelihood. [sent-95, score-0.125]
55 Our model is parameterized by Θ = {{λe , ωe } : 1 < e < E} where E is the total number of projections (or “experts”). [sent-96, score-0.127]
56 We use gradient ascent on the log likelihood to train the λ e ’s. [sent-97, score-0.203]
57 1 Learning the λ’s Standard gradient ascent becomes intractable for large numbers of cells because computing the partition function and its gradient becomes intractable. [sent-100, score-0.281]
58 The gradient of the log probability with respect to λ1. [sent-101, score-0.122]
59 (2) Θλ log P (dk ) = [ ∂λ1 ∂λE Besides not being able to normalize the distribution, the right hand term of the partial ∂ log Z(Θ) ∂ log P (dk ) T = φ(ωe dk ) − ∂λe ∂λe typically has no closed-form solution and is very hard to compute. [sent-107, score-0.518]
60 Contrastive divergence [4] is an efficient learning algorithm for energy based models that approximates the gradient as ∂ log P (dk ) ∂ log P (dk ) ∂ log P (dk ) − ≈ (3) ∂λe ∂λe ∂λe P0 Pm Θ m where P 0 is the training data and PΘ are samples drawn according to the model. [sent-109, score-0.568]
61 The superscript indicates that we use m regular Metropolis sampling steps [13] to draw samples from the model for contrastive divergence training (m = 50 in our experiments). [sent-111, score-0.511]
62 Maximizing the log probability of training data is equivalent to minimizing the Kullback Leibler (KL) divergence between the model and the true distribution. [sent-113, score-0.304]
63 Contrastive divergence attempts to minimize the difference in KL divergence between the model one step towards equilibrium and the training data. [sent-114, score-0.377]
64 Intuitively this means that the contrastive divergence opposes any tendency for the model to diverge from the true distribution. [sent-115, score-0.471]
65 2 Learning the ω’s Because φ(·) is not differentiable, we turn to the feature pursuit method of [2] to learn the projection directions ω1. [sent-117, score-0.313]
66 This approach involves successively searching for a new projection in a direction where a model with the new projection would differ maximally from the model without. [sent-120, score-0.41]
67 Their approach involves approximating the expected projection using a Parzen window method with Gaussian kernels. [sent-121, score-0.16]
68 It may be as effective (and certainly more efficient) to simply guess a large number of projections and estimate the marginal KL-divergence for them all, selecting the largest as the new projection. [sent-128, score-0.216]
69 The per-expert normalization is Ze = (b) s(b) e−λe e b (b) se is where b indexes the elements of λe and the width of the bth bin of the eth histogramming function. [sent-132, score-0.132]
70 This is not possible when the number of experts exceeds or is smaller than the dimensionality of the data. [sent-137, score-0.217]
71 The test data consists of the spiking activity of 5 cells and x, y position behavioral variables as illustrated in Fig. [sent-139, score-0.562]
72 Each row shows normalized 2-d histograms of samples projected onto a plane. [sent-142, score-0.154]
73 We show results for complete models of the joint neuronal response of 5 real motor cortex cells plus x, y hand kinematics (3 past spikes for each cell plus 2 behavior variables equals a 17 dimension dataset). [sent-145, score-0.64]
74 A complete model has the same number of experts as dimensions. [sent-146, score-0.316]
75 Each model was trained on 4000 points and the log likelihood was computed using 1000 distinct test points. [sent-149, score-0.164]
76 4 we show histograms of samples drawn from a full covariance Gaussian and energy-based models with two times more projection directions than the data dimensionality. [sent-151, score-0.378]
77 These figures illustrate the modeling power of our approach in that it represents the irregularities common to real neural data better than Gaussian and other symmetric distributions. [sent-152, score-0.158]
78 Note that the model using random marginal directions does not model the data as well as one using optimized directions; this is not surprising. [sent-153, score-0.304]
79 It may well be the case, however, that with many more random directions such a model would perform significantly better. [sent-154, score-0.125]
80 4 Discussion In this work we demonstrated an approach for using Gibbs distributions to model the joint spiking activity of a population of cells and an associated behavior. [sent-156, score-0.836]
81 We developed a novel application of contrastive divergence for learning a FRAME model which can be viewed as a semi-parametric Product-of-Experts model. [sent-157, score-0.471]
82 We showed that our model outperformed other models in representing complex monkey motor cortical spiking data. [sent-158, score-0.559]
83 Previous methods for probabilistically modeling spiking process have focused on modeling the firing rates of a population in terms of a conditional intensity function (firing rate conditioned on various correlates and previous spiking) [15, 16, 17, 18, 19]. [sent-159, score-0.596]
84 Here we take a more direct approach of modeling the joint probability using energy-based models and exploit contrastive divergence for learning Information theoretic analysis of spiking populations calls for modeling high dimensional joint and conditional distributions. [sent-161, score-1.105]
85 Given an experimental design with a relatively low dimension stimulus, where the entropy of that stimulus can be accurately computed, our models are applicable without modification. [sent-164, score-0.153]
86 Decoding algorithms that exploits these joint models by maximizing the likelihood of the observed firing activity over an entire data set remain to be developed. [sent-167, score-0.362]
87 Note that it may be possible to produce more accurate models of the un-normalized joint probability by increasing the number of marginal constraints. [sent-168, score-0.262]
88 We have begun exploring more flexible and asymmetric experts which would offer advantages over discrete histogramming inherent to the FRAME model. [sent-171, score-0.298]
89 Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Comp. [sent-195, score-0.497]
90 Hinton, “Energy-based models for sparse overcomplete representations,” JMLR, vol. [sent-204, score-0.127]
91 Parra, “Maximising sensitivity in a spiking network,” in Advances in NIPS, vol. [sent-215, score-0.242]
92 Dayan, “Probabilistic computation in spiking populations,” in Advances in NIPS, vol. [sent-225, score-0.242]
93 Donoghue, “Robustness of neuroprosthetic decoding algorithms,” Biological Cybernetics, vol. [sent-233, score-0.164]
94 Donoghue, “Neural decoding of cursor motion using a Kalman filter,” in Advances in NIPS, vol. [sent-258, score-0.238]
95 Donoghue, “Probabilistic inference of arm motion from neural activity in motor cortex,” Advances in NIPS, vol. [sent-268, score-0.392]
96 Donoghue, “A quantitative comparison of linear and non-linear models of motor cortical activity for the encoding and decoding of arm motions,” in First International IEEE/EMBS Conference on Neural Engineering, pp. [sent-313, score-0.603]
97 Brown, “A point process framework for relating neural spiking activity to spiking history,” J. [sent-328, score-0.733]
98 Nirenberg, “Synergy, redundancy, and independence in population codes, revisited,” J. [sent-335, score-0.16]
99 Latham, “Decoding neuronal spike trains: How important are correlations? [sent-342, score-0.147]
100 Young, “Objective assessment of the functional role of spike train correlations using information measures,” Visual Cognition, vol. [sent-354, score-0.142]
wordName wordTfidf (topN-words)
[('contrastive', 0.28), ('frame', 0.27), ('dk', 0.265), ('spiking', 0.242), ('donoghue', 0.218), ('experts', 0.217), ('activity', 0.188), ('decoding', 0.164), ('projection', 0.16), ('population', 0.16), ('divergence', 0.146), ('marginal', 0.134), ('gibbs', 0.116), ('fellows', 0.108), ('ring', 0.107), ('spike', 0.1), ('motor', 0.1), ('modeling', 0.097), ('cells', 0.095), ('poe', 0.093), ('pot', 0.093), ('serruya', 0.093), ('sk', 0.092), ('projections', 0.082), ('bienenstock', 0.081), ('histogramming', 0.081), ('directions', 0.08), ('histograms', 0.076), ('manipulandum', 0.074), ('cursor', 0.074), ('recorded', 0.074), ('log', 0.073), ('pursuit', 0.073), ('rk', 0.071), ('wu', 0.069), ('gao', 0.069), ('joint', 0.066), ('overcomplete', 0.065), ('monkey', 0.064), ('hatsopoulos', 0.062), ('ig', 0.062), ('mumford', 0.062), ('nirenberg', 0.062), ('truccolo', 0.062), ('ze', 0.062), ('cell', 0.062), ('models', 0.062), ('hinton', 0.061), ('neural', 0.061), ('parametric', 0.059), ('datum', 0.059), ('entropy', 0.054), ('osindero', 0.054), ('latham', 0.054), ('complete', 0.054), ('partition', 0.053), ('correlated', 0.053), ('marginals', 0.053), ('energy', 0.052), ('black', 0.052), ('bin', 0.051), ('zhu', 0.051), ('cybernetics', 0.05), ('gradient', 0.049), ('welling', 0.049), ('intervals', 0.049), ('exploit', 0.049), ('neuronal', 0.047), ('likelihood', 0.046), ('kinematics', 0.046), ('cortical', 0.046), ('model', 0.045), ('xk', 0.044), ('paninski', 0.043), ('bell', 0.043), ('rf', 0.043), ('arm', 0.043), ('probabilistic', 0.043), ('correlations', 0.042), ('gaussian', 0.041), ('training', 0.04), ('spikes', 0.04), ('distributions', 0.04), ('fp', 0.039), ('skewed', 0.039), ('product', 0.039), ('challenging', 0.039), ('projected', 0.039), ('onto', 0.039), ('timestep', 0.038), ('stimulus', 0.037), ('position', 0.037), ('learned', 0.035), ('texture', 0.035), ('ascent', 0.035), ('minimax', 0.035), ('multivariate', 0.035), ('roth', 0.034), ('hand', 0.034), ('past', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions
Author: Frank Wood, Stefan Roth, Michael J. Black
Abstract: Probabilistic modeling of correlated neural population firing activity is central to understanding the neural code and building practical decoding algorithms. No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. The parameters of the model are learned using contrastive divergence and an optimization procedure for finding appropriate marginal directions. We evaluate the method using real data recorded from a population of motor cortical neurons. In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is significantly higher than under other models. These results suggest that our model captures correlations in the firing activity. Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. 1
2 0.1426394 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy
Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reflecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target specification and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1
3 0.11963617 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
Author: Edward Meeds, Simon Osindero
Abstract: We present an infinite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective ‘gating network’ for the different experts. 1
4 0.11426666 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity
Author: Robert A. Legenstein, Wolfgang Maass
Abstract: We investigate under what conditions a neuron can learn by experimentally supported rules for spike timing dependent plasticity (STDP) to predict the arrival times of strong “teacher inputs” to the same neuron. It turns out that in contrast to the famous Perceptron Convergence Theorem, which predicts convergence of the perceptron learning rule for a simplified neuron model whenever a stable solution exists, no equally strong convergence guarantee can be given for spiking neurons with STDP. But we derive a criterion on the statistical dependency structure of input spike trains which characterizes exactly when learning with STDP will converge on average for a simple model of a spiking neuron. This criterion is reminiscent of the linear separability criterion of the Perceptron Convergence Theorem, but it applies here to the rows of a correlation matrix related to the spike inputs. In addition we show through computer simulations for more realistic neuron models that the resulting analytically predicted positive learning results not only hold for the common interpretation of STDP where STDP changes the weights of synapses, but also for a more realistic interpretation suggested by experimental data where STDP modulates the initial release probability of dynamic synapses. 1
5 0.10721247 99 nips-2005-Integrate-and-Fire models with adaptation are good enough
Author: Renaud Jolivet, Alexander Rauch, Hans-rudolf Lüscher, Wulfram Gerstner
Abstract: Integrate-and-Fire-type models are usually criticized because of their simplicity. On the other hand, the Integrate-and-Fire model is the basis of most of the theoretical studies on spiking neuron models. Here, we develop a sequential procedure to quantitatively evaluate an equivalent Integrate-and-Fire-type model based on intracellular recordings of cortical pyramidal neurons. We find that the resulting effective model is sufficient to predict the spike train of the real pyramidal neuron with high accuracy. In in vivo-like regimes, predicted and recorded traces are almost indistinguishable and a significant part of the spikes can be predicted at the correct timing. Slow processes like spike-frequency adaptation are shown to be a key feature in this context since they are necessary for the model to connect between different driving regimes. 1
6 0.10633906 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions
7 0.10373981 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression
8 0.10319043 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
9 0.10210005 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
10 0.10188967 181 nips-2005-Spiking Inputs to a Winner-take-all Network
11 0.094944037 45 nips-2005-Conditional Visual Tracking in Kernel Space
12 0.094176777 30 nips-2005-Assessing Approximations for Gaussian Process Classification
13 0.091895349 126 nips-2005-Metric Learning by Collapsing Classes
14 0.091869533 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains
15 0.091440119 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
16 0.091242783 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels
17 0.089253232 136 nips-2005-Noise and the two-thirds power Law
18 0.088282302 109 nips-2005-Learning Cue-Invariant Visual Responses
19 0.083530158 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits
20 0.081170313 141 nips-2005-Norepinephrine and Neural Interrupts
topicId topicWeight
[(0, 0.253), (1, -0.159), (2, -0.051), (3, 0.05), (4, 0.05), (5, -0.106), (6, 0.006), (7, -0.125), (8, 0.016), (9, 0.015), (10, 0.072), (11, -0.064), (12, 0.074), (13, -0.013), (14, 0.087), (15, 0.063), (16, 0.1), (17, -0.013), (18, -0.062), (19, -0.018), (20, -0.023), (21, 0.024), (22, -0.034), (23, 0.094), (24, 0.089), (25, 0.117), (26, 0.064), (27, 0.02), (28, 0.052), (29, -0.084), (30, 0.0), (31, -0.09), (32, 0.191), (33, -0.043), (34, 0.143), (35, -0.092), (36, 0.101), (37, -0.031), (38, -0.019), (39, 0.229), (40, -0.093), (41, -0.048), (42, -0.046), (43, -0.017), (44, -0.048), (45, -0.007), (46, -0.066), (47, -0.002), (48, -0.114), (49, -0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.94912016 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions
Author: Frank Wood, Stefan Roth, Michael J. Black
Abstract: Probabilistic modeling of correlated neural population firing activity is central to understanding the neural code and building practical decoding algorithms. No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. The parameters of the model are learned using contrastive divergence and an optimization procedure for finding appropriate marginal directions. We evaluate the method using real data recorded from a population of motor cortical neurons. In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is significantly higher than under other models. These results suggest that our model captures correlations in the firing activity. Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. 1
2 0.63626522 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
Author: Keiji Miura, Masato Okada, Shun-ichi Amari
Abstract: We considered a gamma distribution of interspike intervals as a statistical model for neuronal spike generation. The model parameters consist of a time-dependent firing rate and a shape parameter that characterizes spiking irregularities of individual neurons. Because the environment changes with time, observed data are generated from the time-dependent firing rate, which is an unknown function. A statistical model with an unknown function is called a semiparametric model, which is one of the unsolved problem in statistics and is generally very difficult to solve. We used a novel method of estimating functions in information geometry to estimate the shape parameter without estimating the unknown function. We analytically obtained an optimal estimating function for the shape parameter independent of the functional form of the firing rate. This estimation is efficient without Fisher information loss and better than maximum likelihood estimation. 1
3 0.52418673 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy
Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reflecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target specification and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1
4 0.52308071 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
Author: Kristina Klinkner, Cosma Shalizi, Marcelo Camperi
Abstract: Most nervous systems encode information about stimuli in the responding activity of large neuronal networks. This activity often manifests itself as dynamically coordinated sequences of action potentials. Since multiple electrode recordings are now a standard tool in neuroscience research, it is important to have a measure of such network-wide behavioral coordination and information sharing, applicable to multiple neural spike train data. We propose a new statistic, informational coherence, which measures how much better one unit can be predicted by knowing the dynamical state of another. We argue informational coherence is a measure of association and shared information which is superior to traditional pairwise measures of synchronization and correlation. To find the dynamical states, we use a recently-introduced algorithm which reconstructs effective state spaces from stochastic time series. We then extend the pairwise measure to a multivariate analysis of the network by estimating the network multi-information. We illustrate our method by testing it on a detailed model of the transition from gamma to beta rhythms. Much of the most important information in neural systems is shared over multiple neurons or cortical areas, in such forms as population codes and distributed representations [1]. On behavioral time scales, neural information is stored in temporal patterns of activity as opposed to static markers; therefore, as information is shared between neurons or brain regions, it is physically instantiated as coordination between entire sequences of neural spikes. Furthermore, neural systems and regions of the brain often require coordinated neural activity to perform important functions; acting in concert requires multiple neurons or cortical areas to share information [2]. Thus, if we want to measure the dynamic network-wide behavior of neurons and test hypotheses about them, we need reliable, practical methods to detect and quantify behavioral coordination and the associated information sharing across multiple neural units. These would be especially useful in testing ideas about how particular forms of coordination relate to distributed coding (e.g., that of [3]). Current techniques to analyze relations among spike trains handle only pairs of neurons, so we further need a method which is extendible to analyze the coordination in the network, system, or region as a whole. Here we propose a new measure of behavioral coordination and information sharing, informational coherence, based on the notion of dynamical state. Section 1 argues that coordinated behavior in neural systems is often not captured by exist- ing measures of synchronization or correlation, and that something sensitive to nonlinear, stochastic, predictive relationships is needed. Section 2 defines informational coherence as the (normalized) mutual information between the dynamical states of two systems and explains how looking at the states, rather than just observables, fulfills the needs laid out in Section 1. Since we rarely know the right states a prori, Section 2.1 briefly describes how we reconstruct effective state spaces from data. Section 2.2 gives some details about how we calculate the informational coherence and approximate the global information stored in the network. Section 3 applies our method to a model system (a biophysically detailed conductance-based model) comparing our results to those of more familiar second-order statistics. In the interest of space, we omit proofs and a full discussion of the existing literature, giving only minimal references here; proofs and references will appear in a longer paper now in preparation. 1 Synchrony or Coherence? Most hypotheses which involve the idea that information sharing is reflected in coordinated activity across neural units invoke a very specific notion of coordinated activity, namely strict synchrony: the units should be doing exactly the same thing (e.g., spiking) at exactly the same time. Investigators then measure coordination by measuring how close the units come to being strictly synchronized (e.g., variance in spike times). From an informational point of view, there is no reason to favor strict synchrony over other kinds of coordination. One neuron consistently spiking 50 ms after another is just as informative a relationship as two simultaneously spiking, but such stable phase relations are missed by strict-synchrony approaches. Indeed, whatever the exact nature of the neural code, it uses temporally extended patterns of activity, and so information sharing should be reflected in coordination of those patterns, rather than just the instantaneous activity. There are three common ways of going beyond strict synchrony: cross-correlation and related second-order statistics, mutual information, and topological generalized synchrony. The cross-correlation function (the normalized covariance function; this includes, for present purposes, the joint peristimulus time histogram [2]), is one of the most widespread measures of synchronization. It can be efficiently calculated from observable series; it handles statistical as well as deterministic relationships between processes; by incorporating variable lags, it reduces the problem of phase locking. Fourier transformation of the covariance function γXY (h) yields the cross-spectrum FXY (ν), which in turn gives the 2 spectral coherence cXY (ν) = FXY (ν)/FX (ν)FY (ν), a normalized correlation between the Fourier components of X and Y . Integrated over frequencies, the spectral coherence measures, essentially, the degree of linear cross-predictability of the two series. ([4] applies spectral coherence to coordinated neural activity.) However, such second-order statistics only handle linear relationships. Since neural processes are known to be strongly nonlinear, there is little reason to think these statistics adequately measure coordination and synchrony in neural systems. Mutual information is attractive because it handles both nonlinear and stochastic relationships and has a very natural and appealing interpretation. Unfortunately, it often seems to fail in practice, being disappointingly small even between signals which are known to be tightly coupled [5]. The major reason is that the neural codes use distinct patterns of activity over time, rather than many different instantaneous actions, and the usual approach misses these extended patterns. Consider two neurons, one of which drives the other to spike 50 ms after it does, the driving neuron spiking once every 500 ms. These are very tightly coordinated, but whether the first neuron spiked at time t conveys little information about what the second neuron is doing at t — it’s not spiking, but it’s not spiking most of the time anyway. Mutual information calculated from the direct observations conflates the “no spike” of the second neuron preparing to fire with its just-sitting-around “no spike”. Here, mutual information could find the coordination if we used a 50 ms lag, but that won’t work in general. Take two rate-coding neurons with base-line firing rates of 1 Hz, and suppose that a stimulus excites one to 10 Hz and suppresses the other to 0.1 Hz. The spiking rates thus share a lot of information, but whether the one neuron spiked at t is uninformative about what the other neuron did then, and lagging won’t help. Generalized synchrony is based on the idea of establishing relationships between the states of the various units. “State” here is taken in the sense of physics, dynamics and control theory: the state at time t is a variable which fixes the distribution of observables at all times ≥ t, rendering the past of the system irrelevant [6]. Knowing the state allows us to predict, as well as possible, how the system will evolve, and how it will respond to external forces [7]. Two coupled systems are said to exhibit generalized synchrony if the state of one system is given by a mapping from the state of the other. Applications to data employ statespace reconstruction [8]: if the state x ∈ X evolves according to smooth, d-dimensional deterministic dynamics, and we observe a generic function y = f (x), then the space Y of time-delay vectors [y(t), y(t − τ ), ...y(t − (k − 1)τ )] is diffeomorphic to X if k > 2d, for generic choices of lag τ . The various versions of generalized synchrony differ on how, precisely, to quantify the mappings between reconstructed state spaces, but they all appear to be empirically equivalent to one another and to notions of phase synchronization based on Hilbert transforms [5]. Thus all of these measures accommodate nonlinear relationships, and are potentially very flexible. Unfortunately, there is essentially no reason to believe that neural systems have deterministic dynamics at experimentally-accessible levels of detail, much less that there are deterministic relationships among such states for different units. What we want, then, but none of these alternatives provides, is a quantity which measures predictive relationships among states, but allows those relationships to be nonlinear and stochastic. The next section introduces just such a measure, which we call “informational coherence”. 2 States and Informational Coherence There are alternatives to calculating the “surface” mutual information between the sequences of observations themselves (which, as described, fails to capture coordination). If we know that the units are phase oscillators, or rate coders, we can estimate their instantaneous phase or rate and, by calculating the mutual information between those variables, see how coordinated the units’ patterns of activity are. However, phases and rates do not exhaust the repertoire of neural patterns and a more general, common scheme is desirable. The most general notion of “pattern of activity” is simply that of the dynamical state of the system, in the sense mentioned above. We now formalize this. Assuming the usual notation for Shannon information [9], the information content of a state variable X is H[X] and the mutual information between X and Y is I[X; Y ]. As is well-known, I[X; Y ] ≤ min H[X], H[Y ]. We use this to normalize the mutual state information to a 0 − 1 scale, and this is the informational coherence (IC). ψ(X, Y ) = I[X; Y ] , with 0/0 = 0 . min H[X], H[Y ] (1) ψ can be interpreted as follows. I[X; Y ] is the Kullback-Leibler divergence between the joint distribution of X and Y , and the product of their marginal distributions [9], indicating the error involved in ignoring the dependence between X and Y . The mutual information between predictive, dynamical states thus gauges the error involved in assuming the two systems are independent, i.e., how much predictions could improve by taking into account the dependence. Hence it measures the amount of dynamically-relevant information shared between the two systems. ψ simply normalizes this value, and indicates the degree to which two systems have coordinated patterns of behavior (cf. [10], although this only uses directly observable quantities). 2.1 Reconstruction and Estimation of Effective State Spaces As mentioned, the state space of a deterministic dynamical system can be reconstructed from a sequence of observations. This is the main tool of experimental nonlinear dynamics [8]; but the assumption of determinism is crucial and false, for almost any interesting neural system. While classical state-space reconstruction won’t work on stochastic processes, such processes do have state-space representations [11], and, in the special case of discretevalued, discrete-time series, there are ways to reconstruct the state space. Here we use the CSSR algorithm, introduced in [12] (code available at http://bactra.org/CSSR). This produces causal state models, which are stochastic automata capable of statistically-optimal nonlinear prediction; the state of the machine is a minimal sufficient statistic for the future of the observable process[13].1 The basic idea is to form a set of states which should be (1) Markovian, (2) sufficient statistics for the next observable, and (3) have deterministic transitions (in the automata-theory sense). The algorithm begins with a minimal, one-state, IID model, and checks whether these properties hold, by means of hypothesis tests. If they fail, the model is modified, generally but not always by adding more states, and the new model is checked again. Each state of the model corresponds to a distinct distribution over future events, i.e., to a statistical pattern of behavior. Under mild conditions, which do not involve prior knowledge of the state space, CSSR converges in probability to the unique causal state model of the data-generating process [12]. In practice, CSSR is quite fast (linear in the data size), and generalizes at least as well as training hidden Markov models with the EM algorithm and using cross-validation for selection, the standard heuristic [12]. One advantage of the causal state approach (which it shares with classical state-space reconstruction) is that state estimation is greatly simplified. In the general case of nonlinear state estimation, it is necessary to know not just the form of the stochastic dynamics in the state space and the observation function, but also their precise parametric values and the distribution of observation and driving noises. Estimating the state from the observable time series then becomes a computationally-intensive application of Bayes’s Rule [17]. Due to the way causal states are built as statistics of the data, with probability 1 there is a finite time, t, at which the causal state at time t is certain. This is not just with some degree of belief or confidence: because of the way the states are constructed, it is impossible for the process to be in any other state at that time. Once the causal state has been established, it can be updated recursively, i.e., the causal state at time t + 1 is an explicit function of the causal state at time t and the observation at t + 1. The causal state model can be automatically converted, therefore, into a finite-state transducer which reads in an observation time series and outputs the corresponding series of states [18, 13]. (Our implementation of CSSR filters its training data automatically.) The result is a new time series of states, from which all non-predictive components have been filtered out. 2.2 Estimating the Coherence Our algorithm for estimating the matrix of informational coherences is as follows. For each unit, we reconstruct the causal state model, and filter the observable time series to produce a series of causal states. Then, for each pair of neurons, we construct a joint histogram of 1 Causal state models have the same expressive power as observable operator models [14] or predictive state representations [7], and greater power than variable-length Markov models [15, 16]. a b Figure 1: Rastergrams of neuronal spike-times in the network. Excitatory, pyramidal neurons (numbers 1 to 1000) are shown in green, inhibitory interneurons (numbers 1001 to 1300) in red. During the first 10 seconds (a), the current connections among the pyramidal cells are suppressed and a gamma rhythm emerges (left). At t = 10s, those connections become active, leading to a beta rhythm (b, right). the state distribution, estimate the mutual information between the states, and normalize by the single-unit state informations. This gives a symmetric matrix of ψ values. Even if two systems are independent, their estimated IC will, on average, be positive, because, while they should have zero mutual information, the empirical estimate of mutual information is non-negative. Thus, the significance of IC values must be assessed against the null hypothesis of system independence. The easiest way to do so is to take the reconstructed state models for the two systems and run them forward, independently of one another, to generate a large number of simulated state sequences; from these calculate values of the IC. This procedure will approximate the sampling distribution of the IC under a null model which preserves the dynamics of each system, but not their interaction. We can then find p-values as usual. We omit them here to save space. 2.3 Approximating the Network Multi-Information There is broad agreement [2] that analyses of networks should not just be an analysis of pairs of neurons, averaged over pairs. Ideally, an analysis of information sharing in a network would look at the over-all structure of statistical dependence between the various units, reflected in the complete joint probability distribution P of the states. This would then allow us, for instance, to calculate the n-fold multi-information, I[X1 , X2 , . . . Xn ] ≡ D(P ||Q), the Kullback-Leibler divergence between the joint distribution P and the product of marginal distributions Q, analogous to the pairwise mutual information [19]. Calculated over the predictive states, the multi-information would give the total amount of shared dynamical information in the system. Just as we normalized the mutual information I[X1 , X2 ] by its maximum possible value, min H[X1 ], H[X2 ], we normalize the multiinformation by its maximum, which is the smallest sum of n − 1 marginal entropies: I[X1 ; X2 ; . . . Xn ] ≤ min k H[Xn ] i=k Unfortunately, P is a distribution over a very high dimensional space and so, hard to estimate well without strong parametric constraints. We thus consider approximations. The lowest-order approximation treats all the units as independent; this is the distribution Q. One step up are tree distributions, where the global distribution is a function of the joint distributions of pairs of units. Not every pair of units needs to enter into such a distribution, though every unit must be part of some pair. Graphically, a tree distribution corresponds to a spanning tree, with edges linking units whose interactions enter into the global probability, and conversely spanning trees determine tree distributions. Writing ET for the set of pairs (i, j) and abbreviating X1 = x1 , X2 = x2 , . . . Xn = xn by X = x, one has n T (X = x) = (i,j)∈ET T (Xi = xi , Xj = xj ) T (Xi = xi ) T (Xi = xi )T (Xj = xj ) i=1 (2) where the marginal distributions T (Xi ) and the pair distributions T (Xi , Xj ) are estimated by the empirical marginal and pair distributions. We must now pick edges ET so that T best approximates the true global distribution P . A natural approach is to minimize D(P ||T ), the divergence between P and its tree approximation. Chow and Liu [20] showed that the maximum-weight spanning tree gives the divergence-minimizing distribution, taking an edge’s weight to be the mutual information between the variables it links. There are three advantages to using the Chow-Liu approximation. (1) Estimating T from empirical probabilities gives a consistent maximum likelihood estimator of the ideal ChowLiu tree [20], with reasonable rates of convergence, so T can be reliably known even if P cannot. (2) There are efficient algorithms for constructing maximum-weight spanning trees, such as Prim’s algorithm [21, sec. 23.2], which runs in time O(n2 + n log n). Thus, the approximation is computationally tractable. (3) The KL divergence of the Chow-Liu distribution from Q gives a lower bound on the network multi-information; that bound is just the sum of the mutual informations along the edges in the tree: I[X1 ; X2 ; . . . Xn ] ≥ D(T ||Q) = I[Xi ; Xj ] (3) (i,j)∈ET Even if we knew P exactly, Eq. 3 would be useful as an alternative to calculating D(P ||Q) directly, evaluating log P (x)/Q(x) for all the exponentially-many configurations x. It is natural to seek higher-order approximations to P , e.g., using three-way interactions not decomposable into pairwise interactions [22, 19]. But it is hard to do so effectively, because finding the optimal approximation to P when such interactions are allowed is NP [23], and analytical formulas like Eq. 3 generally do not exist [19]. We therefore confine ourselves to the Chow-Liu approximation here. 3 Example: A Model of Gamma and Beta Rhythms We use simulated data as a test case, instead of empirical multiple electrode recordings, which allows us to try the method on a system of over 1000 neurons and compare the measure against expected results. The model, taken from [24], was originally designed to study episodes of gamma (30–80Hz) and beta (12–30Hz) oscillations in the mammalian nervous system, which often occur successively with a spontaneous transition between them. More concretely, the rhythms studied were those displayed by in vitro hippocampal (CA1) slice preparations and by in vivo neocortical EEGs. The model contains two neuron populations: excitatory (AMPA) pyramidal neurons and inhibitory (GABAA ) interneurons, defined by conductance-based Hodgkin-Huxley-style equations. Simulations were carried out in a network of 1000 pyramidal cells and 300 interneurons. Each cell was modeled as a one-compartment neuron with all-to-all coupling, endowed with the basic sodium and potassium spiking currents, an external applied current, and some Gaussian input noise. The first 10 seconds of the simulation correspond to the gamma rhythm, in which only a group of neurons is made to spike via a linearly increasing applied current. The beta rhythm a b c d Figure 2: Heat-maps of coordination for the network, as measured by zero-lag cross-correlation (top row) and informational coherence (bottom), contrasting the gamma rhythm (left column) with the beta (right). Colors run from red (no coordination) through yellow to pale cream (maximum). (subsequent 10 seconds) is obtained by activating pyramidal-pyramidal recurrent connections (potentiated by Hebbian preprocessing as a result of synchrony during the gamma rhythm) and a slow outward after-hyper-polarization (AHP) current (the M-current), suppressed during gamma due to the metabotropic activation used in the generation of the rhythm. During the beta rhythm, pyramidal cells, silent during gamma rhythm, fire on a subset of interneurons cycles (Fig. 1). Fig. 2 compares zero-lag cross-correlation, a second-order method of quantifying coordination, with the informational coherence calculated from the reconstructed states. (In this simulation, we could have calculated the actual states of the model neurons directly, rather than reconstructing them, but for purposes of testing our method we did not.) Crosscorrelation finds some of the relationships visible in Fig. 1, but is confused by, for instance, the phase shifts between pyramidal cells. (Surface mutual information, not shown, gives similar results.) Informational coherence, however, has no trouble recognizing the two populations as effectively coordinated blocks. The presence of dynamical noise, problematic for ordinary state reconstruction, is not an issue. The average IC is 0.411 (or 0.797 if the inactive, low-numbered neurons are excluded). The tree estimate of the global informational multi-information is 3243.7 bits, with a global coherence of 0.777. The right half of Fig. 2 repeats this analysis for the beta rhythm; in this stage, the average IC is 0.614, and the tree estimate of the global multi-information is 7377.7 bits, though the estimated global coherence falls very slightly to 0.742. This is because low-numbered neurons which were quiescent before are now active, contributing to the global information, but the over-all pattern is somewhat weaker and more noisy (as can be seen from Fig. 1b.) So, as expected, the total information content is higher, but the overall coordination across the network is lower. 4 Conclusion Informational coherence provides a measure of neural information sharing and coordinated activity which accommodates nonlinear, stochastic relationships between extended patterns of spiking. It is robust to dynamical noise and leads to a genuinely multivariate measure of global coordination across networks or regions. Applied to data from multi-electrode recordings, it should be a valuable tool in evaluating hypotheses about distributed neural representation and function. Acknowledgments Thanks to R. Haslinger, E. Ionides and S. Page; and for support to the Santa Fe Institute (under grants from Intel, the NSF and the MacArthur Foundation, and DARPA agreement F30602-00-2-0583), the Clare Booth Luce Foundation (KLK) and the James S. McDonnell Foundation (CRS). References [1] L. F. Abbott and T. J. Sejnowski, eds. Neural Codes and Distributed Representations. MIT Press, 1998. [2] E. N. Brown, R. E. Kass, and P. P. Mitra. Nature Neuroscience, 7:456–461, 2004. [3] D. H. Ballard, Z. Zhang, and R. P. N. Rao. In R. P. N. Rao, B. A. Olshausen, and M. S. Lewicki, eds., Probabilistic Models of the Brain, pp. 273–284, MIT Press, 2002. [4] D. R. Brillinger and A. E. P. Villa. In D. R. Brillinger, L. T. Fernholz, and S. Morgenthaler, eds., The Practice of Data Analysis, pp. 77–92. Princeton U.P., 1997. [5] R. Quian Quiroga et al. Physical Review E, 65:041903, 2002. [6] R. F. Streater. Statistical Dynamics. Imperial College Press, London. [7] M. L. Littman, R. S. Sutton, and S. Singh. In T. G. Dietterich, S. Becker, and Z. Ghahramani, eds., Advances in Neural Information Processing Systems 14, pp. 1555–1561. MIT Press, 2002. [8] H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge U.P., 1997. [9] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. [10] M. Palus et al. Physical Review E, 63:046211, 2001. [11] F. B. Knight. Annals of Probability, 3:573–596, 1975. [12] C. R. Shalizi and K. L. Shalizi. In M. Chickering and J. Halpern, eds., Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference, pp. 504–511. AUAI Press, 2004. [13] C. R. Shalizi and J. P. Crutchfield. Journal of Statistical Physics, 104:817–819, 2001. [14] H. Jaeger. Neural Computation, 12:1371–1398, 2000. [15] D. Ron, Y. Singer, and N. Tishby. Machine Learning, 25:117–149, 1996. [16] P. B¨ hlmann and A. J. Wyner. Annals of Statistics, 27:480–513, 1999. u [17] N. U. Ahmed. Linear and Nonlinear Filtering for Scientists and Engineers. World Scientific, 1998. [18] D. R. Upper. PhD thesis, University of California, Berkeley, 1997. [19] E. Schneidman, S. Still, M. J. Berry, and W. Bialek. Physical Review Letters, 91:238701, 2003. [20] C. K. Chow and C. N. Liu. IEEE Transactions on Information Theory, IT-14:462–467, 1968. [21] T. H. Cormen et al. Introduction to Algorithms. 2nd ed. MIT Press, 2001. [22] S. Amari. IEEE Transacttions on Information Theory, 47:1701–1711, 2001. [23] S. Kirshner, P. Smyth, and A. Robertson. Tech. Rep. 04-04, UC Irvine, Information and Computer Science, 2004. [24] M. S. Olufsen et al. Journal of Computational Neuroscience, 14:33–54, 2003.
5 0.50700444 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression
Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong
Abstract: In this paper, we aim at analyzing the characteristic of neuronal population responses to instantaneous or time-dependent inputs and the role of synapses in neural information processing. We have derived an evolution equation of the membrane potential density function with synaptic depression, and obtain the formulas for analytic computing the response of instantaneous re rate. Through a technical analysis, we arrive at several signi cant conclusions: The background inputs play an important role in information processing and act as a switch betwee temporal integration and coincidence detection. the role of synapses can be regarded as a spatio-temporal lter; it is important in neural information processing for the spatial distribution of synapses and the spatial and temporal relation of inputs. The instantaneous input frequency can affect the response amplitude and phase delay. 1
6 0.50347292 136 nips-2005-Noise and the two-thirds power Law
7 0.50259423 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
8 0.49421465 203 nips-2005-Visual Encoding with Jittering Eyes
9 0.47203729 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
10 0.45590621 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions
11 0.45540595 30 nips-2005-Assessing Approximations for Gaussian Process Classification
12 0.43703863 133 nips-2005-Nested sampling for Potts models
13 0.43026888 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
14 0.42673221 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
15 0.39892232 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
16 0.39060104 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
17 0.38811147 99 nips-2005-Integrate-and-Fire models with adaptation are good enough
18 0.38580051 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
19 0.36096862 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains
20 0.35877606 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
topicId topicWeight
[(3, 0.035), (10, 0.057), (11, 0.015), (27, 0.039), (31, 0.039), (34, 0.084), (39, 0.026), (44, 0.012), (50, 0.245), (55, 0.026), (57, 0.065), (60, 0.028), (65, 0.011), (69, 0.047), (73, 0.028), (77, 0.022), (88, 0.1), (91, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.81058383 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions
Author: Frank Wood, Stefan Roth, Michael J. Black
Abstract: Probabilistic modeling of correlated neural population firing activity is central to understanding the neural code and building practical decoding algorithms. No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. The parameters of the model are learned using contrastive divergence and an optimization procedure for finding appropriate marginal directions. We evaluate the method using real data recorded from a population of motor cortical neurons. In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is significantly higher than under other models. These results suggest that our model captures correlations in the firing activity. Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. 1
2 0.80635834 126 nips-2005-Metric Learning by Collapsing Classes
Author: Amir Globerson, Sam T. Roweis
Abstract: We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature ´Üµ for an input using a simple distance funcspace , we can measure distances between ܽ ܾ ¾ tion (e.g. Euclidean) ´Ü½ µ ´Ü¾ µ℄ in feature space. Thus by fixing , any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the ´Üµ is a linear projection of Ü ¾ Ö so that ´Üµ Ï Ü. The Euclidean distance between ´Ü½ µ and ´Ü¾ µ is then the Mahalanobis distance ´Ü½ µ ´Ü¾ µ ¾ ´Ü½ ܾ µÌ ´Ü½ ܾ µ, where Ï Ì Ï is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix . This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equiv- alence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of Ò labeled examples ´Ü Ý µ, where Ü Ý ¾ ½ , we seek a space. We focus on Mahalanobis form metrics similarity measure between two points in ´Ü where Ü ´Ü µ ¾ Ü µÌ Ö and ´Ü Ü µ (1) is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping Ü via a linear projection Ï Ü ( Ï Ì Ï ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). such that Specifically, for each Ü we define a conditional distribution over points Ô ´ µ ½ È (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: Ô¼ ´ µ » Ò ½ ¼ Ý Ý Ý Ý (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least Ö · ¾ points in each class, where Ö rank ℄ (note that Ö Ö). Then Ô ´ µ Ô¼ ´ µ ( ) implies that under all points in the same class will be mapped to a single point, infinitely far from other class points 1 . 1 Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix such that Ô ´ µ is as close as possible to Ô¼ ´ Since we are trying to match distributions, we minimize the KL divergence ÃÄ Ô¼ Ô℄: ÑÒ ÃÄ Ô¼ ´ µ Ô ´ µ℄ ¾ ÈË ×Ø µ. (4) The crucial property of this optimization problem is that it is convex in the matrix . To see this, first note that any convex linear combination of feasible solutions « ¼ · ´½ «µ ½ s.t. ¼ « ½ is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that ´ µ alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ in the form 2 : È ´ µ ÐÓ Ý Ý Ô´ µ · Ý ÐÓ Ý where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why ´ µ is convex, first note that ´Ü Ü µÌ ´Ü Ü µ is linear is a ÐÓ ÜÔ function of affine functions of in , and thus convex. The function ÐÓ and is therefore also convex (see [4], page 74). È 2.1 Convex Duality Since our optimization problem is convex, it has an equivalent convex dual. Specifically, the convex dual of Eq. (4) is the following entropy maximization problem: À Ô´ Ñ Ü Ô´ µ where Ú Ü ×Ø µ℄ Ô¼ ´ Ú ÚÌ ℄ µ Ô´ µ Ú ÚÌ ℄ Ü , À ¡℄ is the entropy function and we require È Ô´ µ ¼ ½ (5) . To prove this duality we start with the proposed dual and obtain the original problem in Equation 4 as its dual. Write the Lagrangian for the above problem (where is PSD) 3 Ä´Ô ¬µ À ´Ô´ µµ Ì Ö´ ´ Ô¼ ´ Ú ÚÌ ℄ Ô Ú ÚÌ ℄µµµ ¬ ´ Ô´ µ ½µ The dual function is defined as ´ ¬ µ Ñ ÒÔ Ä´Ô ¬ µ. To derive it, we first solve for the minimizing Ô by setting the derivative of Ä´Ô ¬ µ w.r.t. Ô´ µ equal to zero. Ì ¼ ½ · ÐÓ Ô´ µ · Ì Ö ´ Ú Ú µ ¬ µ Ô´ µ ¬ ½ Ì Ö´ Ú ÚÌ µ Plugging solution È Ô thisThe dual to Ä Ô is¬ towe get ¬ . problem maximize È Ì Ö Ú Ú . ¬ , yielding ¬ ´ ´ µ ´ µ ½ Now note that Ì Ö´ ÐÓ Ú ÚÌ µ ÚÌ Ú ´ µ ´ ̵ ´ µ Ì Ö´ ¬ µ. È Ô¼ Ú ÚÌ ℄µ · Ȭ We can do this analytically w.r.t. , so we can write Ý Ý ÐÓ which is minus our original target function. Since ´ µ should be maximized, and we have the desired duality result (identifying with ). ¼ » µ ¼ when Ý Ý . For a given point Ü , all the points in its class satisfy Ô´ µ ½. Due to the structure of Ô´ µ in Equation 2, and because it is obeyed for all points in ܼ × class, this implies that all the points in that class are equidistant from each other. However, it is easy to show that the maximum number of different equidistant points (also known as the equilateral dimension [1]) in Ö dimensions is Ö · ½. Since by assumption we have at least Ö · ¾ points in the class of Ü , and maps points into Ö , it follows that all points are identical. 2 À Ô¼ ´ µ℄. Up to an additive constant 3 We consider the equivalent problem of minimizing minus entropy. Ô¼ ´ È 2.1.1 Relation to covariance based and embedding methods The convex dual derived above reveals an interesting relation to covariance based learning methods. The sufficient statistics used by the algorithm are a set of Ò “spread” matrices. Each matrix is of the form Ô¼ ´ µ Ú ÚÌ ℄. The algorithm tries to find a maximum entropy distribution which matches these matrices when averaged over the sample. This should be contrasted with the covariance matrices used in metric learning such as Fisher’s Discriminant Analysis. The latter uses the within and between class covariance matrices. The within covariance matrix is similar to the covariance matrix used here, but is calculated with respect to the class means, whereas here it is calculated separately for every point, and is centered on this point. This highlights the fact that MCML is not based on Gaussian assumptions where it is indeed sufficient to calculate a single class covariance. Our method can also be thought of as a supervised version of the Stochastic Neighbour Embedding algorithm [7] in which the “target” distribution is Ô¼ (determined by the class labels) and the embedding points are not completely free but are instead constrained to be of the form Ï Ü . 2.2 Optimizing the Convex Objective Since the optimization problem in Equation 4 is convex, it is guaranteed to have only a single minimum which is the globally optimal solution4 . It can be optimized using any appropriate numerical convex optimization machinery; all methods will yield the same solution although some may be faster than others. One standard approach is to use interior point Newton methods. However, these algorithms require the Hessian to be calculated, which would require Ç´ µ resources, and could be prohibitive in our case. Instead, we have experimented with using a first order gradient method, specifically the projected gradient approach as in [10]. At each iteration we take a small step in the direction of the negative gradient of the objective function5, followed by a projection back onto the PSD cone. This projection is performed simply by taking the eigen-decomposition of and removing the components with negative eigenvalues. The algorithm is summarized below: Ý µ, Ò Input: Set of labeled data points ´Ü Output: PSD metric which optimally collapses classes. ½ Initialization: Initialize ¼ to some PSD matrix (randomly or using some initialization heuristic). Iterate: È Ø ¯ ÔØ where Ü Ô Ü ¯ Calculate the eigen-decomposition of È È Ù ÙÌ , then set Ø Ø Ø ¯ Set Ø·½ ´ µ ´ ´ ¼´ µ µ ´ µµ´ µ´Ü Ü µÌ ·½ ·½ ·½ Ñ Ü´ ¼µÙ ÙÌ Of course in principle it is possible to optimize over the dual instead of the primal but in our case, if the training data consists of Ò points in Ö-dimensional space then the primal has only Ç´Ö¾ ¾µ variables while the dual has Ç´Ò¾ µ so it will almost always be more efficient to operate on the primal directly. One exception to this case may be the kernel version (Section 4) where the primal is also of size Ç´Ò¾ µ. 4 When the data can be exactly collapsed into single class points, there will be multiple solutions at infinity. However, this is very unlikely to happen in real data. 5 In the experiments, we used an Armijo like step size rule, as described in [3]. 3 Low Dimensional Projections for Feature Extraction The Mahalanobis distance under a metric can be interpreted as a linear projection of the original inputs by the square root of , followed by Euclidean distance in the projected space. Matrices which have less than full rank correspond to Mahalanobis distances based on low dimensional projections. Such metrics and the induced distances can be advantageous for several reasons [5]. First, low dimensional projections can substantially reduce the storage and computational requirements of a supervised method since only the projections of the training points must be stored and the manipulations at test time all occur in the lower dimensional feature space. Second, low dimensional projections re-represent the inputs, allowing for a supervised embedding or visualization of the original data. If we consider matrices with rank at most Õ , we can always represent them in the form Ï Ì Ï for some projection matrix Ï of size Õ ¢ Ö. This corresponds to projecting the original data into a Õ -dimensional space specified by the rows of Ï . However, rank constraints on a matrix are not convex [4], and hence the rank constrained problem is not convex and is likely to have local minima which make the optimization difficult and illdefined since it becomes sensitive to initial conditions and choice of optimization method. Luckily, there is an alternative approach to obtaining low dimensional projections, which does specify a unique solution by sequentially solving two globally tractable problems. This is the approach we follow here. First we solve for a (potentially) full rank metric using the convex program outlined above, and then obtain a low rank projection from it via spectral decomposition. This is done by diagonalizing into the form Ö Ú ÚÌ where ½ and Ú are the corre¾ Ö are eigenvalues of ½ sponding eigenvectors. To obtain a low rank projection we constrain the sum above to Õ include only the Õ terms corresponding to the Õ largest eigenvalues: Õ Ú ÚÌ . ½ The resulting projection is uniquely defined (up to an irrelevant unitary transformation) as Ô Ì Ì Ï ´ ÚÕ ℄. ½ Õ µ Ú½ È È Ô In general, the projection returned by this approach is not guaranteed to be the same as the projection corresponding to minimizing our objective function subject to a rank constraint on unless the optimal metric is of rank less than or equal to Õ . However, as we show in the experimental results, it is often the case that for practical problems the optimal has an eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed very small, suggesting the low rank solution will be close to optimal. 4 Learning Metrics with Kernels It is interesting to consider the case where Ü are mapped into a high dimensional feature space ´Ü µ and a Mahalanobis distance is sought in this space. We focus on the case where dot products in the feature space may be expressed via a kernel function, such that ´Ü µ ¡ ´Ü µ ´Ü Ü µ for some kernel . We now show how our method can be changed to accommodate this setting, so that optimization depends only on dot products. Consider the regularized target function: Ê ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ · Ì Ö´ µ (6) where the regularizing factor is equivalent to the Frobenius norm of the projection matrix Ï since Ì Ö´ µ Ï ¾. Deriving w.r.t. Ï we obtain Ï Í , where Í is some matrix which specifies Ï as a linear combination of sample points, and the Ø row of the matrix Ì Í Ì Í . Defining the PSD matrix is Ü . Thus is given by Í Ì Í , we can recast our optimization as looking for a PSD matrix , where the Mahalanobis distance is ´Ü Ü µÌ Ì ´Ü Ü µ ´ µÌ ´ µ, where we define Ü. This is exactly our original distance, with Ü replaced by , which depends only on dot products in space. The regularization term also depends solely on the dot products since Ì Ö´ µ Ì Ö´ Ì µ Ì Ö´ Ì µ Ì Ö´Ã µ, where à is the kernel matrix given Ì . Note that the trace is a linear function of , keeping the problem convex. by à Thus, as long as dot products can be represented via kernels, the optimization can be carried out without explicitly using the high dimensional space. To obtain a low dimensional solution, we follow the approach in Section 3: obtain a decomposition Î Ì Î 6 , and take the projection matrix to be the first Õ rows of ¼ Î . Ì , and thus Ì Ì As a first step, we calculate a matrix such that . Since is a correlation matrix for the rows of it can be shown (as in Kernel PCA) that its (left) eigenvectors are linear combinations of the rows of . Denoting by Î « the eigenvector matrix, we obtain, after some algebra, that « Ã Ì «. We conclude that « is an eigenvector of the matrix Ã Ì . Denote by « the matrix whose rows are orthonormal eigenvectors of Ã Ì . Then Î can be shown to be orthonormal if we set ¼ « . The final projection will then be ¼ Î Ü « . Low dimensional Î projections will be obtained by keeping only the first Õ components of this projection. 5 Experimental Results We compared our method to several metric learning algorithms on a supervised classification task. Training data was first used to learn a metric over the input space. Then this metric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we investigated were taken from the UCI repository and have been used previously in evaluating supervised methods for metric learning [10, 5]. To these we added the USPS handwritten digits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22). The algorithms used in the comparative evaluation were ¯ ¯ ¯ Fisher’s Linear Discriminant Analysis (LDA), which projects on the eigenvectors of ËϽ Ë where ËÏ Ë are the within and between class covariance matrices. The method of Xing et al [10] which minimizes the mean within class distance, while keeping the mean between class distance larger than one. Principal Component Analysis (PCA). There are several possibilities for scaling the PCA projections. We tested several, and report results of the empirically superior one (PCAW), which scales the projection components so that the covariance matrix after projection is the identity. PCAW often performs poorly on high dimensions, but globally outperforms all other variants. We also evaluated the kernel version of MCML with an RBF kernel (denoted by KMCML)7 . Since all methods allow projections to lower dimensions we compared performance for different projection dimensions 8 . The out-of sample performance results (based on 40 random splits of the data taking 70% for training and 30% for testing9 ) are shown in Figure 1. It can be seen that when used in a simple nearest-neighbour classifier, the metric learned by MCML almost always performs as well as, or significantly better than those learned by all other methods, across most dimensions. Furthermore, the kernel version of MCML outperforms the linear one on most datasets. Where Î is orthonormal, and the eigenvalues in are sorted in decreasing order. The regularization parameter and the width of the RBF kernel were chosen using 5 fold crossvalidation. KMCML was only evaluated for datasets with less than 1000 training points. 8 To obtain low dimensional mappings we used the approach outlined in Section 3. 9 Except for the larger datasets where 1000 random samples were used for training. 6 7 Wine 0.2 0.1 2 4 6 8 10 Projection Dimension 0.5 Error Rate Error Rate Balance Ion MCML PCAW LDA XING KMCML 0.3 0.25 0.4 0.2 0.15 0.3 0.1 0.2 0.05 10 20 30 Projection Dimension Soybean−small 0.1 1 2 3 Protein 4 Spam 0.6 0.4 0.25 0.5 0.4 0 5 10 15 0.2 0.3 0.2 0.15 20 5 10 Yale7 15 20 0.1 10 20 30 40 50 40 50 Digits Housing 0.6 0.4 0.4 0.3 0.4 0.35 0.2 0.3 0.2 0.1 0.25 10 20 30 40 50 5 10 15 10 20 30 Figure 1: Classification error rate on several UCI datasets, USPS digits and YALE faces, for different projection dimensions. Algorithms are our Maximally Collapsing Metric Learning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher’s Discriminant Analysis (LDA). Standard errors of the means shown on curves. No results given for XING on YALE and KMCML on Digits and Spam due to the data size. 5.1 Comparison to non convex procedures The methods in the previous comparison are all well defined, in the sense that they are not susceptible to local minima in the optimization. They also have the added advantage of obtaining projections to all dimensions using one optimization run. Below, we also compare the MCML results to the results of two non-convex procedures. The first is the Non Convex variant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the projection matrix Ï , where Ï Ì Ï . Although this is no longer a convex problem, it is not constrained and is thus easier to optimize. The second non convex method is Neighbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error incurred by a nearest neighbor classifier. For both methods we optimized the matrix Ï by restarting the optimization separately for each size of Ï . Minimization was performed using a conjugate gradient algorithm, initialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets. It can be seen that the performance of NMCML is similar to that of MCML, although it is less stable, possibly due to local minima, and both methods usually outperform NCA. The inset in each figure shows the spectrum of the MCML matrix , revealing that it often drops quickly after a few dimensions. This illustrates the effectiveness of our two stage optimization procedure, and suggests its low dimensional solutions are close to optimal. 6 Discussion and Extensions We have presented an algorithm for learning maximally collapsing metrics (MCML), based on the intuition of collapsing classes into single points. MCML assumes that each class Balance Error Rate Wine MCML NMCML NCA 0.2 Soybean 0.4 0.1 0.3 0.1 0.05 0.2 0 2 4 6 8 10 Projection Dimension 0.1 1 0 2 Protein 3 4 5 10 Ion 15 20 Housing 0.4 0.6 0.3 0.5 0.35 0.2 0.4 0.3 0.3 0.25 5 10 15 20 0.1 10 20 30 5 10 15 Figure 2: Classification error for non convex procedures, and the MCML method. Eigen-spectra for the MCML solution are shown in the inset. may be collapsed to a single point, at least approximately, and thus is only suitable for unimodal class distributions (or for simply connected sets if kernelization is used). However, if points belonging to a single class appear in several disconnected clusters in input (or feature) space, it is unlikely that MCML could collapse the class into a single point. It is possible that using a mixture of distributions, an EM-like algorithm can be constructed to accommodate this scenario. The method can also be used to learn low dimensional projections of the input space. We showed that it performs well, even across a range of projection dimensions, and consistently outperforms existing methods. Finally, we have shown how the method can be extended to projections in high dimensional feature spaces using the kernel trick. The resulting nonlinear method was shown to improve classification results over the linear version. References Ò [1] N. Alon and P. Pudlak. Equilateral sets in ÐÔ . Geom. Funct. Anal., 13(3), 2003. [2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In ECCV (1), 1996. [3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transaction on Automatic Control, 21(2):174–184, 1976. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. [5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2001. [7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002. [8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. [9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proc. of ECCV, 2002. [10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2004.
3 0.76126945 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
4 0.60344714 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1
5 0.59045267 184 nips-2005-Structured Prediction via the Extragradient Method
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
6 0.58008283 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
7 0.56507784 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
8 0.56396806 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
9 0.56021827 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
10 0.55986899 45 nips-2005-Conditional Visual Tracking in Kernel Space
11 0.55854601 106 nips-2005-Large-scale biophysical parameter estimation in single neurons via constrained linear regression
12 0.5577997 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection
13 0.55487829 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise
14 0.55082619 114 nips-2005-Learning Rankings via Convex Hull Separation
15 0.5493077 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
16 0.54882628 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
17 0.54861647 34 nips-2005-Bayesian Surprise Attracts Human Attention
18 0.5481723 50 nips-2005-Convex Neural Networks
19 0.54674268 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects
20 0.54207933 177 nips-2005-Size Regularized Cut for Data Clustering