nips nips2003 nips2003-166 knowledge-graph by maker-knowledge-mining

166 nips-2003-Reconstructing MEG Sources with Unknown Correlations

Source: pdf

Author: Maneesh Sahani, Srikantan S. Nagarajan

Abstract: Existing source location and recovery algorithms used in magnetoencephalographic imaging generally assume that the source activity at different brain locations is independent or that the correlation structure is known. However, electrophysiological recordings of local ﬁeld potentials show strong correlations in aggregate activity over signiﬁcant distances. Indeed, it seems very likely that stimulus-evoked activity would follow strongly correlated time-courses in different brain areas. Here, we present, and validate through simulations, a new approach to source reconstruction in which the correlation between sources is modelled and estimated explicitly by variational Bayesian methods, facilitating accurate recovery of source locations and the time-courses of their activation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Existing source location and recovery algorithms used in magnetoencephalographic imaging generally assume that the source activity at different brain locations is independent or that the correlation structure is known. [sent-8, score-1.659]

2 However, electrophysiological recordings of local ﬁeld potentials show strong correlations in aggregate activity over signiﬁcant distances. [sent-9, score-0.225]

3 Indeed, it seems very likely that stimulus-evoked activity would follow strongly correlated time-courses in different brain areas. [sent-10, score-0.319]

4 Here, we present, and validate through simulations, a new approach to source reconstruction in which the correlation between sources is modelled and estimated explicitly by variational Bayesian methods, facilitating accurate recovery of source locations and the time-courses of their activation. [sent-11, score-1.6]

5 1 Introduction The brain’s neuronal activity generates weak magnetic ﬁelds (10 fT – 1 pT). [sent-12, score-0.225]

6 Magnetoencephalography (MEG) is a non-invasive technique for detecting and characterising these magnetic ﬁelds. [sent-13, score-0.148]

7 MEG sensors use super-conducting quantum interference devices (SQUIDs) to measure the changes in the brain’s magnetic ﬁeld on a millisecond time-scale. [sent-14, score-0.229]

8 When combined with electromagnetic source localisation, magnetic source imaging (MSI) becomes a functional brain imaging method that allows us to characterise macroscopic dynamic neural information processing. [sent-15, score-1.41]

9 In the past decade, the development of MSI source reconstruction algorithms has progressed signiﬁcantly [1]. [sent-16, score-0.502]

10 With parametric methods, a few current dipoles of unknown location and moment are assumed to represent the sources of activity in the brain. [sent-18, score-0.472]

11 In this case, solving the inverse problem requires a non-linear optimisation to estimate the position and magnitude of an unknown number of dipoles. [sent-19, score-0.164]

12 With imaging methods, a grid of voxels is used to represent the entire brain volume. [sent-20, score-0.342]

13 The inverse problem is then to recover whole brain activation images, represented by the timedependent moment and magnitude of an elementary dipole source located at each voxel. [sent-21, score-0.912]

14 However, the ill-posed nature of the problem leads to non-unique solutions which must be distinguished by prior information, usually in the form of assumptions regarding the correlation between the sources. [sent-23, score-0.216]

15 Our formulation makes no assumptions about the correlation of the sources; instead, we estimate the extent of the correlation by an evidence optimisation procedure within a variational Bayesian framework [3]. [sent-25, score-0.56]

16 1 MEG imaging Many standard MEG devices measure the radial gradient of the magnetic ﬁeld at a number, db , of sensor locations (typically arranged on a segment of a sphere). [sent-27, score-0.521]

17 Measurements made at a single time can be formed into a db -dimensional vector b; an experiment yields a series of N such samples, giving a db × N data matrix B. [sent-28, score-0.164]

18 The component we seek to isolate is stimulus- or event-related, and is presumably contributed to by signiﬁcant activity at a relatively small number of locations in the brain. [sent-30, score-0.217]

19 This signal is corrupted by thermal noise at the sensors, and by widespread spontaneous, unrelated brain activity. [sent-31, score-0.314]

20 For our purposes, these are both sources of noise, whose distributions are approximately normal [2] (in the case of the unrelated brain activity, the normality results from the fact that any one sensor sees the sum of effects from a large number of locations). [sent-32, score-0.639]

21 The covariance matrix of this noise, Ψ, can be measured approximately by accumulating sensor readings in a quiescent state; simulations suggest that the techniques presented here are reasonably tolerant to mis-estimation of the noise level. [sent-33, score-0.249]

22 Measurements are also affected by other forms of interference associated with experimental electronics or bio-magnetic activity external to the brain. [sent-34, score-0.158]

23 We will not here treat such interference explicitly, instead assuming that major sources have been removed by preprocessing the measured data, e. [sent-35, score-0.325]

24 For simplicity, we assume a spherical volume conductor model, which permits analytical calculation of L independent of the tissue conductivity [2], and which is reasonably accurate for most brain regions [1]. [sent-39, score-0.199]

25 (Non-uniform volume conduction properties of the brain and surrounding tissues can be explicitly accounted for by elaborating the lead-ﬁeld matrix calculation, but they do not otherwise affect the analysis presented below. [sent-40, score-0.263]

26 ) In the simple model, only the two tangential components of the current dipole which fall orthogonal to the radial direction contribute to b, and so the source vector s has a dimension ds which is twice the number of voxels dv . [sent-41, score-0.884]

27 The source matrix S associated with the N ﬁeld measurements has dimensions ds × N . [sent-42, score-0.668]

28 Instead, attempts at source recovery depend, either implicitly or explicitly, on the application of prior knowledge about the source distribution. [sent-44, score-0.929]

29 Most existing methods constrain the source locations and/or activities in various ways: based on anatomical or fMRI data; by maximum entropy, minimum L1 norm, weighted-minimum L2 norm or maximum smoothness priors; or to achieve optimal resolution [1]. [sent-45, score-0.576]

30 Most of these constraints can be formulated as priors for maximum a posteriori estimation of the sources (although the original statements do not always make such priors explicit). [sent-46, score-0.336]

31 In addition, some studies have also included temporal constraints on sources such as smoothness or phase-locking between sources [5]. [sent-47, score-0.564]

32 The optimal estimate (in a s least-squares sense) is given by the Wiener ﬁlter: F = bb −1 bs = bb −1 (Ls + n)s = bb −1 L ss , (2) (where n ∼ N (0, Ψ) is a noise vector uncorrelated with s) and therefore requires knowledge of the source correlation matrix ss . [sent-49, score-1.294]

33 If the orientation of each source dipole is known or estimated independently (so that s contains only one magnitude at each location), then the source correlation matrix ss reduces to a diagonal matrix of gain factors. [sent-51, score-1.502]

34 For the beamformer, these factors are chosen to give a unit “loop gain” for each source i. [sent-52, score-0.431]

35 It can be shown that the beamformer only yields accurate results when the number of active sources is few [7]. [sent-55, score-0.554]

36 A related algorithm using Multiple Signal Classiﬁcation (MUSIC) also assumes sparsity and linear independence in the time-series of the sources [1]. [sent-58, score-0.327]

37 Minimum-norm methods can also be viewed as making speciﬁc assumptions about the source correlation matrix [8]. [sent-59, score-0.711]

38 In this paper, we present a novel approach to source reconstruction. [sent-63, score-0.464]

39 Our technique shares with many of the methods described above the assumption of sparsity in source activation. [sent-64, score-0.514]

40 However, it dispenses entirely with assumption of source independence. [sent-65, score-0.431]

41 Instead, we estimate the source correlation matrix from the data by hyperparameter optimisation. [sent-66, score-0.747]

42 1 The sources are not really expected to have the Gaussian amplitude distribution that this construction implies. [sent-68, score-0.34]

43 Instead, the assumption forms a convenient ﬁction, making it easy to estimate the source correlation matrix. [sent-69, score-0.62]

44 We show in simulations below that estimation in this framework can indeed yield accurate estimates of the correlation matrix even for non-normally distributed source activity. [sent-70, score-0.724]

45 Once the correlation matrix has been established, estimation using the Wiener ﬁlter of (2) provides the best linear estimate of source activity (and would be the exact maximum a posteriori estimate if the sources really were normally distributed). [sent-71, score-1.081]

46 1 This formulation is similar to that used in weighted minimum-norm methods, although there the weights W are ﬁxed, implying a pre-determined source correlation matrix. [sent-72, score-0.62]

47 The model of (3) parameterises the source correlation in a general way, subject to a maximum rank of dz . [sent-73, score-0.77]

48 This rank constraint does not by itself favour sparsity in the source distribution, and could easily be chosen to be equal to ds . [sent-74, score-0.6]

49 Instead, the sparsity emerges from a hyperparameter optimisation similar to the automatic relevance determination (ARD) of Mackay and Neal [11] (see also [12, 13]). [sent-75, score-0.236]

50 We now add a hyperprior on W under which the expected power of both tangential components at the vth voxel is determined by a hyperparameter αv . [sent-77, score-0.262]

51 For notational convenience we collect the αv into a vector α and introduce a ds × dv indicator matrix J, with Jiv = 1 if the ith source is located in the vth voxel and 0 otherwise. [sent-78, score-0.863]

52 Thus, each column of J contains exactly two unit entries, one for each tangential component of the corresponding voxel dipole. [sent-79, score-0.14]

53 Finally, we introduce a ds × ds diagonal matrix A with Aii = (Jα)i . [sent-80, score-0.312]

54 ii (4) Thus each αv sets a prior distribution on the length of the two rows in the weight matrix corresponding to source components at the vth voxel. [sent-82, score-0.554]

55 As in the original ARD models, optimisation of the marginal likelihood or evidence, P (B | α, L, Ψ), with respect to the αv results in a number of the hyperparameters diverging to inﬁnity. [sent-83, score-0.169]

56 This imposes a zero-centred delta-function prior on the corresponding row of W , in turn forcing the corresponding source power to vanish. [sent-84, score-0.431]

57 As it is, the optimisation is only approximate, but has been found to yield good maxima in a factor analysis model very similar to the one we consider here [12]. [sent-92, score-0.128]

58 w where the normal distribution on Z implies a normal distribution on each column z; the distribution on W is normal on vec (W ) 2 ; 1 is a vector of ones; and the diag [·] operator returns the main diagonal of its argument as a vector. [sent-95, score-0.219]

59 It has the additional beneﬁt of simplifying both the notational and computational complexities of the updates (for the latter, it reduces the complexity of the inversion needed to calculate Σw from (ds dz )3 to d2 ). [sent-98, score-0.179]

60 The use of the matrix inversion lemma in (6b) exploits the diagow nality of A to reduce the computational complexity of the algorithm with respect to ds . [sent-100, score-0.188]

61 The formulae of (6) are easily implemented and recover an estimate of W , and thus the source correlation matrix, by iteration. [sent-101, score-0.62]

62 The source activities can then be estimated by use of the Wiener ﬁlter (2). [sent-102, score-0.474]

63 Note that the measured data enter into the estimation procedure only through their correlation BB . [sent-104, score-0.189]

64 In other words, the hyperparameter optimisation stage of our algorithm is only being used to model the data correlation, not their amplitudes. [sent-105, score-0.191]

65 As a result, the effects of incorrectly assuming a Gaussian source amplitude distribution can be expected to remain relatively benign. [sent-106, score-0.489]

66 4 Simulations Simulation studies provide an important tool for evaluating source recovery algorithms, in that they provide “sensor” data sets for which the correct answer (i. [sent-107, score-0.524]

67 1 Methods We simulated 100 1-s-long epochs of evoked response data. [sent-112, score-0.174]

68 The sensor conﬁguration was taken from a real experiment: two sensor arrays, with 37 gradiometer coils each, were 2 for a discussion of the vec operator and the Kronecker product ⊗ see e. [sent-113, score-0.28]

69 Candidate source dipoles were located on a grid with 1 cm spacing within a hemispherical brain volume with a radius of 8 cm, to give a total of 956 possible source locations. [sent-116, score-1.165]

70 Signiﬁcant (above background) evoked activity was simulated at 5 of these locations (see ﬁgure 1a), with random dipole orientations. [sent-117, score-0.588]

71 The evoked waveforms were similar in form to the evoked responses seen in many areas of the brain (see ﬁgure 2a), and were strongly correlated between the ﬁve sites (ﬁgure 3a). [sent-118, score-0.484]

72 The dipole orientation at each site was chosen randomly in the plane parallel to the sensor tangent. [sent-122, score-0.337]

73 Note that the amplitude distribution of these sources is strongly non-Gaussian; we will see, however, that they can be recovered successfully by the present technique despite its assumption of normality. [sent-123, score-0.378]

74 The simulated sensor recordings were corrupted by noise from two sources, both with Gaussian distribution. [sent-124, score-0.28]

75 Background activity in the brain was simulated with equal power at every point on the grid of candidate sources, with a root-mean-square (RMS) amplitude 1. [sent-125, score-0.439]

76 Although this background activity was uncorrelated between brain locations, it resulted in correlated disturbances at the magnetic sensors. [sent-127, score-0.541]

77 Thermal noise in the sensors was uncorrelated, and had a similar magnitude (at the sensors) to that of the background noise. [sent-128, score-0.153]

78 The novel Bayesian estimation technique was applied to the raw simulated sensor trace rather than to epoch-averaged data. [sent-129, score-0.279]

79 While in this simulation the evoked activity was identical in each trial, determining the correlation matrix from unaveraged data should, in the more general case, make single-trial reconstructions more accurate. [sent-130, score-0.49]

80 Once reconstructed, the source timecourses were averaged, and are shown in ﬁgure 2. [sent-131, score-0.476]

81 The number of presources dz , a free parameter in the algorithm, was set to 10. [sent-132, score-0.15]

82 For comparison, we also reconstructed sources using the vector minimum-variance adaptive beamformer approach [15]. [sent-134, score-0.659]

83 Note that this technique, along with many other existing reconstruction methods, assumes that sources at different locations are uncorrelated and so it should not be expected to perform well under the conditions of our simulation. [sent-135, score-0.529]

84 2 Results Figure 1 shows the source locations and powers reconstructed by the novel Bayesian approach developed here (b) and by the beamformer (c). [sent-137, score-0.97]

85 The Bayesian approach identiﬁed the correct number of sources, at the correct locations and with approximately correct relative powers. [sent-138, score-0.18]

86 By contrast, the beamformer approach, which assumes uncorrelated sources, entirely failed to locate the sources of activity. [sent-139, score-0.628]

87 Figure 2b shows the average evoked-response reconstruction at each of the identiﬁed source locations (with the simulated waveforms shown in panel a). [sent-140, score-0.776]

88 The time-courses estimated by the vector beamformer are shown in ﬁgure 2c. [sent-142, score-0.272]

89 As beamformer localisation proved to be unreliable, the time-courses shown are the reconstructions at the positions of the correct (simulated) sources. [sent-143, score-0.377]

90 Nonetheless, the strong correlations in the sources have corrupted the reconstructions. [sent-144, score-0.399]

91 Note that the only difference between the time-courses shown in ﬁgure 2b and c is premultiplication by the estimated source correlation matrix in b. [sent-145, score-0.684]

92 Finally, ﬁgure 3 shows the correlation coefﬁcient matrices for the dipole amplitude timecourses of the active sources shown in ﬁgure 2. [sent-146, score-0.771]

93 We see that the Bayesian approach ﬁnds a reasonable approximation to the correct correlation structure. [sent-147, score-0.215]

94 Again, however, the beamformer is unable to accurately characterise the correlation matrix. [sent-148, score-0.494]

95 The two traces for each location show the dipole components in two orthogonal directions. [sent-153, score-0.233]

96 Correlations were computed between epochaveraged dipole amplitude time-courses at each location. [sent-155, score-0.255]

97 (a) simulated sources; (b) sources reconstructed by our novel algorithm; (c) sources reconstructed by beamforming. [sent-157, score-0.902]

98 5 Conclusions We have demonstrated a novel evidence-optimisation approach to the location and reconstruction of dipole sources contributing to MEG measurements. [sent-158, score-0.619]

99 Unlike existing methods, this new technique does not assume a correlation structure for the sources, instead estimating it from the data. [sent-159, score-0.227]

100 As such, this approach holds great promise for high ﬁdelity imaging of correlated magnetic activity in the brain. [sent-160, score-0.375]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('source', 0.431), ('sources', 0.282), ('beamformer', 0.272), ('dipole', 0.197), ('meg', 0.197), ('qn', 0.195), ('correlation', 0.189), ('qw', 0.182), ('brain', 0.171), ('qz', 0.159), ('dz', 0.15), ('optimisation', 0.128), ('bb', 0.126), ('ds', 0.124), ('imaging', 0.117), ('activity', 0.115), ('sensor', 0.113), ('magnetic', 0.11), ('reconstructed', 0.105), ('locations', 0.102), ('simulated', 0.095), ('sekihara', 0.091), ('voxel', 0.09), ('lw', 0.084), ('evoked', 0.079), ('waveforms', 0.077), ('correlations', 0.077), ('uncorrelated', 0.074), ('reconstruction', 0.071), ('wiener', 0.067), ('recovery', 0.067), ('matrix', 0.064), ('hyperparameter', 0.063), ('ss', 0.063), ('diag', 0.063), ('vth', 0.059), ('nagarajan', 0.059), ('amplitude', 0.058), ('vb', 0.054), ('voxels', 0.054), ('vec', 0.054), ('tangential', 0.05), ('db', 0.05), ('measurements', 0.049), ('sensors', 0.047), ('gure', 0.046), ('sparsity', 0.045), ('msi', 0.045), ('poeppel', 0.045), ('timecourses', 0.045), ('sites', 0.045), ('activities', 0.043), ('interference', 0.043), ('reconstructions', 0.043), ('tr', 0.042), ('hyperparameters', 0.041), ('log', 0.041), ('simulations', 0.04), ('corrupted', 0.04), ('unrelated', 0.039), ('zz', 0.039), ('ashrae', 0.039), ('dipoles', 0.039), ('activation', 0.039), ('background', 0.038), ('located', 0.038), ('technique', 0.038), ('epoch', 0.037), ('location', 0.036), ('delity', 0.036), ('maneesh', 0.036), ('sahani', 0.036), ('localisation', 0.036), ('ard', 0.036), ('dw', 0.036), ('magnitude', 0.036), ('normal', 0.034), ('correlated', 0.033), ('electrophysiological', 0.033), ('ls', 0.033), ('characterise', 0.033), ('eld', 0.033), ('novel', 0.033), ('noise', 0.032), ('thermal', 0.032), ('bayesian', 0.031), ('lter', 0.03), ('ms', 0.029), ('notational', 0.029), ('devices', 0.029), ('ller', 0.029), ('volume', 0.028), ('dv', 0.028), ('variational', 0.027), ('assumptions', 0.027), ('orientation', 0.027), ('priors', 0.027), ('cm', 0.027), ('powers', 0.027), ('correct', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations

Author: Maneesh Sahani, Srikantan S. Nagarajan

2 0.26190323 182 nips-2003-Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron

Author: Sung C. Jun, Barak A. Pearlmutter

Abstract: We describe a system that localizes a single dipole to reasonable accuracy from noisy magnetoencephalographic (MEG) measurements in real time. At its core is a multilayer perceptron (MLP) trained to map sensor signals and head position to dipole location. Including head position overcomes the previous need to retrain the MLP for each subject and session. The training dataset was generated by mapping randomly chosen dipoles and head positions through an analytic model and adding noise from real MEG recordings. After training, a localization took 0.7 ms with an average error of 0.90 cm. A few iterations of a Levenberg-Marquardt routine using the MLP’s output as its initial guess took 15 ms and improved the accuracy to 0.53 cm, only slightly above the statistical limits on accuracy imposed by the noise. We applied these methods to localize single dipole sources from MEG components isolated by blind source separation and compared the estimated locations to those generated by standard manually-assisted commercial software. 1

3 0.2252634 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki

Abstract: In this paper, sparse representation (factorization) of a data matrix is ﬁrst discussed. An overcomplete basis matrix is estimated by using the K−means method. We have proved that for the estimated overcomplete basis matrix, the sparse solution (coefﬁcient matrix) with minimum l1 −norm is unique with probability of one, which can be obtained using a linear programming algorithm. The comparisons of the l1 −norm solution and the l0 −norm solution are also presented, which can be used in recoverability analysis of blind source separation (BSS). Next, we apply the sparse matrix factorization approach to BSS in the overcomplete case. Generally, if the sources are not sufﬁciently sparse, we perform blind separation in the time-frequency domain after preprocessing the observed data using the wavelet packets transformation. Third, an EEG experimental data analysis example is presented to illustrate the usefulness of the proposed approach and demonstrate its performance. Two almost independent components obtained by the sparse representation method are selected for phase synchronization analysis, and their periods of signiﬁcant phase synchronization are found which are related to tasks. Finally, concluding remarks review the approach and state areas that require further study. 1

4 0.15385117 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

Author: Brian J. Fischer, Charles H. Anderson

Abstract: The barn owl is a nocturnal hunter, capable of capturing prey using auditory information alone [1]. The neural basis for this localization behavior is the existence of auditory neurons with spatial receptive ﬁelds [2]. We provide a mathematical description of the operations performed on auditory input signals by the barn owl that facilitate the creation of a representation of auditory space. To develop our model, we ﬁrst formulate the sound localization problem solved by the barn owl as a statistical estimation problem. The implementation of the solution is constrained by the known neurobiology.

5 0.093887165 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

Author: David Barber, Felix V. Agakov

Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally diﬃcult problem. We approach the diﬃculty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1

6 0.08172331 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

7 0.078866512 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects

8 0.078589223 5 nips-2003-A Classification-based Cocktail-party Processor

9 0.07070265 55 nips-2003-Distributed Optimization in Adaptive Networks

10 0.061296578 92 nips-2003-Information Bottleneck for Gaussian Variables

11 0.057822093 155 nips-2003-Perspectives on Sparse Bayesian Learning

12 0.056128848 73 nips-2003-Feature Selection in Clustering Problems

13 0.054405596 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring

14 0.05255552 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

15 0.05233065 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

16 0.052159466 43 nips-2003-Bounded Invariance and the Formation of Place Fields

17 0.051545825 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

18 0.050988141 193 nips-2003-Variational Linear Response

19 0.050348591 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

20 0.050156508 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.182), (1, 0.012), (2, 0.108), (3, 0.023), (4, -0.066), (5, 0.085), (6, 0.223), (7, -0.077), (8, -0.116), (9, 0.076), (10, 0.089), (11, -0.061), (12, 0.151), (13, 0.063), (14, -0.237), (15, 0.084), (16, -0.206), (17, 0.027), (18, -0.019), (19, -0.048), (20, 0.182), (21, 0.082), (22, -0.044), (23, 0.108), (24, 0.083), (25, -0.13), (26, -0.172), (27, 0.095), (28, -0.011), (29, -0.083), (30, -0.008), (31, -0.165), (32, 0.066), (33, 0.101), (34, -0.044), (35, 0.086), (36, 0.041), (37, -0.025), (38, 0.172), (39, -0.05), (40, 0.07), (41, 0.004), (42, 0.081), (43, -0.084), (44, 0.025), (45, -0.007), (46, 0.037), (47, 0.068), (48, 0.01), (49, 0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98163593 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations

Author: Maneesh Sahani, Srikantan S. Nagarajan

2 0.85993624 182 nips-2003-Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron

Author: Sung C. Jun, Barak A. Pearlmutter

3 0.74794203 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki

4 0.60522205 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

Author: Brian J. Fischer, Charles H. Anderson

5 0.40539467 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring

Author: Cynthia Archer, Todd K. Leen, António M. Baptista

Abstract: As part of an environmental observation and forecasting system, sensors deployed in the Columbia RIver Estuary (CORIE) gather information on physical dynamics and changes in estuary habitat. Of these, salinity sensors are particularly susceptible to biofouling, which gradually degrades sensor response and corrupts critical data. Automatic fault detectors have the capability to identify bio-fouling early and minimize data loss. Complicating the development of discriminatory classiﬁers is the scarcity of bio-fouling onset examples and the variability of the bio-fouling signature. To solve these problems, we take a novelty detection approach that incorporates a parameterized bio-fouling model. These detectors identify the occurrence of bio-fouling, and its onset time as reliably as human experts. Real-time detectors installed during the summer of 2001 produced no false alarms, yet detected all episodes of sensor degradation before the ﬁeld staﬀ scheduled these sensors for cleaning. From this initial deployment through February 2003, our bio-fouling detectors have essentially doubled the amount of useful data coming from the CORIE sensors. 1

6 0.35729018 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

7 0.34030965 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks

8 0.30967394 55 nips-2003-Distributed Optimization in Adaptive Networks

9 0.30504239 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects

10 0.28922629 5 nips-2003-A Classification-based Cocktail-party Processor

11 0.28477141 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian

12 0.27991664 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

13 0.27589038 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

14 0.26830846 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models

15 0.25657406 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples

16 0.25485572 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface

17 0.25110513 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

18 0.23746015 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

19 0.23497057 92 nips-2003-Information Bottleneck for Gaussian Variables

20 0.22754158 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (11, 0.013), (30, 0.019), (35, 0.031), (53, 0.62), (71, 0.03), (76, 0.049), (85, 0.044), (91, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99404705 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations

Author: Maneesh Sahani, Srikantan S. Nagarajan

2 0.98921192 111 nips-2003-Learning the k in k-means

Author: Greg Hamerly, Charles Elkan

Abstract: When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical signiﬁcance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density estimation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difﬁcult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectationmaximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm 0.9 4 0.8 3 0.7 2 0.6 1 0.5 0 0.4 −1 0.3 −2 −3 0.2 0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −4 −3 −2 −1 0 1 2 3 Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; ﬁve should be used. On the right, too many centers are used; one center is sufﬁcient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for ﬁxed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C) − p log n 2 where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the description length is a measure of how well the data are ﬁt by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model ﬁt to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-speciﬁc knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to reﬁne the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯}). x 2: C ← kmeans(C, X). 3: Let {xi |class(xi ) = j} be the set of datapoints assigned to center cj . 4: Use a statistical test to detect if each {xi |class(xi ) = j} follow a Gaussian distribution (at conﬁdence level α). 5: If the data look Gaussian, keep cj . Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when ﬁnding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be ﬁxed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2.1 Testing clusters for Gaussian ﬁt To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0 , then we believe that the one center is sufﬁcient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1 , then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), where F is the N (0, 1) cumulative distribution function. Then the statistic is A2 (Z) = − 1 n n (2i − 1) [log(zi ) + log(1 − zn+1−i )] − n (1) i=1 Stephens [17] showed that for the case where µ and σ are estimated from the data (as in clustering), we must correct the statistic according to A2 (Z) ∗ = A2 (Z)(1 + 4/n − 25/(n2 )) (2) Given a subset of data X in d dimensions that belongs to center c, the hypothesis test proceeds as follows: 1. Choose a signiﬁcance level α for the test. 2. Initialize two centers, called “children” of c. See the text for good ways to do this. 3. Run k-means on these two centers in X. This can be run to completion, or to some early stopping point if desired. Let c1 , c2 be the child centers chosen by k-means. 4. Let v = c1 − c2 be a d-dimensional vector that connects the two centers. This is the direction that k-means believes to be important for clustering. Then project X onto v: xi = xi , v /||v||2 . X is a 1-dimensional representation of the data projected onto v. Transform X so that it has mean 0 and variance 1. 5. Let zi = F (x(i) ). If A2 (Z) is in the range of non-critical values at conﬁdence ∗ level α, then accept H0 , keep the original center, and discard {c1 , c2 }. Otherwise, reject H0 and keep {c1 , c2 } in place of the original center. A primary contribution of this work is simplifying the test for Gaussian ﬁt by projecting the data to one dimension where the test is simple to apply. The authors of [5] also use this approach for online dimensionality reduction during clustering. The one-dimensional representation of the data allows us to consider only the data along the direction that kmeans has found to be important for separating the data. This is related to the problem of projection pursuit [7], where here k-means searches for a direction in which the data appears non-Gaussian. We must choose the signiﬁcance level of the test, α, which is the desired probability of making a Type I error (i.e. incorrectly rejecting H0 ). It is appropriate to use a Bonferroni adjustment to reduce the chance of making Type I errors over multiple tests. For example, if we want a 0.01 chance of making a Type I error in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To ﬁnd k ﬁnal centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. In our tests, we always use α = 0.0001. We consider two ways to initialize the two child centers. Both approaches initialize with c ± m, where c is a center and m is chosen. The ﬁrst method chooses m as a random d-dimensional vector such that ||m|| is small compared to the distortion of the data. A second method ﬁnds the main principal component s of the data (having eigenvalue λ), and chooses m = s 2λ/π. This deterministic method places the two centers in their expected locations under H0 . The principal component calculations require O(nd2 + d3 ) time and O(d2 ) space, but since we only want the main principal component, we can use fast methods like the power method, which takes time that is at most linear in the ratio of the two largest eigenvalues [4]. In this paper we use principal-component-based splitting. 2.2 An example Figure 2 shows a run of the G-means algorithm on a synthetic dataset with two true clusters and 1000 points, using α = 0.0001. The critical value for the Anderson-Darling test is 1.8692 for this conﬁdence level. Starting with one center, after one iteration of G-means, we have 2 centers and the A2 statistic is 38.103. This is much larger than the critical value, ∗ so we reject H0 and accept this split. On the next iteration, we split each new center and repeat the statistical test. The A2 values for the two splits are 0.386 and 0.496, both of ∗ which are well below the critical value. Therefore we accept H0 for both tests, and discard these splits. Thus G-means gives a ﬁnal answer of k = 2. 2.3 Statistical power Figure 3 shows the power of the Anderson-Darling test, as compared to the BIC. Lower is better for both plots. We run 1000 tests for each data point plotted for both plots. In the left 14 14 14 13 13 13 12 12 12 11 11 11 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 4 4 0 2 4 6 8 10 12 5 4 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Figure 2: An example of running G-means for three iterations on a 2-dimensional dataset with two true clusters and 1000 points. Starting with one center (left plot), G-means splits into two centers (middle). The test for normality is signiﬁcant, so G-means rejects H0 and keeps the split. After splitting each center again (right), the test values are not signiﬁcant, so G-means accepts H0 for both tests and does not accept these splits. The middle plot is the G-means answer. See the text for further details. 1 1 G-means X-means 0.8 P(Type II error) P(Type I error) 0.8 G-means X-means 0.6 0.4 0.2 0.6 0.4 0.2 0 0 0 30 60 90 120 150 number of datapoints 180 210 0 30 60 90 120 150 number of datapoints 180 210 Figure 3: A comparison of the power of the Anderson-Darling test versus the BIC. For the AD test we ﬁx the signiﬁcance level (α = 0.0001), while the BIC’s signiﬁcance level depends on n. The left plot shows the probability of incorrectly splitting (Type I error) one true 2-d cluster that is 5% elliptical. The right plot shows the probability of incorrectly not splitting two true clusters separated by 5σ (Type II error). Both plots are functions of n. Both plots show that the BIC overﬁts (splits clusters) when n is small. plot, for each test we generate n datapoints from a single true Gaussian distribution, and then plot the frequency with which BIC and G-means will choose k = 2 rather than k = 1 (i.e. commit a Type I error). BIC tends to overﬁt by choosing too many centers when the data is not strictly spherical, while G-means does not. This is consistent with the tests of real-world data in the next section. While G-means commits more Type II errors when n is small, this prevents it from overﬁtting the data. The BIC can be considered a likelihood ratio test, but with a signiﬁcance level that cannot be ﬁxed. The signiﬁcance level instead varies depending on n and ∆k (the change in the number of model parameters between two models). As n or ∆k decrease, the signiﬁcance level increases (the BIC becomes weaker as a statistical test) [10]. Figure 3 shows this effect for varying n. In [11] the authors show that penalty-based methods require problemspeciﬁc tuning and don’t generalize as well as other methods, such as cross validation. 3 Experiments Table 1 shows the results from running G-means and X-means on many large synthetic. On synthetic datasets with spherically distributed clusters, G-means and X-means do equally Table 1: Results for many synthetic datasets. We report distortion relative to the optimum distortion for the correct clustering (closer to one is better), and time is reported relative to k-means run with the correct k. For BIC, larger values are better, but it is clear that ﬁnding the correct clustering does not always coincide with ﬁnding a larger BIC. Items with a star are where X-means always chose the largest number of centers we allowed. dataset synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 synthetic k=5 synthetic k=20 synthetic k=80 d 2 k found 9.1± 9.9 18.1± 3.2 20.1± 0.6 70.5±11.6 80.0± 0.2 171.7±23.7 5.0± 0.0 *20.0± 0.0 20.0± 0.1 *80.0± 0.0 80.2± 0.5 229.2±36.8 5.0± 0.0 *20.0± 0.0 20.0± 0.0 *80.0± 0.0 80.0± 0.0 171.5±10.9 method G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means G-means X-means 2 2 8 8 8 32 32 32 BIC(×104 ) -0.19±2.70 0.70±0.93 0.21±0.18 14.83±3.50 1.84±0.12 40.16±6.59 -0.74±0.16 -2.28±0.20 -0.18±0.17 14.36±0.21 1.45±0.20 52.28±9.26 -3.36±0.21 -27.92±0.22 -2.73±0.22 -11.13±0.23 -1.10±0.16 11.78±2.74 distortion(× optimal) 0.89± 0.23 0.37± 0.12 0.99± 0.01 9.45±28.02 1.00± 0.01 48.49±70.04 1.00± 0.00 0.47± 0.03 0.99± 0.00 0.47± 0.01 0.99± 0.00 0.57± 0.06 1.00± 0.00 0.76± 0.00 1.00± 0.00 0.76± 0.01 1.00± 0.00 0.84± 0.01 7 7 6 6 5 5 4 4 3 3 2 2 1 time(× k-means) 13.2 2.8 2.1 1.2 2.2 1.8 4.6 11.0 2.6 4.0 2.9 6.5 4.4 29.9 2.3 21.2 2.8 53.3 1 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure 4: 2-d synthetic dataset with 5 true clusters. On the left, G-means correctly chooses 5 centers and deals well with non-spherical data. On the right, the BIC causes X-means to overﬁt the data, choosing 20 unevenly distributed clusters. well at ﬁnding the correct k and maximizing the BIC statistic, so we don’t show these results here. Most real-world data is not spherical, however. The synthetic datasets used here each have 5000 datapoints in d = 2/8/32 dimensions. The true ks are 5, 20, and 80. For each synthetic dataset type, we generate 30 datasets with the true center means chosen uniformly randomly from the unit hypercube, and choosing σ so that no two clusters are closer than 3σ apart. Each cluster is also given a transformation to make it non-spherical, by multiplying the data by a randomly chosen scaling and rotation matrix. We run G-means starting with one center. We allow X-means to search between 2 and 4k centers (where here k is the true number of clusters). The G-means algorithm clearly does better at ﬁnding the correct k on non-spherical data. Its results are closer to the true distortions and the correct ks. The BIC statistic that X-means uses has been formulated to maximize the likelihood for spherically-distributed data. Thus it overestimates the number of true clusters in non-spherical data. This is especially evident when the number of points per cluster is small, as in datasets with 80 true clusters. 1 2 2 3 3 4 4 Digit 0 1 Digit 0 5 5 6 6 7 7 8 8 9 9 5 10 15 20 25 30 Cluster 10 20 30 40 50 60 Cluster Figure 5: NIST and Pendigits datasets: correspondence between each digit (row) and each cluster (column) found by G-means. G-means did not have the labels, yet it found meaningful clusters corresponding with the labels. Because of this overestimation, X-means often hits our limit of 4k centers. Figure 4 shows an example of overﬁtting on a dataset with 5 true clusters. X-means chooses k = 20 while G-means ﬁnds all 5 true cluster centers. Also of note is that X-means does not distribute centers evenly among clusters; some clusters receive one center, but others receive many. G-means runs faster than X-means for 8 and 32 dimensions, which we expect, since the kd-tree structures which make X-means fast in low dimensions take time exponential in d, making them slow for more than 8 to 12 dimensions. All our code is written in Matlab; X-means is written in C. 3.1 Discovering true clusters in labeled data We tested these algorithms on two real-world datasets for handwritten digit recognition: the NIST dataset [12] and the Pendigits dataset [2]. The goal is to cluster the data without knowledge of the labels and measure how well the clustering captures the true labels. Both datasets have 10 true classes (digits 0-9). NIST has 60000 training examples and 784 dimensions (28×28 pixels). We use 6000 randomly chosen examples and we reduce the dimension to 50 by random projection (following [3]). The Pendigits dataset has 7984 examples and 16 dimensions; we did not change the data in any way. We cluster each dataset with G-means and X-means, and measure performance by comparing the cluster labels Lc with the true labels Lt . We deﬁne the partition quality (PQ) as kt kc kt 2 2 pq = i=1 j=1 p(i, j) i=1 p(i) where kt is the true number of classes, and kc is the number of clusters found by the algorithm. This metric is maximized when Lc induces the same partition of the data as Lt ; in other words, when all points in each cluster have the same true label, and the estimated k is the true k. The p(i, j) term is the frequency-based probability that a datapoint will be labeled i by Lt and j by Lc . This quality is normalized by the sum of true probabilities, squared. This statistic is related to the Rand statistic for comparing partitions [8]. For the NIST dataset, G-means ﬁnds 31 clusters in 30 seconds with a PQ score of 0.177. X-means ﬁnds 715 clusters in 4149 seconds, and 369 of these clusters contain only one point, indicating an overestimation problem with the BIC. X-means receives a PQ score of 0.024. For the Pendigits dataset, G-means ﬁnds 69 clusters in 30 seconds, with a PQ score of 0.196; X-means ﬁnds 235 clusters in 287 seconds, with a PQ score of 0.057. Figure 5 shows Hinton diagrams of the G-means clusterings of both datasets, showing that G-means succeeds at identifying the true clusters concisely, without aid of the labels. The confusions between different digits in the NIST dataset (seen in the off-diagonal elements) are common for other researchers using more sophisticated techniques, see [3]. 4 Discussion and conclusions We have introduced the new G-means algorithm for learning k based on a statistical test for determining whether datapoints are a random sample from a Gaussian distribution with arbitrary dimension and covariance matrix. The splitting uses dimension reduction and a powerful test for Gaussian ﬁtness. G-means uses this statistical test as a wrapper around k-means to discover the number of clusters automatically. The only parameter supplied to the algorithm is the signiﬁcance level of the statistical test, which can easily be set in a standard way. The G-means algorithm takes linear time and space (plus the cost of the splitting heuristic and test) in the number of datapoints and dimension, since k-means is itself linear in time and space. Empirically, the G-means algorithm works well at ﬁnding the correct number of clusters and the locations of genuine cluster centers, and we have shown it works well in moderately high dimensions. Clustering in high dimensions has been an open problem for many years. Recent research has shown that it may be preferable to use dimensionality reduction techniques before clustering, and then use a low-dimensional clustering algorithm such as k-means, rather than clustering in the high dimension directly. In [3] the author shows that using a simple, inexpensive linear projection preserves many of the properties of data (such as cluster distances), while making it easier to ﬁnd the clusters. Thus there is a need for good-quality, fast clustering algorithms for low-dimensional data. Our work is a step in this direction. Additionally, recent image segmentation algorithms such as normalized cut [16, 13] are based on eigenvector computations on distance matrices. These “spectral” clustering algorithms still use k-means as a post-processing step to ﬁnd the actual segmentation and they require k to be speciﬁed. Thus we expect G-means will be useful in combination with spectral clustering. References [1] Horst Bischof, Aleˇ Leonardis, and Alexander Selb. MDL principle for robust vector quantisation. Pattern analysis and applications, 2:59–72, s 1999. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [3] Sanjoy Dasgupta. Experiments with random projection. In Uncertainty in Artiﬁcial Intelligence: Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan Kaufmann Publishers. [4] Gianna M. Del Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913–937, 1997. [5] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002. [6] Fredrik Farnstrom, James Lewis, and Charles Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. [7] Peter J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, June 1985. [8] L. Hubert and P. Arabie. Comparing partitions. Journal of Classiﬁcation, 2:193–218, 1985. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [10] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90(431):928–934, 1995. [11] Michael J. Kearns, Yishay Mansour, Andrew Y. Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. In Computational Learing Theory (COLT), pages 21–30, 1995. [12] Yann LeCun, L´ on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the e IEEE, 86(11):2278–2324, 1998. [13] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neural Information Processing Systems, 14, 2002. [14] Dan Pelleg and Andrew Moore. X-means: Extending K-means with efﬁcient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. [15] Peter Sand and Andrew Moore. Repairing faulty mixture models using density estimation. In Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [17] M. A. Stephens. EDF statistics for goodness of ﬁt and some comparisons. American Statistical Association, 69(347):730–737, September 1974.

3 0.98245126 45 nips-2003-Circuit Optimization Predicts Dynamic Networks for Chemosensory Orientation in Nematode C. elegans

Author: Nathan A. Dunn, John S. Conery, Shawn R. Lockery

Abstract: The connectivity of the nervous system of the nematode Caenorhabditis elegans has been described completely, but the analysis of the neuronal basis of behavior in this system is just beginning. Here, we used an optimization algorithm to search for patterns of connectivity sufﬁcient to compute the sensorimotor transformation underlying C. elegans chemotaxis, a simple form of spatial orientation behavior in which turning probability is modulated by the rate of change of chemical concentration. Optimization produced differentiator networks with inhibitory feedback among all neurons. Further analysis showed that feedback regulates the latency between sensory input and behavior. Common patterns of connectivity between the model and biological networks suggest new functions for previously identiﬁed connections in the C. elegans nervous system. 1

4 0.98135346 92 nips-2003-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was addressed through the information bottleneck (IB) method, by (soft) clustering one variable while preserving information about another - relevance - variable. An interesting question addressed in the current work is the extension of these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters. We give a formal deﬁnition of the general continuous IB problem and obtain an analytic solution for the optimal representation for the important case of multivariate Gaussian variables. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized correlation matrix Σx|y Σ−1 , which x is also the basis obtained in Canonical Correlation Analysis. However, in Gaussian IB, the compression tradeoﬀ parameter uniquely determines the dimension, as well as the scale of each eigenvector. This introduces a novel interpretation where solutions of diﬀerent ranks lie on a continuum parametrized by the compression level. Our analysis also provides an analytic expression for the optimal tradeoﬀ - the information curve - in terms of the eigenvalue spectrum. 1

5 0.976412 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

Author: Haifeng Li, Tao Jiang, Keshu Zhang

Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efﬁcient and stable. 1

6 0.96675307 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

7 0.85771292 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

8 0.82935786 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

9 0.82736146 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

10 0.79816127 66 nips-2003-Extreme Components Analysis

11 0.79773617 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface

12 0.79076129 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

13 0.77389073 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

14 0.76500976 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

15 0.76014972 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

16 0.75960833 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

17 0.75741625 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

18 0.75723672 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

19 0.75198615 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

20 0.74856889 185 nips-2003-The Doubly Balanced Network of Spiking Neurons: A Memory Model with High Capacity