nips nips2003 nips2003-179 knowledge-graph by maker-knowledge-mining

179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

Source: pdf

Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki

Abstract: In this paper, sparse representation (factorization) of a data matrix is ﬁrst discussed. An overcomplete basis matrix is estimated by using the K−means method. We have proved that for the estimated overcomplete basis matrix, the sparse solution (coefﬁcient matrix) with minimum l1 −norm is unique with probability of one, which can be obtained using a linear programming algorithm. The comparisons of the l1 −norm solution and the l0 −norm solution are also presented, which can be used in recoverability analysis of blind source separation (BSS). Next, we apply the sparse matrix factorization approach to BSS in the overcomplete case. Generally, if the sources are not sufﬁciently sparse, we perform blind separation in the time-frequency domain after preprocessing the observed data using the wavelet packets transformation. Third, an EEG experimental data analysis example is presented to illustrate the usefulness of the proposed approach and demonstrate its performance. Two almost independent components obtained by the sparse representation method are selected for phase synchronization analysis, and their periods of signiﬁcant phase synchronization are found which are related to tasks. Finally, concluding remarks review the approach and state areas that require further study. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 An overcomplete basis matrix is estimated by using the K−means method. [sent-2, score-0.357]

2 We have proved that for the estimated overcomplete basis matrix, the sparse solution (coefﬁcient matrix) with minimum l1 −norm is unique with probability of one, which can be obtained using a linear programming algorithm. [sent-3, score-0.736]

3 The comparisons of the l1 −norm solution and the l0 −norm solution are also presented, which can be used in recoverability analysis of blind source separation (BSS). [sent-4, score-0.919]

4 Next, we apply the sparse matrix factorization approach to BSS in the overcomplete case. [sent-5, score-0.505]

5 Generally, if the sources are not sufﬁciently sparse, we perform blind separation in the time-frequency domain after preprocessing the observed data using the wavelet packets transformation. [sent-6, score-0.717]

6 Two almost independent components obtained by the sparse representation method are selected for phase synchronization analysis, and their periods of signiﬁcant phase synchronization are found which are related to tasks. [sent-8, score-1.539]

7 1 Introduction Sparse representation or sparse coding of signals has received a great deal of attention in recent years. [sent-10, score-0.369]

8 For instance, sparse representation of signals using large-scale linear programming under given overcomplete bases (e. [sent-11, score-0.582]

9 Also, in [2], a sparse image coding approach using the wavelet pyramid architecture was presented. [sent-14, score-0.453]

10 Sparse representation can be used in blind source separation [3][4]. [sent-15, score-0.595]

11 In [3], a two stage approach was proposed, that is, the ﬁrst is to estimate the mixing matrix by using a clustering algorithm, the second is to estimate the source matrix. [sent-16, score-0.641]

12 In our opinion, there are still three fundamental problems related to sparse representation of signals and BSS which need to be further studied: 1) detailed recoverability analysis; 2) high dimensionality of the observed data; 3) overcomplete case in which the sources number is unknown. [sent-17, score-0.782]

13 All basis vectors are normalized to be unit vectors with their 2−norms being equal to 1 and all n basis vectors are linearly independent. [sent-24, score-0.3]

14 Section 2 analyzes the sparse representation of a data matrix. [sent-26, score-0.323]

15 Section 3 presents the comparison of the l0 norm solution and l1 norm solution. [sent-27, score-0.546]

16 Section 4 discusses blind source separation via sparse representation. [sent-28, score-0.772]

17 2 Sparse representation of data matrix In this section, we discuss sparse representation of the data matrix X using the two-stage approach proposed in [3]. [sent-31, score-0.645]

18 At ﬁrst, we apply an algorithm based on K −means clustering method for ﬁnding a suboptimal basis matrix that is composed of the cluster centers of the normalized, known data vectors as in [3]. [sent-32, score-0.292]

19 With this kind of cluster center basis matrix, the corresponding coefﬁcient matrix estimated by linear programming algorithm presented in this section can become very sparse. [sent-33, score-0.36]

20 For a given basis matrix B in (1), the coefﬁcient matrix can be found by solving the following optimization problem as in many existing references (e. [sent-39, score-0.321]

21 min (2) i=1 j=1 It is not difﬁcult to prove that the linear programming problem (2) is equivalent to the following set of N smaller scale linear programming problems: m min |sij |, subject to Bsj = x(j), j = 1, · · · , N. [sent-42, score-0.254]

22 (3) i=1 By setting S = U − V, where U = [uij ]m×N ≥ 0, V = [vij ]m×N ≥ 0, (3) can be converted to the following standard linear programming problems with non-negative constraints, m T (uij + vij ), subject to [B, −B][uT , vj ]T = x(j), uj ≥ 0, vj ≥ 0, j min i=1 where j = 1, · · · , N . [sent-43, score-0.246]

23 (4) Theorem 1 For almost all bases B ∈ Rn×m , the sparse solution (l1 −norm solution) of (1) is unique. [sent-44, score-0.404]

24 That is, the set of bases B, under which the sparse solution of (1) is not unique, is of measure zero. [sent-45, score-0.404]

25 It follows from Theorem 1 that for any given basis, there exists a unique sparse solution of (2) with probability of one. [sent-47, score-0.431]

26 3 Comparison of the l0 norm solution and l1 norm solution Usually, l0 norm J0 (S) = n N |sij |0 (the number of nonzero entries of S) is used as a i=1 j=1 sparsity measure of S, since it ensures the sparsest solution. [sent-48, score-1.162]

27 Under this measure, the sparse solution is obtained by solving the problem m N |sij |0 , subject to BS = X. [sent-49, score-0.456]

28 This implies that the l0 −norm solution allows only one nonzero entry in order that the equivalence holds. [sent-54, score-0.294]

29 In the next, we will also discuss the equivalence of the l0 norm solution and l1 norm solution but from the viewpoint of probability. [sent-55, score-0.727]

30 min i=1 where A ∈ Rn×m , x ∈ Rn are a known basis matrix and a data vector, respectively, and s ∈ Rm , n ≤ m. [sent-57, score-0.221]

31 Suppose that s0∗ is a solution of (P0 ), and s1∗ is a solution of (P1 ). [sent-58, score-0.22]

32 Theorem 2 The solution of (P0 ) is not robust to additive noise of the model, while the solution of (P1 ) is robust to additive noise, at least to some degree. [sent-59, score-0.22]

33 Although the problem (P0 ) provides the sparsest solution, it is not an efﬁcient way to ﬁnd the solution by solving the problem (P0 ). [sent-60, score-0.21]

34 The reasons are: 1) if ||s0∗ ||0 = n, then the solution of (P0 ) is not unique generally; 2) until now, an effective algorithm to solve the optimization problem (P0 ) does not exist (it has been proved that problem (P0 ) is NP hard); 3) the solution of (P0 ) is not robust to noise. [sent-61, score-0.262]

35 From the above mentioned facts arises naturally a problem: what is the condition under which the solution of (P 1 ) is one of the sparsest solutions, that is, the solution has the same number of nonzero entries as the solution of (P0 )? [sent-64, score-0.571]

36 Theorem 3 For the optimization problems (P0 ) and (P1 ), suppose that A ∈ Rn×m is selected randomly, x ∈ Rn is generated by As∗ , l = ||s∗ ||0 < n, and that all nonzero entries of s∗ are also selected randomly. [sent-68, score-0.274]

37 4 Blind source separation based on sparse representation In this section, we discuss blind source separation based on sparse representation of mixture signals. [sent-81, score-1.567]

38 The proposed approach is also suitable for the case in which the number of sensors is less than or equal to the number of sources, while the number of source is unknown. [sent-82, score-0.293]

39 The task of blind source separation is to recover the sources using only the observable data matrix X. [sent-84, score-0.79]

40 The ﬁrst step is to estimate the mixing matrix using clustering Algorithm 1. [sent-86, score-0.358]

41 If the mixing matrix is estimated correctly, and a source vector s∗ satisﬁes that ||s∗ ||0 = l < n, then by Theorem 3, s∗ is the l0 norm solution of (6) with probability one. [sent-87, score-0.973]

42 And if the source vector is sufﬁciently sparse, e. [sent-88, score-0.257]

43 Considering the source number is ¯ ˜ unknown generally, we denote the estimated mixing matrix A = [A, A] ∈ Rn×m (m > m). [sent-91, score-0.616]

44 We introduce the following optimization problem (P1 ) and denote its solution ¯ = [˜T , sT ]T ∈ Rm , s s m (P1 ) min ¯ |si |, subject to As = x. [sent-92, score-0.214]

45 ˜ ¯ Theorem 4 Suppose that the sub-matrix A (of the estimated mixing matrix A) is sufﬁciently close to the true mixing matrix A neglecting scaling and permutation ambiguities, and that a source vector is sufﬁciently sparse. [sent-94, score-0.914]

46 Then the source vector can be recovered s with a high probability (close to one) by solving (P1 ). [sent-95, score-0.359]

47 That is, ˜ is sufﬁciently close to the original source vector, and s is close to zero vector. [sent-96, score-0.257]

48 To illustrate Theorem 4 partially, we have performed two simulation experiments in which the mixing matrix is supposed to be estimated correctly. [sent-97, score-0.402]

49 1 shows the probabilities that a source vector can be recovered correctly in different cases, estimated in the two simulations. [sent-99, score-0.425]

50 In the ﬁrst simulation, n and m are ﬁxed to be 10 and 15, respectively, l denotes the number of nonzero entries of source vector and changes from 1 to 15. [sent-100, score-0.428]

51 We can see that the source can be estimated correctly when l = 1, 2, and the probability is greater than 0. [sent-104, score-0.411]

52 In the second simulation experiment, all original source vectors have 5 nonzero entries, that is, l = 5; and m = 15. [sent-106, score-0.464]

53 As in the ﬁrst simulation, the probabilities for correctly estimated source vectors are estimated through 3000 stochastic experiments and showed in Fig. [sent-108, score-0.493]

54 It is evident that when n ≥ 10, the source can be estimated correctly with probability higher than 0. [sent-110, score-0.411]

55 2 5 10 l 15 0 5 10 n 15 Figure 1: (a) the probability curve that the source vectors are estimated correctly as a function of l obtained in the ﬁrst simulation; (b) the probability curve that the source vectors are estimated correctly as a function of n obtained in the second simulation. [sent-120, score-0.922]

56 In order to estimate the mixing matrix correctly, the sources should be sufﬁciently sparse. [sent-121, score-0.484]

57 Thus sparseness of the sources plays an important role not only in estimating the sources but also in estimating the mixing matrix. [sent-122, score-0.51]

58 However, if the sources are not sufﬁciently sparse in reality, we can have a wavelet packets transformation preprocessing. [sent-123, score-0.702]

59 In the following, a blind separation algorithm based on preprocessing is presented for dense sources. [sent-124, score-0.295]

60 Transform the n time domain signals, (n rows of X, to time-frequency signals by a wavelet packets transformation, and make sure that n wavelet packets trees have the same structure. [sent-126, score-0.668]

61 Select these nodes of wavelet packets trees, of which the coefﬁcients are as sparse as possible. [sent-128, score-0.542]

62 Based on ¯ these coefﬁcient vectors, estimate the mixing matrix A ∈ Rn×m using the Algorithm 1 presented in Section 2. [sent-130, score-0.354]

63 Based on the estimated mixing matrix A and the coefﬁcients of all nodes obtained in step 1, estimate the coefﬁcients of all the nodes of the wavelet packets trees of sources by solving the set of linear programming problems (4). [sent-132, score-0.961]

64 End We have successfully separated speech sources in a number of simulations in overcomplete case (e. [sent-135, score-0.312]

65 Remark 2: A challenge problem in the algorithm above is to estimate the mixing matrix as precisely as possible. [sent-139, score-0.324]

66 In our many simulations on BSS of speech mixtures, we use 7−level wavelet packets transformation for preprocessing. [sent-140, score-0.292]

67 When K−means clustering method is used for estimating the mixing matrix, the number of clusters (the number of columns of the estimated mixing matrix) should be set to be greater than the source number even if the source number is known. [sent-141, score-0.989]

68 In this way, the estimated matrix will contain a submatrix very close to the original mixing matrix. [sent-142, score-0.359]

69 From Theorem 4, we can estimate the source using the overestimated mixing matrix. [sent-143, score-0.473]

70 However, compared with ICA, the sparse representation has two important advantages: 1) sources are not assumed to be mutually independent as in ICA, even be not stationary; 2) source number can be larger than the number of sensors. [sent-148, score-0.74]

71 We believe that sparse representation is a complementary and very prospective approach in the analysis of EEG. [sent-149, score-0.36]

72 Here we present the results of testing the usefulness of sparse representation in the analysis of EEG data based on temporal synchronization between components. [sent-150, score-0.881]

73 EEG was ﬁltered off-line in 1 − 70 Hz range, trials with artifacts were rejected by visual inspection, and a data set including 20 trials with correct response, and 20 trials with incorrect response, was selected for analysis (1 trial=2176 points). [sent-159, score-0.538]

74 Using the sparse representation algorithm proposed in this paper, we decomposed the EEG signals X into 20 components. [sent-161, score-0.369]

75 Denote the 20×87040 dimensional components matrix S, which contains 20 trials for correct response, and 20 trials for incorrect response, respectively. [sent-162, score-0.491]

76 According to modern brain theories, dynamics of synchronization of rhythmic activities in distinct neural networks plays a very important role in interactions between them. [sent-179, score-0.578]

77 Thus, phase synchronization in a pair of two almost independent components (si , si ) (Rs1,14 = 1 14 0. [sent-180, score-0.681]

78 The synchronization index is deﬁned by SI(f, t) = max(SP LV (f, t) − Ssur, 0), where SP LV (f, t) is a single-trial phase locking value at the frequency f and time t, which has been smoothed by a window with a length of 99, and Ssur is the 0. [sent-183, score-0.632]

79 Though only ten trials are presented among the 40 trials due to page space, 32 of 40 trials shows similar characteristics. [sent-188, score-0.405]

80 3 (a), two averaged synchronization index curves are presented, which are obtained by averaging synchronization index SI in the range 1-15 Hz and across 20 trials, separately for correct and incorrect response. [sent-190, score-1.271]

81 Note the time variations of the averaged synchronization index and its higher values for correct responses, especially in the beginning and the end of the trial (preparation and response periods). [sent-191, score-0.674]

82 To test the signiﬁcance of the time and correctness effects, the synchronization index was averaged again for each 128 time points (0. [sent-192, score-0.636]

83 Thus, the phase synchronization between the two analyzed components was sensitive both to changes in brain activity induced by time-varying task demands and to correctness-related variations in the brain state. [sent-197, score-0.802]

84 The higher synchronization for correct responses could be related to higher integration of brain systems required for effective information processing. [sent-198, score-0.609]

85 A substantial part of synchronization between raw EEG channels can be explained by volume conduction effects. [sent-201, score-0.49]

86 Large cortical areas may work as stable uniﬁed oscillating systems, and this may account for other large part of synchronization in raw EEG. [sent-202, score-0.49]

87 This kind of strong synchronization may make invisible synchronization appearing for brief periods, which is of special interest in brain research. [sent-203, score-1.124]

88 To study temporally appearing synchronization, components related to the activity of more or less uniﬁed brain sources should be separated from EEG. [sent-204, score-0.358]

89 Our ﬁrst results of application of sparse representation to real EEG data support that they can help us to reveal brief periods of synchronization between brain “sources”. [sent-205, score-0.948]

90 2nd row: mean synchronization index averaged across frequencies in range 1-15 Hz, for the same trials as in the 1st row. [sent-229, score-0.735]

91 6 Concluding remarks Sparse representation of data matrices and its application to blind source separation were analyzed based on a two-step approach presented in [3] in this paper. [sent-232, score-0.714]

92 The curves show mean values of synchronization index averaged in the range 1-15 Hz and across 20 trials. [sent-241, score-0.637]

93 Black curves are for trials with correct response, red dotted curves refers to trials with incorrect response. [sent-242, score-0.421]

94 as a sparsity measure, whereas, the l0 norm sparsity measure is considered for comparison and recoverability analysis of BSS. [sent-244, score-0.489]

95 This kind of construct that employs sparse representation can be used in BSS as in [3], especially in cases in which fewer sensors exist than sources while the source number is unknown, and sources are not completely independent. [sent-246, score-0.966]

96 Lastly, an application example for analysis of phase synchrony in real EEG data supports its validity and performance of the proposed approach. [sent-247, score-0.212]

97 Since the components separated by sparse representation are not constrained by the condition of complete independence, they can be used in the analysis of brain synchrony maybe more effectively than components separated by general ICA algorithms based on independence principle. [sent-248, score-0.738]

98 (2001) Learning sparse image codes using a wavelet pyramid architecture. [sent-261, score-0.453]

99 (2000) Blind source separation by sparse decomposition in a signal dictionary. [sent-270, score-0.616]

100 (1999) Blind source separation of more sources than mixtures using overcomplete representations. [sent-284, score-0.668]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('synchronization', 0.49), ('eeg', 0.276), ('source', 0.257), ('sparse', 0.25), ('norm', 0.218), ('mixing', 0.19), ('wavelet', 0.169), ('sources', 0.16), ('blind', 0.156), ('recoverability', 0.14), ('trials', 0.125), ('packets', 0.123), ('synchrony', 0.122), ('nonzero', 0.114), ('overcomplete', 0.113), ('solution', 0.11), ('separation', 0.109), ('matrix', 0.108), ('coef', 0.098), ('si', 0.093), ('bss', 0.092), ('rn', 0.091), ('brain', 0.088), ('sij', 0.077), ('basis', 0.075), ('representation', 0.073), ('hz', 0.072), ('sparsest', 0.07), ('subject', 0.066), ('correctly', 0.064), ('averaged', 0.064), ('estimated', 0.061), ('bs', 0.061), ('theorem', 0.058), ('entries', 0.057), ('incorrect', 0.057), ('programming', 0.056), ('index', 0.056), ('saitama', 0.055), ('suf', 0.055), ('rs', 0.054), ('ica', 0.054), ('phase', 0.053), ('remarks', 0.051), ('vectors', 0.05), ('ciently', 0.049), ('correlation', 0.047), ('sparsity', 0.047), ('periods', 0.047), ('lv', 0.047), ('memorized', 0.047), ('ssur', 0.047), ('signals', 0.046), ('components', 0.045), ('cients', 0.044), ('bases', 0.044), ('simulation', 0.043), ('recovered', 0.043), ('unique', 0.042), ('lewicki', 0.041), ('uij', 0.041), ('donoho', 0.041), ('separated', 0.039), ('min', 0.038), ('selected', 0.038), ('concluding', 0.038), ('equivalence', 0.038), ('trees', 0.038), ('analyzed', 0.038), ('analysis', 0.037), ('median', 0.037), ('sensors', 0.036), ('vij', 0.034), ('pyramid', 0.034), ('factorization', 0.034), ('clustering', 0.034), ('response', 0.033), ('discuss', 0.033), ('frequency', 0.033), ('rx', 0.032), ('entry', 0.032), ('generally', 0.032), ('correct', 0.031), ('rm', 0.031), ('usefulness', 0.031), ('presented', 0.03), ('solving', 0.03), ('kind', 0.03), ('probability', 0.029), ('mixtures', 0.029), ('refers', 0.029), ('suppose', 0.027), ('sejnowski', 0.027), ('sp', 0.027), ('curves', 0.027), ('correctness', 0.026), ('appearing', 0.026), ('vj', 0.026), ('estimate', 0.026), ('suboptimal', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki

2 0.2252634 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations

Author: Maneesh Sahani, Srikantan S. Nagarajan

Abstract: Existing source location and recovery algorithms used in magnetoencephalographic imaging generally assume that the source activity at different brain locations is independent or that the correlation structure is known. However, electrophysiological recordings of local ﬁeld potentials show strong correlations in aggregate activity over signiﬁcant distances. Indeed, it seems very likely that stimulus-evoked activity would follow strongly correlated time-courses in different brain areas. Here, we present, and validate through simulations, a new approach to source reconstruction in which the correlation between sources is modelled and estimated explicitly by variational Bayesian methods, facilitating accurate recovery of source locations and the time-courses of their activation. 1

3 0.13988417 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface

Author: Yu Zhou, Steven G. Mason, Gary E. Birch

Abstract: This paper presents an energy normalization transform as a method to reduce system errors in the LF-ASD brain-computer interface. The energy normalization transform has two major benefits to the system performance. First, it can increase class separation between the active and idle EEG data. Second, it can desensitize the system to the signal amplitude variability. For four subjects in the study, the benefits resulted in the performance improvement of the LF-ASD in the range from 7.7% to 18.9%, while for the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance did not change notably with normalization. 1 In trod u ction In an effort to provide alternative communication channels for people who suffer from severe loss of motor function, several researchers have worked over the past two decades to develop a direct Brain-Computer Interface (BCI). Since electroencephalographic (EEG) signal has good time resolution and is non-invasive, it is commonly used for data source of a BCI. A BCI system converts the input EEG into control signals, which are then used to control devices like computers, environmental control system and neuro-prostheses. Mason and Birch [1] proposed the Low-Frequency Asynchronous Switch Design (LF-ASD) as a BCI which detected imagined voluntary movement-related potentials (IVMRPs) in spontaneous EEG. The principle signal processing components of the LF-ASD are shown in Figure 1. sIN Feature Extractor sLPF LPF Feature Classifier sFE sFC Figure 1: The original LF-ASD design. The input to the low-pass filter (LPF), denoted as SIN in Figure 1, are six bipolar EEG signals recorded from F1-FC1, Fz-FCz, F2-FC2, FC1-C1, FCz-Cz and FC2-C2 sampled at 128 Hz. The cutoff frequency of the LPF implemented by Mason and Birch was 4 Hz. The Feature Extractor of the LF-ASD extracts custom features related to IVMRPs. The Feature Classifier implements a one-nearest-neighbor (1NN) classifier, which determines if the input signals are related to a user state of voluntary movement or passive (idle) observation. The LF-ASD was able to achieve True Positive (TP) values in the range of 44%-81%, with the corresponding False Positive (FP) values around 1% [1]. Although encouraging, the current error rates of the LF-ASD are insufficient for real-world applications. This paper proposes a method to improve the system performance. 2 Design and Rationale The improved design of the LF-ASD with the Energy Normalization Transform (ENT) is provided in Figure 2. SIN ENT SN SNLPF LPF Feature Extractor SNFE SNFC Feature Classifier Figure 2: The improved LF-ASD with the Energy Normalization Transform. The design of the Feature Extractor and Feature Classifier were the same as shown in Figure 1. The Energy Normalization Transform (ENT) is implemented as S N (n ) = S s=( w s= − ( N ∑ w IN S IN −1) / 2 N −1) / 2 (n ) 2 (n − s) w N where W N (normalization window size) is the only parameter in the equation. The optimal parameter value was obtained by exhaustive search for the best class separation between active and idle EEG data. The method of obtaining the active and idle EEG data is provided in Section 3.1. The idea to use energy normalization to improve the LF-ASD design was based primarily on an observation that high frequency power decreases significantly around movement. For example, Jasper and Penfield [3] and Pfurtscheller et al, [4] reported EEG power decrease in the mu (8-12 Hz) and beta rhythm (18-26 Hz) when people are involved in motor related activity. Also Mason [5] found that the power in the frequency components greater than 4Hz decreased significantly during movement-related potential periods, while power in the frequency components less than 4Hz did not. Thus energy normalization, which would increase the low frequency power level, would strengthen the 0-4 Hz features used in the LF-ASD and hence reduce errors. In addition, as a side benefit, it can automatically adjust the mean scale of the input signal and desensitize the system to change in EEG power, which is known to vary over time [2]. Therefore, it was postulated that the addition of ENT into the improved design would have two major benefits. First, it can increase the EEG power around motor potentials, consequently increasing the class separation and feature strength. Second, it can desensitize the system to amplitude variance of the input signal. In addition, since the system components of the modified LF-ASD after the ENT were the same as in the original design, a major concern was whether or not the ENT distorted the features used by the LF-ASD. Since the features used by the LFASD are generated from the 0-4 Hz band, if the ENT does not distort the phase and magnitude spectrum in this specific band, it would not distort the features related to movement potential detection in the application. 3 3.1 Evaluation Test data Two types of EEG data were pre-recorded from five able-bodied individuals as shown in Figure 3. Active Data Type and Idle Data Type. Active Data was recorded during repeated right index finger flexions alternating with periods of no motor activity; Idle Data was recorded during extended periods of passive observation. Figure 3: Data Definition of M1, M2, Idle1 and Idle2. Observation windows centered at the time of the finger switch activations (as shown in Figure 4) were imposed in the active data to separate data related to movements from data during periods of idleness. For purpose of this study, data in the front part of the observation window was defined as M1 and data in the rear part of the window was defined as M2. Data falling out of the observation window was defined as Idle2. All the data in the Idle Data Type was defined as Idle1 for comparison with Idle2. Figure 4: Ensemble Average of EEG centered on finger activations. Figure 5: Density distribution of Idle1, Idle2, M1 and M2. It was noted, in terms of the density distribution of active and idle data, the separation between M2 and Idle2 was the largest and Idle1 and Idle2 were nearly identical (see Figure 5). For the study, M2 and Idle2 were chosen to represent the active and idle data classes and the separation between M2 and Idle2 data was defined by the difference of means (DOM) scaled by the amplitude range of Idle2. 3.2 Optimal parameter determination The optimal combination of normalization window size, W N, and observation window size, W O was selected to be that which achieved the maximal DOM value. This was determined by exhaustive search, and discussed in Section 4.1. 3.3 Effect of ENT on the Low Pass Filter output As mentioned previously, it was postulated that the ENT had two major impacts: increasing the class separation between active and idle EEG and desensitizing the system to the signal amplitude variance. The hypothesis was evaluated by comparing characteristics of SNLPF and SLPF in Figure 1 and Figure 2. DOM was applied to measure the increased class separation. The signal with the larger DOM meant larger class separation. In addition, the signal with smaller standard deviation may result in a more stable feature set. 3.4 Effect of ENT on the LF-ASD output The performances of the original and improved designs were evaluated by comparing the signal characteristics of SNFC in Figure 2 to SFC in Figure 1. A Receiver Operating Characteristic Curve (ROC Curve) [6] was generated for the original and improved designs. The ROC Curve characterizes the system performance over a range of TP vs. FP values. The larger area under ROC Curve indicates better system performance. In real applications, a BCI with high-level FP rates could cause frustration for subjects. Therefore, in this work only the LF-ASD performance when the FP values are less than 1% were studied. 4 4.1 Results Optimal normalization window size (WN) The method to choose optimal WN was an exhaustive search for maximal DOM between active and idle classes. This method was possibly dependent on the observation window size (W O). However, as shown in Figure 6a, the optimal WN was found to be independent of WO. Experimentally, the W O values were selected in the range of 50-60 samples, which corresponded to largest DOM between nonnormalized active and idle data. The optimal WN was obtained by exhaustive search for the largest DOM through normalized active and idle data. The DOM vs. WN profile for Subject 1 is shown in Figure 6b. a) b) Figure 6: Optimal parameter determination for Subject 1 in Channel 1 a) DOM vs. WO; b) DOM vs. WN. When using ENT, a small W N value may cause distortion to the feature set used by the LF-ASD. Thus, the optimal W N was not selected in this range (< 40 samples). When W N is greater than 200, the ENT has lost its capability to increase class separation and the DOM curve gradually goes towards the best separation without normalization. Thus, the optimal W N should correspond to the maximal DOM value when W N is in the range from 40 to 200. In Figure 6b, the optimal WN is around 51. 4.2 Effect of ENT on the Low Pass Filter output With ENT, the standard deviation of the low frequency EEG signal decreased from around 1.90 to 1.30 over the six channels and over the five subjects. This change resulted in more stable feature sets. Thus, the ENT desensitizes the system to input signal variance. a) b) Figure 7: Density distribution of the active vs. idle class without (a) and with (b) ENT, for Subject 1 in Channel 1. As shown in Figure 7, by increasing the EEG power around motor potentials, ENT can increase class separations between active and idle EEG data. The class separation in (frontal) Channels 1-3 across all subjects increased consistently with the proposed ENT. The same was true for (midline) Channels 4-6, for all subjects except Subject 5, whose DOM in channel 5-6 decreased by 2.3% and 3.4% respectively with normalization. That is consistent with the fact that his EEG power in Channels 4-6 does not decrease. On average, across all five subjects, DOM increases with normalization to about 28.8%, 26.4%, 39.4%, 20.5%, 17.8% and 22.5% over six channels respectively. In addition, the magnitude and phase spectrums of the EEG signal before and after ENT is provided in Figure 8. The ENT has no visible distortion to the signal in the low frequency band (0-4 Hz) used by the LF-ASD. Therefore, the ENT does not distort the features used by the LF-ASD. (a) (b) Figure 8: Magnitude and phase spectrum of the EEG signal before and after ENT. 4.3 Effect of ENT on the LF-ASD output The two major benefits of the ENT to the low frequency EEG data result in the performance improvement of the LF-ASD. Subject 1’s ROC Curves with and without ENT is shown in Figure 9, where the ROC-Curve with ENT of optimal parameter value is above the ROC Curve without ENT. This indicates that the improved LF-ASD performs better. Table I compares the system performance with and without ENT in terms of TP with corresponding FP at 1% across all the 5 subjects. Figure 9: The ROC Curves (in the section of interest) of Subject 1 with different WN values and the corresponding ROC Curve without ENT. Table I: Performance of the LF-ASD with and without LF-ASD in terms of the True Positive rate with corresponding False Positive at 1%. Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 TP without ENT 66.1% 82.7% 79.7% 79.3% 90.5% TP with ENT 85.0% 90.4% 88.0% 87.8% 88.7% Performance Improvement 18.9% 7.7% 8.3% 8.5% -1.8% For 4 out of 5 subjects, corresponding with the FP at 1%, the improved system with ENT increased the TP value by 7.7%, 8.3%, 8.5% and 18.9% respectively. Thus, for these subjects, the range of TP with FP at 1% was improved from 66.1%-82.7% to 85.0%-90.4% with ENT. For the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance remained around 90% with ENT. In addition, this evaluation is conservative. Since the codebook in the Feature Classifier and the parameters in the Feature Extractor of the LF-ASD were derived from nonnormalized EEG, they work in favor of the non-normalized EEG. Therefore, if the parameters and the codebook of the modified LF-ASD are generated from the normalized EEG in the future, the modified LF-ASD may show better performance than this evaluation. 5 Conclusion The evaluation with data from five able-bodied subjects indicates that the proposed system with Energy Normalization Transform (ENT) has better performance than the original. This study has verified the original hypotheses that the improved design with ENT might have two major benefits: increased the class separation between active and idle EEG and desensitized the system performance to input amplitude variance. As a side benefit, the ENT can also make the design less sensitive to the mean input scale. In the broad band, the Energy Normalization Transform is a non-linear transform. However, it has no visible distortion to the signal in the 0-4 Hz band. Therefore, it does not distort the features used by the LF-ASD. For 4 out of 5 subjects, with the corresponding False Positive rate at 1%, the proposed transform increased the system performance by 7.7%, 8.3%, 8.5% and 18.9% respectively in terms of True Positive rate. Thus, the overall performance of the LF-ASD for these subjects was improved from 66.1%-82.7% to 85.0%-90.4%. For the fifth subject, who had the highest non-normalized accuracy of 90.5%, the performance did not change notably with normalization. In the future with the codebook derived from the normalized data, the performance could be further improved. References [1] Mason, S. G. and Birch, G. E., (2000) A Brain-Controlled Switch for Asynchronous Control Applications. IEEE Trans Biomed Eng, 47(10):1297-1307. [2] Vaughan, T. M., Wolpaw, J. R., and Donchin, E. (1996) EEG-Based Communication: Prospects and Problems. IEEE Trans Reh Eng, 4(4):425-430. [3] Jasper, H. and Penfield, W. (1949) Electrocortiograms in man: Effect of voluntary movement upon the electrical activity of the precentral gyrus. Arch.Psychiat.Nervenkr., 183:163-174. [4] Pfurtscheller, G., Neuper, C., and Flotzinger, D. (1997) EEG-based discrimination between imagination of right and left hand movement. Electroencephalography and Clinical Neurophysiology, 103:642-651. [5] Mason, S. G. (1997) Detection of single trial index finger flexions from continuous, spatiotemporal EEG. PhD Thesis, UBC, January. [6] Green, D. M. and Swets, J. A. (1996) Signal Detection Theory and Psychophysics New York: John Wiley and Sons, Inc.

4 0.10683445 119 nips-2003-Local Phase Coherence and the Perception of Blur

Author: Zhou Wang, Eero P. Simoncelli

Abstract: unkown-abstract

5 0.09433122 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

Author: Brian J. Fischer, Charles H. Anderson

Abstract: The barn owl is a nocturnal hunter, capable of capturing prey using auditory information alone [1]. The neural basis for this localization behavior is the existence of auditory neurons with spatial receptive ﬁelds [2]. We provide a mathematical description of the operations performed on auditory input signals by the barn owl that facilitate the creation of a representation of auditory space. To develop our model, we ﬁrst formulate the sound localization problem solved by the barn owl as a statistical estimation problem. The implementation of the solution is constrained by the known neurobiology.

6 0.087859847 128 nips-2003-Minimax Embeddings

7 0.079520725 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

8 0.077159084 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

9 0.072257482 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

10 0.072015882 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

11 0.071806341 73 nips-2003-Feature Selection in Clustering Problems

12 0.071397424 107 nips-2003-Learning Spectral Clustering

13 0.07136222 92 nips-2003-Information Bottleneck for Gaussian Variables

14 0.071024083 193 nips-2003-Variational Linear Response

15 0.069339424 182 nips-2003-Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron

16 0.065843999 155 nips-2003-Perspectives on Sparse Bayesian Learning

17 0.063712128 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples

18 0.062681362 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

19 0.061808933 1 nips-2003-1-norm Support Vector Machines

20 0.059215758 5 nips-2003-A Classification-based Cocktail-party Processor

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.201), (1, -0.036), (2, 0.064), (3, 0.009), (4, -0.053), (5, 0.114), (6, 0.148), (7, -0.049), (8, -0.103), (9, 0.046), (10, 0.097), (11, -0.115), (12, 0.088), (13, 0.033), (14, -0.23), (15, 0.097), (16, -0.189), (17, 0.031), (18, -0.017), (19, -0.033), (20, 0.11), (21, 0.002), (22, 0.067), (23, 0.022), (24, 0.031), (25, -0.03), (26, -0.165), (27, 0.063), (28, 0.009), (29, -0.045), (30, -0.025), (31, -0.029), (32, -0.035), (33, -0.044), (34, -0.11), (35, 0.019), (36, 0.015), (37, -0.101), (38, -0.033), (39, -0.057), (40, 0.16), (41, 0.137), (42, 0.123), (43, -0.045), (44, -0.081), (45, -0.059), (46, -0.076), (47, -0.027), (48, 0.034), (49, -0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96662343 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki

2 0.7775439 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations

Author: Maneesh Sahani, Srikantan S. Nagarajan

3 0.57787234 89 nips-2003-Impact of an Energy Normalization Transform on the Performance of the LF-ASD Brain Computer Interface

Author: Yu Zhou, Steven G. Mason, Gary E. Birch

4 0.53202391 182 nips-2003-Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron

Author: Sung C. Jun, Barak A. Pearlmutter

Abstract: We describe a system that localizes a single dipole to reasonable accuracy from noisy magnetoencephalographic (MEG) measurements in real time. At its core is a multilayer perceptron (MLP) trained to map sensor signals and head position to dipole location. Including head position overcomes the previous need to retrain the MLP for each subject and session. The training dataset was generated by mapping randomly chosen dipoles and head positions through an analytic model and adding noise from real MEG recordings. After training, a localization took 0.7 ms with an average error of 0.90 cm. A few iterations of a Levenberg-Marquardt routine using the MLP’s output as its initial guess took 15 ms and improved the accuracy to 0.53 cm, only slightly above the statistical limits on accuracy imposed by the noise. We applied these methods to localize single dipole sources from MEG components isolated by blind source separation and compared the estimated locations to those generated by standard manually-assisted commercial software. 1

5 0.47892389 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

Author: Brian J. Fischer, Charles H. Anderson

6 0.47877103 123 nips-2003-Markov Models for Automated ECG Interval Analysis

7 0.44526684 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class

8 0.41065618 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

9 0.40712756 128 nips-2003-Minimax Embeddings

10 0.39838889 119 nips-2003-Local Phase Coherence and the Perception of Blur

11 0.34208488 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples

12 0.33655363 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

13 0.31946826 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects

14 0.317054 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

15 0.31621286 155 nips-2003-Perspectives on Sparse Bayesian Learning

16 0.30764073 107 nips-2003-Learning Spectral Clustering

17 0.29641351 71 nips-2003-Fast Embedding of Sparse Similarity Graphs

18 0.29318643 73 nips-2003-Feature Selection in Clustering Problems

19 0.2781989 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

20 0.27411646 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.038), (11, 0.023), (18, 0.225), (29, 0.023), (30, 0.033), (35, 0.062), (53, 0.172), (71, 0.052), (76, 0.058), (85, 0.08), (91, 0.11), (99, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91331369 128 nips-2003-Minimax Embeddings

Author: Matthew Brand

Abstract: Spectral methods for nonlinear dimensionality reduction (NLDR) impose a neighborhood graph on point data and compute eigenfunctions of a quadratic form generated from the graph. We introduce a more general and more robust formulation of NLDR based on the singular value decomposition (SVD). In this framework, most spectral NLDR principles can be recovered by taking a subset of the constraints in a quadratic form built from local nullspaces on the manifold. The minimax formulation also opens up an interesting class of methods in which the graph is “decorated” with information at the vertices, offering discrete or continuous maps, reduced computational complexity, and immunity to some solution instabilities of eigenfunction approaches. Apropos, we show almost all NLDR methods based on eigenvalue decompositions (EVD) have a solution instability that increases faster than problem size. This pathology can be observed (and corrected via the minimax formulation) in problems as small as N < 100 points. 1 Nonlinear dimensionality reduction (NLDR) . Spectral NLDR methods are graph embedding problems where a set of N points X = [x1 , · · · , xN ] ∈ RD×N sampled from a low-dimensional manifold in a ambient space RD is reparameterized by imposing a neighborhood graph G on X and embedding the graph with minimal distortion in a “parameterization” space Rd , d < D. Typically the graph is sparse and local, with edges connecting points to their immediate neighbors. The embedding must keep these edges short or preserve their length (for isometry) or angles (for conformality). The graph-embedding problem was ﬁrst introduced as a least-squares problem by Tutte [1], and as an eigenvalue problem by Fiedler [2]. The use of sparse graphs to generate metrics for least-squares problems has been studied intensely in the following three decades (see [3]). Modern NLDR methods use graph constraints to generate a metric in a space of embeddings RN . Eigenvalue decomposition (EVD) gives the directions of least or greatest variance under this metric. Typically a subset of d extremal eigenvectors gives the embedding of N points in Rd parameterization space. This includes the IsoMap family [4], the locally linear embedding (LLE) family [5,6], and Laplacian methods [7,8]. Using similar methods, the Automatic Alignment [6] and Charting [9] algorithms embed local subspaces instead of points, and by combining subspace projections thus obtain continuous maps between RD and Rd . This paper introduces a general algebraic framework for computing optimal embeddings directly from graph constraints. The aforementioned methods can can be recovered as special cases. The framework also suggests some new methods with very attractive properties, including continuous maps, reduced computational complexity, and control over the degree of conformality/isometry in the desired map. It also eliminates a solution instability that is intrinsic to EVD-based approaches. A perturbational analysis quantiﬁes the instability. 2 Minimax theorem for graph embeddings We begin with neighborhood graph speciﬁed by a nondiagonal weighted adjacency matrix M ∈ RN×N that has the data-reproducing property XM = X (this can be relaxed to XM ≈ X in practice). The graph-embedding and NLDR literatures offer various constructions of M, each appropriate to different sets of assumptions about the original embedding and its sampling X (e.g., isometry, local linearity, noiseless samples, regular sampling, etc.). Typically Mi j = 0 if points i, j are nearby on the intrinsic manifold and |Mi j | is small or zero otherwise. Each point is taken to be a linear or convex combination of its neighbors, and thus M speciﬁes manifold connectivity in the sense that any nondegenerate embedding Y that satisﬁes YM ≈ Y with small residual YM − Y F will preserve this connectivity and the structure of local neighborhoods. For example, in barycentric embeddings, each point is the average of its neighbors and thus Mi j = 1/k if vertex i is connected to vertex j (of degree k). We will also consider three optional constraints on the embedding : 1. A null-space restriction, where the solution must be outside to the column-space of C ∈ RN×M , M < N. For example, it is common to stipulate that the solution Y be centered, i.e., YC = 0 for C = 1, the constant vector. 2. A basis restriction, where the solution must be a linear combination of the rows of basis Z ∈ RK×N , K ≤ N. This can be thought of as information placed at the vertices of the graph that serves as example inputs for a target NLDR function. We will use this to construct dimension-reducing radial basis function networks. 3. A metric Σ ∈ RN×N that determines how error is distributed over the points. For example, it might be important that boundary points have less error. We assume that Σ is symmetric positive deﬁnite and has factorization Σ = AA (e.g., A could be a Cholesky factor of Σ). In most settings, the optional matrices will default to the identity matrix. In this context, we deﬁne the per-dimension embedding error of row-vector yi ∈ rows(Y) to be . EM (yi ) = max yi ∈range(Z),, K∈RM×N (yi (M + CD) − yi )A yi A (1) where D is a matrix constructed by an adversary to maximize the error. The optimizing yi is a vector inside the subspace spanned by the rows of Z and outside the subspace spanned by the columns of C, for which the reconstruction residual yi M−yi has smallest norm w.r.t. the metric Σ. The following theorem identiﬁes the optimal embedding Y for any choice of M, Z, C, Σ: Minimax solution: Let Q ∈ SK×P be a column-orthonormal basis of the null-space of the rows of ZC, with P = K − rank(C). Let B ∈ RP×P be a square factor satisfying B B = Q ZΣZ Q, e.g., a Cholesky factor (or the “R” factor in QR-decomposition of (Q ZA) ). Compute the left singular vectors U ∈ SN×N of Udiag(s)V = B− Q Z(I − M)A, with . singular values s = [s1 , · · · , sP ] ordered s1 ≤ s2 ≤ · · · ≤ s p . Using the leading columns U1:d of U, set Y = U1:d B− Q Z. Theorem 1. Y is the optimal (minimax) embedding in Rd with error [s1 , · · · , sd ] 2 : . Y = U1:d B− Q Z = arg min ∑ EM (yi )2 with EM (yi ) = si . Y∈Rd×N y ∈rows(Y) i (2) Appendix A develops the proof and other error measures that are minimized. Local NLDR techniques are easily expressed in this framework. When Z = A = I, C = [], and M reproduces X through linear combinations with M 1 = 1, we recover LLE [5]. When Z = I, C = [], I − M is the normalized graph Laplacian, and A is a diagonal matrix of vertex degrees, we recover Laplacian eigenmaps [7]. When further Z = X we recover locally preserving projections [8]. 3 Analysis and generalization of charting The minimax construction of charting [9] takes some development, but offers an interesting insight into the above-mentioned methods. Recall that charting ﬁrst solves for a set of local afﬁne subspace axes S1 ∈ RD×d , S2 , · · · at offsets µ1 ∈ RD , µ2 , · · · that best cover the data and vary smoothly over the manifold. Each subspace offers a chart—a local parameterization of the data by projection onto the local axes. Charting then constructs a weighted mixture of afﬁne projections that merges the charts into a global parameterization. If the data manifold is curved, each projection will assign a point a slightly different embedding, so the error is measured as the variance of these proposed embeddings about their mean. This maximizes consistency and tends to produce isometric embeddings; [9] discusses ways to explicitly optimize the isometry of the embedding. Under the assumption of isometry, the charting error is equivalent to the sumsquared displacements of an embedded point relative to its immediate neighbors (summed over all neighborhoods). To construct the same error criteria in the minimax setting, let xi−k , · · · , xi , · · · , xi+k denote points in the ith neighborhood and let the columns of Vi ∈ R(2k+1)×d be an orthonormal basis of rows of the local parameterization Si [xi−k , · · · , xi , · · · , xi+k ]. Then a nonzero reparameterization will satisfy [yi−k , · · · , yi , · · · , yi+k ]Vi Vi = [yi−k , · · · , yi , · · · , yi+k ] if and only if it preserves the relative position of the points in the local parameterization. Conversely, any relative displacements of the points are isolated by the formula [yi−k , · · · , yi , · · · , yi+k ](I − Vi Vi ). Minimizing the Frobenius norm of this expression is thus equivalent to minimizing the local error in charting. We sum these constraints over all neighborhoods to obtain the constraint matrix M = I − ∑i Fi (I − Vi Vi )Fi , where (Fi )k j = 1 iff the jth point of the ith neighborhood is the kth point of the dataset. Because Vi Vi and (I − Vi Vi ) are complementary, it follows that the error criterion of any local NLDR method (e.g., LLE, Laplacian eigenmaps, etc.) must measure the projection of the embedding onto some subspace of (I − Vi Vi ). To construct a continuous map, charting uses an overcomplete radial basis function (RBF) representation Z = [z(x1 ), z(x2 ), · · · z(xN )], where z(x) is a vector that stacks z1 (x), z2 (x), etc., and pm (x) . Km (x − µm ) , zm (x) = 1 ∑m pm (x) −1 . pm (x) = N (x|µm , Σm ) ∝ e−(x−µm ) Σm (x−µm )/2 (3) (4) and Km is any local linear dimensionality reducer, typically Sm itself. Each column of Z contains many “views” of the same point that are combined to give its low-dimensional embedding. Finally, we set C = 1, which forces the embedding of the full data to be centered. Applying the minimax solution to these constraints yields the RBF network mixing ma. trix, f (x) = U1:d B− Q z(x). Theorem 1 guarantees that the resulting embedding is leastsquares optimal w.r.t. Z, M, C, A at the datapoints f (xi ), and because f (·) is an afﬁne transform of z(·) it smoothly interpolates the embedding between points. There are some interesting variants: Kernel embeddings of the twisted swiss roll generalized EVD minimax SVD UR corner detail LL corner detail Fig. 1. Minimax and generalized EVD solution for kernel eigenmap of a non-developable swiss roll. Points are connected into a grid which ideally should be regular. The EVD solution shows substantial degradation. Insets detail corners where the EVD solution crosses itself repeatedly. The border compression is characteristic of Laplacian constraints. One-shot charting: If we set the local dimensionality reducers to the identity matrix (all Km = I), then the minimax method jointly optimizes the local dimensionality reduction to charts and the global coordination of the charts (under any choice of M). This requires that rows(Z) ≤ N for a fully determined solution. Discrete isometric charting: If Z = I then we directly obtain a discrete isometric embedding of the data, rather than a continuous map, making this a local equivalent of IsoMap. Reduced basis charting: Let Z be constructed using just a small number of kernels randomly placed on the data manifold, such that rows(Z) N. Then the size of the SVD problem is substantially reduced. 4 Numerical advantage of minimax method Note that the minimax method projects the constraint matrix M into a subspace derived from C and Z and decomposes it there. This suppresses unwanted degrees of freedom (DOFs) admitted by the problem constraints, for example the trivial R0 embedding where all points are mapped to a single point yi = N −1/2 . The R0 embedding serves as a translational DOF in the solution. LLE- and eigenmap-based methods construct M to have a constant null-space so that the translational DOF will be isolated in the EVD as null eigenvalue paired to a constant eigenvector, which is then discarded. However, section 4.1 shows that this construction makes the EVD increasingly unstable as problem size grows and/or the data becomes increasing amenable to low-residual embeddings, ultimately causing solution collapse. As the next paragraph demonstrates, the problem is exacerbated when embedding w.r.t. a basis Z (via the equivalent generalized eigenproblem), partly because the eigenvector associated with the unwanted DOF can have arbitrary structure. In all cases the problem can be averted by using the minimax formulation with C = 1 to suppress the DOF. A 2D plane was embedded in 3D with a curl, a twist, and 2.5% Gaussian noise, then regularly sampled at 900 points. We computed a kernelized Laplacian eigenmap using 70 random points as RBF centers, i.e., a continous map using M derived from the graph Laplacian and Z constructed as above. The map was computed both via the minimax (SVD) method and via the equivalent generalized eigenproblem, where the translational degree of freedom must be removed by discarding an eigenvector from the solution. The two solutions are algebraically equivalent in every other regard. A variety of eigensolvers were tried; we took −5 excess energy x 10 Eigen spectrum compared to minimax spectrum 15 10 5 0 −5 Eigen spectrum compared to minimax spectrum 2 15 deviation excess energy x 10 10 5 100 200 eigenvalue Error in null embedding −5 x 10 0 −2 −4 −6 −8 0 100 −5 eigenvalue Error in null embedding 200 100 200 300 400 500 point 600 700 800 900 Fig. 2. Excess energy in the eigenspectrum indicates that the translational DOF has contam2 inated many eigenvectors. If the EVD had successfully isolated the unwanted DOF, then its 0 remaining eigenvalues should be identical to those derived from the minimax solution. The −2 −4 graph at left shows the difference in the eigenspectra. The graph at right shows the EVD −6 solution’s deviation from the translational vector y0 = 1 · N −1/2 ≈ .03333. If the numer−8 ics were100 200 the line would be ﬂat, but in practice the deviation is signiﬁcant enough perfect 300 400 500 600 700 800 900 point (roughly 1% of the diameter of the embedding) to noticably perturb points in ﬁgure 1. deviation x 10 the best result. Figure 1 shows that the EVD solution exhibits many defects, particularly a folding-over of the manifold at the top and bottom edges and at the corners. Figure 2 shows that the noisiness of the EVD solution is due largely to mutual contamination of numerically unstable eigenvectors. 4.1 Numerical instability of eigen-methods The following theorem uses tools of matrix perturbation theory to show that as the problem size increases, the desired and unwanted eigenvectors become increasingly wobbly and gradually contaminate each other, leading to degraded solutions. More precisely, the low-order eigenvalues are ill-conditioned and exhibit multiplicities that may be true (due to noiseless samples from low-curvature manifolds) or false (due to numerical noise). Although in many cases some post-hoc algebra can “ﬁlter” the unwanted components out of the contaminated eigensolution, it is not hard to construct cases where the eigenvectors cannot be cleanly separated. The minimax formulation is immune to this problem because it explicitly suppresses the gratuitous component(s) before matrix decomposition. Theorem 2. For any ﬁnite numerical precision, as the number of points N increases, the Frobenius norm of numerical noise in the null eigenvector v0 can grow as O(N 3/2 ), and the eigenvalue problem can approach a false multiplicity at a rate as fast as O(N 3/2 ), at which point the eigenvectors of interest—embedding and translational—are mutually contaminated and/or have an indeterminate eigenvalue ordering. Please see appendix B for the proof. This theorem essentially lower-bounds an upperbound on error; examples can be constructed in which the problem is worse. For example, it can be shown analytically that when embedding points drawn from the simple curve xi = [a, cos πa] , a ∈ [0, 1] with K = 2 neighbors, instabilities cannot be bounded better than O(N 5/2 ); empirically we see eigenvector mixing with N < 100 points and we see it grow at the rate ≈ O(N 4 )—in many different eigensolvers. At very large scales, more pernicious instabilities set in. E.g., by N = 20000 points, the solution begins to fold over. Although algebraic multiplicity and instability of the eigenproblem is conceptually a minor oversight in the algorithmic realizations of eigenfunction embeddings, as theorem 2 shows, the consequences are eventually fatal. 5 Summary One of the most appealing aspects of the spectral NLDR literature is that algorithms are usually motivated from analyses of linear operators on smooth differentiable manifolds, e.g., [7]. Understandably, these analysis rely on assumptions (e.g., smoothness or isometry or noiseless sampling) that make it difﬁcult to predict what algorithmic realizations will do when real, noisy data violates these assumptions. The minimax embedding theorem provides a complete algebraic characterization of this discrete NLDR problem, and provides a solution that recovers numerically robustiﬁed versions of almost all known algorithms. It offers a principled way of constructing new algorithms with clear optimality properties and good numerical conditioning—notably the construction of a continuous NLDR map (an RBF network) in a one-shot optimization ( SVD ). We have also shown how to cast several local NLDR principles in this framework, and upgrade these methods to give continuous maps. Working in the opposite direction, we sketched the minimax formulation of isometric charting and showed that its constraint matrix contains a superset of all the algebraic constraints used in local NLDR techniques. References 1. W.T. Tutte. How to draw a graph. Proc. London Mathematical Society, 13:743–768, 1963. 2. Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech. Math. Journal, 25:619–633, 1975. 3. Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997. 4. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000. 5. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. 6. Yee Whye Teh and Sam T. Roweis. Automatic alignment of hidden representations. In Proc. NIPS-15, 2003. 7. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. volume 14 of Advances in Neural Information Processing Systems, 2002. 8. Xiafei He and Partha Niyogi. Locality preserving projections. Technical Report TR-2002-09, University of Chicago Computer Science, October 2002. 9. Matthew Brand. Charting a manifold. volume 15 of Advances in Neural Information Processing Systems, 2003. 10. G.W. Stewart and Ji-Guang Sun. Matrix perturbation theory. Academic Press, 1990. A Proof of minimax embedding theorem (1) The burden of this proof is carried by supporting lemmas, below. To emphasize the proof strategy, we give the proof ﬁrst; supporting lemmas follow. Proof. Setting yi = li Z, we will solve for li ∈ columns(L). Writing the error in terms of li , EM (li ) = max K∈RM×N li Z(I − M − CK)A li Z(I − M)A − li ZCKA = max . M×N li ZA li ZA K∈R (5) The term li ZCKA produces inﬁnite error unless li ZC = 0, so we accept this as a constraint and seek li Z(I − M)A min . (6) li ZA li ZC=0 By lemma 1, that orthogonality is satisﬁed by solving the problem in the space orthogonal . to ZC; the basis for this space is given by columns of Q = null((ZC) ). By lemma 2, the denominator of the error speciﬁes the metric in solution space to be ZAA Z ; when the problem is projected into the space orthogonal to ZC it becomes Q (ZAA Z )Q. Nesting the “orthogonally-constrained-SVD” construction of lemma 1 inside the “SVD-under-a-metric” lemma 2, we obtain a solution that uses the correct metric in the orthogonal space: B B = Q ZAA Z Q − Udiag(s)V = B (7) {Q(Z(I − M)A)} (8) L = QB−1 U (9) where braces indicate the nesting of lemmas. By the “best-projection” lemma (#3), if we order the singular values by ascending magnitude, L1:d = arg min J∈RN×d ∑ji ∈cols(J) ( j Z(I − M)A / j )2 ZΣZ The proof is completed by making the substitutions L Z → Y and x A → x Σ = AA ), and leaving off the ﬁnal square root operation to obtain (Y )1:d = arg min ∑ji ∈cols(J) j (I − M) Σ / j J∈RN×d (10) Σ (for 2 Σ . (11) Lemma 1. Orthogonally constrained SVD: The left singular vectors L of matrix M under . SVD the constraint U C = 0 are calculated as Q = null(C ), Udiag(s)V ← Q M, L = QU. Proof. First observe that L is orthogonal to C: By deﬁnition, the null-space basis satisﬁes Q C = 0, thus L C = U Q C = 0. Let J be an orthonormal basis for C, with J J = I and Q J = 0. Then Ldiag(s)V = QQ M = (I − JJ )M, the orthogonal projector of C applied to M, proving that the SVD captures the component of M that is orthogonal to C. Lemma 2. SVD with respect to a metric: The vectors li ∈ L, vi ∈ V that diagonalize matrix M with respect to positive deﬁnite column-space metric Σ are calculated as B B ← Σ, SVD . Udiag(s)V ← B− M, L = B−1 U satisfy li M / li Σ = si and extremize this form for the extremal singular values smin , smax . Proof. By construction, L and V diagonalize M: L MV = (B−1 U) MV = U (B− M)V = diag(s) (12) B− and diag(s)V = M. Forming the gram matrices of both sides of the last line, we obtain the identity Vdiag(s)2 V = M B−1 B− M = M Σ−1 M, which demonstrates that si ∈ s are the singular values of M w.r.t. column-space metric Σ. Finally, L is orthonormal w.r.t. the metric Σ, because L 2 = L ΣL = U B− B BB−1 U = I. Consequently, Σ l M / l Σ = l M /1 = si vi and by the Courant-Hilbert theorem, smax = max l M / l Σ ; = si . smin = min l M / l Σ . l l (13) (14) Lemma 3. Best projection: Taking L and s from lemma 2, let the columns of L and elements of s be sorted so that s1 ≥ s2 ≥ · · · ≥ sN . Then for any dimensionality 1 ≤ d ≤ N, . L1:d = [l1 , · · · , ld ] = arg max J M (J ΣJ)−1 (15) J∈RN×d = arg max F (16) ∑ji ∈cols(J) ( j M / j Σ )2 (17) J∈RN×d |J ΣJ=I = arg max J∈RN×d J M with the optimum value of all right hand sides being (∑d s2 )1/2 . If the sort order is rei=1 i versed, the minimum of this form is obtained. Proof. By the Eckart-Young-Mirsky theorem, if U MV = diag(s) with singular values . sorted in descending order, then U1:d = [u1 , · · · , ud ] = arg maxU∈SN×d U M F . We ﬁrst extend this to a non-orthonogonal basis J under a Mahalonobis norm: maxJ∈RN×d J M (J J)−1 = maxU∈SN×d U M F (18) because J M 2 (J J)−1 = trace(M J(J J)−1 J M) = trace(M JJ+ (JJ+ ) M) = (JJ+ )M 2 = UU M 2 = U M 2 since JJ+ is a (symmetric) orthogonal proF F F jector having binary eigenvalues λ ∈ {0, 1} and therefore it is the gram of an thin orthogonal matrix. We then impose a metric Σ on the column-space of J to obtain the ﬁrst criterion (equation 15), which asks what maximizes variance in J M while minimizing the norm of J w.r.t. metric Σ. Here it sufﬁces to substitute in the leading (resp., trailing) columns of L and verify that the norm is maximized (resp., minimized). Expanding, L1:d M 2 ΣL )−1 = trace((L1:d M) (L1:d ΣL1:d )−1 (L1:d M)) = (L 1:d 1:d trace((L1:d M) I(L1:d M)) = trace((diag(s1:d )V1:d ) (diag(s1:d )V1:d )) = s1:d 2 . Again, by the Eckart-Young-Mirsky theorem, these are the maximal variance-preserving projections, so the ﬁrst criterion is indeed maximized by setting J to the columns in L corresponding to the largest values in s. Criterion #2 restates the ﬁrst criterion with the set of candidates for J restricted to (the hyperelliptical manifold of) matrices that reduce the metric on the norm to the identity matrix (thereby recovering the Frobenius norm). Criterion #3 criterion merely expands the above trace by individual singular values. Note that the numerator and denominator can have different metrics because they are norms in different spaces, possibly of different dimension. Finally, that the trailing d eigenvectors minimize these criteria follows directly from the fact that leading N − d singular values account for the maximal part of the variance. B Proof of instability theorem (2) Proof. When generated from a sparse graph with average degree K, weighted connectivity matrix W is sparse and has O(NK) entries. Since the graph vertices represent samples from a smooth manifold, increasing the sampling density N does not change the distribution of magnitudes in W. Consider a perturbation of the nonzero values in W, e.g., W → W + E due to numerical noise E created by ﬁnite machine precision. By the weak law of large √ numbers, the Frobenius norm of the sparse perturbation grows as E F ∼ O( N). However the t th -smallest nonzero eigenvalue λt (W) grows as λt (W) = vt Wvt ∼ O(N −1 ), because elements of corresponding eigenvector vt grow as O(N −1/2 ) and only K of those elements are multiplied by nonzero values to form each element of Wvt . In sum, the perturbation E F grows while the eigenvalue λt (W) shrinks. In linear embedding algorithms, . the eigengap of interest is λgap = λ1 − λ0 . The tail eigenvalue λ0 = 0 by construction but it is possible that λ0 > 0 with numerical error, thus λgap ≤ λ1 . Combining these facts, the ratio between the perturbation and the eigengap grows as E F /λgap ∼ O(N 3/2 ) or faster. Now consider the shifted eigenproblem I − W with leading (maximal) eigenvalues 1 − λ0 ≥ 1 − λ1 ≥ · · · and unchanged eigenvectors. From matrix perturbation the. ory [10, thm. V.2.8], when W is perturbed to W = W + E, the change in the lead√ ing eigenvalue from 1 − λ0 to 1 − λ0 is bounded as |λ0 − λ0 | ≤ 2 E F and similarly √ √ 1 − λ1 ≤ 1 − λ1 + 2 E F . Thus λgap ≥ λgap − 2 E F . Since E F /λgap ∼ O(N 3/2 ), the right hand side of the gap bound goes negative at a supralinear rate, implying that the eigenvalue ordering eventually becomes unstable with the possibility of the ﬁrst and second eigenvalue/vector pairs being swapped. Mutual contamination of the eigenvectors happens well before: Under general (dense) conditions, the change in the eigenvector v0 is bounded E F as v0 − v0 ≤ |λ −λ4 |−√2 E [10, thm. V.2.8]. (This bound is often tight enough to serve F 0 1 as a good approximation.) Specializing this to the sparse embedding matrix, we ﬁnd that √ √ O( N) O( N) the bound weakens to v0 − 1 · N −1/2 ∼ O(N −1 )−O(√N) > O(N −1 ) = O(N 3/2 ).

same-paper 2 0.8485781 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki

3 0.72101974 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1

4 0.71943933 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

Author: Thomas Natschläger, Wolfgang Maass

Abstract: We employ an efﬁcient method using Bayesian and linear classiﬁers for analyzing the dynamics of information in high-dimensional states of generic cortical microcircuit models. It is shown that such recurrent circuits of spiking neurons have an inherent capability to carry out rapid computations on complex spike patterns, merging information contained in the order of spike arrival with previously acquired context information. 1

5 0.71623099 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

Author: Amos J. Storkey

Abstract: Discrete Fourier transforms and other related Fourier methods have been practically implementable due to the fast Fourier transform (FFT). However there are many situations where doing fast Fourier transforms without complete data would be desirable. In this paper it is recognised that formulating the FFT algorithm as a belief network allows suitable priors to be set for the Fourier coeﬃcients. Furthermore eﬃcient generalised belief propagation methods between clusters of four nodes enable the Fourier coeﬃcients to be inferred and the missing data to be estimated in near to O(n log n) time, where n is the total of the given and missing data points. This method is compared with a number of common approaches such as setting missing data to zero or to interpolation. It is tested on generated data and for a Fourier analysis of a damaged audio signal. 1

6 0.70955044 78 nips-2003-Gaussian Processes in Reinforcement Learning

7 0.70863843 66 nips-2003-Extreme Components Analysis

8 0.70806533 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

9 0.70786124 126 nips-2003-Measure Based Regularization

10 0.70769149 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

11 0.70504647 73 nips-2003-Feature Selection in Clustering Problems

12 0.70432723 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

13 0.70364588 30 nips-2003-Approximability of Probability Distributions

14 0.70289052 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

15 0.70221579 113 nips-2003-Learning with Local and Global Consistency

16 0.70211083 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

17 0.70204574 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing

18 0.70188797 115 nips-2003-Linear Dependent Dimensionality Reduction

19 0.70188779 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

20 0.69993514 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data