nips nips2000 nips2000-61 knowledge-graph by maker-knowledge-mining

61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

Source: pdf

Author: Ulrik Kjems, Lars Kai Hansen, Stephen C. Strother

Abstract: We demonstrate that statistical analysis of ill-posed data sets is subject to a bias, which can be observed when projecting independent test set examples onto a basis defined by the training examples. Because the training examples in an ill-posed data set do not fully span the signal space the observed training set variances in each basis vector will be too high compared to the average variance of the test set projections onto the same basis vectors. On basis of this understanding we introduce the Generalizable Singular Value Decomposition (GenSVD) as a means to reduce this bias by re-estimation of the singular values obtained in a conventional Singular Value Decomposition, allowing for a generalization performance increase of a subsequent statistical model. We demonstrate that the algorithm succesfully corrects bias in a data set from a functional PET activation study of the human brain. 1 Ill-posed Data Sets An ill-posed data set has more dimensions in each example than there are examples. Such data sets occur in many fields of research typically in connection with image measurements. The associated statistical problem is that of extracting structure from the observed high-dimensional vectors in the presence of noise. The statistical analysis can be done either supervised (Le. modelling with target values: classification, regresssion) or unsupervised (modelling with no target values: clustering, PCA, ICA). In both types of analysis the ill-posedness may lead to immediate problems if one tries to apply conventional statistical methods of analysis, for example the empirical covariance matrix is prohibitively large and will be rank-deficient. A common approach is to use Singular Value Decomposition (SVD) or the analogue Principal Component Analysis (PCA) to reduce the dimensionality of the data. Let the N observed i-dimensional samples Xj, j = L .N, collected in the data matrix X = [Xl ... XN] of size I x N, I> N . The SVD-theorem states that such a matrix can be decomposed as (1) where U is a matrix of the same size as X with orthogonal basis vectors spanning the space of X, so that UTU = INxN. The square matrix A contains the singular values in the diagonal, A = diag( AI, ... , >w), which are ordered and positive Al ~ A2 ~ ... ~ AN ~ 0, and V is N x N and orthogonal V TV = IN. If there is a mean value significantly different from zero it may at times be advantageous to perform the above analysis on mean-subtracted data, i.e. X - X = U A V T where columns of X all contain the mean vector x = Lj xj/N. Each observation Xj can be expressed in coordinates in the basis defined by the vectors of U with no loss of information[Lautrup et al., 1995]. A change of basis is obtained by qj = U T Xj as the orthogonal basis rotation Q = [ql ... qN] = U T X = UTUAV T = AVT . (2) Since Q is only N x Nand N « I, Q is a compact representation of the data. Having now N examples of N dimension we have reduced the problem to a marginally illposed one. To further reduce the dimensionality, it is common to retain only a subset of the coordinates, e.g. the top P coordinates (P < N) and the supervised or unsupervised model can be formed in this smaller but now well-posed space. So far we have considered the procedure for modelling from a training set. Our hope is that the statistical description generalizes well to new examples proving that is is a good description of the generating process. The model should, in other words, be able to perform well on a new example, x*, and in the above framework this would mean the predictions based on q* = U T x* should generalize well. We will show in the following, that in general, the distribution of the test set projection q* is quite different from the statistics of the projections of the training examples qj. It has been noted in previous work [Hansen and Larsen, 1996, Roweis, 1998, Hansen et al., 1999] that PCA/SVD of ill-posed data does not by itself represent a probabilistic model where we can assign a likelihood to a new test data point, and procedures have been proposed which make this possible. In [Bishop, 1999] PCA has been considered in a Bayesian framework, but does not address the significant bias of the variance in training set projections in ill-posed data sets. In [Jackson, 1991] an asymptotic expression is given for the bias of eigen-values in a sample covariance matrix, but this expression is valid only in the well-posed case and is not applicable for ill-posed data. 1.1 Example Let the signal source be I-dimensional multivariate Gaussian distribution N(O,~) with a covariance matrix where the first K eigen-values equal u 2 and the last 1- K are zero, so that the covariance matrix has the decomposition ~=u2YDyT, D=diag(1, ... ,1,0, ... ,0), yTY=I (3) Our N samples of the distribution are collected in the matrix X = [Xij] with the SVD (4) A = diag(Al, ... , AN) and the representation ofthe N examples in the N basis vector coordinates defined by U is Q = [%] = U T X = A V T. The total variance per training example is ~ LX;j ~Tr(XTX) = ~Tr(VAUTUAVT) = ~Tr(VA2VT) i,j = ~ Tr(VVT A2) = ~ Tr(A2) = ~L A; i (5) Note that this variance is the same in the U-basis coordinates: 1 L...J 2 N '

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 gov Abstract We demonstrate that statistical analysis of ill-posed data sets is subject to a bias, which can be observed when projecting independent test set examples onto a basis defined by the training examples. [sent-9, score-0.5]

2 Because the training examples in an ill-posed data set do not fully span the signal space the observed training set variances in each basis vector will be too high compared to the average variance of the test set projections onto the same basis vectors. [sent-10, score-1.12]

3 We demonstrate that the algorithm succesfully corrects bias in a data set from a functional PET activation study of the human brain. [sent-12, score-0.152]

4 1 Ill-posed Data Sets An ill-posed data set has more dimensions in each example than there are examples. [sent-13, score-0.056]

5 Such data sets occur in many fields of research typically in connection with image measurements. [sent-14, score-0.028]

6 The associated statistical problem is that of extracting structure from the observed high-dimensional vectors in the presence of noise. [sent-15, score-0.044]

7 The statistical analysis can be done either supervised (Le. [sent-16, score-0.041]

8 modelling with target values: classification, regresssion) or unsupervised (modelling with no target values: clustering, PCA, ICA). [sent-17, score-0.12]

9 In both types of analysis the ill-posedness may lead to immediate problems if one tries to apply conventional statistical methods of analysis, for example the empirical covariance matrix is prohibitively large and will be rank-deficient. [sent-18, score-0.21]

10 A common approach is to use Singular Value Decomposition (SVD) or the analogue Principal Component Analysis (PCA) to reduce the dimensionality of the data. [sent-19, score-0.027]

11 Let the N observed i-dimensional samples Xj, j = L . [sent-20, score-0.073]

12 The SVD-theorem states that such a matrix can be decomposed as (1) where U is a matrix of the same size as X with orthogonal basis vectors spanning the space of X, so that UTU = INxN. [sent-25, score-0.327]

13 The square matrix A contains the singular values in the diagonal, A = diag( AI, . [sent-26, score-0.218]

14 ~ AN ~ 0, and V is N x N and orthogonal V TV = IN. [sent-32, score-0.053]

15 If there is a mean value significantly different from zero it may at times be advantageous to perform the above analysis on mean-subtracted data, i. [sent-33, score-0.029]

16 X - X = U A V T where columns of X all contain the mean vector x = Lj xj/N. [sent-35, score-0.029]

17 Each observation Xj can be expressed in coordinates in the basis defined by the vectors of U with no loss of information[Lautrup et al. [sent-36, score-0.178]

18 A change of basis is obtained by qj = U T Xj as the orthogonal basis rotation Q = [ql . [sent-38, score-0.385]

19 Having now N examples of N dimension we have reduced the problem to a marginally illposed one. [sent-43, score-0.091]

20 To further reduce the dimensionality, it is common to retain only a subset of the coordinates, e. [sent-44, score-0.027]

21 the top P coordinates (P < N) and the supervised or unsupervised model can be formed in this smaller but now well-posed space. [sent-46, score-0.161]

22 So far we have considered the procedure for modelling from a training set. [sent-47, score-0.141]

23 Our hope is that the statistical description generalizes well to new examples proving that is is a good description of the generating process. [sent-48, score-0.091]

24 The model should, in other words, be able to perform well on a new example, x*, and in the above framework this would mean the predictions based on q* = U T x* should generalize well. [sent-49, score-0.029]

25 We will show in the following, that in general, the distribution of the test set projection q* is quite different from the statistics of the projections of the training examples qj. [sent-50, score-0.43]

26 It has been noted in previous work [Hansen and Larsen, 1996, Roweis, 1998, Hansen et al. [sent-51, score-0.032]

27 , 1999] that PCA/SVD of ill-posed data does not by itself represent a probabilistic model where we can assign a likelihood to a new test data point, and procedures have been proposed which make this possible. [sent-52, score-0.128]

28 In [Bishop, 1999] PCA has been considered in a Bayesian framework, but does not address the significant bias of the variance in training set projections in ill-posed data sets. [sent-53, score-0.432]

29 In [Jackson, 1991] an asymptotic expression is given for the bias of eigen-values in a sample covariance matrix, but this expression is valid only in the well-posed case and is not applicable for ill-posed data. [sent-54, score-0.229]

30 1 Example Let the signal source be I-dimensional multivariate Gaussian distribution N(O,~) with a covariance matrix where the first K eigen-values equal u 2 and the last 1- K are zero, so that the covariance matrix has the decomposition ~=u2YDyT, D=diag(1, . [sent-56, score-0.404]

31 ,0), yTY=I (3) Our N samples of the distribution are collected in the matrix X = [Xij] with the SVD (4) A = diag(Al, . [sent-62, score-0.177]

32 , AN) and the representation ofthe N examples in the N basis vector coordinates defined by U is Q = [%] = U T X = A V T. [sent-65, score-0.269]

33 The total variance per training example is ~ LX;j ~Tr(XTX) = ~Tr(VAUTUAVT) = ~Tr(VA2VT) i,j = ~ Tr(VVT A2) = ~ Tr(A2) = ~L A; i (5) Note that this variance is the same in the U-basis coordinates: 1 L. [sent-66, score-0.325]

34 The training set variance is K / N a 2 on average per coordinate, compared to a 2 for the test examples. [sent-73, score-0.295]

35 From a modelling point of view, the variance from the test example tells us the true story, so the training set variance should be regarded as biased. [sent-75, score-0.445]

36 This suggests that the training set singular values should be corrected for this bias, in the above example by re-estimating the training set projections using Q = N / K Q. [sent-76, score-0.502]

37 J In the more general case we do not know K, and the true covariance may have an arbitrary eigen-spectrum. [sent-77, score-0.064]

38 The GenSVD algorithm below is a more general algorithm for correcting for the training set bias. [sent-78, score-0.108]

39 2 The GenSVD Algorithm The data matrix consists of N statistically independent samples X = [Xl . [sent-79, score-0.127]

40 XN ] so X is size I x N, and each column of X is assumed multivariate Gaussian, Xj '" N(O,:E) and is ill-posed with rank:E > N. [sent-82, score-0.027]

41 With the SVD X = UoAoVaT, we now make the approximation that Uo contains an actual subset of the true eigen-vectors of :E (9) where we have collected the remaining (unspanned by X) eigen-vectors and values in UJ. [sent-83, score-0.109]

42 The unknown 'true' eigen-values corresponding to the observed eigen-vectors are collected in A = diag(Al, . [sent-87, score-0.122]

43 It should be noted that a direct estimation of :E using f: = j;y X X T yields f: = j;yuoAoVaTVoAoUJ = j;yUoA~UJ, i. [sent-91, score-0.032]

44 The distribution of test samples x* inside the space spanned by Uo is (10) The problem is that Uo and the examples Xj are not independent, so UJ Xj is biased, e. [sent-94, score-0.274]

45 the SVD estimate A ~ of A 2 assigns all variance to lie within U o. [sent-96, score-0.102]

46 -k The GenSVD algorithm bypasses this problem by, for each example, computing a basis on all other examples, estimating the variances in A 2 in a leave-one-out manner. [sent-97, score-0.152]

47 Since B_j and Xj are independent B-"J Xj has the same distribution as the projection of a test example x*, B_; x*. [sent-99, score-0.16]

48 Now, since span B_j=span X_j and span Uo=span [X_j Xj] we have that span B_j~span Uo so we see that Zj and U B_jB-"J X* are identically distributed. [sent-101, score-0.416]

49 This means that Zj has the covariance UJ B_jB-"J~B_jB_;Uo and using Eq. [sent-102, score-0.09]

50 (9) and that ul B_j = 0 (since uluo = 0) we get J (12) We note that this distribution is degenerate because the covariance is of rank N -l. [sent-103, score-0.141]

51 (13) and that the determinant l is approximated by This above expression is maximized when 5. [sent-105, score-0.076]

52 (11) directly to compute an SVD of the matrix X_ j for each example is computationally demanding. [sent-112, score-0.131]

53 It is possible to compute Zj in a more efficient two-level procedure with the following algorithm: Compute UOAoVOT = svd(X) and Q o = [qj] = AoVOT lSince Zj is degenerate, we define the likelihood over the space where Zj occur, i. [sent-113, score-0.033]

54 J A2 1 2 '\ = Iii L: j Zij If the data has a mean value that we wish to remove prior to the SVD it is important that this is done within the GenSVD algorithm. [sent-121, score-0.057]

55 Consider a centered matrix Xc = X - X where X contains the mean x replicated in all N columns. [sent-122, score-0.128]

56 The signal space in Xc is now corrupted because each centered example will contain a component of all examples, which means the 'stripping' of signal components not spanned by other examples no longer works: Xj is no longer distributed like x*. [sent-123, score-0.317]

57 This suggests the alternative algorithm for data with removal of mean component: B; _ Compute UOAoVOT foreach j = L. [sent-124, score-0.118]

58 ;) ii-j) 1 N -1 Finally, note that it is possible to leave out more than one example at a time if the data is independent only in block, i. [sent-128, score-0.056]

59 Example With PET Scans We compared the performance of GenSVD to conventional SVD on a functional [ 15 0] water PET activation study of the human brain. [sent-132, score-0.077]

60 The study consisted of 18 subjects, who were scanned four times while tracing a star-shaped maze with a joy-stick with visual feedback, in total 72 scans of dimension '" 25000 spatial voxels. [sent-133, score-0.152]

61 After the second scan, the visual feedback was mirrored, and the subject accomodated to and learned the new control environment during the last two scans. [sent-134, score-0.068]

62 Voxels inside aforementioned brain mask were arranged in the data matrix with one scan per column. [sent-136, score-0.536]

63 Figure 1 shows the results of an SVD decomposition compared to GenSVD. [sent-137, score-0.07]

64 Each marker represents one scan and the glyphs indicate scan number out of the four (circle-square-star-triangle). [sent-138, score-0.57]

65 The ellipses indicate the mean and covariances of the projections in each scan number. [sent-139, score-0.454]

66 The 32 scans from eight subjects were used as a training set and 40 scans from the remaining 10 subjects for testing. [sent-140, score-0.504]

67 The training set projections are filled markers, test-set projections onto the basis defined by the training set are open markers (i. [sent-141, score-0.626]

68 We see that there is a clear difference in variance in the train- and test-examples, which is corrected quite well by GenSVD. [sent-144, score-0.154]

69 The lower plot in Figure 1 shows the singular values for the PET data set. [sent-145, score-0.176]

70 We see that GenSVD estimates are much closer to the actual test projection standard deviations than the SVD singular values. [sent-146, score-0.28]

71 3 Conclusion We have demonstrated that projection of ill-posed data sets onto a basis defined by the same examples introduces a significant bias on the observed variance when comparing to projections of test examples onto the same basis. [sent-147, score-0.939]

72 The GenSVD algorithm has been presented as a tool for correcting for this bias using a leave-one-out re-estimation scheme, and a computationally efficient implementation has been proposed. [sent-148, score-0.136]

73 We have demonstrated that the method works well on an ill-posed real-world data set, were the distribution of the GenSVD-corrected training test set projections matched the distribution of the observed test set projections far better than the uncorrected training examples. [sent-149, score-0.63]

74 This allows a generalization performance increase of a subsequent statistical model, in the case of both supervised and unsupervised models. [sent-150, score-0.087]

75 Acknowledgments This work was supported partly by the Human Brain Project grant P20 MH57180, the Danish Research councils for the Natural and Technical Sciences through the Danish Computational Neural Network Center (CONNECT) and the Technology Center Through Highly Oriented Research (THOR). [sent-151, score-0.026]

76 , editors, Proceedings of Workshop on Supercomputing in Brain Research: Prom Tomography to Neural Networks: Prom tomography to neural networks, HLRZ, KFA Jillich, Germany, pages 137- 148. [sent-204, score-0.052]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('svd', 0.332), ('zj', 0.315), ('gensvd', 0.305), ('scan', 0.285), ('uo', 0.214), ('hansen', 0.21), ('tr', 0.167), ('uj', 0.158), ('scans', 0.152), ('singular', 0.148), ('xj', 0.14), ('projections', 0.14), ('span', 0.13), ('qj', 0.124), ('generalizable', 0.122), ('pet', 0.119), ('larsen', 0.105), ('basis', 0.104), ('variance', 0.102), ('bias', 0.095), ('diag', 0.093), ('jackson', 0.091), ('lautrup', 0.091), ('strother', 0.091), ('examples', 0.091), ('collected', 0.078), ('coordinates', 0.074), ('modelling', 0.074), ('test', 0.072), ('matrix', 0.07), ('decomposition', 0.07), ('training', 0.067), ('bishop', 0.066), ('roweis', 0.066), ('covariance', 0.064), ('aovot', 0.061), ('foreach', 0.061), ('juo', 0.061), ('prom', 0.061), ('soli', 0.061), ('svarer', 0.061), ('uoaovot', 0.061), ('projection', 0.06), ('onto', 0.056), ('pca', 0.056), ('orthogonal', 0.053), ('tomography', 0.052), ('corrected', 0.052), ('danish', 0.052), ('markers', 0.052), ('mirror', 0.052), ('subjects', 0.051), ('conventional', 0.048), ('variances', 0.048), ('zij', 0.048), ('xc', 0.048), ('unsupervised', 0.046), ('observed', 0.044), ('degenerate', 0.044), ('denmark', 0.044), ('brain', 0.043), ('inside', 0.043), ('correcting', 0.041), ('determinant', 0.041), ('mask', 0.041), ('supervised', 0.041), ('spanned', 0.039), ('jj', 0.039), ('trace', 0.039), ('signal', 0.039), ('al', 0.038), ('subject', 0.038), ('lx', 0.037), ('expression', 0.035), ('principal', 0.033), ('compute', 0.033), ('rank', 0.033), ('block', 0.033), ('noted', 0.032), ('remaining', 0.031), ('decomposed', 0.03), ('feedback', 0.03), ('samples', 0.029), ('mean', 0.029), ('centered', 0.029), ('human', 0.029), ('data', 0.028), ('example', 0.028), ('kearns', 0.028), ('average', 0.028), ('reduce', 0.027), ('editors', 0.027), ('multivariate', 0.027), ('means', 0.026), ('component', 0.026), ('per', 0.026), ('identically', 0.026), ('councils', 0.026), ('lars', 0.026), ('thor', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

Author: Ulrik Kjems, Lars Kai Hansen, Stephen C. Strother

2 0.11368646 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1

3 0.098946437 27 nips-2000-Automatic Choice of Dimensionality for PCA

Author: Thomas P. Minka

Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.

4 0.082767822 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

Author: Pedro A. d. F. R. Højen-Sørensen, Ole Winther, Lars Kai Hansen

Abstract: We propose a general Bayesian framework for performing independent component analysis (leA) which relies on ensemble learning and linear response theory known from statistical physics. We apply it to both discrete and continuous sources. For the continuous source the underdetermined (overcomplete) case is studied. The naive mean-field approach fails in this case whereas linear response theory-which gives an improved estimate of covariances-is very efficient. The examples given are for sources without temporal correlations. However, this derivation can easily be extended to treat temporal correlations. Finally, the framework offers a simple way of generating new leA algorithms without needing to define the prior distribution of the sources explicitly.

5 0.081899442 82 nips-2000-Learning and Tracking Cyclic Human Motion

Author: Dirk Ormoneit, Hedvig Sidenbladh, Michael J. Black, Trevor Hastie

Abstract: We present methods for learning and tracking human motion in video. We estimate a statistical model of typical activities from a large set of 3D periodic human motion data by segmenting these data automatically into

6 0.074802965 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

7 0.06510587 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

8 0.063346356 51 nips-2000-Factored Semi-Tied Covariance Matrices

9 0.059627742 120 nips-2000-Sparse Greedy Gaussian Process Regression

10 0.059134591 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

11 0.055658385 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes

12 0.054755904 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

13 0.053133089 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

14 0.051546194 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

15 0.050587405 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

16 0.048681185 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

17 0.047082577 74 nips-2000-Kernel Expansions with Unlabeled Examples

18 0.04604787 16 nips-2000-Active Inference in Concept Learning

19 0.045957565 6 nips-2000-A Neural Probabilistic Language Model

20 0.044959817 146 nips-2000-What Can a Single Neuron Compute?

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, -0.011), (2, 0.05), (3, 0.058), (4, 0.028), (5, 0.121), (6, -0.049), (7, -0.004), (8, -0.006), (9, -0.047), (10, -0.116), (11, -0.019), (12, -0.001), (13, 0.02), (14, -0.076), (15, 0.004), (16, -0.064), (17, 0.021), (18, -0.001), (19, 0.088), (20, 0.089), (21, 0.065), (22, 0.009), (23, -0.18), (24, -0.078), (25, 0.093), (26, -0.016), (27, 0.124), (28, -0.19), (29, 0.103), (30, 0.095), (31, -0.088), (32, -0.031), (33, -0.004), (34, -0.06), (35, 0.135), (36, 0.059), (37, -0.152), (38, -0.099), (39, -0.03), (40, -0.064), (41, 0.058), (42, -0.017), (43, -0.072), (44, 0.163), (45, -0.039), (46, -0.109), (47, -0.128), (48, -0.117), (49, -0.201)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94894713 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

Author: Ulrik Kjems, Lars Kai Hansen, Stephen C. Strother

2 0.54831374 27 nips-2000-Automatic Choice of Dimensionality for PCA

Author: Thomas P. Minka

3 0.48654658 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

4 0.43292499 82 nips-2000-Learning and Tracking Cyclic Human Motion

Author: Dirk Ormoneit, Hedvig Sidenbladh, Michael J. Black, Trevor Hastie

5 0.42816344 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

Author: Pedro A. d. F. R. Højen-Sørensen, Ole Winther, Lars Kai Hansen

6 0.40818653 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

7 0.36903131 16 nips-2000-Active Inference in Concept Learning

8 0.29095152 57 nips-2000-Four-legged Walking Gait Control Using a Neuromorphic Chip Interfaced to a Support Vector Learning Algorithm

9 0.27886719 120 nips-2000-Sparse Greedy Gaussian Process Regression

10 0.26973411 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

11 0.26622751 51 nips-2000-Factored Semi-Tied Covariance Matrices

12 0.25968748 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

13 0.2590057 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes

14 0.24887367 22 nips-2000-Algorithms for Non-negative Matrix Factorization

15 0.2459985 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

16 0.2402339 127 nips-2000-Structure Learning in Human Causal Induction

17 0.23851334 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

18 0.23630361 133 nips-2000-The Kernel Gibbs Sampler

19 0.23588438 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach

20 0.22791143 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.02), (17, 0.162), (33, 0.034), (50, 0.308), (54, 0.024), (55, 0.018), (60, 0.021), (62, 0.038), (65, 0.029), (67, 0.05), (75, 0.014), (76, 0.055), (79, 0.044), (81, 0.014), (90, 0.052), (97, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83053672 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

Author: Ulrik Kjems, Lars Kai Hansen, Stephen C. Strother

2 0.51741999 122 nips-2000-Sparse Representation for Gaussian Process Models

Author: Lehel Csatč´¸, Manfred Opper

Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.

3 0.51695234 4 nips-2000-A Linear Programming Approach to Novelty Detection

Author: Colin Campbell, Kristin P. Bennett

Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J

4 0.51333749 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

5 0.50854039 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

6 0.50735909 133 nips-2000-The Kernel Gibbs Sampler

7 0.50597095 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

8 0.50429899 130 nips-2000-Text Classification using String Kernels

9 0.50347924 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

10 0.50115347 79 nips-2000-Learning Segmentation by Random Walks

11 0.49825555 60 nips-2000-Gaussianization

12 0.49679652 51 nips-2000-Factored Semi-Tied Covariance Matrices

13 0.49677327 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

14 0.49588042 36 nips-2000-Constrained Independent Component Analysis

15 0.49463105 37 nips-2000-Convergence of Large Margin Separable Linear Classification

16 0.49214354 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

17 0.4917669 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes

18 0.4906047 56 nips-2000-Foundations for a Circuit Complexity Theory of Sensory Processing

19 0.48954082 111 nips-2000-Regularized Winnow Methods

20 0.48761949 146 nips-2000-What Can a Single Neuron Compute?