nips nips2000 nips2000-61 nips2000-61-reference knowledge-graph by maker-knowledge-mining

61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

Source: pdf

Author: Ulrik Kjems, Lars Kai Hansen, Stephen C. Strother

Abstract: We demonstrate that statistical analysis of ill-posed data sets is subject to a bias, which can be observed when projecting independent test set examples onto a basis defined by the training examples. Because the training examples in an ill-posed data set do not fully span the signal space the observed training set variances in each basis vector will be too high compared to the average variance of the test set projections onto the same basis vectors. On basis of this understanding we introduce the Generalizable Singular Value Decomposition (GenSVD) as a means to reduce this bias by re-estimation of the singular values obtained in a conventional Singular Value Decomposition, allowing for a generalization performance increase of a subsequent statistical model. We demonstrate that the algorithm succesfully corrects bias in a data set from a functional PET activation study of the human brain. 1 Ill-posed Data Sets An ill-posed data set has more dimensions in each example than there are examples. Such data sets occur in many fields of research typically in connection with image measurements. The associated statistical problem is that of extracting structure from the observed high-dimensional vectors in the presence of noise. The statistical analysis can be done either supervised (Le. modelling with target values: classification, regresssion) or unsupervised (modelling with no target values: clustering, PCA, ICA). In both types of analysis the ill-posedness may lead to immediate problems if one tries to apply conventional statistical methods of analysis, for example the empirical covariance matrix is prohibitively large and will be rank-deficient. A common approach is to use Singular Value Decomposition (SVD) or the analogue Principal Component Analysis (PCA) to reduce the dimensionality of the data. Let the N observed i-dimensional samples Xj, j = L .N, collected in the data matrix X = [Xl ... XN] of size I x N, I> N . The SVD-theorem states that such a matrix can be decomposed as (1) where U is a matrix of the same size as X with orthogonal basis vectors spanning the space of X, so that UTU = INxN. The square matrix A contains the singular values in the diagonal, A = diag( AI, ... , >w), which are ordered and positive Al ~ A2 ~ ... ~ AN ~ 0, and V is N x N and orthogonal V TV = IN. If there is a mean value significantly different from zero it may at times be advantageous to perform the above analysis on mean-subtracted data, i.e. X - X = U A V T where columns of X all contain the mean vector x = Lj xj/N. Each observation Xj can be expressed in coordinates in the basis defined by the vectors of U with no loss of information[Lautrup et al., 1995]. A change of basis is obtained by qj = U T Xj as the orthogonal basis rotation Q = [ql ... qN] = U T X = UTUAV T = AVT . (2) Since Q is only N x Nand N « I, Q is a compact representation of the data. Having now N examples of N dimension we have reduced the problem to a marginally illposed one. To further reduce the dimensionality, it is common to retain only a subset of the coordinates, e.g. the top P coordinates (P < N) and the supervised or unsupervised model can be formed in this smaller but now well-posed space. So far we have considered the procedure for modelling from a training set. Our hope is that the statistical description generalizes well to new examples proving that is is a good description of the generating process. The model should, in other words, be able to perform well on a new example, x*, and in the above framework this would mean the predictions based on q* = U T x* should generalize well. We will show in the following, that in general, the distribution of the test set projection q* is quite different from the statistics of the projections of the training examples qj. It has been noted in previous work [Hansen and Larsen, 1996, Roweis, 1998, Hansen et al., 1999] that PCA/SVD of ill-posed data does not by itself represent a probabilistic model where we can assign a likelihood to a new test data point, and procedures have been proposed which make this possible. In [Bishop, 1999] PCA has been considered in a Bayesian framework, but does not address the significant bias of the variance in training set projections in ill-posed data sets. In [Jackson, 1991] an asymptotic expression is given for the bias of eigen-values in a sample covariance matrix, but this expression is valid only in the well-posed case and is not applicable for ill-posed data. 1.1 Example Let the signal source be I-dimensional multivariate Gaussian distribution N(O,~) with a covariance matrix where the first K eigen-values equal u 2 and the last 1- K are zero, so that the covariance matrix has the decomposition ~=u2YDyT, D=diag(1, ... ,1,0, ... ,0), yTY=I (3) Our N samples of the distribution are collected in the matrix X = [Xij] with the SVD (4) A = diag(Al, ... , AN) and the representation ofthe N examples in the N basis vector coordinates defined by U is Q = [%] = U T X = A V T. The total variance per training example is ~ LX;j ~Tr(XTX) = ~Tr(VAUTUAVT) = ~Tr(VA2VT) i,j = ~ Tr(VVT A2) = ~ Tr(A2) = ~L A; i (5) Note that this variance is the same in the U-basis coordinates: 1 L...J 2 N '

reference text

[Bishop, 1999] Bishop, C. (1999) . Bayesian pca. In Kearns, M. S. , Soli a, S. A. , and Cohn, D . A., editors, Advances in Neural Information Processing Systems, volume 11. The MIT Press. [Hansen et al. , 1999] Hansen, L. , Larsen, J. , Nielsen, F. , Strother, S., Rostrup , E. , Savoy, R. , Lange, N., Sidtis, J ., Svarer, C. , and Paulson, O. (1999) . Generalizable patterns in neuroimaging: How many principal components? NeuroImage, 9:534- 544. [Hansen and Larsen, 1996] Hansen, L. K. and Larsen, J. (1996). Unsupervised learning and generalization. In Proceedings of IEEE International Conference on Neural Networks, pages 25- 30. [Jackson , 1991] Jackson, J . E. (1991). A Us er's Guide to Principal Components. Wiley Series on Probability and Statistics, John Wiley and Sons. [Lautrup et aI., 1995] Lautrup, B., Hansen, L. K., Law, I., M0rch, N., Svarer, C., and Strother, S. (1995). Massive weight sharing: A cure for extremely ill-posed problems. In Hermann, H. J ., Wolf, D. E. , and Poppel , E. P., editors, Proceedings of Workshop on Supercomputing in Brain Research: Prom Tomography to Neural Networks: Prom tomography to neural networks, HLRZ, KFA Jillich, Germany, pages 137- 148. World Scientific. [Roweis, 1998] Roweis, S. (1998) . Em algorithms for pca and spca. In Jordan, M. I., Kearns, M. J., and Soli a, S. A., editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press. Conventional SVD 3.00 * 2 .00 'E 11 0 • 1.00 8 0 > (jJ . * • 1< . 0.00 -g 0 i;l - 1.00 (jJ - 2.00 ~. - 3 .00 - 2 .00 .· ..: .Ii. .Ii. .J>. .Ii. - 3.00 - 4 .00 Oc .. • • '!' ** ~