nips nips2000 nips2000-27 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas P. Minka
Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.
Reference: text
sentIndex sentText sentNum sentScore
1 Automatic choice of dimensionality for peA Thomas P. [sent-1, score-0.171]
2 edu Abstract A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. [sent-4, score-0.216]
3 By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. [sent-5, score-0.455]
4 The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. [sent-6, score-0.155]
5 The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. [sent-7, score-0.163]
6 But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. [sent-8, score-0.171]
7 In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster. [sent-9, score-0.038]
8 1 Introduction Recovering the intrinsic dimensionality of a data set is a classic and fundamental problem in data analysis. [sent-10, score-0.292]
9 A popular method for doing this is PCA or localized PCA. [sent-11, score-0.091]
10 Modeling the data manifold with localized PCA dates back to [4]. [sent-12, score-0.245]
11 However, the task of dimensionality selection has not been solved in a satisfactory way. [sent-14, score-0.309]
12 On the one hand we have crude methods based on eigenvalue thresholding [4] which are very fast, or we have iterative methods [1] which require excessive computing time. [sent-15, score-0.112]
13 This paper resolves the situation by deriving a method which is both accurate and fast. [sent-16, score-0.129]
14 It is an application of Bayesian model selection to the probabilistic PCA model developed by [12, 15]. [sent-17, score-0.329]
15 The new method operates exclusively on the eigenvalues of the data covariance matrix. [sent-18, score-0.316]
16 In the local PCA context, these would be the eigenvalues of the local responsibility-weighted covariance matrix, as defined by [14]. [sent-19, score-0.176]
17 The method can be used to fit different PCA models to different classes, for use in Bayesian classification [11]. [sent-20, score-0.046]
18 The PCA model is that a d-dimensional vector x was generated from a smaller k-dimensional vector w by a linear transformation (H, m) plus a noise vector e: x = Hw + m + e. [sent-22, score-0.206]
19 Both the noise and the principal component vector ware assumed spherical Gaussian: (1) The observation x is therefore Gaussian itself: p(xIH, m, v) '" N(m, HHT + vI) (2) The goal of PCA is to estimate the basis vectors H and the noise variance v from a data set D = {Xl, . [sent-23, score-0.393]
20 The probability of the data set is (27f)-Nd/2IHHT + vII- N/ 2 exp(-~tr((HHT + VI)-lS)) (3) p(DIH,m,v) S = I)Xi - m)(xi - m)T (4) As shown by [15], the maximum-likelihood estimates are: 1 ~ m= N~xi A A _ V - "'~ L. [sent-27, score-0.039]
21 "J=k+l A' J (5) d-k i where orthogonal matrix U contains the top k eigenvectors of SIN, diagonal matrix A contains the corresponding eigenvalues, and R is an arbitrary orthogonal matrix. [sent-28, score-0.433]
22 3 Bayesian model selection Bayesian model selection scores models according to the probability they assign the observed data [9, 8]. [sent-29, score-0.48]
23 It automatically encodes a preference for simpler, more constrained models, as illustrated in figure 1. [sent-31, score-0.033]
24 Simple models only fit a small fraction of data sets, but they assign correspondingly higher probability to those data sets. [sent-32, score-0.156]
25 The probability of the data given the model is computed by integrating over the unknown parameter values in that model: p(DIM) p(D I M) . [sent-34, score-0.14]
26 n ~""";"'" model flexible model ------~--_r~------ constrained model wins D flexible model wins Figure 1: Why Bayesian model selection prefers simpler models = fo p(DIO)p(OIM)dO (6) This quantity is called the evidence for model M. [sent-35, score-0.821]
27 A useful property of Bayesian model selection is that it is guaranteed to select the true model, if it is among the candidates, as the size of the dataset grows to infinity. [sent-36, score-0.231]
28 1 The evidence for probabilistic peA For the PCA model, we want to select the subspace dimensionality k. [sent-38, score-0.325]
29 To do this, we compute the probability of the data for each possible dimensionality and pick the maximum. [sent-39, score-0.253]
30 For a given dimensionality, this requires integrating over all PCA parameters (m, H, v) . [sent-40, score-0.041]
31 First we need to define a prior density for these parameters. [sent-41, score-0.126]
32 Assuming there is no information other than the data D, the prior should be as noninformative as possible. [sent-42, score-0.19]
33 Let H be decomposed just as in (5): (9) where L is diagonal with diagonal elements k The orthogonal matrix U is the basis, L is the scaling (corrected for noise), and R is a rotation within the subspace (which will turn out to be irrelevant). [sent-44, score-0.323]
34 For a noninformative prior, a should be small, making the prior diffuse. [sent-46, score-0.151]
35 Combining the likelihood with the prior gives p(Dlk) =Ck /IHH T +vII-n/2exp(-~tr((HHT +vI)-l(S+aI)))dUdLdv (14) n = N +1+a (15) The constant Ck includes N-d/2 and the normalizing terms for p(U) , p(L), and p(v) (given in [lO])-only p(U) will matter in the end. [sent-48, score-0.115]
36 In this formula R has already been integrated out; the likelihood does not involve R so we just get a multiplicative factor of JRP(R) dR = 1. [sent-49, score-0.069]
37 2 Laplace approximation Laplace's method is a powerful method for approximating integrals in Bayesian statistics [8]: / f(())d() ~ f(B)(27f),ows(A)/2IAI- 1 / 2 (16) (17) The key to getting a good approximation is choosing a good parameterization for () = (U, L, v). [sent-51, score-0.287]
38 _ NAi + a: , - N-1+a: d 2 10g f((}) I (dlD2 ()=o N~:=k+1 Aj v= n(d-k)-2 = _ N - 1 + a: d2 10g f((}) I (dV')2 ()=o 2 (18) = _ n(d - k) - 2 (19) 2 The matrix U is an orthogonal k-frame and therefore lives on the Stiefel manifold [7], which is defined by condition (9). [sent-54, score-0.285]
39 The dimension of the manifold is m = dk - k(k + 1) /2, since we are imposing k(k + 1)/2 constraints on a d x k matrix. [sent-55, score-0.128]
40 As a function of U, the integrand is simply 1 p(UID, L, v) ex: exp( -2 tr ((L -1 - v-1I)U T SU)) (23) The density is maximized when U contains the top k eigenvectors of S . [sent-59, score-0.212]
41 However, the density is unchanged if we negate any column of U. [sent-60, score-0.051]
42 This means that there are actually 2k different maxima, and we need to apply Laplace's method to each. [sent-61, score-0.046]
43 Fortunately, these maxima are identical so can simply multiply (16) by 2k to get the integral over the whole manifold. [sent-62, score-0.082]
44 If we set U d to the eigenvectors of S : uIsu d = N A (24) then we just need to apply Laplace's method at Z = O. [sent-63, score-0.102]
45 As shown in [10], if we define the estimated eigenvalue matrix A= then the second differential at Z [~ VI~-J = 0 simplifies to k d 2 logf((}) IZ=Q (25) d = -" L. [sent-64, score-0.094]
46 J (\ - \~ -1 )(Ai - Aj)Ndzij (26) i=l j=i+1 There are no cross derivatives; the Hessian matrix Az is diagonal. [sent-70, score-0.058]
47 ~j1 - ~i1)(Ai - Aj)N (27) Laplace's method requires this to be nonsingular, so we must have k < N. [sent-72, score-0.046]
48 The crossderivatives between the parameters are all zero: cP log 1(0) I dlidZ = d2 10g 1(0) dvdZ 0=0 I 0=0 = d2 10g 1(0) dlidv I = 0 (28) 0=0 so A is block diagonal and IAI = IAzIIALIIAvl. [sent-73, score-0.063]
49 Given the eigenvalues, the cost of computing p(D Ik) is O(min(d, N)k), which is less than one loop over the data matrix. [sent-78, score-0.039]
50 A simplification of Laplace's method is the BIC approximation [8]. [sent-79, score-0.083]
51 This approximation drops all terms which do not grow with N, which in this case leaves only p(Dlk) RJ ( g Aj ) -N/2 v- N (d-k)/2 N-(m+k)/2 (32) BIC is compared to Laplace in section 4. [sent-80, score-0.037]
52 4 Results To test the performance of various algorithms for model selection, we sample data from a known model and see how often the correct dimensionality is recovered. [sent-81, score-0.374]
53 The seven estimators implemented and tested in this study are Laplace's method (30), BIC (32), the two methods of [13] (called RR-N and RR-U), the algorithm in [3] (ER), the ARD algorithm of [1], and 5-fold cross-validation (CV). [sent-82, score-0.135]
54 For cross-validation, the log-probability assigned to the held-out data is the scoring function. [sent-83, score-0.039]
55 ER is the most similar to this paper, since it performs Bayesian model selection on the same model, but uses a different kind of approximation combined with explicit numerical integration. [sent-84, score-0.235]
56 RR-N and RR-U are maximum likelihood techniques on models slightly different than probabilistic PCA; the details are in [10]. [sent-85, score-0.071]
57 ARD is an iterative estimation algorithm for H which sets columns to zero unless they are supported by the data. [sent-86, score-0.046]
58 The number of nonzero columns at convergence is the estimate of dimensionality. [sent-87, score-0.081]
59 Most of these estimators work exclusively from the eigenvalues of the sample covariance matrix. [sent-88, score-0.32]
60 The exceptions are RR-U, cross-validation, and ARD; the latter two require diagonalizing a series of different matrices constructed from the data. [sent-89, score-0.033]
61 In our implementation, the algorithms are ordered from fastest to slowest as RR-N, mc, Laplace, cross-validation, RR-U, ARD, and ER (ER is slowest because of the numerical integrations required). [sent-90, score-0.152]
62 The first experiment tests the data-rich case where N > > d. [sent-91, score-0.116]
63 The data is generated from a lO-dimensional Gaussian distribution with 5 "signal" dimensions and 5 noise dimensions. [sent-92, score-0.184]
64 The eigenvalues of the true covariance matrix are: Signal Noise N = 100 108642 1(x5) The number of times the correct dimensionality (k = 5) was chosen over 60 replications is shown at right. [sent-93, score-0.564]
65 ER Laplace CV BIC ARD RRN RRU The second experiment tests the case of sparse data and low noise: Signal Noise N= 10 108642 0. [sent-96, score-0.155]
66 1 (xl0) The results over 60 replications are shown at right. [sent-97, score-0.115]
67 Cross-validation also fails, because it doesn't have enough data to work with. [sent-99, score-0.039]
68 The third experiment tests the case of high noise dimensionality: Signal Noise N=60 10 8 642 0. [sent-100, score-0.224]
69 25 (x95) The ER algorithm was not run in this case because of its excessive computation time for large d. [sent-101, score-0.076]
70 Laplace The final experiment tests the robustness to having a non-Gaussian data distribution within the subspace. [sent-102, score-0.155]
71 We start with four sound fragments of 100 samples each. [sent-103, score-0.06]
72 To make things especially non-Gaussian, the values in third fragment are squared and the values in the fourth fragment are cubed. [sent-104, score-0.11]
73 All fragments are standardized to zero mean and unit variance. [sent-105, score-0.06]
74 Gaussian noise in 20 dimensions is added to get: Signal Noise N = 100 4 sounds 0. [sent-106, score-0.145]
75 5 (x20) The results over 60 replications of the noise (the signals were constant) are reported at right. [sent-107, score-0.223]
76 CV Laplace ARD ARD CV RRU BIC RRN BlC RRU RAN ER 5 Discussion Bayesian model selection has been shown to provide excellent performance when the assumed model is correct or partially correct. [sent-108, score-0.302]
77 The evaluation criterion was the number of times the correct dimensionality was chosen. [sent-109, score-0.215]
78 It would also be useful to evaluate the trained model with respect to its performance on new data within an applied setting. [sent-110, score-0.099]
79 In this case, Bayesian model averaging is more appropriate, and it is conceivable that a method like ARD, which encompasses a soft blend between different dimensionalities, might perform better by this criterion than selecting one dimensionality. [sent-111, score-0.106]
80 It is important to remember that these estimators are for density estimation, i. [sent-112, score-0.173]
81 accurate representation of the data, and are not necessarily appropriate for other purposes like reducing computation or extracting salient features. [sent-114, score-0.05]
82 For example, on a database of 301 face images the Laplace evidence picked 120 dimensions, which is far more than one would use for feature extraction. [sent-115, score-0.077]
83 (This result also suggests that probabilistic PCA is not a good generative model for face images. [sent-116, score-0.165]
84 Inferring the eigenvalues of covariance matrices from limited, noisy data. [sent-130, score-0.176]
85 The EM algorithm for mi xtures of factor analyzers. [sent-152, score-0.036]
86 Model order selection for the singular value decomposition and the discrete Karhunen-Loeve transform using a Bayesian approach. [sent-216, score-0.138]
wordName wordTfidf (topN-words)
[('laplace', 0.442), ('ard', 0.263), ('pca', 0.227), ('hht', 0.191), ('bic', 0.179), ('er', 0.173), ('dimensionality', 0.171), ('cv', 0.165), ('dlk', 0.153), ('bayesian', 0.149), ('selection', 0.138), ('manifold', 0.128), ('eigenvalues', 0.119), ('replications', 0.115), ('rru', 0.115), ('noise', 0.108), ('tr', 0.105), ('orthogonal', 0.099), ('aj', 0.093), ('estimators', 0.089), ('ck', 0.089), ('trans', 0.089), ('media', 0.083), ('parameterization', 0.078), ('excessive', 0.076), ('noninformative', 0.076), ('rrn', 0.076), ('slowest', 0.076), ('wins', 0.076), ('prior', 0.075), ('tests', 0.074), ('probabilistic', 0.071), ('vi', 0.07), ('principal', 0.07), ('az', 0.066), ('pea', 0.064), ('rj', 0.064), ('flexible', 0.064), ('diagonal', 0.063), ('http', 0.062), ('model', 0.06), ('fragments', 0.06), ('matrix', 0.058), ('covariance', 0.057), ('signal', 0.057), ('eigenvectors', 0.056), ('fragment', 0.055), ('exclusively', 0.055), ('moghaddam', 0.055), ('tipping', 0.052), ('exp', 0.052), ('density', 0.051), ('accurate', 0.05), ('maxima', 0.049), ('method', 0.046), ('columns', 0.046), ('localized', 0.045), ('assign', 0.045), ('correct', 0.044), ('evidence', 0.043), ('pick', 0.043), ('intrinsic', 0.043), ('choosing', 0.043), ('experiment', 0.042), ('integrating', 0.041), ('matter', 0.04), ('ghahramani', 0.04), ('ps', 0.04), ('subspace', 0.04), ('data', 0.039), ('ex', 0.039), ('lab', 0.038), ('plus', 0.038), ('dimensions', 0.037), ('approximation', 0.037), ('xi', 0.036), ('eigenvalue', 0.036), ('factor', 0.036), ('automatic', 0.035), ('estimate', 0.035), ('face', 0.034), ('ac', 0.034), ('component', 0.033), ('guaranteed', 0.033), ('derivatives', 0.033), ('get', 0.033), ('dates', 0.033), ('bregler', 0.033), ('corrected', 0.033), ('correspondingly', 0.033), ('diagonalizing', 0.033), ('dimensionalities', 0.033), ('everson', 0.033), ('iz', 0.033), ('minka', 0.033), ('nonsingular', 0.033), ('preference', 0.033), ('remember', 0.033), ('resolves', 0.033), ('sharpness', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 27 nips-2000-Automatic Choice of Dimensionality for PCA
Author: Thomas P. Minka
Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.
2 0.2004189 121 nips-2000-Sparse Kernel Principal Component Analysis
Author: Michael E. Tipping
Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1
3 0.14912815 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
4 0.11793606 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
Author: Michael S. Gray, Terrence J. Sejnowski, Javier R. Movellan
Abstract: We examine eight different techniques for developing visual representations in machine vision tasks. In particular we compare different versions of principal component and independent component analysis in combination with stepwise regression methods for variable selection. We found that local methods, based on the statistics of image patches, consistently outperformed global methods based on the statistics of entire images. This result is consistent with previous work on emotion and facial expression recognition. In addition, the use of a stepwise regression technique for selecting variables and regions of interest substantially boosted performance. 1
5 0.10649859 92 nips-2000-Occam's Razor
Author: Carl Edward Rasmussen, Zoubin Ghahramani
Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.
6 0.098946437 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets
7 0.096899994 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
8 0.094352186 133 nips-2000-The Kernel Gibbs Sampler
9 0.090057373 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
10 0.088717632 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
11 0.084688626 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
12 0.081977375 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
13 0.079130389 54 nips-2000-Feature Selection for SVMs
14 0.077690758 51 nips-2000-Factored Semi-Tied Covariance Matrices
15 0.075084537 82 nips-2000-Learning and Tracking Cyclic Human Motion
16 0.074151769 35 nips-2000-Computing with Finite and Infinite Networks
17 0.073730126 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
18 0.072462663 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
19 0.071637735 120 nips-2000-Sparse Greedy Gaussian Process Regression
20 0.070425831 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding
topicId topicWeight
[(0, 0.255), (1, 0.008), (2, 0.099), (3, 0.083), (4, 0.101), (5, 0.162), (6, -0.087), (7, -0.002), (8, -0.008), (9, -0.177), (10, -0.058), (11, -0.013), (12, 0.004), (13, 0.001), (14, -0.087), (15, 0.069), (16, -0.042), (17, -0.027), (18, -0.137), (19, 0.079), (20, 0.031), (21, 0.138), (22, -0.107), (23, -0.147), (24, -0.122), (25, 0.126), (26, 0.156), (27, 0.071), (28, -0.082), (29, 0.089), (30, 0.016), (31, 0.012), (32, 0.04), (33, -0.014), (34, -0.116), (35, -0.028), (36, 0.073), (37, -0.179), (38, 0.01), (39, -0.001), (40, -0.146), (41, 0.074), (42, 0.145), (43, 0.004), (44, -0.002), (45, -0.025), (46, -0.012), (47, 0.09), (48, 0.022), (49, -0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.94620496 27 nips-2000-Automatic Choice of Dimensionality for PCA
Author: Thomas P. Minka
Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.
2 0.62019527 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets
Author: Ulrik Kjems, Lars Kai Hansen, Stephen C. Strother
Abstract: We demonstrate that statistical analysis of ill-posed data sets is subject to a bias, which can be observed when projecting independent test set examples onto a basis defined by the training examples. Because the training examples in an ill-posed data set do not fully span the signal space the observed training set variances in each basis vector will be too high compared to the average variance of the test set projections onto the same basis vectors. On basis of this understanding we introduce the Generalizable Singular Value Decomposition (GenSVD) as a means to reduce this bias by re-estimation of the singular values obtained in a conventional Singular Value Decomposition, allowing for a generalization performance increase of a subsequent statistical model. We demonstrate that the algorithm succesfully corrects bias in a data set from a functional PET activation study of the human brain. 1 Ill-posed Data Sets An ill-posed data set has more dimensions in each example than there are examples. Such data sets occur in many fields of research typically in connection with image measurements. The associated statistical problem is that of extracting structure from the observed high-dimensional vectors in the presence of noise. The statistical analysis can be done either supervised (Le. modelling with target values: classification, regresssion) or unsupervised (modelling with no target values: clustering, PCA, ICA). In both types of analysis the ill-posedness may lead to immediate problems if one tries to apply conventional statistical methods of analysis, for example the empirical covariance matrix is prohibitively large and will be rank-deficient. A common approach is to use Singular Value Decomposition (SVD) or the analogue Principal Component Analysis (PCA) to reduce the dimensionality of the data. Let the N observed i-dimensional samples Xj, j = L .N, collected in the data matrix X = [Xl ... XN] of size I x N, I> N . The SVD-theorem states that such a matrix can be decomposed as (1) where U is a matrix of the same size as X with orthogonal basis vectors spanning the space of X, so that UTU = INxN. The square matrix A contains the singular values in the diagonal, A = diag( AI, ... , >w), which are ordered and positive Al ~ A2 ~ ... ~ AN ~ 0, and V is N x N and orthogonal V TV = IN. If there is a mean value significantly different from zero it may at times be advantageous to perform the above analysis on mean-subtracted data, i.e. X - X = U A V T where columns of X all contain the mean vector x = Lj xj/N. Each observation Xj can be expressed in coordinates in the basis defined by the vectors of U with no loss of information[Lautrup et al., 1995]. A change of basis is obtained by qj = U T Xj as the orthogonal basis rotation Q = [ql ... qN] = U T X = UTUAV T = AVT . (2) Since Q is only N x Nand N « I, Q is a compact representation of the data. Having now N examples of N dimension we have reduced the problem to a marginally illposed one. To further reduce the dimensionality, it is common to retain only a subset of the coordinates, e.g. the top P coordinates (P < N) and the supervised or unsupervised model can be formed in this smaller but now well-posed space. So far we have considered the procedure for modelling from a training set. Our hope is that the statistical description generalizes well to new examples proving that is is a good description of the generating process. The model should, in other words, be able to perform well on a new example, x*, and in the above framework this would mean the predictions based on q* = U T x* should generalize well. We will show in the following, that in general, the distribution of the test set projection q* is quite different from the statistics of the projections of the training examples qj. It has been noted in previous work [Hansen and Larsen, 1996, Roweis, 1998, Hansen et al., 1999] that PCA/SVD of ill-posed data does not by itself represent a probabilistic model where we can assign a likelihood to a new test data point, and procedures have been proposed which make this possible. In [Bishop, 1999] PCA has been considered in a Bayesian framework, but does not address the significant bias of the variance in training set projections in ill-posed data sets. In [Jackson, 1991] an asymptotic expression is given for the bias of eigen-values in a sample covariance matrix, but this expression is valid only in the well-posed case and is not applicable for ill-posed data. 1.1 Example Let the signal source be I-dimensional multivariate Gaussian distribution N(O,~) with a covariance matrix where the first K eigen-values equal u 2 and the last 1- K are zero, so that the covariance matrix has the decomposition ~=u2YDyT, D=diag(1, ... ,1,0, ... ,0), yTY=I (3) Our N samples of the distribution are collected in the matrix X = [Xij] with the SVD (4) A = diag(Al, ... , AN) and the representation ofthe N examples in the N basis vector coordinates defined by U is Q = [%] = U T X = A V T. The total variance per training example is ~ LX;j ~Tr(XTX) = ~Tr(VAUTUAVT) = ~Tr(VA2VT) i,j = ~ Tr(VVT A2) = ~ Tr(A2) = ~L A; i (5) Note that this variance is the same in the U-basis coordinates: 1 L...J 2 N '
3 0.6150589 121 nips-2000-Sparse Kernel Principal Component Analysis
Author: Michael E. Tipping
Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1
4 0.49868029 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
5 0.49303702 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian
Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which
6 0.46762973 92 nips-2000-Occam's Razor
7 0.4607951 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
8 0.4495796 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
9 0.44356117 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
10 0.43297812 51 nips-2000-Factored Semi-Tied Covariance Matrices
11 0.38630161 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
12 0.38191271 133 nips-2000-The Kernel Gibbs Sampler
13 0.37121865 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding
14 0.37026489 82 nips-2000-Learning and Tracking Cyclic Human Motion
15 0.35938621 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
16 0.3407397 18 nips-2000-Active Support Vector Machine Classification
18 0.31412792 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA
19 0.30397275 49 nips-2000-Explaining Away in Weight Space
20 0.29899094 35 nips-2000-Computing with Finite and Infinite Networks
topicId topicWeight
[(10, 0.027), (17, 0.164), (32, 0.015), (33, 0.031), (54, 0.015), (55, 0.018), (62, 0.034), (65, 0.019), (67, 0.054), (75, 0.01), (76, 0.06), (79, 0.423), (81, 0.01), (90, 0.035), (97, 0.013)]
simIndex simValue paperId paperTitle
1 0.96871239 29 nips-2000-Bayes Networks on Ice: Robotic Search for Antarctic Meteorites
Author: Liam Pedersen, Dimitrios Apostolopoulos, William Whittaker
Abstract: A Bayes network based classifier for distinguishing terrestrial rocks from meteorites is implemented onboard the Nomad robot. Equipped with a camera, spectrometer and eddy current sensor, this robot searched the ice sheets of Antarctica and autonomously made the first robotic identification of a meteorite, in January 2000 at the Elephant Moraine. This paper discusses rock classification from a robotic platform, and describes the system onboard Nomad. 1
same-paper 2 0.89689124 27 nips-2000-Automatic Choice of Dimensionality for PCA
Author: Thomas P. Minka
Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.
3 0.86382049 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
Author: Simon Tong, Daphne Koller
Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.
4 0.58115798 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
5 0.56707412 92 nips-2000-Occam's Razor
Author: Carl Edward Rasmussen, Zoubin Ghahramani
Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.
6 0.52551204 133 nips-2000-The Kernel Gibbs Sampler
7 0.51533473 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
8 0.48730329 94 nips-2000-On Reversing Jensen's Inequality
9 0.48622766 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
10 0.47628057 4 nips-2000-A Linear Programming Approach to Novelty Detection
11 0.47488317 127 nips-2000-Structure Learning in Human Causal Induction
12 0.47179374 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets
13 0.45871305 82 nips-2000-Learning and Tracking Cyclic Human Motion
14 0.45800573 57 nips-2000-Four-legged Walking Gait Control Using a Neuromorphic Chip Interfaced to a Support Vector Learning Algorithm
15 0.45601243 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates
16 0.45340815 74 nips-2000-Kernel Expansions with Unlabeled Examples
17 0.4527601 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
18 0.44693881 122 nips-2000-Sparse Representation for Gaussian Process Models
19 0.44691148 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition
20 0.44583878 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling