nips nips2000 nips2000-46 knowledge-graph by maker-knowledge-mining

46 nips-2000-Ensemble Learning and Linear Response Theory for ICA


Source: pdf

Author: Pedro A. d. F. R. Højen-Sørensen, Ole Winther, Lars Kai Hansen

Abstract: We propose a general Bayesian framework for performing independent component analysis (leA) which relies on ensemble learning and linear response theory known from statistical physics. We apply it to both discrete and continuous sources. For the continuous source the underdetermined (overcomplete) case is studied. The naive mean-field approach fails in this case whereas linear response theory-which gives an improved estimate of covariances-is very efficient. The examples given are for sources without temporal correlations. However, this derivation can easily be extended to treat temporal correlations. Finally, the framework offers a simple way of generating new leA algorithms without needing to define the prior distribution of the sources explicitly.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 se 1 Department Abstract We propose a general Bayesian framework for performing independent component analysis (leA) which relies on ensemble learning and linear response theory known from statistical physics. [sent-10, score-0.553]

2 We apply it to both discrete and continuous sources. [sent-11, score-0.044]

3 For the continuous source the underdetermined (overcomplete) case is studied. [sent-12, score-0.382]

4 The naive mean-field approach fails in this case whereas linear response theory-which gives an improved estimate of covariances-is very efficient. [sent-13, score-0.293]

5 The examples given are for sources without temporal correlations. [sent-14, score-0.216]

6 However, this derivation can easily be extended to treat temporal correlations. [sent-15, score-0.126]

7 Finally, the framework offers a simple way of generating new leA algorithms without needing to define the prior distribution of the sources explicitly. [sent-16, score-0.24]

8 1 Introduction Reconstruction of statistically independent source signals from linear mixtures is an active research field. [sent-17, score-0.359]

9 The source separation problem has a Bayesian formulation, see e. [sent-21, score-0.343]

10 , [2, 3] for which there has been some recent progress based on ensemble learning [4]. [sent-23, score-0.216]

11 In the Bayesian framework, the covariances of the sources are needed in order to estimate the mixing matrix and the noise level. [sent-24, score-0.346]

12 Unfortunately, ensemble learning using factorized trial distributions only treats self-interactions correctly and trivially predicts: (SiSi)(Si}(Sj) = 0 for i -I j. [sent-25, score-0.272]

13 This naive mean-field (NMF) approximation first introduced in the neural computing context by Ref. [sent-26, score-0.062]

14 [5] for Boltzmann machine learning may completely fail in some cases [6]. [sent-27, score-0.062]

15 Recently, Kappen and Rodriguez [6] introduced an efficient learning algorithm for Boltzmann Machines based on linear response (LR) theory. [sent-28, score-0.179]

16 LR theory gives a recipe for computing an improved approximation to the covariances directly from the solution to the NMF equations [7]. [sent-29, score-0.23]

17 Ensemble learning has been applied in many contexts within neural computation, e. [sent-30, score-0.029]

18 for sigmoid belief networks [8], where advanced mean field methods such as LR theory or TAP [9] may also be applicable. [sent-32, score-0.294]

19 In this paper, we show how LR theory can be applied to independent component analysis (leA). [sent-33, score-0.155]

20 We observe that NMF may fail for high noise levels and binary sources and for the underdetermined continuous case. [sent-35, score-0.39]

21 In these cases the NMF approach ignores one of the sources and consequently overestimates the noise. [sent-36, score-0.19]

22 The LR approach on the other hand succeeds in all cases studied. [sent-37, score-0.064]

23 The derivation of the mean-field equations are kept completely general and are thus valid for a general source prior (without temporal correlations). [sent-38, score-0.536]

24 show that the mean-field framework may be used to propose ICA algorithms for which the source prior is only defined implicitly. [sent-40, score-0.38]

25 [10], we consider a collection of N temporal measurements, X = {Xdt}, where Xdt denotes the measurement at the dth sensor at time t. [sent-42, score-0.11]

26 Similarly, let S = {Smd denote a collection of M mutually independent sources where Sm. [sent-43, score-0.201]

27 is the mth source which in general may have temporal correlations. [sent-44, score-0.356]

28 The measured signals X are assumed to be an instantaneous linear mixing of the sources corrupted with additive Gaussian noise r , that is, X=As+r , (1) where A is the mixing matrix. [sent-45, score-0.362]

29 Furthermore, to simplify this exposition the noise is assumed to be iid Gaussian with variance a 2 . [sent-46, score-0.09]

30 The likelihood of the parameters is then given by, P(XIA, ( 2 ) = ! [sent-47, score-0.037]

31 dSP(XIA, a 2 , S) P(S) , (2) where P(S) is the prior on the sources which might include temporal correlations. [sent-48, score-0.29]

32 We will, however, throughout this paper assume that the sources are temporally uncorrelated. [sent-49, score-0.134]

33 We choose to estimate the mixing matrix A and noise level a 2 by Maximum Likelihood (ML-II). [sent-50, score-0.18]

34 The saddlepoint of P(XIA, ( 2 ) is attained at, 810gP~~IA,(2) = 0 810gP~~IA,(2) = 0 A = X(S)T(SST)-l (3) = D~(Tr(X - (4) a2 ASf(X - AS)) , where (. [sent-51, score-0.028]

35 3 Mean field theory First, we derive mean field equations using ensemble learning. [sent-53, score-0.676]

36 Secondly, using linear response theory, we obtain improved estimates of the off-diagonal terms of (SST) which are needed for estimating A and a 2 . [sent-54, score-0.183]

37 The following derivation is performed for an arbitrary source prior. [sent-55, score-0.318]

38 1 Ensemble learning We adopt a standard ensemble learning approach and approximate 2) = P(XIA, a 2 , S)P(S) (5) "a P(XIA,a 2 ) in a family of product distributions Q(S) = TImt Q(Smt) . [sent-57, score-0.245]

39 We therefore parameterize the Gaussian as, P(XIA, a 2 , S) = P(XIJ, h, S) = Ce~ Tr(ST JS)+Tr(hTS) (7) , where J = _AT AI a 2 is the M x M interaction matrix and h = A TXI a 2 has the same dimensions as the source matrix S. [sent-60, score-0.332]

40 Note that h acts as an external field from which we can obtain all moments of the sources. [sent-61, score-0.241]

41 This is a property that we will make use of in the next section when we derive the linear response corrections. [sent-62, score-0.188]

42 The Kullback-Leibler divergence between the optimal product distribution Q (S) and the true source posterior is given by ! [sent-63, score-0.303]

43 d8P(8)e~>'~tS2+Y~tS + ~ 2: (Jmm - Amt)(8~t) mt 1 mt +2 Tr(ST)(J - diag(J)(S) + Tr(h - "If (S) + In C , (9) where P(XIA, a 2) is the naive mean field approximation to the Likelihood and diag(J) is the diagonal matrix of J. [sent-65, score-0.565]

44 The saddlepoints define the mean field equations; oKL o(S) = 0 oKL 0(8;'t) "I = h + (J - (10) diag(J))(S) (11) =0 The remaining two equations depend explicitly on the source prior, P(S); oK L 01mt =0 (8 mt ) = _O_IOg! [sent-66, score-0.667]

45 d8mtP(8mt)e~>'~t s;'t+Y~tS~t 01mt == fbmt, Amt) oKL OAmt =0 : (8 2 \ mtl (12) 0 = 2 --10 g ! [sent-67, score-0.096]

46 (13) In section 4, we calculate fbmt' Amt) for some of the prior distributions found in the leA literature. [sent-69, score-0.074]

47 2 Linear response theory As mentioned already, h acts as an external field. [sent-71, score-0.255]

48 This makes it possible to calculate the means and covariances as derivatives of log P(XIJ, h), i. [sent-72, score-0.089]

49 (8 tt' - X mm, = \ mtl = ologP(XIJ, h) (14) oh mt (8 8 \ (8 \(8 \ _ 0 2 log P(XIJ, h) _ 0(8mt ) mt m't'l mtl m't'l ! [sent-74, score-0.478]

50 U m't' U mt U m't' (15) To derive an equation for X~m" we use eqs. [sent-77, score-0.181]

51 (10), (11) and (12) to get tt' Xmm , = Of(1mt, Amt) 01mt ohm't' = Of(1mt, Amt) ( " J tt L. [sent-78, score-0.056]

52 -1 -2 -2 a 2 x, -2 -2 a -2 -2 2 a x, X, 2 Figure 1: Binary source recovery for low noise level (M = 2, D = 2). [sent-116, score-0.447]

53 Shows from left to right: +/- the column vectors of; the true A (with the observations superimposed); the estimated A (NMF); estimated A (LR). [sent-117, score-0.137]

54 5 20 iteration 40 Figure 2: Binary source recovery for low noise level (M = 2, D = 2), Shows the dynamics of the fix-point iterations. [sent-145, score-0.478]

55 From left to right; +/- the column vectors of A (NMF); +/- the column vectors of A (LR); variance (72 (solid:NMF, dashed:LR, thick dash-dotted: the true empirical noise variance). [sent-146, score-0.179]

56 We now see that the x-matrix factorizes in time X~ml = ott' X~ml. [sent-147, score-0.028]

57 This is a direct consequence of the fact that the model has no temporal correlations. [sent-148, score-0.082]

58 The above equation is linear and may straightforwardly be solved to yield X~ml = [(At - J)-l]mm ' , (17) where we have defined the diagonal matrix At = diag (8fh~'Alt) + J 11 , . [sent-149, score-0.192]

59 8"(Mt At this point is appropriate to explain why linear response theory is more precise than using the factorized distribution which predicts X~ml 0 for non-diagonal terms. [sent-154, score-0.362]

60 Here, we give an argument that can be found in Parisi's book on statistical field theory [7] : Let us assume that the approximate and exact distribution is close in some sense, i,e. [sent-155, score-0.201]

61 Mean field theory gives a lower bound on the log-Likelihood since K L , eq. [sent-157, score-0.201]

62 Consequently, the linear term vanishes in the expansion of the log-Likelihood: log P(XIA, (72) log P(XIA, (72) + O(c 2 ) . [sent-159, score-0.046]

63 It is therefore more precise to obtain moments of the variables through derivatives of the approximate log-Likelihood, i,e. [sent-160, score-0.093]

64 , AMt) and likewise in the definition of At above we get TAP equations = 8fh=t ,A=,) = [(At - J)-l] mm . [sent-166, score-0.185]

65 [9], The TAP equation for Amt is xt mm 8"(=t 2 2 :0,. [sent-167, score-0.123]

66 ' :::: -1 -2 -2 a -1 -2 -2 2 x, - a X'" a -1 a -2 -2 2 X, a 2 x, Figure 3: Binary source recovery for high noise level (M = 2, D = 2). [sent-194, score-0.447]

67 Shows from left to right: +/- the column vectors of; the true A (with the observations superimposed); the estimated A (NMF); estimated A (LR). [sent-195, score-0.137]

68 1000 2000 iteration Figure 6: Overcomplete continuous source recovery with M = 3 and D = 2. [sent-251, score-0.431]

69 Note that the initial iteration step for A is very large. [sent-253, score-0.031]

70 2 Continuous Source To give a tractable example which illustrates the improvement by LR, we consider the Gaussian prior P(Smt) ex: exp( -o. [sent-255, score-0.074]

71 Since we have a factorized distribution, ensemble learning predicts (SmtSm't') - (Smt}(Sm't') = 8mm ,8tt' (a. [sent-260, score-0.326]

72 For the popular choice of prior P(Smt) = 7r cos ~ s tnt [1], it is not possible to derive fbmt. [sent-267, score-0.112]

73 Mean field equations for negative kurtosis can be obtained using the prior P(Smt) exp( -(Smt - 1-£)2/2) + exp( -(Smt + 1-£)2/2) [1] leading to ex:= Figure 5 and 6 show simulations using this source prior with 1-£ = 1 in an overcomplete setting with D = 2 and M = 3. [sent-272, score-0.804]

74 Note that 1-£ = 1 yields a unimodal source distribution and hence qualitatively different from the bimodal prior considered in the binary case. [sent-273, score-0.401]

75 In the overcomplete setting the NMF approach fails to recover the true sources. [sent-274, score-0.27]

76 See [13] for further discussion of the overcomplete case. [sent-275, score-0.193]

77 5 Conclusion We have presented a general ICA mean field framework based upon ensemble learning and linear response theory. [sent-276, score-0.586]

78 The naive mean-field approach (pure ensemble learning) fails in some cases and we speculate that it is incapable of handling the overcomplete case (more sources than sensors). [sent-277, score-0.624]

79 Linear response theory, on the other hand, succeeds in all the examples studied. [sent-278, score-0.168]

80 There are two directions in which we plan to extend this work: (1) to sources with temporal correlations and (2) to source models defined not by a parametric source prior, but directly in terms of the function j, which defines the mean field equations. [sent-279, score-0.952]

81 Starting directly from the j-function makes it possible to test a whole range of implicitly defined source priors. [sent-280, score-0.274]

82 A detailed analysis of a large selection of constrained and unconstrained source priors as well as comparisons of LR and the TAP approach can be found in [14]. [sent-281, score-0.274]

83 Acknowledgments PHS wishes to thank Mike Jordan for stimulating discussions on the mean field and variational methods. [sent-282, score-0.216]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('amt', 0.383), ('smt', 0.288), ('xia', 0.288), ('source', 0.274), ('nmf', 0.249), ('lr', 0.205), ('overcomplete', 0.193), ('ensemble', 0.187), ('mt', 0.143), ('sources', 0.134), ('field', 0.127), ('mm', 0.123), ('diag', 0.117), ('winther', 0.11), ('response', 0.104), ('lea', 0.1), ('fbmt', 0.096), ('jmm', 0.096), ('mtl', 0.096), ('okl', 0.096), ('tap', 0.09), ('ts', 0.087), ('hansen', 0.083), ('xij', 0.082), ('recovery', 0.082), ('temporal', 0.082), ('theory', 0.074), ('prior', 0.074), ('tr', 0.073), ('separation', 0.069), ('girolami', 0.064), ('inp', 0.064), ('oamt', 0.064), ('ott', 0.064), ('parisi', 0.064), ('sst', 0.064), ('succeeds', 0.064), ('underdetermined', 0.064), ('noise', 0.062), ('naive', 0.062), ('equations', 0.062), ('covariances', 0.061), ('mean', 0.061), ('mixing', 0.06), ('factorized', 0.056), ('tt', 0.056), ('lund', 0.055), ('xdt', 0.055), ('predicts', 0.054), ('binary', 0.053), ('superimposed', 0.05), ('ml', 0.048), ('fails', 0.048), ('rodriguez', 0.046), ('lh', 0.046), ('denmark', 0.046), ('linear', 0.046), ('continuous', 0.044), ('derivation', 0.044), ('column', 0.044), ('kappen', 0.043), ('component', 0.042), ('boltzmann', 0.042), ('acts', 0.041), ('opper', 0.041), ('independent', 0.039), ('derive', 0.038), ('blind', 0.037), ('moments', 0.037), ('likelihood', 0.037), ('external', 0.036), ('ia', 0.034), ('fail', 0.033), ('improved', 0.033), ('framework', 0.032), ('sigmoid', 0.032), ('estimated', 0.032), ('ica', 0.031), ('iteration', 0.031), ('gaussian', 0.03), ('level', 0.029), ('matrix', 0.029), ('true', 0.029), ('learning', 0.029), ('consequently', 0.028), ('precise', 0.028), ('derivatives', 0.028), ('collection', 0.028), ('exposition', 0.028), ('overestimates', 0.028), ('alt', 0.028), ('saddlepoint', 0.028), ('wishes', 0.028), ('factorizes', 0.028), ('ole', 0.028), ('solvegatan', 0.028), ('councils', 0.028), ('danish', 0.028), ('fokoue', 0.028), ('hts', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

Author: Pedro A. d. F. R. Højen-Sørensen, Ole Winther, Lars Kai Hansen

Abstract: We propose a general Bayesian framework for performing independent component analysis (leA) which relies on ensemble learning and linear response theory known from statistical physics. We apply it to both discrete and continuous sources. For the continuous source the underdetermined (overcomplete) case is studied. The naive mean-field approach fails in this case whereas linear response theory-which gives an improved estimate of covariances-is very efficient. The examples given are for sources without temporal correlations. However, this derivation can easily be extended to treat temporal correlations. Finally, the framework offers a simple way of generating new leA algorithms without needing to define the prior distribution of the sources explicitly.

2 0.16463351 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

3 0.12220111 96 nips-2000-One Microphone Source Separation

Author: Sam T. Roweis

Abstract: Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (

4 0.11321498 13 nips-2000-A Tighter Bound for Graphical Models

Author: Martijn A. R. Leisink, Hilbert J. Kappen

Abstract: We present a method to bound the partition function of a Boltzmann machine neural network with any odd order polynomial. This is a direct extension of the mean field bound, which is first order. We show that the third order bound is strictly better than mean field. Additionally we show the rough outline how this bound is applicable to sigmoid belief networks. Numerical experiments indicate that an error reduction of a factor two is easily reached in the region where expansion based approximations are useful. 1

5 0.11248412 114 nips-2000-Second Order Approximations for Probability Models

Author: Hilbert J. Kappen, Wim Wiegerinck

Abstract: In this paper, we derive a second order mean field theory for directed graphical probability models. By using an information theoretic argument it is shown how this can be done in the absense of a partition function. This method is a direct generalisation of the well-known TAP approximation for Boltzmann Machines. In a numerical example, it is shown that the method greatly improves the first order mean field approximation. For a restricted class of graphical models, so-called single overlap graphs, the second order method has comparable complexity to the first order method. For sigmoid belief networks, the method is shown to be particularly fast and effective.

6 0.097412214 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

7 0.084832773 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

8 0.082767822 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

9 0.080242999 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

10 0.075983375 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

11 0.073666923 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

12 0.066991739 27 nips-2000-Automatic Choice of Dimensionality for PCA

13 0.066069633 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

14 0.064137265 35 nips-2000-Computing with Finite and Infinite Networks

15 0.063356102 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

16 0.061315026 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

17 0.059899032 122 nips-2000-Sparse Representation for Gaussian Process Models

18 0.058512814 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

19 0.05385166 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code

20 0.049985081 22 nips-2000-Algorithms for Non-negative Matrix Factorization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.196), (1, -0.054), (2, 0.1), (3, -0.027), (4, 0.089), (5, -0.043), (6, -0.132), (7, -0.004), (8, 0.039), (9, -0.067), (10, -0.079), (11, -0.06), (12, -0.195), (13, 0.042), (14, -0.093), (15, -0.07), (16, -0.068), (17, -0.205), (18, 0.002), (19, -0.023), (20, -0.026), (21, -0.056), (22, 0.144), (23, -0.092), (24, 0.151), (25, 0.011), (26, -0.024), (27, -0.013), (28, -0.24), (29, 0.188), (30, 0.079), (31, -0.065), (32, 0.026), (33, 0.136), (34, -0.062), (35, 0.051), (36, 0.07), (37, -0.148), (38, -0.101), (39, -0.131), (40, -0.08), (41, -0.158), (42, -0.15), (43, 0.086), (44, -0.052), (45, 0.048), (46, 0.015), (47, -0.113), (48, -0.014), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95884246 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

Author: Pedro A. d. F. R. Højen-Sørensen, Ole Winther, Lars Kai Hansen

Abstract: We propose a general Bayesian framework for performing independent component analysis (leA) which relies on ensemble learning and linear response theory known from statistical physics. We apply it to both discrete and continuous sources. For the continuous source the underdetermined (overcomplete) case is studied. The naive mean-field approach fails in this case whereas linear response theory-which gives an improved estimate of covariances-is very efficient. The examples given are for sources without temporal correlations. However, this derivation can easily be extended to treat temporal correlations. Finally, the framework offers a simple way of generating new leA algorithms without needing to define the prior distribution of the sources explicitly.

2 0.54224718 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

3 0.5301308 96 nips-2000-One Microphone Source Separation

Author: Sam T. Roweis

Abstract: Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (

4 0.41664219 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

Author: Ulrik Kjems, Lars Kai Hansen, Stephen C. Strother

Abstract: We demonstrate that statistical analysis of ill-posed data sets is subject to a bias, which can be observed when projecting independent test set examples onto a basis defined by the training examples. Because the training examples in an ill-posed data set do not fully span the signal space the observed training set variances in each basis vector will be too high compared to the average variance of the test set projections onto the same basis vectors. On basis of this understanding we introduce the Generalizable Singular Value Decomposition (GenSVD) as a means to reduce this bias by re-estimation of the singular values obtained in a conventional Singular Value Decomposition, allowing for a generalization performance increase of a subsequent statistical model. We demonstrate that the algorithm succesfully corrects bias in a data set from a functional PET activation study of the human brain. 1 Ill-posed Data Sets An ill-posed data set has more dimensions in each example than there are examples. Such data sets occur in many fields of research typically in connection with image measurements. The associated statistical problem is that of extracting structure from the observed high-dimensional vectors in the presence of noise. The statistical analysis can be done either supervised (Le. modelling with target values: classification, regresssion) or unsupervised (modelling with no target values: clustering, PCA, ICA). In both types of analysis the ill-posedness may lead to immediate problems if one tries to apply conventional statistical methods of analysis, for example the empirical covariance matrix is prohibitively large and will be rank-deficient. A common approach is to use Singular Value Decomposition (SVD) or the analogue Principal Component Analysis (PCA) to reduce the dimensionality of the data. Let the N observed i-dimensional samples Xj, j = L .N, collected in the data matrix X = [Xl ... XN] of size I x N, I> N . The SVD-theorem states that such a matrix can be decomposed as (1) where U is a matrix of the same size as X with orthogonal basis vectors spanning the space of X, so that UTU = INxN. The square matrix A contains the singular values in the diagonal, A = diag( AI, ... , >w), which are ordered and positive Al ~ A2 ~ ... ~ AN ~ 0, and V is N x N and orthogonal V TV = IN. If there is a mean value significantly different from zero it may at times be advantageous to perform the above analysis on mean-subtracted data, i.e. X - X = U A V T where columns of X all contain the mean vector x = Lj xj/N. Each observation Xj can be expressed in coordinates in the basis defined by the vectors of U with no loss of information[Lautrup et al., 1995]. A change of basis is obtained by qj = U T Xj as the orthogonal basis rotation Q = [ql ... qN] = U T X = UTUAV T = AVT . (2) Since Q is only N x Nand N « I, Q is a compact representation of the data. Having now N examples of N dimension we have reduced the problem to a marginally illposed one. To further reduce the dimensionality, it is common to retain only a subset of the coordinates, e.g. the top P coordinates (P < N) and the supervised or unsupervised model can be formed in this smaller but now well-posed space. So far we have considered the procedure for modelling from a training set. Our hope is that the statistical description generalizes well to new examples proving that is is a good description of the generating process. The model should, in other words, be able to perform well on a new example, x*, and in the above framework this would mean the predictions based on q* = U T x* should generalize well. We will show in the following, that in general, the distribution of the test set projection q* is quite different from the statistics of the projections of the training examples qj. It has been noted in previous work [Hansen and Larsen, 1996, Roweis, 1998, Hansen et al., 1999] that PCA/SVD of ill-posed data does not by itself represent a probabilistic model where we can assign a likelihood to a new test data point, and procedures have been proposed which make this possible. In [Bishop, 1999] PCA has been considered in a Bayesian framework, but does not address the significant bias of the variance in training set projections in ill-posed data sets. In [Jackson, 1991] an asymptotic expression is given for the bias of eigen-values in a sample covariance matrix, but this expression is valid only in the well-posed case and is not applicable for ill-posed data. 1.1 Example Let the signal source be I-dimensional multivariate Gaussian distribution N(O,~) with a covariance matrix where the first K eigen-values equal u 2 and the last 1- K are zero, so that the covariance matrix has the decomposition ~=u2YDyT, D=diag(1, ... ,1,0, ... ,0), yTY=I (3) Our N samples of the distribution are collected in the matrix X = [Xij] with the SVD (4) A = diag(Al, ... , AN) and the representation ofthe N examples in the N basis vector coordinates defined by U is Q = [%] = U T X = A V T. The total variance per training example is ~ LX;j ~Tr(XTX) = ~Tr(VAUTUAVT) = ~Tr(VA2VT) i,j = ~ Tr(VVT A2) = ~ Tr(A2) = ~L A; i (5) Note that this variance is the same in the U-basis coordinates: 1 L...J 2 N '

5 0.36572093 114 nips-2000-Second Order Approximations for Probability Models

Author: Hilbert J. Kappen, Wim Wiegerinck

Abstract: In this paper, we derive a second order mean field theory for directed graphical probability models. By using an information theoretic argument it is shown how this can be done in the absense of a partition function. This method is a direct generalisation of the well-known TAP approximation for Boltzmann Machines. In a numerical example, it is shown that the method greatly improves the first order mean field approximation. For a restricted class of graphical models, so-called single overlap graphs, the second order method has comparable complexity to the first order method. For sigmoid belief networks, the method is shown to be particularly fast and effective.

6 0.36527556 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

7 0.35953298 13 nips-2000-A Tighter Bound for Graphical Models

8 0.35465094 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

9 0.34842405 22 nips-2000-Algorithms for Non-negative Matrix Factorization

10 0.29157287 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

11 0.28396362 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

12 0.27277222 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

13 0.27035698 126 nips-2000-Stagewise Processing in Error-correcting Codes and Image Restoration

14 0.26698849 27 nips-2000-Automatic Choice of Dimensionality for PCA

15 0.25426286 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

16 0.24554634 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes

17 0.24501084 122 nips-2000-Sparse Representation for Gaussian Process Models

18 0.24312434 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

19 0.24178348 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

20 0.23242398 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.013), (17, 0.101), (26, 0.012), (32, 0.027), (33, 0.041), (36, 0.018), (54, 0.023), (55, 0.026), (60, 0.336), (62, 0.034), (65, 0.025), (67, 0.088), (76, 0.061), (79, 0.025), (81, 0.029), (90, 0.024), (91, 0.032), (97, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80281878 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

Author: Pedro A. d. F. R. Højen-Sørensen, Ole Winther, Lars Kai Hansen

Abstract: We propose a general Bayesian framework for performing independent component analysis (leA) which relies on ensemble learning and linear response theory known from statistical physics. We apply it to both discrete and continuous sources. For the continuous source the underdetermined (overcomplete) case is studied. The naive mean-field approach fails in this case whereas linear response theory-which gives an improved estimate of covariances-is very efficient. The examples given are for sources without temporal correlations. However, this derivation can easily be extended to treat temporal correlations. Finally, the framework offers a simple way of generating new leA algorithms without needing to define the prior distribution of the sources explicitly.

2 0.6825766 146 nips-2000-What Can a Single Neuron Compute?

Author: Blaise Agüera y Arcas, Adrienne L. Fairhall, William Bialek

Abstract: In this paper we formulate a description of the computation performed by a neuron as a combination of dimensional reduction and nonlinearity. We implement this description for the HodgkinHuxley model, identify the most relevant dimensions and find the nonlinearity. A two dimensional description already captures a significant fraction of the information that spikes carry about dynamic inputs. This description also shows that computation in the Hodgkin-Huxley model is more complex than a simple integrateand-fire or perceptron model. 1

3 0.42010134 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

4 0.41954839 13 nips-2000-A Tighter Bound for Graphical Models

Author: Martijn A. R. Leisink, Hilbert J. Kappen

Abstract: We present a method to bound the partition function of a Boltzmann machine neural network with any odd order polynomial. This is a direct extension of the mean field bound, which is first order. We show that the third order bound is strictly better than mean field. Additionally we show the rough outline how this bound is applicable to sigmoid belief networks. Numerical experiments indicate that an error reduction of a factor two is easily reached in the region where expansion based approximations are useful. 1

5 0.41902393 122 nips-2000-Sparse Representation for Gaussian Process Models

Author: Lehel Csatč´¸, Manfred Opper

Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.

6 0.41652119 134 nips-2000-The Kernel Trick for Distances

7 0.413984 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

8 0.4063139 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

9 0.40629464 37 nips-2000-Convergence of Large Margin Separable Linear Classification

10 0.4019514 74 nips-2000-Kernel Expansions with Unlabeled Examples

11 0.40109268 85 nips-2000-Mixtures of Gaussian Processes

12 0.40079945 60 nips-2000-Gaussianization

13 0.39781365 94 nips-2000-On Reversing Jensen's Inequality

14 0.39568713 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

15 0.39506263 79 nips-2000-Learning Segmentation by Random Walks

16 0.39437088 49 nips-2000-Explaining Away in Weight Space

17 0.39422181 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

18 0.39344907 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account

19 0.39230257 21 nips-2000-Algorithmic Stability and Generalization Performance

20 0.39126647 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets