nips nips2001 nips2001-165 knowledge-graph by maker-knowledge-mining

165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA


Source: pdf

Author: Magnus Rattray, Gleb Basalyga

Abstract: We study the dynamics of a Hebbian ICA algorithm extracting a single non-Gaussian component from a high-dimensional Gaussian background. For both on-line and batch learning we find that a surprisingly large number of examples are required to avoid trapping in a sub-optimal state close to the initial conditions. To extract a skewed signal at least examples are required for -dimensional data and examples are required to extract a symmetrical signal with non-zero kurtosis. § ¡ ©£¢  £ §¥ ¡ ¨¦¤£¢

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We study the dynamics of a Hebbian ICA algorithm extracting a single non-Gaussian component from a high-dimensional Gaussian background. [sent-8, score-0.187]

2 For both on-line and batch learning we find that a surprisingly large number of examples are required to avoid trapping in a sub-optimal state close to the initial conditions. [sent-9, score-0.826]

3 To extract a skewed signal at least examples are required for -dimensional data and examples are required to extract a symmetrical signal with non-zero kurtosis. [sent-10, score-0.634]

4 § ¡ ©£¢  £ §¥ ¡ ¨¦¤£¢  1 Introduction Independent component analysis (ICA) is a statistical modelling technique which has attracted a significant amount of research interest in recent years (for a review, see Hyv¨ rinen, a 1999). [sent-11, score-0.073]

5 A number of neural learning algorithms have been applied to this problem, as detailed in the aforementioned review. [sent-13, score-0.067]

6 Theoretical studies of ICA algorithms have mainly focussed on asymptotic stability and efficiency, using the established results of stochastic approximation theory. [sent-14, score-0.095]

7 However, in practice the transient stages of learning will often be more significant in determining the success of an algorithm. [sent-15, score-0.176]

8 In this paper a Hebbian ICA algorithm is analysed in both on-line and batch mode, highlighting the critical importance of the transient dynamics. [sent-16, score-0.322]

9 We find that a surprisingly large number of training examples are required in order to avoid trapping in a sub-optimal state close to the initial conditions. [sent-17, score-0.625]

10 To detect a skewed signal at least examples are required for -dimensional data, while examples are required for a symmetric signal with non-zero kurtosis. [sent-18, score-0.63]

11 In addition, for on-line learning we show that the maximal initial learning rate which allows successful learning is unusually low, being for a skewed signal and for a symmetric signal. [sent-19, score-0.7]

12  ©£¢  §   ¡ ©£¢  In order to obtain a tractable model, we consider the limit of high-dimensional data and study an idealised data set in which a single non-Gaussian source is mixed into a large number of Gaussian sources. [sent-21, score-0.211]

13 In that work a solution to the dynamics of the on-line algorithm was obtained in closed form for learning iterations and a simple solution to the asymptotic dynamics under the optimal learning rate decay was obtained. [sent-23, score-0.615]

14 However, it was noted there that modelling the dynamics on an timescale is not always appropriate, because the algorithm typically requires much longer in order to § ¡ "¤£¢  § ¡ "©£¢  escape from a class of metastable states close to the initial conditions. [sent-24, score-1.081]

15 In order to elucidate this effect in greater detail we focus here on the simplest case of a single non-Gaussian source and we will limit our analysis to the dynamics close to the initial conditions. [sent-25, score-0.578]

16 In recent years a number of on-line learning algorithms, including back-propagation and Sanger’s PCA algorithm, have been studied using techniques from statistical mechanics (see, for example, Biehl (1994); Biehl and Schwarze (1995); Saad and Solla (1995) and contributions in Saad (1998)). [sent-26, score-0.1]

17 These analyses exploited the “self-averaging” property of certain macroscopic variables in order to obtain ordinary differential equations describing the deterministic evolution of these quantities over time in the large limit. [sent-27, score-0.119]

18 In the present case the appropriate macroscopic quantity does not self-average and fluctuations have to be considered even in the limit. [sent-28, score-0.103]

19 In this case it is more natural to model the on-line learning dynamics as a diffusion process (see, for example Gardiner, 1985). [sent-29, score-0.405]

20 £ 2 Data Model In order to apply the Hebbian ICA algorithm we must first sphere the data, ie. [sent-30, score-0.079]

21 This can be achieved by standard transformations in a batch setting or for on-line learning an adaptive sphering algorithm, such as the one introduced by Cardoso and Laheld (1996), could be used. [sent-32, score-0.247]

22 § ¢ $ ¨¤'&¤£¡ %£  © § ¡ ¥¨¦   Each data point is generated from a noiseless linear mixture of sources which are decomposed into a single non-Gaussian source and uncorrelated Gaussian components, . [sent-35, score-0.241]

23 0 89¦ (765 "¡ 30 2£¡ ¦ 1 ¨'  ( ) is presented to the We will consider both the on-line case, in which a new IID example algorithm at each time and then discarded, and also the batch case, in which a finite set of examples are available to the algorithm. [sent-38, score-0.27]

24 C H 3 On-line learning c b 0t V ¤ ƒ‚f ¤ utvV wtt   V f t rp 4¡ ¢ ¥ wtvV utt 4 sq5if 0 § ¦ ()h5 "¡ ! [sent-49, score-0.265]

25 This assumes zero correlation between and which is true for on-line learning but is only strictly true for the first iteration of batch learning (see section 4). [sent-54, score-0.35]

26 In the algorithm described below we impose a normalisation constraint on such that . [sent-55, score-0.119]

27 V A simple Hebbian (or anti-Hebbian) learning rule was studied by Hyv¨ rinen and Oja a (1998), who showed it to have a remarkably simple stability condition. [sent-57, score-0.303]

28 We will consider the deflationary form in which a single source is learned at one time. [sent-58, score-0.165]

29 Maximising some such measure (simple examples would be skewness or kurtosis) leads to the following simple algorithm (see Hyv¨ rinen a and Oja, 1998, for details). [sent-61, score-0.319]

30 The change in at time is given by, ¥ W 0 § W¡ ¥ 0t 8 ¤ utvV utt B V followed by normalisation such that (5) § W¡ ¥ ¥£¡ § ¨@   § @ W ¡ ¦¤¢0 V   ¡ Here, is the learning rate and is some non-linear function which we will take to be at least three times differentiable. [sent-62, score-0.321]

31 , is appropriate for detecting asymmetric signals while a more common choice is an odd function, eg. [sent-64, score-0.422]

32 or , which can be used to detect symmetric non-Gaussian signals. [sent-65, score-0.095]

33 In the latter case has to be chosen in order to ensure stability of the correct solution, as described by Hyv¨ rinen and Oja (1998), either adaptively or using a a ´ in the case of an even non-linearity. [sent-66, score-0.243]

34 €i¢ £0 W ¥ W 0 W ¥ ¤  ¤ § W ¡  © § ¡  § ¡ We can write the above algorithm as, 8 (6) ¡ ¥£ ¡& ¥ wtt @   wtt § @ W ¡ ¥ ¥ ¥ (4 @ W § @ W ¡ ¦¤'4 @ @ ¥£¡ @ D  § GW ¡ ¦%$4 7V f ¡ f f 7W ¥£¡ 0 @W f ¢ ¡@ 2 § W ¡ ¦"%! [sent-70, score-0.231]

35 8`f 7 £¡ ¢ @   23§ @ W ¡ ¥¦¤$4 @ V 1 #"@ V 0 § ¤£¢  '¡  ¡ ) ¤ r 0  #"@ V £ For large and (two different scalings will be considered below) we can expand out to get a weight decay normalisation, 0t ¤ utvV utt ¡ 0   "D! [sent-71, score-0.22]

36 H  @ f § @ W ¡ ¥ ¥ £ ¥ ¡ ¥ ¢ 4 @ ¡ 8 @ V § @ W ¡ ¥ ¥ £ ¥ 6¥ ¢ 54@ @V W f Taking the dot-product with (7) gives the following update increment for the overlap , (8) where we used the constraint in eqn. [sent-72, score-0.099]

37 Below we calculate the mean and variance of for two different scalings of the learning rate. [sent-74, score-0.124]

38 4) these expressions distribution for given only depends on (setting will depend only on and statistics of the non-Gaussian source distribution. [sent-76, score-0.165]

39 1 Dynamics close to the initial conditions § 1 ©£¢  9 ¡ 0 `f V D! [sent-78, score-0.255]

40 " If the entries in and are initially of similar order then one would expect . [sent-79, score-0.174]

41 This is the typical case if we consider a random and uncorrelated choice for and the initial entries in . [sent-80, score-0.189]

42 Larger initial values of could only be obtained with some prior knowledge of the mixing matrix which we will not assume. [sent-81, score-0.166]

43 The discussion below is therefore restricted to describing the dynamics close to the initial conditions. [sent-83, score-0.367]

44 For an account of the transient dynamics far from the initial conditions and the asymptotic dynamics close to an optimal solution, see Rattray (2002). [sent-84, score-0.713]

45 1 If the signal is asymmetrical then an even non-linearity can be used, for example is a common choice. [sent-87, score-0.317]

46 maximal) scaling for the learning rate is and we set where is an scaled learning rate parameter. [sent-89, score-0.365]

47 H X Figure 1: Close to the initial conditions (where ) the learning dynamics is equivalent to diffusion in a polynomial potential. [sent-92, score-0.566]

48 For asymmetrical source distributions we can use an even non-linearity in which case the potential is cubic, as shown on the left. [sent-93, score-0.481]

49 For symmetrical source distributions with non-zero kurtosis we should use an odd non-linearity in which case the potential is quartic, as shown on the right. [sent-94, score-0.781]

50 dynamics is initially confined in a metastable state near 0 x i@ ! [sent-96, score-0.447]

51 In this case the system can be described by a Fokker-Planck equation for large (see, for example, Gardiner, 1985) with a characteristic timescale of . [sent-100, score-0.138]

52 The system is locally equivalent to a diffusion in the following cubic potential, (11) §  ¨¤ yx ¡ © § p & § ¥ ©£¢  ¡ §  0 ¨¥ ©£¡ 0 2 31 £ @ ! [sent-101, score-0.253]

53 7 ¥ F (§ p ¥ ¥ 70 6 ' ¡ & @   ¤F ('§ p ¡ )) ¥ &  C 5 ¢ ¥ @ ¥ F ('§ p ¡ ¥ ¥ & 4 0 § @ ¡ F ( § @ 7¡C with a diffusion coefficient which is independent of . [sent-102, score-0.184]

54 The shape of this must be overcome to potential is shown on the left of fig. [sent-103, score-0.179]

55 A potential barrier of escape a metastable state close to the initial conditions. [sent-105, score-0.986]

56 2 If the signal is symmetrical, or only weakly asymmetrical, it will be necessary to use an odd non-linearity, for example or are popular choices. [sent-108, score-0.413]

57 In this case a lower learning rate is required in order to achieve successful separation. [sent-109, score-0.313]

58 The appropriate scaling for the learning rate is and we set where again is an scaled learning rate parameter. [sent-110, score-0.395]

59 ¥  F ('§ £ p¥ ¡ ¥ ('§¥ ¡& ¥ ¥ ¢2 0 F @ 7C 0 F @ 7C @ § ¡ ¨¤ ¢  Var § W¡ E (12) (13) 4 C where is the fourth cumulant of the source distribution (measuring kurtosis) and brackets denote averages over . [sent-113, score-0.285]

60 Again the system can be described by a Fokker-Planck §¤ €yx¡ © '§ p §  ¤£¢  ¡ £ £ equation for large but in this case the timescale for learning is , an order of slower than in the asymmetrical case. [sent-114, score-0.452]

61 The system is locally equivalent to diffusion in the following quartic potential, 0`£ ¥ F § p ¡ ¥ ¥ & 0 6 ' @ )  4 ¤F t ('§ p ¡ )) ¥ & 4 C t 4  ¥ ¢ ¥ @ ¥ F '§ p ¡ ¥ ¥ & 4 0 § @ ¡ (14) § 4 C¡ with a diffusion coefficient . [sent-115, score-0.434]

62 We have assumed Sign which is a necessary condition for successful learning. [sent-116, score-0.068]

63 In the case of a cubic non-linearity this is also the condition for stability of the optimal fixed point, although in general these two conditions may not be equivalent (Rattray, 2002). [sent-117, score-0.165]

64 The shape of this potential is shown on the right of fig. [sent-118, score-0.146]

65 1 and again a potential barrier of must be overcome to escape a metastable state close to the initial conditions. [sent-119, score-1.019]

66 3 Escape times from a metastable state at @ F For large the dynamics of corresponds to an Ornstein-Uhlenbeck process with a Gaussian stationary distribution of fixed unit variance. [sent-124, score-0.35]

67 Thus, if one chooses too large initially (recall, ). [sent-125, score-0.097]

68 As is reduced the dynamics will become localised close to the potential barrier confining the dynamics is reduced. [sent-126, score-0.687]

69 The timescale for escape for large (mean first passage time) is mainly determined by the effective size of the barrier (see, for example, Gardiner, 1985),  B £ 6 ! [sent-127, score-0.6]

70 7 is a unit of time in F ¥ £ 87F C x D 0 4 C § W ¡ ¥ ¡ X ¡ XSF C x D 0  C § W ¡ ¥ , , , F  £   § B  © § ¥ ¨¦ ¥   for even  for odd £ F ¡ £ ¤ ¢ ¡ ! [sent-128, score-0.297]

71 1 © )) ' § F p ¡ p ¥ & ¥  & C  & ¤ ¥ ¥ £ ¤  ©§ ' § ¡ ¥ ¢ ¡ ¢ ¡ odd escape F (15) where is the potential barrier, is the diffusion coefficient and the diffusion process. [sent-131, score-1.072]

72 For the two cases considered above we obtain, even escape F escape The constants of proportionality depend on the shape of the potential and not on . [sent-132, score-0.73]

73 As the learning rate parameter is reduced so the timescale for escape is also reduced. [sent-133, score-0.566]

74 Notice that escape time is shortest when the cumulants or are large, suggesting that deflationary ICA algorithms will tend to find these signals first. [sent-135, score-0.323]

75 Firstly, the initial learning rate must be less than initially in order to avoid trapping close to the initial conditions. [sent-137, score-0.791]

76 Secondly, the number of iterations required to escape the initial transient will be greater than , resulting in an extremely slow initial stage of learning for large . [sent-138, score-0.769]

77 The most extreme case is for symmetric source distributions with non-zero kurtosis, in which case learning iterations are required. [sent-139, score-0.296]

78 2 we show results of learning with an asymmetric source (top) and uniform source (bottom) for different scaled learning rates. [sent-141, score-0.581]

79 As the learning rate is increased (left to right) we observe that the dynamics becomes increasingly stochastic, with the potential barrier becoming increasingly significant (potential maxima are shown as dashed lines). [sent-142, score-0.726]

80 For the ) the algorithm becomes trapped close to the initial largest value of learning rate ( conditions for the whole simulation time. [sent-143, score-0.424]

81 From the time axis we observe that the learning timescale is for the asymmetrical signal and for the symmetric signal, as predicted by our theory. [sent-144, score-0.617]

82 In the top row we show results for a binary, asymmetrical source with skewness and . [sent-156, score-0.484]

83 In the bottom row we show results for a uniformly distributed source and . [sent-157, score-0.197]

84 Each row shows learning with the same initial conditions and data but with different scaled learning rates (left to right and ) where (top) or (bottom). [sent-158, score-0.38]

85 # ¤ ¤ 8 x ‚F  0 ¥   0 § ¡   ¡   ¡ ¥¦£ ¢X 0 §¥ ¡ ¥ #   ¡ 8¤ 0 C F  £ ¡ X F 4 Batch learning The batch version of eqn. [sent-161, score-0.247]

86 (5) for sufficiently small learning rates can be written, ¢ (17) @ £ 4 @ @V W ¢ @   ¨§ @ W ¡ ¥  %¡ 0 V 7 2 £ ¤ ¡ where is the number of training examples. [sent-162, score-0.067]

87 Here we argue that such an update requires at least the same order of examples as in the on-line case, in order to be successful. [sent-163, score-0.149]

88 Less data will result in a low signal-to-noise ratio initially and the possibility of trapping in a sub-optimal fixed point close to the initial conditions. [sent-164, score-0.442]

89 For example, with an asymmetric signal and quadratic nonlinearity we require initially, while for a symmetric signal and odd non-linearity we require . [sent-166, score-0.699]

90 We have carried out simulations of batch learning which confirm that a relatively low percentage of runs in which the intial increment was incorrect result in successful learning compared to typical performance. [sent-167, score-0.451]

91 (18) in orders of initially and we can therefore expand the right-hand side of init for large . [sent-170, score-0.296]

92 ( at the first iteration) is a sum over ran- domly sampled terms, and the central limit theorem states that for large the distribution init from which is sampled will be Gaussian, with mean and variance given by (to leading order in ), & 8 ('§ p ¡ ¥   ¥ ! [sent-171, score-0.245]

93 ¥ ¡ 0 ¡ f 7   )  ¡  f '§ p ¡ ))) ¥ & 4 C 5 4 ¥ f '(§ p ¡ ) ¥ &  C ¥ ¢ £%¡ 0 ¡ f 7   f f7 ¥ E init Var init (19) (20) ¥ ¥  C Notice that the term disappears in the case of an asymmetrical non-linearity, which is why we have left both terms in eqn. [sent-172, score-0.531]

94 The algorithm will be likely to fail when init is of the same order (or greater) than the mean. [sent-174, score-0.278]

95 Since the standard deviation of initially, we see that this is true for in the case of an even nonlinearity and asymmetric signal, or for in the case of an odd non-linearity and a signal with non-zero kurtosis. [sent-175, score-0.519]

96 We expect these results to be necessary but not necessarily sufficient for successful learning, since we have only shown that this order of examples is the minimum required to avoid a low signal-to-noise ratio in the first learning iteration. [sent-176, score-0.349]

97 A complete treatment of the batch learning problem would require much more sophisticated formulations such as the mean-field theory of Wong et al. [sent-177, score-0.247]

98 §¥ ¡ ©£¢  0  ¡ 0 ‚§©£¢  ‚ ¥ ¥ f7 § 9 ¤£¢   ¡ 0 ¨f 5 Conclusions and future work In both the batch and on-line Hebbian ICA algorithm we find that a surprisingly large number of examples are required to avoid a sub-optimal fixed point close to the initial conditions. [sent-179, score-0.629]

99 We expect simialr scaling laws to apply in the case of small numbers of non-Gaussian sources. [sent-180, score-0.089]

100 Analysis of the square demixing problem appears to be much more challenging as in this case there may be no simple macroscopic description of the system for large . [sent-181, score-0.122]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('odd', 0.297), ('escape', 0.292), ('ica', 0.212), ('asymmetrical', 0.201), ('diffusion', 0.184), ('batch', 0.18), ('hyv', 0.172), ('barrier', 0.17), ('init', 0.165), ('metastable', 0.165), ('source', 0.165), ('dynamics', 0.154), ('rinen', 0.143), ('timescale', 0.138), ('rattray', 0.132), ('trapping', 0.132), ('initial', 0.119), ('signal', 0.116), ('potential', 0.115), ('hebbian', 0.112), ('transient', 0.109), ('kurtosis', 0.105), ('biehl', 0.099), ('gardiner', 0.099), ('symmetrical', 0.099), ('utt', 0.099), ('utvv', 0.099), ('wtt', 0.099), ('initially', 0.097), ('saad', 0.097), ('close', 0.094), ('skewness', 0.086), ('normalisation', 0.086), ('nd', 0.076), ('macroscopic', 0.073), ('cardoso', 0.073), ('oja', 0.073), ('rate', 0.069), ('increment', 0.069), ('cubic', 0.069), ('successful', 0.068), ('learning', 0.067), ('ationary', 0.066), ('laheld', 0.066), ('quartic', 0.066), ('var', 0.065), ('asymmetric', 0.064), ('symmetric', 0.064), ('required', 0.063), ('skewed', 0.063), ('magnus', 0.057), ('scalings', 0.057), ('manchester', 0.057), ('examples', 0.057), ('stability', 0.054), ('scaled', 0.053), ('cumulant', 0.049), ('laws', 0.049), ('demixing', 0.049), ('wong', 0.049), ('avoid', 0.048), ('mixing', 0.047), ('order', 0.046), ('nonlinearity', 0.042), ('conditions', 0.042), ('asymptotic', 0.041), ('xed', 0.04), ('modelling', 0.04), ('maxima', 0.04), ('increasingly', 0.04), ('scaling', 0.04), ('uncorrelated', 0.039), ('remarkably', 0.039), ('bell', 0.039), ('sources', 0.037), ('brackets', 0.037), ('iteration', 0.036), ('coef', 0.036), ('surprisingly', 0.035), ('leading', 0.034), ('fail', 0.034), ('averages', 0.034), ('expand', 0.034), ('projection', 0.033), ('overcome', 0.033), ('years', 0.033), ('algorithm', 0.033), ('con', 0.032), ('row', 0.032), ('ning', 0.031), ('amari', 0.031), ('entries', 0.031), ('observe', 0.031), ('signals', 0.031), ('shape', 0.031), ('state', 0.031), ('detect', 0.031), ('appropriate', 0.03), ('decay', 0.03), ('overlap', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA

Author: Magnus Rattray, Gleb Basalyga

Abstract: We study the dynamics of a Hebbian ICA algorithm extracting a single non-Gaussian component from a high-dimensional Gaussian background. For both on-line and batch learning we find that a surprisingly large number of examples are required to avoid trapping in a sub-optimal state close to the initial conditions. To extract a skewed signal at least examples are required for -dimensional data and examples are required to extract a symmetrical signal with non-zero kurtosis. § ¡ ©£¢  £ §¥ ¡ ¨¦¤£¢

2 0.18709452 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

Author: Roland Vollgraf, Klaus Obermayer

Abstract: We present a new method for the blind separation of sources, which do not fulfill the independence assumption. In contrast to standard methods we consider groups of neighboring samples (

3 0.10203461 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

Author: Stefan Harmeling, Andreas Ziehe, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: In kernel based learning the data is mapped to a kernel feature space of a dimension that corresponds to the number of training data points. In practice, however, the data forms a smaller submanifold in feature space, a fact that has been used e.g. by reduced set techniques for SVMs. We propose a new mathematical construction that permits to adapt to the intrinsic dimension and to find an orthonormal basis of this submanifold. In doing so, computations get much simpler and more important our theoretical framework allows to derive elegant kernelized blind source separation (BSS) algorithms for arbitrary invertible nonlinear mixings. Experiments demonstrate the good performance and high computational efficiency of our kTDSEP algorithm for the problem of nonlinear BSS.

4 0.094435781 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

Author: Michael Zibulevsky, Pavel Kisilev, Yehoshua Y. Zeevi, Barak A. Pearlmutter

Abstract: We consider a problem of blind source separation from a set of instantaneous linear mixtures, where the mixing matrix is unknown. It was discovered recently, that exploiting the sparsity of sources in an appropriate representation according to some signal dictionary, dramatically improves the quality of separation. In this work we use the property of multi scale transforms, such as wavelet or wavelet packets, to decompose signals into sets of local features with various degrees of sparsity. We use this intrinsic property for selecting the best (most sparse) subsets of features for further separation. The performance of the algorithm is verified on noise-free and noisy data. Experiments with simulated signals, musical sounds and images demonstrate significant improvement of separation quality over previously reported results. 1

5 0.09316913 71 nips-2001-Estimating the Reliability of ICA Projections

Author: Frank C. Meinecke, Andreas Ziehe, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: When applying unsupervised learning techniques like ICA or temporal decorrelation, a key question is whether the discovered projections are reliable. In other words: can we give error bars or can we assess the quality of our separation? We use resampling methods to tackle these questions and show experimentally that our proposed variance estimations are strongly correlated to the separation error. We demonstrate that this reliability estimation can be used to choose the appropriate ICA-model, to enhance significantly the separation performance, and, most important, to mark the components that have a actual physical meaning. Application to 49-channel-data from an magneto encephalography (MEG) experiment underlines the usefulness of our approach. 1

6 0.085315257 61 nips-2001-Distribution of Mutual Information

7 0.081277572 139 nips-2001-Online Learning with Kernels

8 0.074660443 72 nips-2001-Exact differential equation population dynamics for integrate-and-fire neurons

9 0.073439933 25 nips-2001-Active Learning in the Drug Discovery Process

10 0.064801656 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections

11 0.06213643 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

12 0.06005495 164 nips-2001-Sampling Techniques for Kernel Methods

13 0.05935486 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

14 0.059218977 21 nips-2001-A Variational Approach to Learning Curves

15 0.059048928 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

16 0.05678314 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network

17 0.056027766 43 nips-2001-Bayesian time series classification

18 0.055685811 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

19 0.054801118 74 nips-2001-Face Recognition Using Kernel Methods

20 0.0539722 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.185), (1, -0.047), (2, -0.017), (3, -0.032), (4, 0.045), (5, -0.015), (6, 0.039), (7, 0.04), (8, 0.121), (9, -0.219), (10, -0.121), (11, -0.1), (12, -0.001), (13, -0.058), (14, 0.066), (15, 0.017), (16, 0.023), (17, -0.009), (18, 0.044), (19, -0.088), (20, 0.003), (21, 0.025), (22, -0.063), (23, 0.01), (24, -0.005), (25, 0.072), (26, 0.007), (27, -0.098), (28, 0.037), (29, 0.059), (30, -0.066), (31, 0.018), (32, 0.061), (33, 0.069), (34, -0.012), (35, -0.107), (36, 0.055), (37, 0.044), (38, 0.006), (39, 0.125), (40, -0.078), (41, 0.087), (42, 0.019), (43, 0.086), (44, 0.187), (45, 0.026), (46, 0.152), (47, -0.156), (48, -0.055), (49, -0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94145429 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA

Author: Magnus Rattray, Gleb Basalyga

Abstract: We study the dynamics of a Hebbian ICA algorithm extracting a single non-Gaussian component from a high-dimensional Gaussian background. For both on-line and batch learning we find that a surprisingly large number of examples are required to avoid trapping in a sub-optimal state close to the initial conditions. To extract a skewed signal at least examples are required for -dimensional data and examples are required to extract a symmetrical signal with non-zero kurtosis. § ¡ ©£¢  £ §¥ ¡ ¨¦¤£¢

2 0.61015218 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

Author: Roland Vollgraf, Klaus Obermayer

Abstract: We present a new method for the blind separation of sources, which do not fulfill the independence assumption. In contrast to standard methods we consider groups of neighboring samples (

3 0.59145916 71 nips-2001-Estimating the Reliability of ICA Projections

Author: Frank C. Meinecke, Andreas Ziehe, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: When applying unsupervised learning techniques like ICA or temporal decorrelation, a key question is whether the discovered projections are reliable. In other words: can we give error bars or can we assess the quality of our separation? We use resampling methods to tackle these questions and show experimentally that our proposed variance estimations are strongly correlated to the separation error. We demonstrate that this reliability estimation can be used to choose the appropriate ICA-model, to enhance significantly the separation performance, and, most important, to mark the components that have a actual physical meaning. Application to 49-channel-data from an magneto encephalography (MEG) experiment underlines the usefulness of our approach. 1

4 0.50277501 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

Author: Michael Zibulevsky, Pavel Kisilev, Yehoshua Y. Zeevi, Barak A. Pearlmutter

Abstract: We consider a problem of blind source separation from a set of instantaneous linear mixtures, where the mixing matrix is unknown. It was discovered recently, that exploiting the sparsity of sources in an appropriate representation according to some signal dictionary, dramatically improves the quality of separation. In this work we use the property of multi scale transforms, such as wavelet or wavelet packets, to decompose signals into sets of local features with various degrees of sparsity. We use this intrinsic property for selecting the best (most sparse) subsets of features for further separation. The performance of the algorithm is verified on noise-free and noisy data. Experiments with simulated signals, musical sounds and images demonstrate significant improvement of separation quality over previously reported results. 1

5 0.49023819 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

Author: Stefan Harmeling, Andreas Ziehe, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: In kernel based learning the data is mapped to a kernel feature space of a dimension that corresponds to the number of training data points. In practice, however, the data forms a smaller submanifold in feature space, a fact that has been used e.g. by reduced set techniques for SVMs. We propose a new mathematical construction that permits to adapt to the intrinsic dimension and to find an orthonormal basis of this submanifold. In doing so, computations get much simpler and more important our theoretical framework allows to derive elegant kernelized blind source separation (BSS) algorithms for arbitrary invertible nonlinear mixings. Experiments demonstrate the good performance and high computational efficiency of our kTDSEP algorithm for the problem of nonlinear BSS.

6 0.44699222 185 nips-2001-The Method of Quantum Clustering

7 0.4437803 25 nips-2001-Active Learning in the Drug Discovery Process

8 0.42050833 61 nips-2001-Distribution of Mutual Information

9 0.40173197 166 nips-2001-Self-regulation Mechanism of Temporally Asymmetric Hebbian Plasticity

10 0.37976155 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

11 0.37308973 68 nips-2001-Entropy and Inference, Revisited

12 0.35831422 57 nips-2001-Correlation Codes in Neuronal Populations

13 0.34898618 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections

14 0.34839237 139 nips-2001-Online Learning with Kernels

15 0.34468508 76 nips-2001-Fast Parameter Estimation Using Green's Functions

16 0.33307964 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures

17 0.33282271 21 nips-2001-A Variational Approach to Learning Curves

18 0.32391152 23 nips-2001-A theory of neural integration in the head-direction system

19 0.3182697 114 nips-2001-Learning from Infinite Data in Finite Time

20 0.31806323 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.032), (19, 0.013), (27, 0.649), (30, 0.04), (38, 0.01), (59, 0.021), (72, 0.053), (79, 0.021), (91, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99546552 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation

Author: Anand Rangarajan, Alan L. Yuille

Abstract: Bayesian belief propagation in graphical models has been recently shown to have very close ties to inference methods based in statistical physics. After Yedidia et al. demonstrated that belief propagation fixed points correspond to extrema of the so-called Bethe free energy, Yuille derived a double loop algorithm that is guaranteed to converge to a local minimum of the Bethe free energy. Yuille’s algorithm is based on a certain decomposition of the Bethe free energy and he mentions that other decompositions are possible and may even be fruitful. In the present work, we begin with the Bethe free energy and show that it has a principled interpretation as pairwise mutual information minimization and marginal entropy maximization (MIME). Next, we construct a family of free energy functions from a spectrum of decompositions of the original Bethe free energy. For each free energy in this family, we develop a new algorithm that is guaranteed to converge to a local minimum. Preliminary computer simulations are in agreement with this theoretical development. 1

same-paper 2 0.98767138 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA

Author: Magnus Rattray, Gleb Basalyga

Abstract: We study the dynamics of a Hebbian ICA algorithm extracting a single non-Gaussian component from a high-dimensional Gaussian background. For both on-line and batch learning we find that a surprisingly large number of examples are required to avoid trapping in a sub-optimal state close to the initial conditions. To extract a skewed signal at least examples are required for -dimensional data and examples are required to extract a symmetrical signal with non-zero kurtosis. § ¡ ©£¢  £ §¥ ¡ ¨¦¤£¢

3 0.98663056 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1

4 0.9833613 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

Author: Andrew Y. Ng, Michael I. Jordan

Abstract: We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widelyheld belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. This stems from the observation- which is borne out in repeated experiments- that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster. 1

5 0.98182029 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Author: Mikhail Belkin, Partha Niyogi

Abstract: Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami op erator on a manifold , and the connections to the heat equation , we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering. Several applications are considered. In many areas of artificial intelligence, information retrieval and data mining, one is often confronted with intrinsically low dimensional data lying in a very high dimensional space. For example, gray scale n x n images of a fixed object taken with a moving camera yield data points in rn: n2 . However , the intrinsic dimensionality of the space of all images of t he same object is the number of degrees of freedom of the camera - in fact the space has the natural structure of a manifold embedded in rn: n2 . While there is a large body of work on dimensionality reduction in general, most existing approaches do not explicitly take into account the structure of the manifold on which the data may possibly reside. Recently, there has been some interest (Tenenbaum et aI, 2000 ; Roweis and Saul, 2000) in the problem of developing low dimensional representations of data in this particular context. In this paper , we present a new algorithm and an accompanying framework of analysis for geometrically motivated dimensionality reduction. The core algorithm is very simple, has a few local computations and one sparse eigenvalu e problem. The solution reflects th e intrinsic geom etric structure of the manifold. Th e justification comes from the role of the Laplacian op erator in providing an optimal emb edding. Th e Laplacian of the graph obtained from the data points may be viewed as an approximation to the Laplace-Beltrami operator defined on the manifold. The emb edding maps for the data come from approximations to a natural map that is defined on the entire manifold. The framework of analysis presented here makes this connection explicit. While this connection is known to geometers and specialists in sp ectral graph theory (for example , see [1, 2]) to the best of our knowledge we do not know of any application to data representation yet. The connection of the Laplacian to the heat kernel enables us to choose the weights of the graph in a principled manner. The locality preserving character of the Laplacian Eigenmap algorithm makes it relatively insensitive to outliers and noise. A byproduct of this is that the algorithm implicitly emphasizes the natural clusters in the data. Connections to spectral clustering algorithms developed in learning and computer vision (see Shi and Malik , 1997) become very clear. Following the discussion of Roweis and Saul (2000) , and Tenenbaum et al (2000), we note that the biological perceptual apparatus is confronted with high dimensional stimuli from which it must recover low dimensional structure. One might argue that if the approach to recovering such low-dimensional structure is inherently local , then a natural clustering will emerge and thus might serve as the basis for the development of categories in biological perception. 1 The Algorithm Given k points Xl , ... , Xk in ]]{ I, we construct a weighted graph with k nodes, one for each point , and the set of edges connecting neighboring points to each other. 1. Step 1. [Constru cting th e Graph] We put an edge between nodes i and j if Xi and Xj are

6 0.98021936 23 nips-2001-A theory of neural integration in the head-direction system

7 0.96875095 47 nips-2001-Causal Categorization with Bayes Nets

8 0.88620162 98 nips-2001-Information Geometrical Framework for Analyzing Belief Propagation Decoder

9 0.85691231 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

10 0.82394207 137 nips-2001-On the Convergence of Leveraging

11 0.77024758 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

12 0.76392645 97 nips-2001-Information-Geometrical Significance of Sparsity in Gallager Codes

13 0.76196045 8 nips-2001-A General Greedy Approximation Algorithm with Applications

14 0.75994182 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

15 0.75959831 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

16 0.74897212 114 nips-2001-Learning from Infinite Data in Finite Time

17 0.74799055 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

18 0.74436712 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

19 0.74104238 154 nips-2001-Products of Gaussians

20 0.73914403 31 nips-2001-Algorithmic Luckiness