nips nips2000 nips2000-24 knowledge-graph by maker-knowledge-mining

24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

Source: pdf

Author: Oren Shriki, Haim Sompolinsky, Daniel D. Lee

Abstract: The principle of maximizing mutual information is applied to learning overcomplete and recurrent representations. The underlying model consists of a network of input units driving a larger number of output units with recurrent interactions. In the limit of zero noise, the network is deterministic and the mutual information can be related to the entropy of the output units. Maximizing this entropy with respect to both the feedforward connections as well as the recurrent interactions results in simple learning rules for both sets of parameters. The conventional independent components (ICA) learning algorithm can be recovered as a special case where there is an equal number of output units and no recurrent connections. The application of these new learning rules is illustrated on a simple two-dimensional input example.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Lee Bell Laboratories Lucent Technologies Murray Hill, NJ 07974 Abstract The principle of maximizing mutual information is applied to learning overcomplete and recurrent representations. [sent-2, score-1.107]

2 The underlying model consists of a network of input units driving a larger number of output units with recurrent interactions. [sent-3, score-0.878]

3 In the limit of zero noise, the network is deterministic and the mutual information can be related to the entropy of the output units. [sent-4, score-0.473]

4 Maximizing this entropy with respect to both the feedforward connections as well as the recurrent interactions results in simple learning rules for both sets of parameters. [sent-5, score-1.137]

5 The conventional independent components (ICA) learning algorithm can be recovered as a special case where there is an equal number of output units and no recurrent connections. [sent-6, score-0.741]

6 The application of these new learning rules is illustrated on a simple two-dimensional input example. [sent-7, score-0.302]

7 1 Introduction Many unsupervised learning algorithms such as principal component analysis, vector quantization, self-organizing feature maps, and others use the principle of minimizing reconstruction error to learn appropriate features from multivariate data [1, 2]. [sent-8, score-0.188]

8 Independent components analysis (ICA) can similarly be understood as maximizing the likelihood of the data under a non-Gaussian generative model, and thus is related to minimizing a reconstruction cost [3, 4, 5]. [sent-9, score-0.301]

9 On the other hand, the same ICA algorithm can also be derived without regard to a particular generative model by maximizing the mutual information between the data and a nonlinearly transformed version of the data [6]. [sent-10, score-0.354]

10 This principle of information maximization has also been previously applied to explain optimal properties for single units, linear networks, and symplectic transformations [7, 8, 9]. [sent-11, score-0.188]

11 In these proceedings, we show how the principle of maximizing mutual information can be generalized to overcomplete as well as recurrent representations. [sent-12, score-1.06]

12 In the limit of zero noise, we derive gradient descent learning rules for both the feedforward and recurrent weights. [sent-13, score-1.062]

13 Finally, we show the application of these learning rules to some simple illustrative examples. [sent-14, score-0.204]

14 M output variables N input variables Figure 1: Network diagram of an overcomplete, recurrent representation. [sent-15, score-0.686]

15 x are input data which influence the output signals s through feedforward connections W. [sent-16, score-0.546]

16 The signals s also interact with each other through the recurrent interactions K. [sent-17, score-0.712]

17 2 Information Maximization The "Infomax" formulation of leA considers the problem of maximizing the mutual information between N-dimensional data observations {x} which are input to a network resulting in N-dimensional output signals {s} [6]. [sent-18, score-0.605]

18 Here, we consider the general problem where the signals s are M -dimensional with M ~ N. [sent-19, score-0.068]

19 Thus, the representation is overcomplete because there are more signal components than data components. [sent-20, score-0.244]

20 We also consider the situation where a signal component Si can influence another component Sj through a recurrent interaction Kji. [sent-21, score-0.576]

21 1 with the feedforward connections described by the M x N matrix Wand the recurrent connections by the M x M matrix K. [sent-23, score-0.898]

22 The network response s is a deterministic function of the input x: (1) where 9 is some nonlinear squashing function. [sent-24, score-0.192]

23 In this case, the mutual information between the inputs x and outputs s is functionally only dependent on the entropy of the outputs: J(s, x) = H(s) - H(slx) '" H(s). [sent-25, score-0.316]

24 (2) The distribution of s is aN-dimensional manifold embedded in aM-dimensional vector space and nominally has a negatively divergent entropy. [sent-26, score-0.051]

25 However, as shown in Appendix 1, the probability density of s can be related to the input distribution via the relation: P(s) ex: P(x) y! [sent-27, score-0.098]

26 det(xTx) (3) where the susceptibility (or Jacobian) matrix X is defined as: OSi Xij =~. [sent-28, score-0.155]

27 uXj (4) This result can be understood in terms of the singular value decomposition (SVD) of the matrix x. [sent-29, score-0.081]

28 The transformation performed by X can be decomposed into a series of three transformations: an orthogonal transformation that rotates the axes, a diagonal transformation that scales each axis, followed by another orthogonal transformation. [sent-30, score-0.217]

29 A volume element in the input space is mapped onto a volume element in the output space, and its volume change is described by the diagonal scaling operation. [sent-31, score-0.406]

30 Thus, the relationship between the probability distribution in the input and output spaces includes the proportionality factor, y'det(xTx), as formally derived in Appendix 1. [sent-33, score-0.204]

31 We now get the following expression for the entropy of the outputs: -I 1 P(x) ) = -2 (logdet(x T X)) y'det(xTx) where the brackets indicate averaging over the input distribution. [sent-34, score-0.318]

32 H(s) '" 3 dxP(x) log ( + H(x), (5) Learning rules From Eq. [sent-35, score-0.157]

33 (5), we see that minimizing the following cost function: 1 E = -"2Tr(log(XTX)), (6) is equivalent to maximizing the mutual information. [sent-36, score-0.332]

34 We first note that the susceptibility X satisfies the following recursion relation: Xij where G ij = g~ . [sent-37, score-0.223]

35 <]>ij can be interpreted as the sensitivity in the recurrent network G- where <]>-1 == of the ith unit's output to changes in the total input of the jth unit. [sent-41, score-0.73]

36 We next derive the learning rules for the network parameters using gradient descent, as shown in detail in Appendix 2. [sent-42, score-0.358]

37 The resulting expression for the learning rule for the feedforward weights is: ~W = - ' f8E = 'f/ (rT + <]>T 'YxT) /8W (9) where'f/ is the learning rate, the matrix r is defined as r = (X T X)-1 XT <]> (0) l' = (Xr)ii (g~t)3 . [sent-43, score-0.523]

38 (11) and the vector 'Y is given by 'Yi Multiplying the gradient in Eq. [sent-44, score-0.058]

39 (9) by the matrix (WWT) yields an expression analogous to the "natural" gradient learning rule [10]: ~W = 'f/W (I + (X T 'YxT)) . [sent-45, score-0.319]

40 (2) Similarly, the learning rule for the recurrent interactions is ~K 8E = -'f/ 8K = 'f/ ((xrf + <]>T'YsT) . [sent-46, score-0.775]

41 (13) In the case when there are equal numbers of input and output units, M = N, and there are no recurrent interactions, K = 0, most of the previous expressions simplify. [sent-47, score-0.686]

42 The susceptibility matrix X is diagonal, <]> = G, and r = W- 1 . [sent-48, score-0.155]

43 (9) for the learning rule for W results in the update rule: ~W = 'f/ [(W T )-1 + (zx T )] , (14) where Z i = gr / g~. [sent-50, score-0.131]

44 Thus, the well-known Infomax leA learning rule is recovered as a special case ofEq. [sent-51, score-0.163]

45 (a) (b) (c) Figure 2: Results of fitting 3 filters to a 2-dimensional hexagon distribution with 10000 sample points. [sent-53, score-0.132]

46 4 Examples We now apply the preceding learning algorithms to a simple two-dimensional (N = 2) input example. [sent-54, score-0.145]

47 Each input point is generated by a linear combination of three (twodimensional) unit vectors with angles of 00 , 1200 and 240 0 • The coefficients are taken from a uniform distribution on the unit interval. [sent-55, score-0.164]

48 The resulting distribution has the shape of a unit hexagon, which is slightly more dense close to the origin than at the boundaries. [sent-56, score-0.033]

49 Samples of the input distribution are shown in Fig. [sent-57, score-0.098]

50 We fix the sigmoidal nonlinearity to be g(x} = tanh(x}. [sent-60, score-0.041]

51 1 Feedforward weights A set of M = 3 overcomplete filters for W are learned by applying the update rule in Eq. [sent-62, score-0.48]

52 (9) to random normalized initial conditions while keeping the recurrent interactions fixed at K = O. [sent-63, score-0.644]

53 The length of the rows of W were constrained to be identical so that the filters are projections along certain directions in the two-dimensional space. [sent-64, score-0.132]

54 Examples of the resulting learned filters are shown by plotting the rows of W as vectors in Fig. [sent-66, score-0.225]

55 If the lengths of the rows of Ware left unconstrained, slight deviations from these solutions occur, but relative orientation differences of 60 0 or 120 0 between the various filters are preserved. [sent-69, score-0.132]

56 2 Recurrent interactions To investigate the effect of recurrent interactions on the representation, we fixed the feedforward weights in W to point in the directions shown in Fig. [sent-71, score-1.021]

57 2(a), and learned the optimal recurrent interactions K using Eq. [sent-72, score-0.676]

58 Depending upon the length of the rows of W which scaled the input patterns, different optimal values are seen for the recurrent connections. [sent-74, score-0.69]

59 3 by plotting the value of the cost function against the strength of the uniform recurrent interaction. [sent-76, score-0.641]

60 For small scaled inputs, the optimal recurrent strength is negative which effectively amplifies the output signals since the 3 signals are negatively correlated. [sent-77, score-0.907]

61 With large scaled inputs, the optimal recurrent strength is positive which tend to decrease the outputs. [sent-78, score-0.584]

62 Thus, in this example, optimizing the recurrent connections performs gain control on the inputs. [sent-79, score-0.558]

63 5 k Figure 3: Effect of adding recurrent interactions to the representation. [sent-93, score-0.644]

64 The cost function is plotted as a function of the recurrent interaction strength, for two different input scaling parameters. [sent-94, score-0.625]

65 5 Discussion The learned feedforward weights are similar to the results of another ICA model that can learn overcomplete representations [11]. [sent-95, score-0.526]

66 Our algorithm, however, does not need to perform approximate inference on a generative model. [sent-96, score-0.065]

67 Instead, it directly maximizes the mutual information between the outputs and inputs of a nonlinear network. [sent-97, score-0.269]

68 Our method also has the advantage of being able to learn recurrent connections that can enhance the representational power of the network. [sent-98, score-0.558]

69 We also note that this approach can be easily generalized to undercomplete representations by simply changing the order of the matrix product in the cost function. [sent-99, score-0.129]

70 Possible extensions of this work would be to optimize the nonlinearity that is used, or to adaptively change the number of output units to best match the input distribution. [sent-101, score-0.319]

71 6 Appendix 1: Relationship between input and output distributions In general, the relation between the input and output distributions is given by P(s) = ! [sent-103, score-0.489]

72 (15) Since we use a deterministic mapping, the conditional distribution of the response given the input is given by P(slx) = 8(s - g(Wx + Ks)). [sent-105, score-0.148]

73 By adding independent Gaussian noise to the responses of the output units and considering the limit where the variance of the noise goes to zero, we can write this term as 1 e-~lls-g(Wx+Ks)112 6. [sent-106, score-0.273]

74 r~2)N/2 P(slx) = lim (16) The output space can be partitioned into those points which belong to the image of the input space, and those which are not. [sent-108, score-0.204]

75 For points outside the image of the input space, P(s) = O. [sent-109, score-0.098]

76 For small~, we can expand g(Wx + Ks) - s ::::: X8x, where X is P(slx) (17) The expression in the square brackets is a delta function in x around Xo. [sent-112, score-0.132]

77 (15) we finally get P(s) = P(x) O(s) (18) Jdet(xTx) where the characteristic function O(s) is 1 if s belongs to the image of the input space and is zero otherwise. [sent-114, score-0.139]

78 Note that for the case when X is a square matrix (M expression reduces to the relation P(s) = P(x) II det(x)l. [sent-115, score-0.211]

79 7 = N), this Appendix 2: Derivation of the learning rules To derive the appropriate learning rules, we need to calculate the derivatives of E with respect to some set of parameters A. [sent-116, score-0.42]

80 In general, these derivatives are obtained from the expression: 7. [sent-117, score-0.077]

81 1 Feedforward weights In order to derive the learning rule for the weights W, we first calculate OWeb o~ e o~ "S: ( ~ae OWl m + OWlaWeb) = ~al6bm + ""S: OWl m Web· m OXab " OWl m = ae (20) From the definition of ~, we see that: O~ae __ ,,~ . [sent-118, score-0.423]

82 J at OWl m Je (21) oGi/ _ 6ij og~ _ 6 g~' OSi OWl m - - (gD 2 OWl m - - ij (gD3 OWl m ' (22) tJ and where g~' == g" (Lj WijXj + Lk KikS k). [sent-121, score-0.058]

83 The derivatives of s also satisfy a recursion relation similar to Eq. [sent-122, score-0.217]

84 (19) and taking the trace, we get the gradient descent rule in Eq. [sent-124, score-0.252]

85 2 Recurrent interactions To derive the learning rules for the recurrent weights K, we first calculate the derivatives of Xab with respect to Kim: OXab oK 1m o<1>ae '"" o<1>ijl = '"" oK1m Web = - e,i,j <1>ai OK1m <1>jeW eb. [sent-127, score-1.066]

86 ~ ~ e (25) From the definition of <1>, we obtain: 0<1> ij 1 £lK u 1m 6ij 0 g~ = - -( u 1m ')2 £lK gi - 6il 6jm. [sent-128, score-0.101]

87 (26) The derivatives of g' are obtained from the following relations: (27) and (28) which results from a recursion relation similar to Eq. [sent-129, score-0.217]

88 Finally, after combining these results and calculating the trace, we get the gradient descent learning rule in Eq. [sent-131, score-0.299]

89 Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. [sent-141, score-0.031]

90 An information maximization approach to blind separation and blind deconvolution. [sent-150, score-0.299]

91 Local synaptic learning rules suffice to maximize mutual information in a linear network. [sent-158, score-0.399]

92 Statistical independence and novelty detection with information preserving nonlinear maps. [sent-162, score-0.035]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('recurrent', 0.482), ('owl', 0.354), ('overcomplete', 0.244), ('slx', 0.177), ('feedforward', 0.166), ('interactions', 0.162), ('mutual', 0.16), ('rules', 0.157), ('appendix', 0.152), ('ks', 0.122), ('xtx', 0.12), ('osi', 0.106), ('susceptibility', 0.106), ('output', 0.106), ('ae', 0.102), ('input', 0.098), ('maximizing', 0.094), ('lk', 0.086), ('rule', 0.084), ('blind', 0.083), ('expression', 0.081), ('relation', 0.081), ('wx', 0.077), ('derivatives', 0.077), ('connections', 0.076), ('units', 0.074), ('filters', 0.071), ('kiks', 0.071), ('oxab', 0.071), ('sejnowsld', 0.071), ('ica', 0.069), ('descent', 0.069), ('signals', 0.068), ('maximization', 0.067), ('generative', 0.065), ('bell', 0.062), ('rows', 0.061), ('lucent', 0.061), ('dxp', 0.061), ('hexagon', 0.061), ('iwi', 0.061), ('parra', 0.061), ('plotting', 0.061), ('wijxj', 0.061), ('recursion', 0.059), ('gradient', 0.058), ('ij', 0.058), ('strength', 0.053), ('derive', 0.052), ('infomax', 0.051), ('laboratories', 0.051), ('brackets', 0.051), ('negatively', 0.051), ('deterministic', 0.05), ('tj', 0.05), ('weights', 0.049), ('matrix', 0.049), ('scaled', 0.049), ('det', 0.048), ('learning', 0.047), ('entropy', 0.047), ('xij', 0.045), ('trace', 0.045), ('web', 0.045), ('principle', 0.045), ('cost', 0.045), ('network', 0.044), ('diagonal', 0.044), ('gi', 0.043), ('transformations', 0.041), ('technologies', 0.041), ('nonlinearity', 0.041), ('get', 0.041), ('calculate', 0.04), ('outputs', 0.038), ('transformation', 0.037), ('lea', 0.037), ('inputs', 0.036), ('information', 0.035), ('representations', 0.035), ('unit', 0.033), ('minimizing', 0.033), ('learned', 0.032), ('influence', 0.032), ('reconstruction', 0.032), ('understood', 0.032), ('recovered', 0.032), ('volume', 0.032), ('lj', 0.031), ('limit', 0.031), ('component', 0.031), ('noise', 0.031), ('orthogonal', 0.031), ('separation', 0.031), ('element', 0.031), ('jolliffe', 0.03), ('amplifies', 0.03), ('binational', 0.03), ('comprehensive', 0.03), ('gw', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

Author: Oren Shriki, Haim Sompolinsky, Daniel D. Lee

2 0.22082956 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

Author: Zhaoping Li, Peter Dayan

Abstract: Stimulus arrays are inevitably presented at different positions on the retina in visual tasks, even those that nominally require fixation. In particular, this applies to many perceptual learning tasks. We show that perceptual inference or discrimination in the face of positional variance has a structurally different quality from inference about fixed position stimuli, involving a particular, quadratic, non-linearity rather than a purely linear discrimination. We show the advantage taking this non-linearity into account has for discrimination, and suggest it as a role for recurrent connections in area VI, by demonstrating the superior discrimination performance of a recurrent network. We propose that learning the feedforward and recurrent neural connections for these tasks corresponds to the fast and slow components of learning observed in perceptual learning tasks.

3 0.14976561 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account

Author: Gal Chechik, Naftali Tishby

Abstract: The paradigm of Hebbian learning has recently received a novel interpretation with the discovery of synaptic plasticity that depends on the relative timing of pre and post synaptic spikes. This paper derives a temporally dependent learning rule from the basic principle of mutual information maximization and studies its relation to the experimentally observed plasticity. We find that a supervised spike-dependent learning rule sharing similar structure with the experimentally observed plasticity increases mutual information to a stable near optimal level. Moreover, the analysis reveals how the temporal structure of time-dependent learning rules is determined by the temporal filter applied by neurons over their inputs. These results suggest experimental prediction as to the dependency of the learning rule on neuronal biophysical parameters 1

4 0.099233523 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

Author: Un-Min Bae, Soo-Young Lee

Abstract: We present an algorithm which compensates for the mismatches between characteristics of real-world problems and assumptions of independent component analysis algorithm. To provide additional information to the ICA network, we incorporate top-down selective attention. An MLP classifier is added to the separated signal channel and the error of the classifier is backpropagated to the ICA network. This backpropagation process results in estimation of expected ICA output signal for the top-down attention. Then, the unmixing matrix is retrained according to a new cost function representing the backpropagated error as well as independence. It modifies the density of recovered signals to the density appropriate for classification. For noisy speech signal recorded in real environments, the algorithm improved the recognition performance and showed robustness against parametric changes. 1

5 0.097412214 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

Author: Pedro A. d. F. R. Højen-Sørensen, Ole Winther, Lars Kai Hansen

Abstract: We propose a general Bayesian framework for performing independent component analysis (leA) which relies on ensemble learning and linear response theory known from statistical physics. We apply it to both discrete and continuous sources. For the continuous source the underdetermined (overcomplete) case is studied. The naive mean-field approach fails in this case whereas linear response theory-which gives an improved estimate of covariances-is very efficient. The examples given are for sources without temporal correlations. However, this derivation can easily be extended to treat temporal correlations. Finally, the framework offers a simple way of generating new leA algorithms without needing to define the prior distribution of the sources explicitly.

6 0.096179098 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

7 0.095745109 147 nips-2000-Who Does What? A Novel Algorithm to Determine Function Localization

8 0.093119435 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

9 0.09215676 49 nips-2000-Explaining Away in Weight Space

10 0.072407641 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

11 0.071514361 121 nips-2000-Sparse Kernel Principal Component Analysis

12 0.070432566 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

13 0.067276277 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

14 0.064879306 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

15 0.063598476 22 nips-2000-Algorithms for Non-negative Matrix Factorization

16 0.063017793 124 nips-2000-Spike-Timing-Dependent Learning for Oscillatory Networks

17 0.061240438 36 nips-2000-Constrained Independent Component Analysis

18 0.061000805 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code

19 0.059803277 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

20 0.058025423 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.223), (1, -0.106), (2, -0.024), (3, 0.018), (4, 0.05), (5, 0.053), (6, -0.059), (7, -0.066), (8, 0.041), (9, -0.036), (10, -0.172), (11, -0.168), (12, 0.042), (13, 0.011), (14, 0.086), (15, -0.148), (16, -0.165), (17, -0.078), (18, 0.299), (19, 0.173), (20, 0.112), (21, -0.037), (22, 0.104), (23, -0.066), (24, 0.164), (25, 0.094), (26, 0.042), (27, 0.001), (28, -0.131), (29, -0.113), (30, 0.081), (31, 0.063), (32, 0.185), (33, -0.006), (34, 0.036), (35, -0.131), (36, -0.004), (37, 0.05), (38, 0.012), (39, 0.033), (40, 0.064), (41, -0.185), (42, -0.053), (43, -0.075), (44, 0.086), (45, -0.196), (46, -0.013), (47, 0.041), (48, 0.009), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96272904 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

Author: Oren Shriki, Haim Sompolinsky, Daniel D. Lee

2 0.77668613 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

Author: Zhaoping Li, Peter Dayan

3 0.55488908 147 nips-2000-Who Does What? A Novel Algorithm to Determine Function Localization

Author: Ranit Aharonov-Barki, Isaac Meilijson, Eytan Ruppin

Abstract: We introduce a novel algorithm, termed PPA (Performance Prediction Algorithm), that quantitatively measures the contributions of elements of a neural system to the tasks it performs. The algorithm identifies the neurons or areas which participate in a cognitive or behavioral task, given data about performance decrease in a small set of lesions. It also allows the accurate prediction of performances due to multi-element lesions. The effectiveness of the new algorithm is demonstrated in two models of recurrent neural networks with complex interactions among the elements. The algorithm is scalable and applicable to the analysis of large neural networks. Given the recent advances in reversible inactivation techniques, it has the potential to significantly contribute to the understanding of the organization of biological nervous systems, and to shed light on the long-lasting debate about local versus distributed computation in the brain.

4 0.49065912 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account

Author: Gal Chechik, Naftali Tishby

5 0.39611515 49 nips-2000-Explaining Away in Weight Space

Author: Peter Dayan, Sham Kakade

Abstract: Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; provide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.

6 0.37371042 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

7 0.36598229 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

8 0.33336455 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

9 0.30934119 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

10 0.30023885 44 nips-2000-Efficient Learning of Linear Perceptrons

11 0.27627346 22 nips-2000-Algorithms for Non-negative Matrix Factorization

12 0.2646153 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

13 0.26198941 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

14 0.25594395 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

15 0.25369358 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

16 0.24609096 34 nips-2000-Competition and Arbors in Ocular Dominance

17 0.23440482 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

18 0.22323509 35 nips-2000-Computing with Finite and Infinite Networks

19 0.22103566 124 nips-2000-Spike-Timing-Dependent Learning for Oscillatory Networks

20 0.22012882 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.028), (17, 0.097), (33, 0.03), (54, 0.01), (55, 0.019), (62, 0.03), (65, 0.019), (67, 0.552), (76, 0.036), (79, 0.012), (81, 0.021), (90, 0.031), (97, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98705417 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

Author: Ilya Nemenman, William Bialek

Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '

same-paper 2 0.97801846 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

Author: Oren Shriki, Haim Sompolinsky, Daniel D. Lee

3 0.84612811 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

4 0.73943216 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account

Author: Gal Chechik, Naftali Tishby

5 0.66260028 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

Author: Sumio Watanabe

Abstract: Algebraic geometry is essential to learning theory. In hierarchical learning machines such as layered neural networks and gaussian mixtures, the asymptotic normality does not hold , since Fisher information matrices are singular. In this paper , the rigorous asymptotic form of the stochastic complexity is clarified based on resolution of singularities and two different problems are studied. (1) If the prior is positive, then the stochastic complexity is far smaller than BIO, resulting in the smaller generalization error than regular statistical models, even when the true distribution is not contained in the parametric model. (2) If Jeffreys' prior, which is coordinate free and equal to zero at singularities, is employed then the stochastic complexity has the same form as BIO. It is useful for model selection, but not for generalization. 1

6 0.59271008 146 nips-2000-What Can a Single Neuron Compute?

7 0.58711743 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

8 0.57413733 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

9 0.56836301 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

10 0.56711149 44 nips-2000-Efficient Learning of Linear Perceptrons

11 0.56562835 21 nips-2000-Algorithmic Stability and Generalization Performance

12 0.55978715 124 nips-2000-Spike-Timing-Dependent Learning for Oscillatory Networks

13 0.55597669 22 nips-2000-Algorithms for Non-negative Matrix Factorization

14 0.55061376 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

15 0.54648304 79 nips-2000-Learning Segmentation by Random Walks

16 0.54402548 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

17 0.54396391 125 nips-2000-Stability and Noise in Biochemical Switches

18 0.54157811 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

19 0.53612161 49 nips-2000-Explaining Away in Weight Space

20 0.53562939 111 nips-2000-Regularized Winnow Methods